Skip to content

Conversation

@dlebauer
Copy link
Member

@dlebauer dlebauer commented Feb 4, 2025

Description

  • Refactored SDA functions to accept direct R objects (lists, data.frames, and sf objects) in addition to file paths.
  • Added a helper function (.convert_coords_to_sf()) to standardize coordinate conversion.
  • Updated man pages to reflect changes in input parameters.
  • Implemented new tests to verify the updated behavior.

Motivation and Context

Existing downscaling code required rasters. In the CCMMF project we are using vector data, so this PR generalizes the downscaling code to support vector file types as well.

It also allows ensemble data to be passed as either an RDS file or an R list of dataframes.

Review Time Estimate

  • Without unnecessary delay

Types of changes

  • New feature (non-breaking change which adds functionality)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I agree that PEcAn Project may distribute my contribution under any or all of
    • the same license as the existing code,
    • and/or the BSD 3-clause license.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

##' - A file path to a `.csv` containing site coordinates with columns `"id"`, `"lat"`, and `"lon"`.
##' - A `data.frame` with the same structure.
##' - An `sf` object with point geometries.
##' @param date Date. If SDA site run, format is yyyy/mm/dd; if NEON, yyyy-mm-dd. Restricted to years within the file or object supplied to 'ensemble_data'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this function was originally written I asked that this be fixed to just take in an actual Date. While you're fixing things could you fix that too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdietze I'm not sure if I fully understand, and how I can make sure not to brake existing code.

What I have in mind is to

  • require the date parameter be a Date object and update the docs accordingly and
  • if it is not a date object, try to convert from the two yyyymmdd options? (already happens)

is that what you had in mind?

keep.forest = TRUE,
importance = TRUE)

data = train_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert indentation change

keep.forest = TRUE,
importance = TRUE
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another suggestion I've made in the past is that there may be logical/computational advantages to separating the function that builds the downscaling models and a separate function that makes the predictions to a map. Right now one can't just fit the model and evaluate the results without also running the (more computationally costly) mapping

Deprecating date arg to SDA_downscale because it is never used
put check for sf object with related checks
Copy link
Member

@infotroph infotroph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A high-level question that you likely considered already:

Accepting both files and in-memory data seems like it adds a lot of complexity to functions that were nice and clean before that, and I naively expect it would be both "easy" and desirable for error recovery to instead write data out to files before running this. What tradeoff factors make this the better approach?

@mdietze
Copy link
Member

mdietze commented Feb 6, 2025

Per @infotroph last comment, I'm inclined to push in the other direction. For SDA_downscale, I'm fairly confident that I asked the original authors of these functions to stop loading the data within the function, as you have greater flexibility if you load the constraint data first then call the function. This way you can easily control what covariate variables are part of any particular downscaling, and what domain you're doing the downscaling for, by changing what stack of variables you're passing in. In other words, for that function you could probably drop the option to pass in file paths.

@infotroph
Copy link
Member

@mdietze 's point makes sense and yes, load-first seems sensible. I was trying to question the need to support both rather than pushing for write-first specifically.

@dlebauer dlebauer added the ccmmf issues and pre related to the ccmmf project label Feb 12, 2025
@dlebauer
Copy link
Member Author

@mdietze 's point makes sense and yes, load-first seems sensible. I was trying to question the need to support both rather than pushing for write-first specifically.

I was maintaining the 'read from file' functionality to remain backwards-compatible. If these functions don't need to be backward compatible, they will be much easier to refactor.

@mdietze
Copy link
Member

mdietze commented Feb 19, 2025

Right now, this code has a small enough user community that backwards compatibility isn't an issue. That will change once we start publishing using this workflow and need outward-facing reproducibility

Comment on lines +19 to +20
- Refactored and created new version of `SDA_downscale` named `ensemble_downscale`.
- Added helper function `.convert_coords_to_sf()` for consistent conversion of data with lat lon data to `sf` pts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits:

  • refactors seem like belong in "changed" not "fixed"
  • internal helper functions don't need to be mentioned in the changelog (and I'd argue it's confusing to do so, since the point is for users to not know or care about them)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: Need to move up to "unlreased" (this section is now for already-tagged 1.9.0)

#' @return A list the same dimension as X, with each column of each dataframe
#' modified by replacing outlier points with the column median
#' @export
#' @export outlier.detector.boxplot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet this is trying to prevent this name being treated as an S3 method, but I recommend renaming the function instead. Life's too short to fight with R about this.

Comment on lines +52 to +54
if (!inherits(date, "Date")) {
standard_date <- lubridate::ymd(date)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ymd accepts Date objects and returns them unchanged -- can skip the if and the rename

Comment on lines +83 to +90
# see https://github.com/PecanProject/pecan/pull/3431/files#r1953601359
# if (nrow(site_coordinates) > nrow(carbon_data)) {
# PEcAn.logger::logger.info("Truncating site coordinates to match carbon data rows.")
# site_coordinates <- site_coordinates[seq_len(nrow(carbon_data)), ]
# } else {
# PEcAn.logger::logger.info("Truncating carbon data to match site coordinates rows.")
# carbon_data <- carbon_data[seq_len(nrow(site_coordinates)), ]
# }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link is taking me to a whole-PR view so I can't tell what specific comment/commit it's referring to, but the commented lines seem safe to delete -- I agree that unequal row counts should make us stop with an error rather than truncate and continue.

##' @param date Date. If SDA site run, format is yyyy/mm/dd; if NEON, yyyy-mm-dd. Restricted to years within file supplied to 'preprocessed' from the 'data_path'.
##' @param carbon_pool Character. Carbon pool of interest. Name must match carbon pool name found within file supplied to 'preprocessed' from the 'data_path'.
##' @param covariates SpatRaster stack. Used as predictors in downscaling. Layers within stack should be named. Recommended that this stack be generated using 'covariates' instructions in assim.sequential/inst folder
##' @param date *Deprecated*. This argument has never been used and will be removed after 2026-04-01
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote drop it now, mention in NEWS, and let users figure it out

site_id %in% unique(site_coords$id),
variable == carbon_pool
) |>
select(site_id, ensemble, prediction) # use site_id instead of site
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
select(site_id, ensemble, prediction) # use site_id instead of site
dplyr::select("site_id", "ensemble", "prediction") # use site_id instead of site

not sure what the comment means -- is it needed?


ensembles <- unique(ensemble_data$ensemble)

results <- furrr::future_map(seq_along(ensembles), function(i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly encourage pulling this anonymous function out to its own definition and giving it an informative name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it appears i is only used to subset ensemble -- is there a reason not to pass each one directlly (i.e map(ensembles, function(ens) ...?

##' @export
downscale_metrics <- function(downscale_output) {

test_data_list <- lapply(downscale_output$test_data, function(x) dplyr::pull(x, prediction))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why the extra lambda -- does this differ from lapply(downscale_output$test_data, dplyr::pull, prediction), or indeed from lapply(downscale_output$test_data, `[[`, "prediction")?

withr::with_tempdir({
temp_ensemble_data_rds <- "ensemble_data.rds"
temp_coords_csv <- "final_design_points.csv"
file.remove(temp_ensemble_data_rds, temp_coords_csv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to start by removing these if the whole test happens in a fresh tempdir?

temp_coords_csv <- "final_design_points.csv"
file.remove(temp_ensemble_data_rds, temp_coords_csv)

set.seed(123)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware that the effects of this set.seed persist outside this test block! I think the easiest way to contain it would be to wrap all the lines that use random nunbers in one big withr::with_seed(123, {...code that uses randomness here...}).

(I think supposedly you could also replace this line with withr::local_seed(123) to have the seed reset at the end of the current execution context, but in my local tests that still affected code outside the function. Dunno if I'm misunderstanding "context" or I did something else wrong)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ccmmf issues and pre related to the ccmmf project Modules Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants