-
Notifications
You must be signed in to change notification settings - Fork 274
Update SDA Downscaling to allow sf #3431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
| ##' - A file path to a `.csv` containing site coordinates with columns `"id"`, `"lat"`, and `"lon"`. | ||
| ##' - A `data.frame` with the same structure. | ||
| ##' - An `sf` object with point geometries. | ||
| ##' @param date Date. If SDA site run, format is yyyy/mm/dd; if NEON, yyyy-mm-dd. Restricted to years within the file or object supplied to 'ensemble_data'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this function was originally written I asked that this be fixed to just take in an actual Date. While you're fixing things could you fix that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdietze I'm not sure if I fully understand, and how I can make sure not to brake existing code.
What I have in mind is to
- require the date parameter be a Date object and update the docs accordingly and
- if it is not a date object, try to convert from the two yyyymmdd options? (already happens)
is that what you had in mind?
| keep.forest = TRUE, | ||
| importance = TRUE) | ||
|
|
||
| data = train_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert indentation change
| keep.forest = TRUE, | ||
| importance = TRUE | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another suggestion I've made in the past is that there may be logical/computational advantages to separating the function that builds the downscaling models and a separate function that makes the predictions to a map. Right now one can't just fit the model and evaluate the results without also running the (more computationally costly) mapping
Deprecating date arg to SDA_downscale because it is never used put check for sf object with related checks
infotroph
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A high-level question that you likely considered already:
Accepting both files and in-memory data seems like it adds a lot of complexity to functions that were nice and clean before that, and I naively expect it would be both "easy" and desirable for error recovery to instead write data out to files before running this. What tradeoff factors make this the better approach?
|
Per @infotroph last comment, I'm inclined to push in the other direction. For SDA_downscale, I'm fairly confident that I asked the original authors of these functions to stop loading the data within the function, as you have greater flexibility if you load the constraint data first then call the function. This way you can easily control what covariate variables are part of any particular downscaling, and what domain you're doing the downscaling for, by changing what stack of variables you're passing in. In other words, for that function you could probably drop the option to pass in file paths. |
|
@mdietze 's point makes sense and yes, load-first seems sensible. I was trying to question the need to support both rather than pushing for write-first specifically. |
Co-authored-by: Chris Black <[email protected]>
I was maintaining the 'read from file' functionality to remain backwards-compatible. If these functions don't need to be backward compatible, they will be much easier to refactor. |
|
Right now, this code has a small enough user community that backwards compatibility isn't an issue. That will change once we start publishing using this workflow and need outward-facing reproducibility |
… but doesn't use raster covariates
… ensemble_downscaling
… ensemble_downscaling
| - Refactored and created new version of `SDA_downscale` named `ensemble_downscale`. | ||
| - Added helper function `.convert_coords_to_sf()` for consistent conversion of data with lat lon data to `sf` pts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits:
- refactors seem like belong in "changed" not "fixed"
- internal helper functions don't need to be mentioned in the changelog (and I'd argue it's confusing to do so, since the point is for users to not know or care about them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: Need to move up to "unlreased" (this section is now for already-tagged 1.9.0)
| #' @return A list the same dimension as X, with each column of each dataframe | ||
| #' modified by replacing outlier points with the column median | ||
| #' @export | ||
| #' @export outlier.detector.boxplot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet this is trying to prevent this name being treated as an S3 method, but I recommend renaming the function instead. Life's too short to fight with R about this.
| if (!inherits(date, "Date")) { | ||
| standard_date <- lubridate::ymd(date) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ymd accepts Date objects and returns them unchanged -- can skip the if and the rename
| # see https://github.com/PecanProject/pecan/pull/3431/files#r1953601359 | ||
| # if (nrow(site_coordinates) > nrow(carbon_data)) { | ||
| # PEcAn.logger::logger.info("Truncating site coordinates to match carbon data rows.") | ||
| # site_coordinates <- site_coordinates[seq_len(nrow(carbon_data)), ] | ||
| # } else { | ||
| # PEcAn.logger::logger.info("Truncating carbon data to match site coordinates rows.") | ||
| # carbon_data <- carbon_data[seq_len(nrow(site_coordinates)), ] | ||
| # } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link is taking me to a whole-PR view so I can't tell what specific comment/commit it's referring to, but the commented lines seem safe to delete -- I agree that unequal row counts should make us stop with an error rather than truncate and continue.
| ##' @param date Date. If SDA site run, format is yyyy/mm/dd; if NEON, yyyy-mm-dd. Restricted to years within file supplied to 'preprocessed' from the 'data_path'. | ||
| ##' @param carbon_pool Character. Carbon pool of interest. Name must match carbon pool name found within file supplied to 'preprocessed' from the 'data_path'. | ||
| ##' @param covariates SpatRaster stack. Used as predictors in downscaling. Layers within stack should be named. Recommended that this stack be generated using 'covariates' instructions in assim.sequential/inst folder | ||
| ##' @param date *Deprecated*. This argument has never been used and will be removed after 2026-04-01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote drop it now, mention in NEWS, and let users figure it out
| site_id %in% unique(site_coords$id), | ||
| variable == carbon_pool | ||
| ) |> | ||
| select(site_id, ensemble, prediction) # use site_id instead of site |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| select(site_id, ensemble, prediction) # use site_id instead of site | |
| dplyr::select("site_id", "ensemble", "prediction") # use site_id instead of site |
not sure what the comment means -- is it needed?
|
|
||
| ensembles <- unique(ensemble_data$ensemble) | ||
|
|
||
| results <- furrr::future_map(seq_along(ensembles), function(i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly encourage pulling this anonymous function out to its own definition and giving it an informative name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it appears i is only used to subset ensemble -- is there a reason not to pass each one directlly (i.e map(ensembles, function(ens) ...?
| ##' @export | ||
| downscale_metrics <- function(downscale_output) { | ||
|
|
||
| test_data_list <- lapply(downscale_output$test_data, function(x) dplyr::pull(x, prediction)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why the extra lambda -- does this differ from lapply(downscale_output$test_data, dplyr::pull, prediction), or indeed from lapply(downscale_output$test_data, `[[`, "prediction")?
| withr::with_tempdir({ | ||
| temp_ensemble_data_rds <- "ensemble_data.rds" | ||
| temp_coords_csv <- "final_design_points.csv" | ||
| file.remove(temp_ensemble_data_rds, temp_coords_csv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to start by removing these if the whole test happens in a fresh tempdir?
| temp_coords_csv <- "final_design_points.csv" | ||
| file.remove(temp_ensemble_data_rds, temp_coords_csv) | ||
|
|
||
| set.seed(123) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beware that the effects of this set.seed persist outside this test block! I think the easiest way to contain it would be to wrap all the lines that use random nunbers in one big withr::with_seed(123, {...code that uses randomness here...}).
(I think supposedly you could also replace this line with withr::local_seed(123) to have the seed reset at the end of the current execution context, but in my local tests that still affected code outside the function. Dunno if I'm misunderstanding "context" or I did something else wrong)
Description
Motivation and Context
Existing downscaling code required rasters. In the CCMMF project we are using vector data, so this PR generalizes the downscaling code to support vector file types as well.
It also allows ensemble data to be passed as either an RDS file or an R list of dataframes.
Review Time Estimate
Types of changes
Checklist: