Update SDA Downscaling to allow sf #3431

dlebauer · 2025-02-04T01:06:37Z

Description

Refactored SDA functions to accept direct R objects (lists, data.frames, and sf objects) in addition to file paths.
Added a helper function (.convert_coords_to_sf()) to standardize coordinate conversion.
Updated man pages to reflect changes in input parameters.
Implemented new tests to verify the updated behavior.

Motivation and Context

Existing downscaling code required rasters. In the CCMMF project we are using vector data, so this PR generalizes the downscaling code to support vector file types as well.

It also allows ensemble data to be passed as either an RDS file or an R list of dataframes.

Review Time Estimate

Without unnecessary delay

Types of changes

New feature (non-breaking change which adds functionality)

Checklist:

My change requires a change to the documentation.
My name is in the list of CITATION.cff
I agree that PEcAn Project may distribute my contribution under any or all of
- the same license as the existing code,
- and/or the BSD 3-clause license.
I have updated the CHANGELOG.md.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

modules/assim.sequential/R/downscale_function.R

mdietze · 2025-02-04T01:36:42Z

modules/assim.sequential/R/downscale_function.R

+##'   - A file path to a `.csv` containing site coordinates with columns `"id"`, `"lat"`, and `"lon"`.
+##'   - A `data.frame` with the same structure.
+##'   - An `sf` object with point geometries.
+##' @param date Date. If SDA site run, format is yyyy/mm/dd; if NEON, yyyy-mm-dd. Restricted to years within the file or object supplied to 'ensemble_data'.


When this function was originally written I asked that this be fixed to just take in an actual Date. While you're fixing things could you fix that too.

@mdietze I'm not sure if I fully understand, and how I can make sure not to brake existing code.

What I have in mind is to

require the date parameter be a Date object and update the docs accordingly and

if it is not a date object, try to convert from the two yyyymmdd options? (already happens)

is that what you had in mind?

modules/assim.sequential/R/downscale_function.R

mdietze · 2025-02-04T01:41:50Z

modules/assim.sequential/R/downscale_function.R

-                                                keep.forest = TRUE,
-                                                importance = TRUE)
-
+        data = train_data,


revert indentation change

mdietze · 2025-02-04T01:45:30Z

modules/assim.sequential/R/downscale_function.R

+        keep.forest = TRUE,
+        importance = TRUE
+      )
+


Another suggestion I've made in the past is that there may be logical/computational advantages to separating the function that builds the downscaling models and a separate function that makes the predictions to a map. Right now one can't just fit the model and evaluate the results without also running the (more computationally costly) mapping

modules/assim.sequential/R/downscale_function.R

modules/assim.sequential/tests/testthat/test_downscaling.R

Deprecating date arg to SDA_downscale because it is never used put check for sf object with related checks

modules/assim.sequential/R/downscale_function.R

infotroph

A high-level question that you likely considered already:

Accepting both files and in-memory data seems like it adds a lot of complexity to functions that were nice and clean before that, and I naively expect it would be both "easy" and desirable for error recovery to instead write data out to files before running this. What tradeoff factors make this the better approach?

mdietze · 2025-02-06T18:15:33Z

Per @infotroph last comment, I'm inclined to push in the other direction. For SDA_downscale, I'm fairly confident that I asked the original authors of these functions to stop loading the data within the function, as you have greater flexibility if you load the constraint data first then call the function. This way you can easily control what covariate variables are part of any particular downscaling, and what domain you're doing the downscaling for, by changing what stack of variables you're passing in. In other words, for that function you could probably drop the option to pass in file paths.

infotroph · 2025-02-06T18:28:07Z

@mdietze 's point makes sense and yes, load-first seems sensible. I was trying to question the need to support both rather than pushing for write-first specifically.

modules/assim.sequential/R/downscale_function.R

Co-authored-by: Chris Black <[email protected]>

dlebauer · 2025-02-18T21:35:21Z

@mdietze 's point makes sense and yes, load-first seems sensible. I was trying to question the need to support both rather than pushing for write-first specifically.

I was maintaining the 'read from file' functionality to remain backwards-compatible. If these functions don't need to be backward compatible, they will be much easier to refactor.

mdietze · 2025-02-19T18:31:41Z

Right now, this code has a small enough user community that backwards compatibility isn't an issue. That will change once we start publishing using this workflow and need outward-facing reproducibility

… but doesn't use raster covariates

…canProject#3451

… ensemble_downscaling

CHANGELOG.md

… ensemble_downscaling

…_downscaling

…an S3 method

infotroph · 2025-04-03T17:38:36Z

CHANGELOG.md

+- Refactored and created new version of `SDA_downscale` named `ensemble_downscale`.
+- Added helper function `.convert_coords_to_sf()` for consistent conversion of data with lat lon data to `sf` pts.


Nits:

refactors seem like belong in "changed" not "fixed"

internal helper functions don't need to be mentioned in the changelog (and I'd argue it's confusing to do so, since the point is for users to not know or care about them)

Also: Need to move up to "unlreased" (this section is now for already-tagged 1.9.0)

infotroph · 2025-04-03T17:49:09Z

modules/assim.sequential/R/Helper.functions.R

 #' @return A list the same dimension as X, with each column of each dataframe
 #'   modified by replacing outlier points with the column median
-#' @export
+#' @export outlier.detector.boxplot


I bet this is trying to prevent this name being treated as an S3 method, but I recommend renaming the function instead. Life's too short to fight with R about this.

infotroph · 2025-04-03T17:58:21Z

modules/assim.sequential/R/downscale_function.R

+  if (!inherits(date, "Date")) {
+    standard_date <- lubridate::ymd(date)
+  }


ymd accepts Date objects and returns them unchanged -- can skip the if and the rename

infotroph · 2025-04-03T18:04:04Z

modules/assim.sequential/R/downscale_function.R

+    # see https://github.com/PecanProject/pecan/pull/3431/files#r1953601359
+  #    if (nrow(site_coordinates) > nrow(carbon_data)) {
+  #      PEcAn.logger::logger.info("Truncating site coordinates to match carbon data rows.")
+  #      site_coordinates <- site_coordinates[seq_len(nrow(carbon_data)), ]
+  #    } else {
+  #      PEcAn.logger::logger.info("Truncating carbon data to match site coordinates rows.")
+  #      carbon_data <- carbon_data[seq_len(nrow(site_coordinates)), ]
+  #    }


The link is taking me to a whole-PR view so I can't tell what specific comment/commit it's referring to, but the commented lines seem safe to delete -- I agree that unequal row counts should make us stop with an error rather than truncate and continue.

infotroph · 2025-04-03T18:21:23Z

modules/assim.sequential/R/downscale_function.R

-##' @param date Date. If SDA site run, format is yyyy/mm/dd; if NEON, yyyy-mm-dd. Restricted to years within file supplied to 'preprocessed' from the 'data_path'.
-##' @param carbon_pool Character. Carbon pool of interest. Name must match carbon pool name found within file supplied to 'preprocessed' from the 'data_path'.
-##' @param covariates SpatRaster stack. Used as predictors in downscaling. Layers within stack should be named. Recommended that this stack be generated using 'covariates' instructions in assim.sequential/inst folder
+##' @param date *Deprecated*. This argument has never been used and will be removed after 2026-04-01


I vote drop it now, mention in NEWS, and let users figure it out

infotroph · 2025-04-03T20:57:17Z

modules/assim.sequential/R/ensemble_downscale.R

+      site_id %in% unique(site_coords$id),
+      variable == carbon_pool
+    ) |>
+    select(site_id, ensemble, prediction)  # use site_id instead of site


Suggested change

select(site_id, ensemble, prediction) # use site_id instead of site

dplyr::select("site_id", "ensemble", "prediction") # use site_id instead of site

not sure what the comment means -- is it needed?

infotroph · 2025-04-03T21:11:49Z

modules/assim.sequential/R/ensemble_downscale.R

+
+  ensembles <- unique(ensemble_data$ensemble)
+
+  results <- furrr::future_map(seq_along(ensembles), function(i) {


I strongly encourage pulling this anonymous function out to its own definition and giving it an informative name

Also it appears i is only used to subset ensemble -- is there a reason not to pass each one directlly (i.e map(ensembles, function(ens) ...?

infotroph · 2025-04-03T21:31:03Z

modules/assim.sequential/R/ensemble_downscale.R

+##' @export
+downscale_metrics <- function(downscale_output) {
+
+  test_data_list <- lapply(downscale_output$test_data, function(x) dplyr::pull(x, prediction))


curious why the extra lambda -- does this differ from lapply(downscale_output$test_data, dplyr::pull, prediction), or indeed from lapply(downscale_output$test_data, `[[`, "prediction")?

infotroph · 2025-04-03T21:35:31Z

modules/assim.sequential/tests/testthat/test_downscaling.R

+withr::with_tempdir({
+  temp_ensemble_data_rds <- "ensemble_data.rds"
+  temp_coords_csv <- "final_design_points.csv"
+  file.remove(temp_ensemble_data_rds, temp_coords_csv)


Why do you need to start by removing these if the whole test happens in a fresh tempdir?

infotroph · 2025-04-03T22:00:33Z

modules/assim.sequential/tests/testthat/test_downscaling.R

+  temp_coords_csv <- "final_design_points.csv"
+  file.remove(temp_ensemble_data_rds, temp_coords_csv)
+
+  set.seed(123)


Beware that the effects of this set.seed persist outside this test block! I think the easiest way to contain it would be to wrap all the lines that use random nunbers in one big withr::with_seed(123, {...code that uses randomness here...}).

(I think supposedly you could also replace this line with withr::local_seed(123) to have the seed reset at the end of the current execution context, but in my local tests that still affected code outside the function. Dunno if I'm misunderstanding "context" or I did something else wrong)

…wnscaling

…unction

…parallelization

…andling

dlebauer added 5 commits February 3, 2025 18:45

also accept sf pts object

4ac829f

update CHANGELOG.md

be310eb

changed SDA downscale argument names to match new functionality

c5ae7dc

make document

db36719

merge upstream develop; resolve changelog conflict

6ec8c14

github-actions bot added Tests Modules labels Feb 4, 2025

dlebauer commented Feb 4, 2025

View reviewed changes

modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved

Update modules/assim.sequential/R/downscale_function.R

fa087c2

dlebauer requested a review from sambhavnoobcoder February 4, 2025 01:25

mdietze reviewed Feb 4, 2025

View reviewed changes

infotroph reviewed Feb 4, 2025

View reviewed changes

modules/assim.sequential/tests/testthat/test_downscaling.R Outdated Show resolved Hide resolved

dlebauer added 2 commits February 4, 2025 22:06

SDA_downscale_preprocess now expects Date objects,

da96331

Deprecating date arg to SDA_downscale because it is never used put check for sf object with related checks

update SDA downscale docs

7041d18

infotroph reviewed Feb 6, 2025

View reviewed changes

modules/assim.sequential/R/downscale_function.R Show resolved Hide resolved

infotroph reviewed Feb 6, 2025

View reviewed changes

remove deprecated date handling

6af65d9

dlebauer added the ccmmf issues and pre related to the ccmmf project label Feb 12, 2025

dlebauer commented Feb 13, 2025

View reviewed changes

modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved

Update modules/assim.sequential/R/downscale_function.R

3411085

Co-authored-by: Chris Black <[email protected]>

Refactoring and debugging SDA_downscale_preprocess; update test cases

e4ca8a6

dlebauer added 5 commits March 14, 2025 15:53

merge

978254a

Merge branch 'develop' into ensemble_downscaling

4f3ff5b

add refactored ensemble_downscale function derived from SDA_downscale…

e00da0a

… but doesn't use raster covariates

reverted original downscale_funciton.R so it doesn't conflict with Pe…

44bcddc

…canProject#3451

Merge branch 'ensemble_downscaling' of github.com:dlebauer/pecan into…

f53eef0

… ensemble_downscaling

dlebauer commented Mar 18, 2025

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

dlebauer added 10 commits March 17, 2025 20:24

Update CHANGELOG.md

0211780

restore refactored SDA_downscale function

6eee5ac

Merge branch 'ensemble_downscaling' of github.com:dlebauer/pecan into…

07daf1c

… ensemble_downscaling

fix borked find and replace

b33a161

update documentation

307228f

revised ensemble downscale

019afe4

export subset_ensemble and ensemble_downscale functions

bc202e0

update metrics function

0d6fa01

Merge branch 'develop' of github.com:pecanproject/pecan into ensemble…

e7655e2

…_downscaling

explicitly export outlier.etector.boxplot so it isn't interpreted as …

01def9a

…an S3 method

dlebauer mentioned this pull request Mar 22, 2025

1b downscaling workflow ccmmf/workflows#2

Merged

infotroph reviewed Apr 3, 2025

View reviewed changes

dlebauer and others added 7 commits May 29, 2025 15:42

Merge branch 'main' of github.com:pecanproject/pecan into ensemble_do…

ecd4ff1

…wnscaling

increased logging and set seed in future_map for ensemble_downscale f…

29ab4d2

…unction

replace randomForest with ranger in ensemble_downscale for speed and …

0310d65

…parallelization

update downscale_metrics function

0309308

merge

b268ece

Refactor ensemble_downscale function for improved clarity and error h…

bddbc99

…andling

Merge branch 'develop' into ensemble_downscaling

8235d54

		- Refactored and created new version of `SDA_downscale` named `ensemble_downscale`.
		- Added helper function `.convert_coords_to_sf()` for consistent conversion of data with lat lon data to `sf` pts.

	select(site_id, ensemble, prediction) # use site_id instead of site
	dplyr::select("site_id", "ensemble", "prediction") # use site_id instead of site


		ensembles <- unique(ensemble_data$ensemble)

		results <- furrr::future_map(seq_along(ensembles), function(i) {

Update SDA Downscaling to allow sf #3431

Are you sure you want to change the base?

Update SDA Downscaling to allow sf #3431

Uh oh!

Conversation

dlebauer commented Feb 4, 2025

Description

Motivation and Context

Review Time Estimate

Types of changes

Checklist:

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

infotroph left a comment

Choose a reason for hiding this comment

Uh oh!

mdietze commented Feb 6, 2025

Uh oh!

infotroph commented Feb 6, 2025

Uh oh!

Uh oh!

dlebauer commented Feb 18, 2025

Uh oh!

mdietze commented Feb 19, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants