Skip to content

Conversation

@divine7022
Copy link
Collaborator

@divine7022 divine7022 commented Jun 5, 2025

Description

This PR significantly improves the extract_soil_gssurgo function for generating soil property ensembles from gSSURGO data. It introduces robust XML parsing, support for soil organic carbon (SOC) calculation, improved uncertainty handling, dynamic depth handling, and fixes for ensemble generation and NetCDF output.

Key Enhancements:

  • Added Fields: Queries chorizon.om_r (organic matter) and chorizon.dbthirdbar_r (bulk density).
  • SOC Calculation:
    • Uses the Van Bemmelen factor (SOC = OM / 1.724).
    • Computes SOC stock:
    horizon_thickness_m * (soc_percent / 100) * bulk_density * 10  
  • Simulates SOC ensembles via gamma distribution for positive-only values.
  • Avoids memory leaks and malformed XML errors by Temp File handling.
  • Extracts mukeys from raw XML text if standard parsing fails.
  • Automatically extends depths vector if observed soil exceeds user-specified layers.
  • Uses pmax/pmin to ensure valid depth indices, avoiding subscript out of bounds
  • Valid sizein Handling: Coerces ensemble size to ≥1 to prevent seq_len() errors.
  • Proper mukey_area Initialization: Creates from simulated data instead of filtering undefined variables.
  • Uses min(x, nrow(.x)) to avoid invalid row access.
  • Includes SOC by removing [1:4] subsetting.
  • extract_soil_gssurgo now supports spatial sampling using a user-defined grid (configurable size and spacing), enabling quantification of within-site spatial variability and improved representation of soil heterogeneity in extracted properties.
  • Soil property extraction and aggregation now employ area-weighted means across all map unit components and sampled grid cells, ensuring that ensemble statistics reflect the true landscape composition and spatial structure.

Motivation and Context

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I agree that PEcAn Project may distribute my contribution under any or all of
    • the same license as the existing code,
    • and/or the BSD 3-clause license.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@dlebauer
Copy link
Member

dlebauer commented Jun 16, 2025

@divine7022 Thank you for these improvements!!!

could you please:

  • resolve conflicts in extract_soil_nc.R?
  • Add your name to the function authors and CITATION.cff if you haven't already done so
  • Update Changelog
  • Add tests for new functionality? These don't have to comprehensively cover cornder cases, mostly aimed at identifying if any essential features are broken in the future.

Comment on lines +154 to +155
soc_mean <- mean(DepthL.Data$soil_organic_carbon_stock, na.rm = TRUE)
soc_sd <- stats::sd(DepthL.Data$soil_organic_carbon_stock, na.rm = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what conditions would the data have NAs? I generally wouldn't expect them in gSSURGO, and if there are known exceptions it would be better to handle these separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And are there going to be cases where we'd want to keep the texture data for a layer with missing SOC ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a useful suggestion, though I think modeling soc = f(texture) to impute missing values is outside the scope of this PR.

Copy link
Collaborator Author

@divine7022 divine7022 Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If om_r is missing then soc_percent becomes NA -- then soil_organic_carbon_stock becomes NA.

The dplyr::filter(stats::complete.cases(.)) filter removes this entire record. So if SOC is missing, the texture data for that same horizon is also removed.

However debugging and testing across multiple geographic regions i couldn't find any sites having incomplete records in gSSURGOS, but Nevada desert location Nevada (36.0, -115.5) among 37 total records i found 27 texture records and 27 OM records -- so this reflected me when texture is missing OM is missing too.. ( just based on this only site that i found with missing ) .

Anyways function filters records with only complete case(dplyr::filter(stats::complete.cases(.))​) we wouldn't get case where texture is missing and soc present.
but wrote na.rm here just as a standard practice
soc_sd <- stats::sd(DepthL.Data$soil_organic_carbon_stock, na.rm = TRUE)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm not suggesting we impute missing values. I'm saying complete.cases() is a reasonable QC filter when only retrieving texture, but it may not be the right call when some users might want every SOC number they can get even in cases where texture is NA, while others might want all the textures even if SOC is missing, and so on.

@github-actions github-actions bot added the Tests label Jul 12, 2025
@divine7022
Copy link
Collaborator Author

divine7022 commented Jul 12, 2025

Alternatively we also had an option : for more spatially accurate and high-throughput workflows, we can extract MUKEYs and soil properties directly from a local gSSURGO raster (.tif) file (e.g, MapunitRaster_10m.tif):

  • this ensures perfect alignment with the official gSSURGO raster grid provided by NRCS.
  • it is computationally efficient for large areas or bulk processing.
  • however, it requires downloading and storing raster files locally, which can significantly increase disk usage.

I'm currently using a grid based sampling approach to retrieve MUKEYs and associated soil properties:

  • a grid is generated, centered on the target lat/lon.
  • for each grid cell center, lat/lon coordinates are calculated.
  • and then retrieve the corresponding MUKEYs for each location using WFS (Web Feature Service) server requests.
  • this method enables spatially flexible, sampling and is suitable when local raster data( if we use it in future) are not available or when disk space is limited

expect_false(is.null(res))

expect_type(res, "list")
expect_gte(length(res), 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why >=1 rather than exactly 3?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that we have at least one soil ensemble in case the modeling part failed
all.soil.ens <-c(all.soil.ens, list(soil.data.gssurgo))
failed ensemble generations are removed, potentially reducing the count. The function doesn't guarantee exactly size ensembles due to its area-weighted sampling

sizein <- mukey_area$Area[mukey_area$mukey == unique(soiltype.sim$mukey)] * size
        1:ceiling(sizein)

generates ensembles based on area proportion, ceiling(sizein) can create more ensembles than requested under good coverage condition

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function doesn't guarantee exactly size ensembles

I had not realized this before and it seems very undesirable! Was that the existing behavior or is it newly added?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an existing behavior. And works as i have explained above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works as
suppose size=3 and two type covering 70% and 30% of the area respectively:

1st : ceiling(0.7 × 3) = 3 ens
2nd : ceiling(0.3 × 3) = 1 ens
Total: 4 + 1(reported values) = 5 ens

Copy link
Member

@infotroph infotroph Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing the math makes me more rather than less sure this behavior is undesirable. Note especially that this will give at least 1 ensemble member for every map unit, even if n map units >> size and some of them are very rare.

I won't make you change the sizein calculation since it was existing code and less of an issue at typical ensemble sizes (dozens to hundreds, so the rounding doesn't change totals as much). But more relevant to the line I'm commenting on, I still recommend testing for an exact number instead of a greater-than:

  1. For unit testing purposes where we control the inputs, we should be able to predict and test against the number of ensemble members we'll get from this specific call, even if it might differ in the general case.
  2. A behavior that is surprising to the user is one that is more rather than less important to test carefully.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated test to expect the exact size number. 👍

@dlebauer dlebauer enabled auto-merge August 30, 2025 01:35
@dlebauer dlebauer dismissed stale reviews from infotroph and themself August 30, 2025 01:36

requested changes addressed

Copy link
Member

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, we made it!

@dlebauer dlebauer added this pull request to the merge queue Sep 2, 2025
Merged via the queue into PecanProject:develop with commit 6eddf5d Sep 2, 2025
18 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants