[WIP] gSSURGO Enhancements: SOC Integration, enhanced XML Parsing, and Ensemble improvements #3534

divine7022 · 2025-06-05T11:16:11Z

Description

This PR significantly improves the extract_soil_gssurgo function for generating soil property ensembles from gSSURGO data. It introduces robust XML parsing, support for soil organic carbon (SOC) calculation, improved uncertainty handling, dynamic depth handling, and fixes for ensemble generation and NetCDF output.

Key Enhancements:

Added Fields: Queries chorizon.om_r (organic matter) and chorizon.dbthirdbar_r (bulk density).
SOC Calculation:
- Uses the Van Bemmelen factor (SOC = OM / 1.724).
- Computes SOC stock:
```
horizon_thickness_m * (soc_percent / 100) * bulk_density * 10  
```
Simulates SOC ensembles via gamma distribution for positive-only values.
Avoids memory leaks and malformed XML errors by Temp File handling.
Extracts mukeys from raw XML text if standard parsing fails.
Automatically extends depths vector if observed soil exceeds user-specified layers.
Uses pmax/pmin to ensure valid depth indices, avoiding subscript out of bounds
Valid sizein Handling: Coerces ensemble size to ≥1 to prevent seq_len() errors.
Proper mukey_area Initialization: Creates from simulated data instead of filtering undefined variables.
Uses min(x, nrow(.x)) to avoid invalid row access.
Includes SOC by removing [1:4] subsetting.
extract_soil_gssurgo now supports spatial sampling using a user-defined grid (configurable size and spacing), enabling quantification of within-site spatial variability and improved representation of soil heterogeneity in extracted properties.
Soil property extraction and aggregation now employ area-weighted means across all map unit components and sampled grid cells, ensuring that ensemble statistics reflect the true landscape composition and spatial structure.

Motivation and Context

Review Time Estimate

Immediately
Within one week
When possible

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation.
My name is in the list of CITATION.cff
I agree that PEcAn Project may distribute my contribution under any or all of
- the same license as the existing code,
- and/or the BSD 3-clause license.
I have updated the CHANGELOG.md.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

…mble improvements

… queries and HTTP 500 errors

dlebauer · 2025-06-16T19:34:08Z

@divine7022 Thank you for these improvements!!!

could you please:

resolve conflicts in extract_soil_nc.R?
Add your name to the function authors and CITATION.cff if you haven't already done so
Update Changelog
Add tests for new functionality? These don't have to comprehensively cover cornder cases, mostly aimed at identifying if any essential features are broken in the future.

modules/data.land/R/extract_soil_nc.R

dlebauer · 2025-06-16T19:46:55Z

modules/data.land/R/extract_soil_nc.R

+          soc_mean <- mean(DepthL.Data$soil_organic_carbon_stock, na.rm = TRUE)
+          soc_sd <- stats::sd(DepthL.Data$soil_organic_carbon_stock, na.rm = TRUE)


Under what conditions would the data have NAs? I generally wouldn't expect them in gSSURGO, and if there are known exceptions it would be better to handle these separately.

And are there going to be cases where we'd want to keep the texture data for a layer with missing SOC ?

That is a useful suggestion, though I think modeling soc = f(texture) to impute missing values is outside the scope of this PR.

If om_r is missing then soc_percent becomes NA -- then soil_organic_carbon_stock becomes NA.

The dplyr::filter(stats::complete.cases(.)) filter removes this entire record. So if SOC is missing, the texture data for that same horizon is also removed.

However debugging and testing across multiple geographic regions i couldn't find any sites having incomplete records in gSSURGOS, but Nevada desert location Nevada (36.0, -115.5) among 37 total records i found 27 texture records and 27 OM records -- so this reflected me when texture is missing OM is missing too.. ( just based on this only site that i found with missing ) .

Anyways function filters records with only complete case(dplyr::filter(stats::complete.cases(.))) we wouldn't get case where texture is missing and soc present.
but wrote na.rm here just as a standard practice
soc_sd <- stats::sd(DepthL.Data$soil_organic_carbon_stock, na.rm = TRUE)

Right, I'm not suggesting we impute missing values. I'm saying complete.cases() is a reasonable QC filter when only retrieving texture, but it may not be the right call when some users might want every SOC number they can get even in cases where texture is NA, while others might want all the textures even if SOC is missing, and so on.

modules/data.land/R/extract_soil_nc.R

divine7022 · 2025-07-12T22:02:27Z

Alternatively we also had an option : for more spatially accurate and high-throughput workflows, we can extract MUKEYs and soil properties directly from a local gSSURGO raster (.tif) file (e.g, MapunitRaster_10m.tif):

this ensures perfect alignment with the official gSSURGO raster grid provided by NRCS.
it is computationally efficient for large areas or bulk processing.
however, it requires downloading and storing raster files locally, which can significantly increase disk usage.

I'm currently using a grid based sampling approach to retrieve MUKEYs and associated soil properties:

a grid is generated, centered on the target lat/lon.
for each grid cell center, lat/lon coordinates are calculated.
and then retrieve the corresponding MUKEYs for each location using WFS (Web Feature Service) server requests.
this method enables spatially flexible, sampling and is suitable when local raster data( if we use it in future) are not available or when disk space is limited

…upport

modules/data.land/R/soil2netcdf.R

modules/data.land/tests/testthat/test-extract_soil_nc.R

modules/data.land/R/extract_soil_nc.R

modules/data.land/tests/testthat/test-extract_soil_nc.R

infotroph · 2025-08-27T10:16:13Z

modules/data.land/tests/testthat/test-extract_soil_nc.R

+  expect_false(is.null(res))
+
+  expect_type(res, "list")
+  expect_gte(length(res), 1)


Why >=1 rather than exactly 3?

This ensures that we have at least one soil ensemble in case the modeling part failed
all.soil.ens <-c(all.soil.ens, list(soil.data.gssurgo))
failed ensemble generations are removed, potentially reducing the count. The function doesn't guarantee exactly size ensembles due to its area-weighted sampling

sizein <- mukey_area$Area[mukey_area$mukey == unique(soiltype.sim$mukey)] * size 1:ceiling(sizein)

generates ensembles based on area proportion, ceiling(sizein) can create more ensembles than requested under good coverage condition

The function doesn't guarantee exactly size ensembles

I had not realized this before and it seems very undesirable! Was that the existing behavior or is it newly added?

It was an existing behavior. And works as i have explained above

works as
suppose size=3 and two type covering 70% and 30% of the area respectively:

1st : ceiling(0.7 × 3) = 3 ens
2nd : ceiling(0.3 × 3) = 1 ens
Total: 4 + 1(reported values) = 5 ens

Seeing the math makes me more rather than less sure this behavior is undesirable. Note especially that this will give at least 1 ensemble member for every map unit, even if n map units >> size and some of them are very rare.

I won't make you change the sizein calculation since it was existing code and less of an issue at typical ensemble sizes (dozens to hundreds, so the rounding doesn't change totals as much). But more relevant to the line I'm commenting on, I still recommend testing for an exact number instead of a greater-than:

For unit testing purposes where we control the inputs, we should be able to predict and test against the number of ensemble members we'll get from this specific call, even if it might differ in the general case.

A behavior that is surprising to the user is one that is more rather than less important to test carefully.

updated test to expect the exact size number. 👍

modules/data.land/R/soil_utils.R

…upport

modules/data.land/R/gSSURGO_Query.R

…ore area weighting

…nt aggregation and added comments

requested changes addressed

dlebauer

Great work, we made it!

…n into feat/gssurgo-soc-support

gSSURGO Enhancements: SOC Integration, enhanced XML Parsing, and Ense…

29bdbe7

…mble improvements

github-actions bot added the Modules label Jun 5, 2025

divine7022 added 4 commits June 5, 2025 11:31

Fix SQL field list construction in gSSURGO.Query to prevent malformed…

becb02f

… queries and HTTP 500 errors

fixed the bug in soil2netcdf

d50b43b

refactored code and fixed some bugs

19e9ffc

update soil.units doc with supported variables

04f3cdd

dlebauer requested review from DongchenZ, infotroph and mdietze June 16, 2025 19:34

dlebauer requested changes Jun 16, 2025

View reviewed changes

dlebauer reviewed Jun 16, 2025

View reviewed changes

modules/data.land/R/extract_soil_nc.R Outdated Show resolved Hide resolved

modules/data.land/R/extract_soil_nc.R Outdated Show resolved Hide resolved

infotroph reviewed Jun 16, 2025

View reviewed changes

modules/data.land/R/extract_soil_nc.R Outdated Show resolved Hide resolved

divine7022 added 7 commits July 12, 2025 21:29

supports spatial sampling using a grid

49b4083

add unit test

533502b

update .Rd

067e1b7

update extract_soil_gssurgo.Rd

2161030

udpated NEWS.md

c7c2926

update NAMESPACE

b46bf05

update CHANGELOG.md

d945162

github-actions bot added the Tests label Jul 12, 2025

divine7022 added 3 commits July 13, 2025 00:09

Merge remote-tracking branch 'origin/develop' into feat/gssurgo-soc-s…

0f411e5

…upport

fixed conflict in soil.units function

543887e

Merge remote-tracking branch 'origin/develop' into feat/gssurgo-soc-s…

d0aad8c

…upport

infotroph requested changes Aug 13, 2025

View reviewed changes

divine7022 added 4 commits August 18, 2025 19:37

removed depends from @importFrom

37e6943

correct condition check to handle cause where we get all NA's

8cb1973

modules/data.land/R/soil2netcdf.R

7828eba

update .Rd file

a11b1d7

infotroph reviewed Aug 27, 2025

View reviewed changes

modules/data.land/tests/testthat/test-extract_soil_nc.R Show resolved Hide resolved

infotroph reviewed Aug 27, 2025

View reviewed changes

modules/data.land/tests/testthat/test-extract_soil_nc.R Outdated Show resolved Hide resolved

infotroph reviewed Aug 27, 2025

View reviewed changes

divine7022 added 5 commits August 27, 2025 11:38

add grid_spacing parm to soil_process

fb24651

remove unnecessary pmax from rgamma

cf833fa

calc grid spacing from radius and grid size

e6e6435

add cleare commet about not using first ens member

32896a1

use proj_crs for terra crs parameter

00d7935

dlebauer reviewed Aug 27, 2025

View reviewed changes

modules/data.land/R/soil_utils.R Show resolved Hide resolved

divine7022 and others added 3 commits August 27, 2025 22:19

Merge remote-tracking branch 'origin/develop' into feat/gssurgo-soc-s…

ecdc7fc

…upport

Enhance documentation for gSSURGO.Query function

cf051e2

add max() to ensure grid_size >= 3

2b7dc11

infotroph reviewed Aug 28, 2025

View reviewed changes

modules/data.land/R/gSSURGO_Query.R Outdated Show resolved Hide resolved

divine7022 added 5 commits August 28, 2025 18:42

update documentation

d29f329

update .Rd

6cd483a

now test made more explicit by testing with exact number of ens length

2f0859b

correct fragment aggregation by summing within component horizons bef…

6e057ff

…ore area weighting

replace pmin with min and slice(1) with distinct() for clearer fragme…

324e6cf

…nt aggregation and added comments

dlebauer enabled auto-merge August 30, 2025 01:35

dlebauer approved these changes Aug 30, 2025

View reviewed changes

dlebauer and others added 4 commits September 1, 2025 15:40

Merge branch 'develop' into feat/gssurgo-soc-support

6ac2992

Merge branch 'develop' into feat/gssurgo-soc-support

d0576f8

fix R CMD check warnings for global variables

025c4a5

Merge branch 'feat/gssurgo-soc-support' of github.com:divine7022/peca…

2cea7df

…n into feat/gssurgo-soc-support

dlebauer added this pull request to the merge queue Sep 2, 2025

Merged via the queue into PecanProject:develop with commit 6eddf5d Sep 2, 2025
18 of 26 checks passed

infotroph mentioned this pull request Oct 31, 2025

gSSURGO: Fix spatial sampling and improve data aggregation accuracy #3643

Open

14 tasks

		soc_mean <- mean(DepthL.Data$soil_organic_carbon_stock, na.rm = TRUE)
		soc_sd <- stats::sd(DepthL.Data$soil_organic_carbon_stock, na.rm = TRUE)

[WIP] gSSURGO Enhancements: SOC Integration, enhanced XML Parsing, and Ensemble improvements #3534

[WIP] gSSURGO Enhancements: SOC Integration, enhanced XML Parsing, and Ensemble improvements #3534

Uh oh!

Conversation

divine7022 commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Enhancements:

Motivation and Context

Review Time Estimate

Types of changes

Checklist:

Uh oh!

dlebauer commented Jun 16, 2025 • edited by divine7022 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divine7022 Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

divine7022 commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

infotroph Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dlebauer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

divine7022 commented Jun 5, 2025 •

edited

Loading

dlebauer commented Jun 16, 2025 •

edited by divine7022

Loading

divine7022 Aug 18, 2025 •

edited

Loading

divine7022 commented Jul 12, 2025 •

edited

Loading

infotroph Aug 28, 2025 •

edited

Loading