-
Notifications
You must be signed in to change notification settings - Fork 273
[WIP] gSSURGO Enhancements: SOC Integration, enhanced XML Parsing, and Ensemble improvements #3534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] gSSURGO Enhancements: SOC Integration, enhanced XML Parsing, and Ensemble improvements #3534
Conversation
…mble improvements
|
@divine7022 Thank you for these improvements!!! could you please:
|
| soc_mean <- mean(DepthL.Data$soil_organic_carbon_stock, na.rm = TRUE) | ||
| soc_sd <- stats::sd(DepthL.Data$soil_organic_carbon_stock, na.rm = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what conditions would the data have NAs? I generally wouldn't expect them in gSSURGO, and if there are known exceptions it would be better to handle these separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And are there going to be cases where we'd want to keep the texture data for a layer with missing SOC ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a useful suggestion, though I think modeling soc = f(texture) to impute missing values is outside the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If om_r is missing then soc_percent becomes NA -- then soil_organic_carbon_stock becomes NA.
The dplyr::filter(stats::complete.cases(.)) filter removes this entire record. So if SOC is missing, the texture data for that same horizon is also removed.
However debugging and testing across multiple geographic regions i couldn't find any sites having incomplete records in gSSURGOS, but Nevada desert location Nevada (36.0, -115.5) among 37 total records i found 27 texture records and 27 OM records -- so this reflected me when texture is missing OM is missing too.. ( just based on this only site that i found with missing ) .
Anyways function filters records with only complete case(dplyr::filter(stats::complete.cases(.))) we wouldn't get case where texture is missing and soc present.
but wrote na.rm here just as a standard practice
soc_sd <- stats::sd(DepthL.Data$soil_organic_carbon_stock, na.rm = TRUE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm not suggesting we impute missing values. I'm saying complete.cases() is a reasonable QC filter when only retrieving texture, but it may not be the right call when some users might want every SOC number they can get even in cases where texture is NA, while others might want all the textures even if SOC is missing, and so on.
|
Alternatively we also had an option : for more spatially accurate and high-throughput workflows, we can extract
I'm currently using a grid based sampling approach to retrieve
|
| expect_false(is.null(res)) | ||
|
|
||
| expect_type(res, "list") | ||
| expect_gte(length(res), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why >=1 rather than exactly 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures that we have at least one soil ensemble in case the modeling part failed
all.soil.ens <-c(all.soil.ens, list(soil.data.gssurgo))
failed ensemble generations are removed, potentially reducing the count. The function doesn't guarantee exactly size ensembles due to its area-weighted sampling
sizein <- mukey_area$Area[mukey_area$mukey == unique(soiltype.sim$mukey)] * size
1:ceiling(sizein)
generates ensembles based on area proportion, ceiling(sizein) can create more ensembles than requested under good coverage condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function doesn't guarantee exactly size ensembles
I had not realized this before and it seems very undesirable! Was that the existing behavior or is it newly added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was an existing behavior. And works as i have explained above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works as
suppose size=3 and two type covering 70% and 30% of the area respectively:
1st : ceiling(0.7 × 3) = 3 ens
2nd : ceiling(0.3 × 3) = 1 ens
Total: 4 + 1(reported values) = 5 ens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing the math makes me more rather than less sure this behavior is undesirable. Note especially that this will give at least 1 ensemble member for every map unit, even if n map units >> size and some of them are very rare.
I won't make you change the sizein calculation since it was existing code and less of an issue at typical ensemble sizes (dozens to hundreds, so the rounding doesn't change totals as much). But more relevant to the line I'm commenting on, I still recommend testing for an exact number instead of a greater-than:
- For unit testing purposes where we control the inputs, we should be able to predict and test against the number of ensemble members we'll get from this specific call, even if it might differ in the general case.
- A behavior that is surprising to the user is one that is more rather than less important to test carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated test to expect the exact size number. 👍
…ore area weighting
…nt aggregation and added comments
requested changes addressed
dlebauer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, we made it!
6eddf5d
Description
This PR significantly improves the
extract_soil_gssurgofunction for generating soil property ensembles from gSSURGO data. It introduces robust XML parsing, support for soil organic carbon (SOC) calculation, improved uncertainty handling, dynamic depth handling, and fixes for ensemble generation and NetCDF output.Key Enhancements:
chorizon.om_r(organic matter) andchorizon.dbthirdbar_r(bulk density).(SOC = OM / 1.724).SOCensembles via gamma distribution for positive-only values.depthsvector if observed soil exceeds user-specified layers.pmax/pminto ensure valid depth indices, avoidingsubscript out of boundssizeinHandling: Coerces ensemble size to≥1to preventseq_len()errors.min(x, nrow(.x))to avoid invalid row access.SOCby removing [1:4] subsetting.extract_soil_gssurgonow supports spatial sampling using a user-defined grid (configurable size and spacing), enabling quantification of within-site spatial variability and improved representation of soil heterogeneity in extracted properties.Motivation and Context
Review Time Estimate
Types of changes
Checklist: