[v2.1] WeatherBench2 verification: eval CLI, scorecard plots & test suite#310
Draft
[v2.1] WeatherBench2 verification: eval CLI, scorecard plots & test suite#310
Conversation
Adds full WB2-style deterministic evaluation tooling to CREDIT:
credit/verification/deterministic.py
- Latitude-weighted RMSE, bias, ACC with correct WB2 normalization
(mean over longitude first, then lat-weighted sum)
- Regional breakdowns: global / tropics / n_extratropics / s_extratropics
- load_wb2_climatology() helper for true anomaly ACC
credit/verification/wb2_references.py
- Published baseline scores for IFS HRES, Pangu-Weather, GraphCast
(Rasp 2024, Bi 2023, Lam 2023) for reference overlays in plots
applications/eval_weatherbench.py
- --csv fast path: aggregate existing per-init CREDIT metrics CSVs
- --netcdf full path: compute RMSE/bias/ACC directly from forecast
netCDFs vs ERA5 zarr with optional climatology for true anomaly ACC
- --plot DIR: generate all WB2 scorecard figures inline
- Ensemble netCDFs supported (ensemble mean used for deterministic scores)
applications/plot_weatherbench.py
- RMSE vs lead time per variable with IFS/Pangu/GraphCast reference lines
- ACC vs lead time (true anomaly ACC when climatology provided)
- Bias vs lead time
- Scorecard heatmap: skill score vs IFS HRES (green = better)
- Regional RMSE breakdown
tests/test_weatherbench.py
- 38 tests covering metrics math, regional breakdown, CSV fast path,
netCDF integration, and climatology loading — all passing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #310 +/- ##
==========================================
+ Coverage 16.13% 16.45% +0.32%
==========================================
Files 122 125 +3
Lines 19604 19727 +123
Branches 3308 3326 +18
==========================================
+ Hits 3164 3247 +83
- Misses 16148 16185 +37
- Partials 292 295 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- WB2_TARGET_LEVELS redesigned as 5-tuple (era5_var, credit_var, era5_level, credit_coord, credit_level) — separates ERA5 long names from CREDIT short names and handles both 'pressure' and 'level' coordinate dims - ERA5_VAR_MAP: added upper-air identity entries (Z, T, U, V, Q) so the reverse lookup no longer returns None for all pressure-level variables (root cause of RMSE=NaN and ACC=0 in the --netcdf path) - Regrid ERA5→forecast grid when lat/lon coords differ (192x288 vs 181x360) - Fix plot_weatherbench import: add sys.path.insert so it resolves relative to the applications/ directory - Update print_wb2_summary key_vars to match available ERA5 variables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs: 1. Loop over self.acc_vars but anomaly tensors are indexed in ordered_acc_vars order → variable labels were mismatched. Fix: loop over ordered_acc_vars. 2. Double-demeaning: subtracted global spatial mean from anomalies that are already climatology-relative. Standard WMO/WB2 ACC uses weighted_corr(f - clim, obs - clim) with no further mean removal. Fix: use anomalies directly as pred_prime / y_prime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Filename suffix (e.g. _006.nc) is already the lead time in hours. Was multiplying by lead_time_hours again → _006.nc showed as 36h instead of 6h. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…steps Each lead-time step is fully independent; run all steps concurrently with ProcessPoolExecutor(max_workers). Adds --workers CLI arg (default os.cpu_count()). PBS script updated to 16 CPUs / 4h walltime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reshape (ensemble, time, lat, lon) → (ensemble, N) and call rankhist once instead of once per time step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gures Each variable figure (rmse, acc, regional_rmse) is independent; run all in parallel via ProcessPoolExecutor. Adds --workers CLI arg. Uses Agg backend for multiprocessing safety. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rank_histogram_apply vectorization was cherry-picked to ensemble-v2 where the file's PR lives. Weatherbench branch should not touch it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New scripts: - applications/wxformer_wb2_verif.py: per-init ACC/RMSE in Kyle-format NetCDF - applications/ensemble_wb2_verif.py: per-init CRPS/spread/RMSE/ACC for 100-member CREDIT ensemble using sorted-ensemble CRPS formula (O(n log n), memory-efficient) - applications/hres_wb2_verif.py: HRES deterministic ACC/RMSE regridded to CREDIT 1.25 deg grid via xesmf bilinear for direct model comparison All three output (days, time=40) NetCDF in standardized format with dayofyear/hour coords and per-100-init batch files matching the CREDIT-arXiv convention. Verification: - credit/verification/ensemble.py: add crps_spatial_avg() using sorted-ensemble identity E[|X-X'|] = (2/n^2) sum (2i-n+1) x_(i) Docs: - docs/source/WeatherBench.md: rewritten user-focused guide - docs/source/PerformanceMetrics.md: new — covers all three scripts, CRPS formula, spread-error ratio, seasonal subsetting, and crps_spatial_avg API - docs/source/Evaluation.md: add links to new verification scripts - docs/source/index.rst: add WeatherBench and PerformanceMetrics to toctree Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cation - Add TestWB2References (11 tests): model presence, variable keys, RMSE/ACC bounds, monotonicity, WB2_STYLE consistency - Add TestLatWeightedMetrics (7 tests): init, var list construction, call returns correct keys, perfect forecast → zero RMSE/MSE, MAE ≥ 0, ACC in [-1,1], predict-mode ensemble_size - Add tests/test_standard.py: TestZonalSpectrum (6 tests) and TestAverageZonalSpectrum (4 tests) for credit.verification.standard, guarded by skipif when torch_harmonics not installed - Add `import torch` to test_weatherbench.py (missing from existing imports) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… scripts Rename output paths/labels from wxformer_v2 → wxformer; these scripts target the v1 model. wxformer_v2 exists only in the wxf_v2 branch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ep by lead_time_hours _score_step_worker was setting lead_h = step (raw step index) instead of step * lead_time_hours. Pass lead_time_hours through work_items tuple so the worker can compute the correct lead time in hours. Fixes test_scores_have_lead_time_column assertion [4,8] != [24,48]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds full WeatherBench2-style deterministic verification tooling directly into CREDIT. This is a proper integration — the metrics module lives in
credit/verification/and the CLI tools are first-classapplications/scripts, not one-off analysis notebooks.New CLI commands:
What's included
credit/verification/deterministic.pycredit/verification/wb2_references.pyapplications/eval_weatherbench.pyapplications/plot_weatherbench.pytests/test_weatherbench.pyFigures generated
Notes
acc_*columns from legacy per-init CSVs are Pearson r, renamed topearson_r_*to avoid confusion with true WB2 ACC/glade/campaign/cisl/aiml/ksha/CREDIT_data/ERA5_plevel_1deg/covers 2020–2022Test plan
python -m pytest tests/test_weatherbench.py— 38/38 pass🤖 Generated with Claude Code