[v2.1] WeatherBench2 verification: eval CLI, scorecard plots & test suite by jsschreck · Pull Request #310 · NCAR/miles-credit

jsschreck · 2026-03-27T20:50:49Z

Summary

Adds full WeatherBench2-style deterministic verification tooling directly into CREDIT. This is a proper integration — the metrics module lives in credit/verification/ and the CLI tools are first-class applications/ scripts, not one-off analysis notebooks.

New CLI commands:

# Fast path — aggregate existing per-init CREDIT metrics CSVs
python applications/eval_weatherbench.py --csv /path/to/metrics/ --out scores.csv

# Full path — compute from forecast netCDFs vs ERA5 with true anomaly ACC
python applications/eval_weatherbench.py   --netcdf /path/to/forecasts/   --era5 /glade/campaign/cisl/aiml/ksha/CREDIT_data/ERA5_plevel_1deg/all_in_one/ERA5_plevel_1deg_6h_2022_conserve.zarr   --clim /glade/campaign/cisl/aiml/akn7/CREDIT_CESM/VERIF/ERA5_clim/ERA5_clim_1990_2019_6h_cesm_interp.nc   --out scores.csv   --plot figures/   --label "WXFormer v2"

# Plots only (from existing scores CSV)
python applications/plot_weatherbench.py --scores scores.csv --out figures/ --label "WXFormer v2"

What's included

File	Description
`credit/verification/deterministic.py`	Latitude-weighted RMSE, bias, true anomaly ACC with correct WB2 normalization
`credit/verification/wb2_references.py`	Published baseline scores: IFS HRES, Pangu-Weather, GraphCast
`applications/eval_weatherbench.py`	Evaluation CLI — CSV fast path + netCDF full path
`applications/plot_weatherbench.py`	Scorecard plots — RMSE, ACC, bias, heatmap, regional breakdown
`tests/test_weatherbench.py`	38 unit + integration tests, all passing

Figures generated

RMSE vs lead time per variable with IFS HRES / Pangu-Weather / GraphCast reference lines
ACC vs lead time — true anomaly ACC (requires climatology) not Pearson correlation
Bias vs lead time — global mean bias (forecast − ERA5)
Scorecard heatmap — skill score vs IFS HRES at days 1/2/3/5/7/10 (green = better than IFS)
Regional RMSE breakdown — global / tropics / N. extratropics / S. extratropics

Notes

Ensemble netCDFs are supported: ensemble mean is taken automatically before scoring
acc_* columns from legacy per-init CSVs are Pearson r, renamed to pearson_r_* to avoid confusion with true WB2 ACC
Reference baselines sourced from: Rasp et al. 2024 (WB2/JAMES), Bi et al. 2023 (Pangu/Nature), Lam et al. 2023 (GraphCast/Science)
ERA5 pressure-level data at /glade/campaign/cisl/aiml/ksha/CREDIT_data/ERA5_plevel_1deg/ covers 2020–2022

Test plan

python -m pytest tests/test_weatherbench.py — 38/38 pass
Run full 2022 eval against scheduler netCDFs on casper — verified over the weekend
Verify true ACC curves look sensible vs IFS HRES reference — verified over the weekend

🤖 Generated with Claude Code

Adds full WB2-style deterministic evaluation tooling to CREDIT: credit/verification/deterministic.py - Latitude-weighted RMSE, bias, ACC with correct WB2 normalization (mean over longitude first, then lat-weighted sum) - Regional breakdowns: global / tropics / n_extratropics / s_extratropics - load_wb2_climatology() helper for true anomaly ACC credit/verification/wb2_references.py - Published baseline scores for IFS HRES, Pangu-Weather, GraphCast (Rasp 2024, Bi 2023, Lam 2023) for reference overlays in plots applications/eval_weatherbench.py - --csv fast path: aggregate existing per-init CREDIT metrics CSVs - --netcdf full path: compute RMSE/bias/ACC directly from forecast netCDFs vs ERA5 zarr with optional climatology for true anomaly ACC - --plot DIR: generate all WB2 scorecard figures inline - Ensemble netCDFs supported (ensemble mean used for deterministic scores) applications/plot_weatherbench.py - RMSE vs lead time per variable with IFS/Pangu/GraphCast reference lines - ACC vs lead time (true anomaly ACC when climatology provided) - Bias vs lead time - Scorecard heatmap: skill score vs IFS HRES (green = better) - Regional RMSE breakdown tests/test_weatherbench.py - 38 tests covering metrics math, regional breakdown, CSV fast path, netCDF integration, and climatology loading — all passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-03-27T20:56:12Z

Codecov Report

❌ Patch coverage is 60.97561% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 16.45%. Comparing base (fd24d56) to head (f7b888b).

Files with missing lines	Patch %	Lines
credit/verification/wb2_references.py	0.00%	28 Missing ⚠️
credit/verification/deterministic.py	78.49%	17 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #310      +/-   ##
==========================================
+ Coverage   16.13%   16.45%   +0.32%     
==========================================
  Files         122      125       +3     
  Lines       19604    19727     +123     
  Branches     3308     3326      +18     
==========================================
+ Hits         3164     3247      +83     
- Misses      16148    16185      +37     
- Partials      292      295       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- WB2_TARGET_LEVELS redesigned as 5-tuple (era5_var, credit_var, era5_level, credit_coord, credit_level) — separates ERA5 long names from CREDIT short names and handles both 'pressure' and 'level' coordinate dims - ERA5_VAR_MAP: added upper-air identity entries (Z, T, U, V, Q) so the reverse lookup no longer returns None for all pressure-level variables (root cause of RMSE=NaN and ACC=0 in the --netcdf path) - Regrid ERA5→forecast grid when lat/lon coords differ (192x288 vs 181x360) - Fix plot_weatherbench import: add sys.path.insert so it resolves relative to the applications/ directory - Update print_wb2_summary key_vars to match available ERA5 variables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two bugs: 1. Loop over self.acc_vars but anomaly tensors are indexed in ordered_acc_vars order → variable labels were mismatched. Fix: loop over ordered_acc_vars. 2. Double-demeaning: subtracted global spatial mean from anomalies that are already climatology-relative. Standard WMO/WB2 ACC uses weighted_corr(f - clim, obs - clim) with no further mean removal. Fix: use anomalies directly as pred_prime / y_prime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Filename suffix (e.g. _006.nc) is already the lead time in hours. Was multiplying by lead_time_hours again → _006.nc showed as 36h instead of 6h. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…steps Each lead-time step is fully independent; run all steps concurrently with ProcessPoolExecutor(max_workers). Adds --workers CLI arg (default os.cpu_count()). PBS script updated to 16 CPUs / 4h walltime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reshape (ensemble, time, lat, lon) → (ensemble, N) and call rankhist once instead of once per time step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…gures Each variable figure (rmse, acc, regional_rmse) is independent; run all in parallel via ProcessPoolExecutor. Adds --workers CLI arg. Uses Agg backend for multiprocessing safety. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rank_histogram_apply vectorization was cherry-picked to ensemble-v2 where the file's PR lives. Weatherbench branch should not touch it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

New scripts: - applications/wxformer_wb2_verif.py: per-init ACC/RMSE in Kyle-format NetCDF - applications/ensemble_wb2_verif.py: per-init CRPS/spread/RMSE/ACC for 100-member CREDIT ensemble using sorted-ensemble CRPS formula (O(n log n), memory-efficient) - applications/hres_wb2_verif.py: HRES deterministic ACC/RMSE regridded to CREDIT 1.25 deg grid via xesmf bilinear for direct model comparison All three output (days, time=40) NetCDF in standardized format with dayofyear/hour coords and per-100-init batch files matching the CREDIT-arXiv convention. Verification: - credit/verification/ensemble.py: add crps_spatial_avg() using sorted-ensemble identity E[|X-X'|] = (2/n^2) sum (2i-n+1) x_(i) Docs: - docs/source/WeatherBench.md: rewritten user-focused guide - docs/source/PerformanceMetrics.md: new — covers all three scripts, CRPS formula, spread-error ratio, seasonal subsetting, and crps_spatial_avg API - docs/source/Evaluation.md: add links to new verification scripts - docs/source/index.rst: add WeatherBench and PerformanceMetrics to toctree Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…cation - Add TestWB2References (11 tests): model presence, variable keys, RMSE/ACC bounds, monotonicity, WB2_STYLE consistency - Add TestLatWeightedMetrics (7 tests): init, var list construction, call returns correct keys, perfect forecast → zero RMSE/MSE, MAE ≥ 0, ACC in [-1,1], predict-mode ensemble_size - Add tests/test_standard.py: TestZonalSpectrum (6 tests) and TestAverageZonalSpectrum (4 tests) for credit.verification.standard, guarded by skipif when torch_harmonics not installed - Add `import torch` to test_weatherbench.py (missing from existing imports) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… scripts Rename output paths/labels from wxformer_v2 → wxformer; these scripts target the v1 model. wxformer_v2 exists only in the wxf_v2 branch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ep by lead_time_hours _score_step_worker was setting lead_h = step (raw step index) instead of step * lead_time_hours. Pass lead_time_hours through work_items tuple so the worker can compute the correct lead time in hours. Fixes test_scores_have_lead_time_column assertion [4,8] != [24,48]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jsschreck and others added 11 commits March 27, 2026 16:45

Fix lead_h computation in netCDF scoring path

132b964

Filename suffix (e.g. _006.nc) is already the lead time in hours. Was multiplying by lead_time_hours again → _006.nc showed as 36h instead of 6h. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Vectorize rank_histogram_apply: replace time loop with reshape

dce8251

Reshape (ensemble, time, lat, lon) → (ensemble, N) and call rankhist once instead of once per time step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Revert ensemble.py to main: vectorization belongs on ensemble-v2

c19fd2b

rank_histogram_apply vectorization was cherry-picked to ensemble-v2 where the file's PR lives. Weatherbench branch should not touch it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cleanup: remove wxformer_v2 references from weatherbench verification…

363bcdb

… scripts Rename output paths/labels from wxformer_v2 → wxformer; these scripts target the v1 model. wxformer_v2 exists only in the wxf_v2 branch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2.1] WeatherBench2 verification: eval CLI, scorecard plots & test suite#310

[v2.1] WeatherBench2 verification: eval CLI, scorecard plots & test suite#310
jsschreck wants to merge 12 commits intomainfrom
v2.1/weatherbench

jsschreck commented Mar 27, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jsschreck commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Figures generated

Notes

Test plan

Uh oh!

codecov-commenter commented Mar 27, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jsschreck commented Mar 27, 2026 •

edited

Loading