Skip to content

[v2.1] WeatherBench2 verification: eval CLI, scorecard plots & test suite#310

Draft
jsschreck wants to merge 12 commits intomainfrom
v2.1/weatherbench
Draft

[v2.1] WeatherBench2 verification: eval CLI, scorecard plots & test suite#310
jsschreck wants to merge 12 commits intomainfrom
v2.1/weatherbench

Conversation

@jsschreck
Copy link
Copy Markdown
Collaborator

@jsschreck jsschreck commented Mar 27, 2026

Summary

Adds full WeatherBench2-style deterministic verification tooling directly into CREDIT. This is a proper integration — the metrics module lives in credit/verification/ and the CLI tools are first-class applications/ scripts, not one-off analysis notebooks.

New CLI commands:

# Fast path — aggregate existing per-init CREDIT metrics CSVs
python applications/eval_weatherbench.py --csv /path/to/metrics/ --out scores.csv

# Full path — compute from forecast netCDFs vs ERA5 with true anomaly ACC
python applications/eval_weatherbench.py   --netcdf /path/to/forecasts/   --era5 /glade/campaign/cisl/aiml/ksha/CREDIT_data/ERA5_plevel_1deg/all_in_one/ERA5_plevel_1deg_6h_2022_conserve.zarr   --clim /glade/campaign/cisl/aiml/akn7/CREDIT_CESM/VERIF/ERA5_clim/ERA5_clim_1990_2019_6h_cesm_interp.nc   --out scores.csv   --plot figures/   --label "WXFormer v2"

# Plots only (from existing scores CSV)
python applications/plot_weatherbench.py --scores scores.csv --out figures/ --label "WXFormer v2"

What's included

File Description
credit/verification/deterministic.py Latitude-weighted RMSE, bias, true anomaly ACC with correct WB2 normalization
credit/verification/wb2_references.py Published baseline scores: IFS HRES, Pangu-Weather, GraphCast
applications/eval_weatherbench.py Evaluation CLI — CSV fast path + netCDF full path
applications/plot_weatherbench.py Scorecard plots — RMSE, ACC, bias, heatmap, regional breakdown
tests/test_weatherbench.py 38 unit + integration tests, all passing

Figures generated

  • RMSE vs lead time per variable with IFS HRES / Pangu-Weather / GraphCast reference lines
  • ACC vs lead time — true anomaly ACC (requires climatology) not Pearson correlation
  • Bias vs lead time — global mean bias (forecast − ERA5)
  • Scorecard heatmap — skill score vs IFS HRES at days 1/2/3/5/7/10 (green = better than IFS)
  • Regional RMSE breakdown — global / tropics / N. extratropics / S. extratropics

Notes

  • Ensemble netCDFs are supported: ensemble mean is taken automatically before scoring
  • acc_* columns from legacy per-init CSVs are Pearson r, renamed to pearson_r_* to avoid confusion with true WB2 ACC
  • Reference baselines sourced from: Rasp et al. 2024 (WB2/JAMES), Bi et al. 2023 (Pangu/Nature), Lam et al. 2023 (GraphCast/Science)
  • ERA5 pressure-level data at /glade/campaign/cisl/aiml/ksha/CREDIT_data/ERA5_plevel_1deg/ covers 2020–2022

Test plan

  • python -m pytest tests/test_weatherbench.py — 38/38 pass
  • Run full 2022 eval against scheduler netCDFs on casper — verified over the weekend
  • Verify true ACC curves look sensible vs IFS HRES reference — verified over the weekend

🤖 Generated with Claude Code

Adds full WB2-style deterministic evaluation tooling to CREDIT:

credit/verification/deterministic.py
  - Latitude-weighted RMSE, bias, ACC with correct WB2 normalization
    (mean over longitude first, then lat-weighted sum)
  - Regional breakdowns: global / tropics / n_extratropics / s_extratropics
  - load_wb2_climatology() helper for true anomaly ACC

credit/verification/wb2_references.py
  - Published baseline scores for IFS HRES, Pangu-Weather, GraphCast
    (Rasp 2024, Bi 2023, Lam 2023) for reference overlays in plots

applications/eval_weatherbench.py
  - --csv fast path: aggregate existing per-init CREDIT metrics CSVs
  - --netcdf full path: compute RMSE/bias/ACC directly from forecast
    netCDFs vs ERA5 zarr with optional climatology for true anomaly ACC
  - --plot DIR: generate all WB2 scorecard figures inline
  - Ensemble netCDFs supported (ensemble mean used for deterministic scores)

applications/plot_weatherbench.py
  - RMSE vs lead time per variable with IFS/Pangu/GraphCast reference lines
  - ACC vs lead time (true anomaly ACC when climatology provided)
  - Bias vs lead time
  - Scorecard heatmap: skill score vs IFS HRES (green = better)
  - Regional RMSE breakdown

tests/test_weatherbench.py
  - 38 tests covering metrics math, regional breakdown, CSV fast path,
    netCDF integration, and climatology loading — all passing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 60.97561% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 16.45%. Comparing base (fd24d56) to head (f7b888b).

Files with missing lines Patch % Lines
credit/verification/wb2_references.py 0.00% 28 Missing ⚠️
credit/verification/deterministic.py 78.49% 17 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #310      +/-   ##
==========================================
+ Coverage   16.13%   16.45%   +0.32%     
==========================================
  Files         122      125       +3     
  Lines       19604    19727     +123     
  Branches     3308     3326      +18     
==========================================
+ Hits         3164     3247      +83     
- Misses      16148    16185      +37     
- Partials      292      295       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jsschreck and others added 11 commits March 27, 2026 16:45
- WB2_TARGET_LEVELS redesigned as 5-tuple (era5_var, credit_var, era5_level,
  credit_coord, credit_level) — separates ERA5 long names from CREDIT short
  names and handles both 'pressure' and 'level' coordinate dims
- ERA5_VAR_MAP: added upper-air identity entries (Z, T, U, V, Q) so the
  reverse lookup no longer returns None for all pressure-level variables
  (root cause of RMSE=NaN and ACC=0 in the --netcdf path)
- Regrid ERA5→forecast grid when lat/lon coords differ (192x288 vs 181x360)
- Fix plot_weatherbench import: add sys.path.insert so it resolves relative
  to the applications/ directory
- Update print_wb2_summary key_vars to match available ERA5 variables

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs:
1. Loop over self.acc_vars but anomaly tensors are indexed in
   ordered_acc_vars order → variable labels were mismatched. Fix: loop
   over ordered_acc_vars.
2. Double-demeaning: subtracted global spatial mean from anomalies that
   are already climatology-relative. Standard WMO/WB2 ACC uses
   weighted_corr(f - clim, obs - clim) with no further mean removal.
   Fix: use anomalies directly as pred_prime / y_prime.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Filename suffix (e.g. _006.nc) is already the lead time in hours.
Was multiplying by lead_time_hours again → _006.nc showed as 36h
instead of 6h.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…steps

Each lead-time step is fully independent; run all steps concurrently with
ProcessPoolExecutor(max_workers). Adds --workers CLI arg (default os.cpu_count()).
PBS script updated to 16 CPUs / 4h walltime.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reshape (ensemble, time, lat, lon) → (ensemble, N) and call rankhist
once instead of once per time step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gures

Each variable figure (rmse, acc, regional_rmse) is independent; run all
in parallel via ProcessPoolExecutor. Adds --workers CLI arg. Uses Agg
backend for multiprocessing safety.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rank_histogram_apply vectorization was cherry-picked to ensemble-v2
where the file's PR lives. Weatherbench branch should not touch it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New scripts:
- applications/wxformer_wb2_verif.py: per-init ACC/RMSE in Kyle-format NetCDF
- applications/ensemble_wb2_verif.py: per-init CRPS/spread/RMSE/ACC for 100-member
  CREDIT ensemble using sorted-ensemble CRPS formula (O(n log n), memory-efficient)
- applications/hres_wb2_verif.py: HRES deterministic ACC/RMSE regridded to CREDIT
  1.25 deg grid via xesmf bilinear for direct model comparison

All three output (days, time=40) NetCDF in standardized format with dayofyear/hour
coords and per-100-init batch files matching the CREDIT-arXiv convention.

Verification:
- credit/verification/ensemble.py: add crps_spatial_avg() using sorted-ensemble
  identity E[|X-X'|] = (2/n^2) sum (2i-n+1) x_(i)

Docs:
- docs/source/WeatherBench.md: rewritten user-focused guide
- docs/source/PerformanceMetrics.md: new — covers all three scripts, CRPS formula,
  spread-error ratio, seasonal subsetting, and crps_spatial_avg API
- docs/source/Evaluation.md: add links to new verification scripts
- docs/source/index.rst: add WeatherBench and PerformanceMetrics to toctree

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cation

- Add TestWB2References (11 tests): model presence, variable keys, RMSE/ACC bounds,
  monotonicity, WB2_STYLE consistency
- Add TestLatWeightedMetrics (7 tests): init, var list construction, call returns
  correct keys, perfect forecast → zero RMSE/MSE, MAE ≥ 0, ACC in [-1,1],
  predict-mode ensemble_size
- Add tests/test_standard.py: TestZonalSpectrum (6 tests) and
  TestAverageZonalSpectrum (4 tests) for credit.verification.standard,
  guarded by skipif when torch_harmonics not installed
- Add `import torch` to test_weatherbench.py (missing from existing imports)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… scripts

Rename output paths/labels from wxformer_v2 → wxformer; these scripts target
the v1 model. wxformer_v2 exists only in the wxf_v2 branch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ep by lead_time_hours

_score_step_worker was setting lead_h = step (raw step index) instead of
step * lead_time_hours. Pass lead_time_hours through work_items tuple so
the worker can compute the correct lead time in hours.

Fixes test_scores_have_lead_time_column assertion [4,8] != [24,48].

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants