-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Context
Hi @raspstephan,
In the context of my activity, I've been given time to work on the benchmark of ML-based forecast models. WeatherBench being a great open-source project, I would like to propose my help to try to improve the benchmark.
The first idea I would like to introduce is the addition of a "Performance vs Latency" plot in the benchmark. Latency would be defined as the inference time per forecast step or for a fixed lead-time (e.g. 15 days). This is interesting for people who face restrictions on GPU resources, or those who want to perform inference at a large scale (e.g. Huge Ensembles related work), and want to know the trade-off between performance and latency.
For the figures, we could draw inspiration from the LLM community. See the example from https://huggingface.co/spaces/optimum/llm-perf-leaderboard., tab "Find Your Best Model".
Proposed steps
For one model, launch a n_steps forecast: ~10 warm-up steps, ~ 50-100 profiled steps. Latency is defined as the average time on profiled steps. Generalize to all models, and save results in the WeatherBench bucket. Then plot the results in the WeatherBench front-end.
As for the inference code, I personally find earth2studio very useful. It has a large model zoo, persistent cache management for input data, access to several datasources (WB2, CDS, ARCO, ...), and runs on Apache 2.0 License. Happy to discuss any alternatives if you have some.
Challenges foreseen
- implementation code for every model should remain in one unique package to ru(le)n them all. This way, we would reduce the implementation difference between models on the impact of the latency benchmark.
- some models currently in WeatherBench are not present in earth2studio. Might need to implement them there in the long-run.
- TPU vs GPU: how to harmonize the Latency benchmark with models originally optimized for different hardwares?
- Traditional NWP are in a category of their own, due to their CPU workloads. Likely better to discuss them in a different issue.
Hope youโll find the contribution useful! Happy to receive feedback, hints, advice, or any clues on existing results related to this topic that I've overlooked.๐