Add latency benchmark for ML-based models

### Context

Hi @raspstephan,

In the context of my activity, I've been given time to work on the benchmark of ML-based forecast models. WeatherBench being a great open-source project, I would like to propose my help to try to improve the benchmark.

The first idea I would like to introduce is the addition of a "Performance vs Latency" plot in the benchmark. Latency would be defined as the inference time per forecast step or for a fixed lead-time (e.g. 15 days). This is interesting for people who face restrictions on GPU resources, or those who want to perform inference at a large scale (e.g. [Huge Ensembles](https://egusphere.copernicus.org/preprints/2024/egusphere-2024-2422/) related work), and want to know the trade-off between performance and latency.

For the figures, we could draw inspiration from the LLM community. See the example from https://huggingface.co/spaces/optimum/llm-perf-leaderboard., tab "Find Your Best Model".


 
### Proposed steps
For one model, launch a n_steps forecast: ~10 warm-up steps, ~ 50-100 profiled steps. Latency is defined as the average time on profiled steps. Generalize to all models, and save results in the WeatherBench bucket. Then plot the results in the WeatherBench front-end.

As for the inference code, I personally find [earth2studio](https://github.com/NVIDIA/earth2studio) very useful. It has a large model zoo, persistent cache management for input data, access to several datasources (WB2, CDS, ARCO, ...), and runs on Apache 2.0 License. Happy to discuss any alternatives if you have some.

### Challenges foreseen
- implementation code for every model should remain in one unique package to ru(le)n them all. This way, we would reduce the implementation difference between models on the impact of the latency benchmark.
- some models currently in WeatherBench are not present in [earth2studio](https://github.com/NVIDIA/earth2studio). Might need to implement them there in the long-run.
- TPU vs GPU: how to harmonize the Latency benchmark with models originally optimized for different hardwares?
- Traditional NWP are in a category of their own, due to their CPU workloads. Likely better to discuss them in a different issue.

Hope you’ll find the contribution useful! Happy to receive feedback, hints, advice, or any clues on existing results related to this topic that I've overlooked.👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add latency benchmark for ML-based models #123

Context

Proposed steps

Challenges foreseen

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add latency benchmark for ML-based models #123

Description

Context

Proposed steps

Challenges foreseen

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions