-
-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Open
Description
Feature Request
Motivation
vLLM's benchmark suite currently tracks throughput and latency, but not energy consumption. As sustainable AI becomes increasingly important, energy-per-token metrics would help users make informed deployment decisions.
Proposal
Add optional energy consumption tracking to vLLM's benchmark scripts using NVIDIA NVML, reporting:
- Total energy (Joules) per benchmark run
- Energy per output token (J/token)
- Average GPU power draw (W)
Evidence
Systematic benchmarking across 12 model-precision configurations on NVIDIA RTX 4090D (Ada Lovelace) and RTX 5090 (Blackwell) shows that:
- Quantization does not always reduce energy — NF4 increases energy by 25–56% for models below 3B parameters
- Batch size has 84–96% impact on per-token energy, often outweighing precision choice
- INT8 mixed-precision adds 17–33% energy overhead vs FP16
- These effects vary significantly across GPU architectures
Data
- Full dataset (200+ measurements): Zenodo
- Profiling toolkit: EcoCompute-AI
- Interactive dashboard: https://hongping-zh.github.io/ecocompute-dynamic-eval/
Implementation
I have an open-source NVML-based energy profiling toolkit (EcoCompute-AI) and would be happy to contribute a PR implementing this if there is interest.
The core approach:
- Use
pynvmlto sample GPU power at 10 Hz during benchmark runs - Compute total energy via trapezoidal integration
- Report energy metrics alongside existing throughput/latency numbers
Related
- MLPerf Inference Benchmark focuses on throughput/latency only
- CodeCarbon provides system-wide tracking but not per-model GPU-specific metrics
- Related PRs: huggingface/transformers#44407, huggingface/optimum#2410
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels