Skip to content

[Feature] Add energy consumption metrics to benchmark suite #36440

@hongping-zh

Description

@hongping-zh

Feature Request

Motivation

vLLM's benchmark suite currently tracks throughput and latency, but not energy consumption. As sustainable AI becomes increasingly important, energy-per-token metrics would help users make informed deployment decisions.

Proposal

Add optional energy consumption tracking to vLLM's benchmark scripts using NVIDIA NVML, reporting:

  • Total energy (Joules) per benchmark run
  • Energy per output token (J/token)
  • Average GPU power draw (W)

Evidence

Systematic benchmarking across 12 model-precision configurations on NVIDIA RTX 4090D (Ada Lovelace) and RTX 5090 (Blackwell) shows that:

  • Quantization does not always reduce energy — NF4 increases energy by 25–56% for models below 3B parameters
  • Batch size has 84–96% impact on per-token energy, often outweighing precision choice
  • INT8 mixed-precision adds 17–33% energy overhead vs FP16
  • These effects vary significantly across GPU architectures

Data

Implementation

I have an open-source NVML-based energy profiling toolkit (EcoCompute-AI) and would be happy to contribute a PR implementing this if there is interest.

The core approach:

  • Use pynvml to sample GPU power at 10 Hz during benchmark runs
  • Compute total energy via trapezoidal integration
  • Report energy metrics alongside existing throughput/latency numbers

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions