Skip to content

[Metrics] Model FLOPs Utilization estimation#30738

Merged
zhuohan123 merged 9 commits intovllm-project:mainfrom
SungMinCho:main
Dec 18, 2025
Merged

[Metrics] Model FLOPs Utilization estimation#30738
zhuohan123 merged 9 commits intovllm-project:mainfrom
SungMinCho:main

Conversation

@SungMinCho
Copy link
Contributor

@SungMinCho SungMinCho commented Dec 16, 2025

Signed-off-by: SungMinCho tjdals4565@gmail.com

TL;DR

This PR implements optional "MFU stats logging", which appends "per-GPU flops/bandwidth" information to the existing periodic logs, reporting the average compute/memory performance of GPUs in each Engine for the duration of that log interval. These stats are calculated with minimal overhead by feeding in SchedulerOutput to the analytic config-based perf calculator at every iteration.

How to use

Set --enable-mfu-metrics when launching the vLLM server to enable MFU logging.

VLLM_DEBUG_MFU_METRICS=1 with VLLM_LOGGING_LEVEL=DEBUG is for debugging purposes for developers working on the metrics calculations.

Purpose

To track the hardware performance over the course of vLLM serving.

Examples

(for each example, please find the MFU stats at the end of each logged line)
(all examples were gathered on B200 GPUs)

GPT-OSS 120B TP=8 BatchSize=256 Input=6K Output=3K NumBatch=1

with VLLM_MFU_LOGGING_LEVEL=1

INFO 12-15 17:38:18 [loggers.py:259] Engine 000: Avg prompt throughput: 3599.9 tokens/s, Avg generation throughput: 199.6 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%, MFU: 6.6 TF/s/GPU 105.0 GB/s/GPU
INFO 12-15 17:38:28 [loggers.py:259] Engine 000: Avg prompt throughput: 150003.2 tokens/s, Avg generation throughput: 938.8 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%, MFU: 259.9 TF/s/GPU 491.7 GB/s/GPU
INFO 12-15 17:38:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18609.5 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.7%, Prefix cache hit rate: 0.0%, MFU: 32.6 TF/s/GPU 1138.1 GB/s/GPU
INFO 12-15 17:38:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18479.1 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%, MFU: 33.3 TF/s/GPU 1191.9 GB/s/GPU
INFO 12-15 17:38:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18428.1 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.2%, Prefix cache hit rate: 0.0%, MFU: 34.2 TF/s/GPU 1249.8 GB/s/GPU
INFO 12-15 17:39:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18201.0 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.5%, Prefix cache hit rate: 0.0%, MFU: 34.8 TF/s/GPU 1294.4 GB/s/GPU
INFO 12-15 17:39:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2130.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MFU: 4.1 TF/s/GPU 159.4 GB/s/GPU
INFO 12-15 17:39:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

with VLLM_MFU_LOGGING_LEVEL=2
https://gist.github.com/SungMinCho/9ed5254e4bd1b3e5eb05a360df0dcc88

gives you the details behind the logged MFU stats such as

  • per-component breakdown of reported flops/bytes
  • breakdown of run context (i.e. input to the MFU calculator for that log duration e.g. prefill_num_tokens, etc)
  • log duration, MFU calculation duration, MFU calculation overhead
  • etc

Some more examples with different parallelism setups

GPT-OSS 120B TP=4 BatchSize=256 Input=6K Output=3K NumBatch=1

[0;36m(APIServer pid=2363162)[0;0m INFO 12-15 18:27:00 [loggers.py:259] Engine 000: Avg prompt throughput: 101389.0 tokens/s, Avg generation throughput: 77.4 tokens/s, Running: 170 reqs, Waiting: 86 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 0.0%, MFU: 349.4 TF/s/GPU 391.1 GB/s/GPU
[0;36m(APIServer pid=2363162)[0;0m INFO 12-15 18:27:10 [loggers.py:259] Engine 000: Avg prompt throughput: 51584.2 tokens/s, Avg generation throughput: 9711.6 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.3%, Prefix cache hit rate: 0.0%, MFU: 211.3 TF/s/GPU 1339.8 GB/s/GPU
[0;36m(APIServer pid=2363162)[0;0m INFO 12-15 18:27:20 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15227.4 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.8%, Prefix cache hit rate: 0.0%, MFU: 54.0 TF/s/GPU 1870.2 GB/s/GPU
[0;36m(APIServer pid=2363162)[0;0m INFO 12-15 18:27:30 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15176.8 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.3%, Prefix cache hit rate: 0.0%, MFU: 55.1 TF/s/GPU 1947.1 GB/s/GPU
[0;36m(APIServer pid=2363162)[0;0m INFO 12-15 18:27:40 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15151.6 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.7%, Prefix cache hit rate: 0.0%, MFU: 56.3 TF/s/GPU 2026.6 GB/s/GPU
[0;36m(APIServer pid=2363162)[0;0m INFO 12-15 18:27:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15074.7 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.2%, Prefix cache hit rate: 0.0%, MFU: 57.4 TF/s/GPU 2098.3 GB/s/GPU

GPT-OSS 120B TP=1 BatchSize=256 Input=6K Output=3K NumBatch=1

[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:32:12 [loggers.py:259] Engine 000: Avg prompt throughput: 600.0 tokens/s, Avg generation throughput: 204.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%, MFU: 11.3 TF/s/GPU 772.8 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:32:22 [loggers.py:259] Engine 000: Avg prompt throughput: 29400.6 tokens/s, Avg generation throughput: 8.7 tokens/s, Running: 50 reqs, Waiting: 206 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 0.0%, MFU: 405.0 TF/s/GPU 269.3 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:32:32 [loggers.py:259] Engine 000: Avg prompt throughput: 37799.9 tokens/s, Avg generation throughput: 27.6 tokens/s, Running: 113 reqs, Waiting: 143 reqs, GPU KV cache usage: 17.4%, Prefix cache hit rate: 0.0%, MFU: 521.0 TF/s/GPU 344.9 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:32:42 [loggers.py:259] Engine 000: Avg prompt throughput: 37800.0 tokens/s, Avg generation throughput: 46.5 tokens/s, Running: 176 reqs, Waiting: 80 reqs, GPU KV cache usage: 25.6%, Prefix cache hit rate: 0.0%, MFU: 521.2 TF/s/GPU 349.3 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:32:52 [loggers.py:259] Engine 000: Avg prompt throughput: 37797.4 tokens/s, Avg generation throughput: 65.4 tokens/s, Running: 239 reqs, Waiting: 17 reqs, GPU KV cache usage: 33.8%, Prefix cache hit rate: 0.0%, MFU: 521.5 TF/s/GPU 353.6 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:33:02 [loggers.py:259] Engine 000: Avg prompt throughput: 10197.0 tokens/s, Avg generation throughput: 6142.2 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 34.7%, Prefix cache hit rate: 0.0%, MFU: 225.5 TF/s/GPU 2940.2 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:33:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6832.8 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 36.2%, Prefix cache hit rate: 0.0%, MFU: 95.6 TF/s/GPU 3238.4 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:33:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6807.8 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 37.6%, Prefix cache hit rate: 0.0%, MFU: 96.4 TF/s/GPU 3293.4 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:33:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6756.5 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 39.0%, Prefix cache hit rate: 0.0%, MFU: 96.7 TF/s/GPU 3334.6 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:33:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6755.3 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.5%, Prefix cache hit rate: 0.0%, MFU: 97.7 TF/s/GPU 3399.8 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:33:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6707.4 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 41.9%, Prefix cache hit rate: 0.0%, MFU: 98.1 TF/s/GPU 3440.7 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:34:02 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6603.0 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.3%, Prefix cache hit rate: 0.0%, MFU: 97.6 TF/s/GPU 3450.4 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:34:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6527.2 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 44.7%, Prefix cache hit rate: 0.0%, MFU: 97.4 TF/s/GPU 3472.5 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:34:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6576.8 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 46.1%, Prefix cache hit rate: 0.0%, MFU: 99.2 TF/s/GPU 3561.0 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:34:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6627.8 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 47.5%, Prefix cache hit rate: 0.0%, MFU: 100.9 TF/s/GPU 3651.6 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:34:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6551.6 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 48.9%, Prefix cache hit rate: 0.0%, MFU: 100.8 TF/s/GPU 3671.8 GB/s/GPU

GPT-OSS 120B DP=2 TP=4 EP=8 BatchSize=256 Input=6K Output=3K NumBatch=1

Note: This is to show that it works well under multiple-engine scenarios (i.e. DP > 1)

INFO 12-15 18:51:34 [loggers.py:259] Engine 000: Avg prompt throughput: 600.0 tokens/s, Avg generation throughput: 98.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MFU: 1.7 TF/s/GPU 50.0 GB/s/GPU
INFO 12-15 18:51:34 [loggers.py:259] Engine 001: Avg prompt throughput: 600.0 tokens/s, Avg generation throughput: 0.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MFU: 0.4 TF/s/GPU 0.5 GB/s/GPU
INFO 12-15 18:51:44 [loggers.py:259] Engine 000: Avg prompt throughput: 62998.4 tokens/s, Avg generation throughput: 32.0 tokens/s, Running: 106 reqs, Waiting: 22 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0%, MFU: 160.6 TF/s/GPU 106.9 GB/s/GPU
INFO 12-15 18:51:44 [loggers.py:259] Engine 001: Avg prompt throughput: 62998.7 tokens/s, Avg generation throughput: 32.0 tokens/s, Running: 106 reqs, Waiting: 22 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0%, MFU: 160.6 TF/s/GPU 106.9 GB/s/GPU
INFO 12-15 18:51:54 [loggers.py:259] Engine 000: Avg prompt throughput: 13197.8 tokens/s, Avg generation throughput: 6962.0 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.8%, Prefix cache hit rate: 0.0%, MFU: 51.6 TF/s/GPU 846.9 GB/s/GPU
INFO 12-15 18:51:54 [loggers.py:259] Engine 001: Avg prompt throughput: 13197.9 tokens/s, Avg generation throughput: 6962.0 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.8%, Prefix cache hit rate: 0.0%, MFU: 51.6 TF/s/GPU 846.9 GB/s/GPU
INFO 12-15 18:52:04 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9136.6 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.1%, Prefix cache hit rate: 0.0%, MFU: 24.5 TF/s/GPU 1136.6 GB/s/GPU
INFO 12-15 18:52:04 [loggers.py:259] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9136.6 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.1%, Prefix cache hit rate: 0.0%, MFU: 24.5 TF/s/GPU 1136.6 GB/s/GPU
INFO 12-15 18:52:14 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9147.9 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 0.0%, MFU: 25.5 TF/s/GPU 1198.2 GB/s/GPU
INFO 12-15 18:52:14 [loggers.py:259] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9147.9 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 0.0%, MFU: 25.5 TF/s/GPU 1198.2 GB/s/GPU
INFO 12-15 18:52:24 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9073.2 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 0.0%, MFU: 26.2 TF/s/GPU 1248.0 GB/s/GPU
INFO 12-15 18:52:24 [loggers.py:259] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9073.3 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 0.0%, MFU: 26.2 TF/s/GPU 1248.0 GB/s/GPU

Test Plan

See Examples above.

Test Result

See Examples above.

Notes

This PR has been moved from #28859 due to code sync issues with Meta-internal codebase. See #28859 for some of the original discussions and reviews.

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces MFU (Model Flops Utilization) stats logging, which is a valuable feature for performance monitoring. The implementation is well-structured, particularly the new vllm/v1/metrics/perf.py file which uses a modular parser chain and component-based metrics calculation.

My main feedback is on a potential correctness issue in the final rate calculation for TFLOPs/s and GB/s when pipeline parallelism is enabled. The current logic seems to inflate these metrics by the pipeline parallel size. I've provided a suggestion to correct this.

Overall, this is a great addition to vLLM's observability features.

Comment on lines 1162 to 1169
delta_time_per_gpu = delta_time / self.pp_size

avg_tflops_per_gpu = self.total_num_flops_per_gpu / delta_time_per_gpu / 1e12
avg_gbps_per_gpu = (
(self.total_read_bytes_per_gpu + self.total_write_bytes_per_gpu)
/ delta_time_per_gpu
/ 1e9
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation of avg_tflops_per_gpu and avg_gbps_per_gpu appears to be incorrect when pipeline parallelism is used (pp_size > 1).

The total_num_flops_per_gpu and total_*_bytes_per_gpu values are already calculated on a per-GPU basis. Dividing delta_time by pp_size to get delta_time_per_gpu incorrectly inflates the reported TFLOPs/s and GB/s rates by a factor of pp_size.

The rate should be calculated over the total delta_time during which the metrics were accumulated. Additionally, it's good practice to handle the case where delta_time could be zero or negative to avoid a ZeroDivisionError.

Suggested change
delta_time_per_gpu = delta_time / self.pp_size
avg_tflops_per_gpu = self.total_num_flops_per_gpu / delta_time_per_gpu / 1e12
avg_gbps_per_gpu = (
(self.total_read_bytes_per_gpu + self.total_write_bytes_per_gpu)
/ delta_time_per_gpu
/ 1e9
)
if delta_time <= 0.0:
avg_tflops_per_gpu = 0.0
avg_gbps_per_gpu = 0.0
else:
avg_tflops_per_gpu = self.total_num_flops_per_gpu / delta_time / 1e12
avg_gbps_per_gpu = (
(self.total_read_bytes_per_gpu + self.total_write_bytes_per_gpu)
/ delta_time
/ 1e9
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SungMinCho what's your take on this comment?

@SungMinCho
Copy link
Contributor Author

SungMinCho commented Dec 16, 2025

Hi @markmc @zhuohan123 @bwasti I moved #28859 into this PR to bypass internal codebase sync problems. Could you guys review this one last time and land? If anything, by default this functionality is turned off so it should be fairly safe to land. Thanks!

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think the main todo is to add support for more complex attention types?

@zhuohan123 zhuohan123 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 16, 2025
@zhuohan123 zhuohan123 enabled auto-merge (squash) December 16, 2025 03:43
@mergify
Copy link

mergify bot commented Dec 16, 2025

Hi @SungMinCho, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@SungMinCho
Copy link
Contributor Author

LGTM! I think the main todo is to add support for more complex attention types?

Yes indeed (and maybe verify more sophisticated parallelism combinations etc).

auto-merge was automatically disabled December 16, 2025 03:56

Head branch was pushed to by a user without write access

@mergify
Copy link

mergify bot commented Dec 16, 2025

Hi @SungMinCho, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link

mergify bot commented Dec 16, 2025

Hi @SungMinCho, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@SungMinCho SungMinCho force-pushed the main branch 2 times, most recently from af378f9 to 0b0adc5 Compare December 16, 2025 09:31
@markmc markmc changed the title Add mfu stats logging [Metrics] Model FLOPs Utilization estimation Dec 16, 2025
@markmc
Copy link
Member

markmc commented Dec 16, 2025

How to use

Set VLLM_MFU_LOGGING_LEVEL=1 when launching the vLLM server to enable MFU logging.

(VLLM_MFU_LOGGING_LEVEL=2 is verbose mode for debugging purposes for experts) (By default VLLM_MFU_LOGGING_LEVEL=0 which disables MFU logging).

As per #25700 I think we should add a config option for this

ObservabilityConfig is probably the right place for it - e.g. --mfu-metrics-level

However, I'd be inclined to do something more descriptive like --mfu-metrics=aggregated/per-gpu or something

And, at first glance, I think some of the verbose logging is more like debug logging for the calculation itself - e.g. we wouldn't add Prometheus metrics for most of those, probably - so I'd just log that stuff with log.debug() or add --mfu-metrics-debug

@markmc
Copy link
Member

markmc commented Dec 16, 2025

xfref to my PR to add Prometheus support - SungMinCho#3 - which I guess you prefer we do as a follow-up

vllm/envs.py Outdated
@@ -244,6 +244,7 @@
VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLD: int = 256
VLLM_COMPILE_CACHE_SAVE_FORMAT: Literal["binary", "unpacked"] = "binary"
VLLM_USE_V2_MODEL_RUNNER: bool = False
VLLM_MFU_LOGGING_LEVEL: int = 0 # 0: disabled, 1: enabled, 2: verbose
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this an engine arg instead?

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
And use VLLM_DEBUG_MFU_METRICS to enable debugging.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc
Copy link
Member

markmc commented Dec 17, 2025

Rebased to pick up #30878

@markmc markmc enabled auto-merge (squash) December 17, 2025 19:40
@SungMinCho
Copy link
Contributor Author

SungMinCho commented Dec 17, 2025

Thank you @markmc for all of these follow up commits. I think they all make sense. (I was surprised with the test file because I had that exact same updated version in my internal diff which I just didn't care to include in this PR but you somehow wrote the exact replica lol).

Yes I absolutely agree with the updated CLI arguments. Thank you for the clean refactorings too. Sorry about numerous back and forths.

Let me just add one more commit on top to include these changes:

  • In the process of separating MFU logs from the main logger, I think we lost visibility on which Engine it is reporting the MFU from (e.g. "Engine 001: ..."), which is important when we have DP>1. Let me bring that back.
  • As for the PP related comment above. I gave it a second thought and maybe the comment is right. My initial reasoning was that perf stats are calculated per PP rank but the duration is measured globally. But if vLLM is doing PP-pipelining correctly then maybe duration is also already per-PP-rank. Let me include that fix too. Unfortunately PP doesn't seem to work for gpt-oss at the current moment so I can't really empirically prove either way. (A while back it did work but its efficiency was too off so I couldn't prove it either). Let me just blindly include that fix anyways.

After that I'll import to fbsource and proceed to land.

Signed-off-by: SungMinCho <tjdals4565@gmail.com>
auto-merge was automatically disabled December 17, 2025 21:09

Head branch was pushed to by a user without write access

@SungMinCho
Copy link
Contributor Author

@markmc JFYI I pushed a new commit as foretold above.

I experimented with DP2TP4-EP8 setup and got the logs below, which confirms that the new Engine visibility works and also that your new CLI arguments work well. I will proceed to import and merge the PR unless you object.

INFO 12-17 13:10:02 [loggers.py:257] Engine 000: Avg prompt throughput: 600.0 tokens/s, Avg generation throughput: 113.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:02 [perf.py:1215] Engine 000: MFU: 1.8 TF/s/GPU 57.2 GB/s/GPU
INFO 12-17 13:10:02 [loggers.py:257] Engine 001: Avg prompt throughput: 600.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:02 [perf.py:1215] Engine 001: MFU: 0.4 TF/s/GPU 0.4 GB/s/GPU
INFO 12-17 13:10:12 [loggers.py:257] Engine 000: Avg prompt throughput: 50394.3 tokens/s, Avg generation throughput: 21.4 tokens/s, Running: 85 reqs, Waiting: 43 reqs, GPU KV cache usage: 2.1%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:12 [perf.py:1215] Engine 000: MFU: 128.4 TF/s/GPU 85.3 GB/s/GPU
INFO 12-17 13:10:12 [loggers.py:257] Engine 001: Avg prompt throughput: 50394.4 tokens/s, Avg generation throughput: 21.4 tokens/s, Running: 85 reqs, Waiting: 43 reqs, GPU KV cache usage: 2.1%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:12 [perf.py:1215] Engine 001: MFU: 128.4 TF/s/GPU 85.3 GB/s/GPU
INFO 12-17 13:10:22 [loggers.py:257] Engine 000: Avg prompt throughput: 25799.1 tokens/s, Avg generation throughput: 5975.1 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.7%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:22 [perf.py:1215] Engine 000: MFU: 81.2 TF/s/GPU 748.1 GB/s/GPU
INFO 12-17 13:10:22 [loggers.py:257] Engine 001: Avg prompt throughput: 25799.3 tokens/s, Avg generation throughput: 5975.1 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.7%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:22 [perf.py:1215] Engine 001: MFU: 81.2 TF/s/GPU 748.1 GB/s/GPU
INFO 12-17 13:10:32 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9147.3 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:32 [perf.py:1215] Engine 000: MFU: 24.4 TF/s/GPU 1131.4 GB/s/GPU
INFO 12-17 13:10:32 [loggers.py:257] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9147.3 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:32 [perf.py:1215] Engine 001: MFU: 24.4 TF/s/GPU 1131.4 GB/s/GPU
INFO 12-17 13:10:42 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9138.8 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:42 [perf.py:1215] Engine 000: MFU: 25.4 TF/s/GPU 1190.5 GB/s/GPU
INFO 12-17 13:10:42 [loggers.py:257] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9138.8 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:42 [perf.py:1215] Engine 001: MFU: 25.4 TF/s/GPU 1190.5 GB/s/GPU
INFO 12-17 13:10:52 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9070.3 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.6%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:52 [perf.py:1215] Engine 000: MFU: 26.1 TF/s/GPU 1241.1 GB/s/GPU
INFO 12-17 13:10:52 [loggers.py:257] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9070.3 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.6%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:52 [perf.py:1215] Engine 001: MFU: 26.1 TF/s/GPU 1241.1 GB/s/GPU
/usr/lib64/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown

@SungMinCho
Copy link
Contributor Author

@markmc JFYI land is currently blocked due to some build fail issue in internal trunk which is irrelevant to this PR... We might have to wait until the oncall resolves that issue... It's really frustrating that I can't merge this PR to OSS without having to sync to internal stack. @zhuohan123 is there any possible bypass?

@zhuohan123 zhuohan123 enabled auto-merge (squash) December 17, 2025 23:03
@SungMinCho
Copy link
Contributor Author

Assuming @zhuohan123 can bypass internal sync and click merge button,

the current CI still seems to have 2 failures.

image

Do we know if this is a known issue at the moment? (cc @markmc)

@zhuohan123 zhuohan123 merged commit a0b782f into vllm-project:main Dec 18, 2025
52 checks passed
@SungMinCho
Copy link
Contributor Author

Thanks @zhuohan123 for merging!

@github-project-automation github-project-automation bot moved this from Backlog to Done in Metrics & Observability Dec 19, 2025
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025
Signed-off-by: SungMinCho <tjdals4565@gmail.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
Signed-off-by: SungMinCho <tjdals4565@gmail.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: SungMinCho <tjdals4565@gmail.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
@markmc markmc moved this from Done to Done - 0.14 in Metrics & Observability Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done - 0.14

Development

Successfully merging this pull request may close these issues.

4 participants