[Metrics] Model FLOPs Utilization estimation by SungMinCho · Pull Request #30738 · vllm-project/vllm

SungMinCho · 2025-12-16T03:27:53Z

Signed-off-by: SungMinCho tjdals4565@gmail.com

TL;DR

This PR implements optional "MFU stats logging", which appends "per-GPU flops/bandwidth" information to the existing periodic logs, reporting the average compute/memory performance of GPUs in each Engine for the duration of that log interval. These stats are calculated with minimal overhead by feeding in SchedulerOutput to the analytic config-based perf calculator at every iteration.

How to use

Set --enable-mfu-metrics when launching the vLLM server to enable MFU logging.

VLLM_DEBUG_MFU_METRICS=1 with VLLM_LOGGING_LEVEL=DEBUG is for debugging purposes for developers working on the metrics calculations.

Purpose

To track the hardware performance over the course of vLLM serving.

Examples

(for each example, please find the MFU stats at the end of each logged line)
(all examples were gathered on B200 GPUs)

GPT-OSS 120B TP=8 BatchSize=256 Input=6K Output=3K NumBatch=1

with VLLM_MFU_LOGGING_LEVEL=1

INFO 12-15 17:38:18 [loggers.py:259] Engine 000: Avg prompt throughput: 3599.9 tokens/s, Avg generation throughput: 199.6 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%, MFU: 6.6 TF/s/GPU 105.0 GB/s/GPU
INFO 12-15 17:38:28 [loggers.py:259] Engine 000: Avg prompt throughput: 150003.2 tokens/s, Avg generation throughput: 938.8 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%, MFU: 259.9 TF/s/GPU 491.7 GB/s/GPU
INFO 12-15 17:38:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18609.5 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.7%, Prefix cache hit rate: 0.0%, MFU: 32.6 TF/s/GPU 1138.1 GB/s/GPU
INFO 12-15 17:38:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18479.1 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%, MFU: 33.3 TF/s/GPU 1191.9 GB/s/GPU
INFO 12-15 17:38:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18428.1 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.2%, Prefix cache hit rate: 0.0%, MFU: 34.2 TF/s/GPU 1249.8 GB/s/GPU
INFO 12-15 17:39:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18201.0 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.5%, Prefix cache hit rate: 0.0%, MFU: 34.8 TF/s/GPU 1294.4 GB/s/GPU
INFO 12-15 17:39:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2130.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MFU: 4.1 TF/s/GPU 159.4 GB/s/GPU
INFO 12-15 17:39:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

with VLLM_MFU_LOGGING_LEVEL=2
https://gist.github.com/SungMinCho/9ed5254e4bd1b3e5eb05a360df0dcc88

gives you the details behind the logged MFU stats such as

per-component breakdown of reported flops/bytes
breakdown of run context (i.e. input to the MFU calculator for that log duration e.g. prefill_num_tokens, etc)
log duration, MFU calculation duration, MFU calculation overhead
etc

Some more examples with different parallelism setups

GPT-OSS 120B TP=4 BatchSize=256 Input=6K Output=3K NumBatch=1

[0;36m(APIServer pid=2363162)[0;0m INFO 12-15 18:27:00 [loggers.py:259] Engine 000: Avg prompt throughput: 101389.0 tokens/s, Avg generation throughput: 77.4 tokens/s, Running: 170 reqs, Waiting: 86 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 0.0%, MFU: 349.4 TF/s/GPU 391.1 GB/s/GPU
[0;36m(APIServer pid=2363162)[0;0m INFO 12-15 18:27:10 [loggers.py:259] Engine 000: Avg prompt throughput: 51584.2 tokens/s, Avg generation throughput: 9711.6 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.3%, Prefix cache hit rate: 0.0%, MFU: 211.3 TF/s/GPU 1339.8 GB/s/GPU
[0;36m(APIServer pid=2363162)[0;0m INFO 12-15 18:27:20 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15227.4 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.8%, Prefix cache hit rate: 0.0%, MFU: 54.0 TF/s/GPU 1870.2 GB/s/GPU
[0;36m(APIServer pid=2363162)[0;0m INFO 12-15 18:27:30 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15176.8 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.3%, Prefix cache hit rate: 0.0%, MFU: 55.1 TF/s/GPU 1947.1 GB/s/GPU
[0;36m(APIServer pid=2363162)[0;0m INFO 12-15 18:27:40 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15151.6 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.7%, Prefix cache hit rate: 0.0%, MFU: 56.3 TF/s/GPU 2026.6 GB/s/GPU
[0;36m(APIServer pid=2363162)[0;0m INFO 12-15 18:27:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15074.7 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.2%, Prefix cache hit rate: 0.0%, MFU: 57.4 TF/s/GPU 2098.3 GB/s/GPU

GPT-OSS 120B TP=1 BatchSize=256 Input=6K Output=3K NumBatch=1

[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:32:12 [loggers.py:259] Engine 000: Avg prompt throughput: 600.0 tokens/s, Avg generation throughput: 204.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%, MFU: 11.3 TF/s/GPU 772.8 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:32:22 [loggers.py:259] Engine 000: Avg prompt throughput: 29400.6 tokens/s, Avg generation throughput: 8.7 tokens/s, Running: 50 reqs, Waiting: 206 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 0.0%, MFU: 405.0 TF/s/GPU 269.3 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:32:32 [loggers.py:259] Engine 000: Avg prompt throughput: 37799.9 tokens/s, Avg generation throughput: 27.6 tokens/s, Running: 113 reqs, Waiting: 143 reqs, GPU KV cache usage: 17.4%, Prefix cache hit rate: 0.0%, MFU: 521.0 TF/s/GPU 344.9 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:32:42 [loggers.py:259] Engine 000: Avg prompt throughput: 37800.0 tokens/s, Avg generation throughput: 46.5 tokens/s, Running: 176 reqs, Waiting: 80 reqs, GPU KV cache usage: 25.6%, Prefix cache hit rate: 0.0%, MFU: 521.2 TF/s/GPU 349.3 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:32:52 [loggers.py:259] Engine 000: Avg prompt throughput: 37797.4 tokens/s, Avg generation throughput: 65.4 tokens/s, Running: 239 reqs, Waiting: 17 reqs, GPU KV cache usage: 33.8%, Prefix cache hit rate: 0.0%, MFU: 521.5 TF/s/GPU 353.6 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:33:02 [loggers.py:259] Engine 000: Avg prompt throughput: 10197.0 tokens/s, Avg generation throughput: 6142.2 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 34.7%, Prefix cache hit rate: 0.0%, MFU: 225.5 TF/s/GPU 2940.2 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:33:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6832.8 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 36.2%, Prefix cache hit rate: 0.0%, MFU: 95.6 TF/s/GPU 3238.4 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:33:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6807.8 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 37.6%, Prefix cache hit rate: 0.0%, MFU: 96.4 TF/s/GPU 3293.4 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:33:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6756.5 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 39.0%, Prefix cache hit rate: 0.0%, MFU: 96.7 TF/s/GPU 3334.6 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:33:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6755.3 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.5%, Prefix cache hit rate: 0.0%, MFU: 97.7 TF/s/GPU 3399.8 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:33:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6707.4 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 41.9%, Prefix cache hit rate: 0.0%, MFU: 98.1 TF/s/GPU 3440.7 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:34:02 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6603.0 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.3%, Prefix cache hit rate: 0.0%, MFU: 97.6 TF/s/GPU 3450.4 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:34:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6527.2 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 44.7%, Prefix cache hit rate: 0.0%, MFU: 97.4 TF/s/GPU 3472.5 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:34:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6576.8 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 46.1%, Prefix cache hit rate: 0.0%, MFU: 99.2 TF/s/GPU 3561.0 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:34:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6627.8 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 47.5%, Prefix cache hit rate: 0.0%, MFU: 100.9 TF/s/GPU 3651.6 GB/s/GPU
[0;36m(APIServer pid=2548211)[0;0m INFO 12-15 18:34:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6551.6 tokens/s, Running: 256 reqs, Waiting: 0 reqs, GPU KV cache usage: 48.9%, Prefix cache hit rate: 0.0%, MFU: 100.8 TF/s/GPU 3671.8 GB/s/GPU

GPT-OSS 120B DP=2 TP=4 EP=8 BatchSize=256 Input=6K Output=3K NumBatch=1

Note: This is to show that it works well under multiple-engine scenarios (i.e. DP > 1)

INFO 12-15 18:51:34 [loggers.py:259] Engine 000: Avg prompt throughput: 600.0 tokens/s, Avg generation throughput: 98.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MFU: 1.7 TF/s/GPU 50.0 GB/s/GPU
INFO 12-15 18:51:34 [loggers.py:259] Engine 001: Avg prompt throughput: 600.0 tokens/s, Avg generation throughput: 0.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MFU: 0.4 TF/s/GPU 0.5 GB/s/GPU
INFO 12-15 18:51:44 [loggers.py:259] Engine 000: Avg prompt throughput: 62998.4 tokens/s, Avg generation throughput: 32.0 tokens/s, Running: 106 reqs, Waiting: 22 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0%, MFU: 160.6 TF/s/GPU 106.9 GB/s/GPU
INFO 12-15 18:51:44 [loggers.py:259] Engine 001: Avg prompt throughput: 62998.7 tokens/s, Avg generation throughput: 32.0 tokens/s, Running: 106 reqs, Waiting: 22 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0%, MFU: 160.6 TF/s/GPU 106.9 GB/s/GPU
INFO 12-15 18:51:54 [loggers.py:259] Engine 000: Avg prompt throughput: 13197.8 tokens/s, Avg generation throughput: 6962.0 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.8%, Prefix cache hit rate: 0.0%, MFU: 51.6 TF/s/GPU 846.9 GB/s/GPU
INFO 12-15 18:51:54 [loggers.py:259] Engine 001: Avg prompt throughput: 13197.9 tokens/s, Avg generation throughput: 6962.0 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.8%, Prefix cache hit rate: 0.0%, MFU: 51.6 TF/s/GPU 846.9 GB/s/GPU
INFO 12-15 18:52:04 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9136.6 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.1%, Prefix cache hit rate: 0.0%, MFU: 24.5 TF/s/GPU 1136.6 GB/s/GPU
INFO 12-15 18:52:04 [loggers.py:259] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9136.6 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.1%, Prefix cache hit rate: 0.0%, MFU: 24.5 TF/s/GPU 1136.6 GB/s/GPU
INFO 12-15 18:52:14 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9147.9 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 0.0%, MFU: 25.5 TF/s/GPU 1198.2 GB/s/GPU
INFO 12-15 18:52:14 [loggers.py:259] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9147.9 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 0.0%, MFU: 25.5 TF/s/GPU 1198.2 GB/s/GPU
INFO 12-15 18:52:24 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9073.2 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 0.0%, MFU: 26.2 TF/s/GPU 1248.0 GB/s/GPU
INFO 12-15 18:52:24 [loggers.py:259] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9073.3 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 0.0%, MFU: 26.2 TF/s/GPU 1248.0 GB/s/GPU

Test Plan

See Examples above.

Test Result

See Examples above.

Notes

This PR has been moved from #28859 due to code sync issues with Meta-internal codebase. See #28859 for some of the original discussions and reviews.

chatgpt-codex-connector · 2025-12-16T03:28:01Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request introduces MFU (Model Flops Utilization) stats logging, which is a valuable feature for performance monitoring. The implementation is well-structured, particularly the new vllm/v1/metrics/perf.py file which uses a modular parser chain and component-based metrics calculation.

My main feedback is on a potential correctness issue in the final rate calculation for TFLOPs/s and GB/s when pipeline parallelism is enabled. The current logic seems to inflate these metrics by the pipeline parallel size. I've provided a suggestion to correct this.

Overall, this is a great addition to vLLM's observability features.

gemini-code-assist · 2025-12-16T03:30:07Z

vllm/v1/metrics/perf.py

+        delta_time_per_gpu = delta_time / self.pp_size
+
+        avg_tflops_per_gpu = self.total_num_flops_per_gpu / delta_time_per_gpu / 1e12
+        avg_gbps_per_gpu = (
+            (self.total_read_bytes_per_gpu + self.total_write_bytes_per_gpu)
+            / delta_time_per_gpu
+            / 1e9
+        )


The calculation of avg_tflops_per_gpu and avg_gbps_per_gpu appears to be incorrect when pipeline parallelism is used (pp_size > 1).

The total_num_flops_per_gpu and total_*_bytes_per_gpu values are already calculated on a per-GPU basis. Dividing delta_time by pp_size to get delta_time_per_gpu incorrectly inflates the reported TFLOPs/s and GB/s rates by a factor of pp_size.

The rate should be calculated over the total delta_time during which the metrics were accumulated. Additionally, it's good practice to handle the case where delta_time could be zero or negative to avoid a ZeroDivisionError.

Suggested change

delta_time_per_gpu = delta_time / self.pp_size

avg_tflops_per_gpu = self.total_num_flops_per_gpu / delta_time_per_gpu / 1e12

avg_gbps_per_gpu = (

(self.total_read_bytes_per_gpu + self.total_write_bytes_per_gpu)

/ delta_time_per_gpu

/ 1e9

)

if delta_time <= 0.0:

avg_tflops_per_gpu = 0.0

avg_gbps_per_gpu = 0.0

else:

avg_tflops_per_gpu = self.total_num_flops_per_gpu / delta_time / 1e12

avg_gbps_per_gpu = (

(self.total_read_bytes_per_gpu + self.total_write_bytes_per_gpu)

/ delta_time

/ 1e9

)

@SungMinCho what's your take on this comment?

SungMinCho · 2025-12-16T03:32:36Z

Hi @markmc @zhuohan123 @bwasti I moved #28859 into this PR to bypass internal codebase sync problems. Could you guys review this one last time and land? If anything, by default this functionality is turned off so it should be fairly safe to land. Thanks!

zhuohan123

LGTM! I think the main todo is to add support for more complex attention types?

mergify · 2025-12-16T03:43:37Z

Hi @SungMinCho, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

SungMinCho · 2025-12-16T03:53:34Z

LGTM! I think the main todo is to add support for more complex attention types?

Yes indeed (and maybe verify more sophisticated parallelism combinations etc).

mergify · 2025-12-16T04:08:40Z

Hi @SungMinCho, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2025-12-16T04:57:30Z

Hi @SungMinCho, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

markmc · 2025-12-16T15:36:33Z

How to use

Set VLLM_MFU_LOGGING_LEVEL=1 when launching the vLLM server to enable MFU logging.

(VLLM_MFU_LOGGING_LEVEL=2 is verbose mode for debugging purposes for experts) (By default VLLM_MFU_LOGGING_LEVEL=0 which disables MFU logging).

As per #25700 I think we should add a config option for this

ObservabilityConfig is probably the right place for it - e.g. --mfu-metrics-level

However, I'd be inclined to do something more descriptive like --mfu-metrics=aggregated/per-gpu or something

And, at first glance, I think some of the verbose logging is more like debug logging for the calculation itself - e.g. we wouldn't add Prometheus metrics for most of those, probably - so I'd just log that stuff with log.debug() or add --mfu-metrics-debug

markmc · 2025-12-16T15:37:23Z

xfref to my PR to add Prometheus support - SungMinCho#3 - which I guess you prefer we do as a follow-up

robertgshaw2-redhat · 2025-12-16T15:38:12Z

vllm/envs.py

@@ -244,6 +244,7 @@
    VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLD: int = 256
    VLLM_COMPILE_CACHE_SAVE_FORMAT: Literal["binary", "unpacked"] = "binary"
    VLLM_USE_V2_MODEL_RUNNER: bool = False
+    VLLM_MFU_LOGGING_LEVEL: int = 0  # 0: disabled, 1: enabled, 2: verbose


can you make this an engine arg instead?

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

And use VLLM_DEBUG_MFU_METRICS to enable debugging. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2025-12-17T18:13:02Z

Rebased to pick up #30878

SungMinCho · 2025-12-17T20:24:32Z

Thank you @markmc for all of these follow up commits. I think they all make sense. (I was surprised with the test file because I had that exact same updated version in my internal diff which I just didn't care to include in this PR but you somehow wrote the exact replica lol).

Yes I absolutely agree with the updated CLI arguments. Thank you for the clean refactorings too. Sorry about numerous back and forths.

Let me just add one more commit on top to include these changes:

In the process of separating MFU logs from the main logger, I think we lost visibility on which Engine it is reporting the MFU from (e.g. "Engine 001: ..."), which is important when we have DP>1. Let me bring that back.
As for the PP related comment above. I gave it a second thought and maybe the comment is right. My initial reasoning was that perf stats are calculated per PP rank but the duration is measured globally. But if vLLM is doing PP-pipelining correctly then maybe duration is also already per-PP-rank. Let me include that fix too. Unfortunately PP doesn't seem to work for gpt-oss at the current moment so I can't really empirically prove either way. (A while back it did work but its efficiency was too off so I couldn't prove it either). Let me just blindly include that fix anyways.

After that I'll import to fbsource and proceed to land.

Signed-off-by: SungMinCho <tjdals4565@gmail.com>

SungMinCho · 2025-12-17T21:12:59Z

@markmc JFYI I pushed a new commit as foretold above.

I experimented with DP2TP4-EP8 setup and got the logs below, which confirms that the new Engine visibility works and also that your new CLI arguments work well. I will proceed to import and merge the PR unless you object.

INFO 12-17 13:10:02 [loggers.py:257] Engine 000: Avg prompt throughput: 600.0 tokens/s, Avg generation throughput: 113.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:02 [perf.py:1215] Engine 000: MFU: 1.8 TF/s/GPU 57.2 GB/s/GPU
INFO 12-17 13:10:02 [loggers.py:257] Engine 001: Avg prompt throughput: 600.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:02 [perf.py:1215] Engine 001: MFU: 0.4 TF/s/GPU 0.4 GB/s/GPU
INFO 12-17 13:10:12 [loggers.py:257] Engine 000: Avg prompt throughput: 50394.3 tokens/s, Avg generation throughput: 21.4 tokens/s, Running: 85 reqs, Waiting: 43 reqs, GPU KV cache usage: 2.1%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:12 [perf.py:1215] Engine 000: MFU: 128.4 TF/s/GPU 85.3 GB/s/GPU
INFO 12-17 13:10:12 [loggers.py:257] Engine 001: Avg prompt throughput: 50394.4 tokens/s, Avg generation throughput: 21.4 tokens/s, Running: 85 reqs, Waiting: 43 reqs, GPU KV cache usage: 2.1%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:12 [perf.py:1215] Engine 001: MFU: 128.4 TF/s/GPU 85.3 GB/s/GPU
INFO 12-17 13:10:22 [loggers.py:257] Engine 000: Avg prompt throughput: 25799.1 tokens/s, Avg generation throughput: 5975.1 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.7%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:22 [perf.py:1215] Engine 000: MFU: 81.2 TF/s/GPU 748.1 GB/s/GPU
INFO 12-17 13:10:22 [loggers.py:257] Engine 001: Avg prompt throughput: 25799.3 tokens/s, Avg generation throughput: 5975.1 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.7%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:22 [perf.py:1215] Engine 001: MFU: 81.2 TF/s/GPU 748.1 GB/s/GPU
INFO 12-17 13:10:32 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9147.3 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:32 [perf.py:1215] Engine 000: MFU: 24.4 TF/s/GPU 1131.4 GB/s/GPU
INFO 12-17 13:10:32 [loggers.py:257] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9147.3 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:32 [perf.py:1215] Engine 001: MFU: 24.4 TF/s/GPU 1131.4 GB/s/GPU
INFO 12-17 13:10:42 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9138.8 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:42 [perf.py:1215] Engine 000: MFU: 25.4 TF/s/GPU 1190.5 GB/s/GPU
INFO 12-17 13:10:42 [loggers.py:257] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9138.8 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:42 [perf.py:1215] Engine 001: MFU: 25.4 TF/s/GPU 1190.5 GB/s/GPU
INFO 12-17 13:10:52 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9070.3 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.6%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:52 [perf.py:1215] Engine 000: MFU: 26.1 TF/s/GPU 1241.1 GB/s/GPU
INFO 12-17 13:10:52 [loggers.py:257] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9070.3 tokens/s, Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.6%, Prefix cache hit rate: 0.0%
INFO 12-17 13:10:52 [perf.py:1215] Engine 001: MFU: 26.1 TF/s/GPU 1241.1 GB/s/GPU
/usr/lib64/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown

SungMinCho · 2025-12-17T21:42:56Z

@markmc JFYI land is currently blocked due to some build fail issue in internal trunk which is irrelevant to this PR... We might have to wait until the oncall resolves that issue... It's really frustrating that I can't merge this PR to OSS without having to sync to internal stack. @zhuohan123 is there any possible bypass?

SungMinCho · 2025-12-18T01:14:23Z

Assuming @zhuohan123 can bypass internal sync and click merge button,

the current CI still seems to have 2 failures.

Neuron Test (bash .buildkite/scripts/hardware_ci/run-neuron-test.sh)
- https://buildkite.com/vllm/ci/builds/44008/steps/canvas?sid=019b2e26-d9d1-40c6-b487-4fe3e5161150
- bash: .buildkite/scripts/hardware_ci/run-neuron-test.sh: No such file or directory
Ascend NPU Test
- https://buildkite.com/vllm/ci/builds/44008/steps/canvas?sid=019b2e26-d9d3-4270-9667-ab32342122dc
- requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.modelscope.cn', port=443): Max retries exceeded with url: /api/v1/models/Qwen/Qwen3-0.6B (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0xfffd0a4fe050>: Failed to resolve 'www.modelscope.cn' ([Errno -3] Temporary failure in name resolution)"))

Do we know if this is a known issue at the moment? (cc @markmc)

SungMinCho · 2025-12-18T04:02:47Z

Thanks @zhuohan123 for merging!

Signed-off-by: SungMinCho <tjdals4565@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: SungMinCho <tjdals4565@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

SungMinCho requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, markmc, njhill, robertgshaw2-redhat and ywang96 as code owners December 16, 2025 03:27

SungMinCho mentioned this pull request Dec 16, 2025

[Metrics] Model FLOPs Utilization estimation #30737

Closed

mergify bot added the v1 label Dec 16, 2025

SungMinCho mentioned this pull request Dec 16, 2025

[Metrics] Model FLOPs Utilization estimation #28859

Closed

gemini-code-assist bot reviewed Dec 16, 2025

View reviewed changes

zhuohan123 approved these changes Dec 16, 2025

View reviewed changes

zhuohan123 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 16, 2025

zhuohan123 enabled auto-merge (squash) December 16, 2025 03:43

auto-merge was automatically disabled December 16, 2025 03:56
Head branch was pushed to by a user without write access

SungMinCho force-pushed the main branch from 156e648 to 13ef03e Compare December 16, 2025 03:56

SungMinCho force-pushed the main branch from 13ef03e to dec07b9 Compare December 16, 2025 04:47

SungMinCho force-pushed the main branch 2 times, most recently from af378f9 to 0b0adc5 Compare December 16, 2025 09:31

markmc changed the title ~~Add mfu stats logging~~ [Metrics] Model FLOPs Utilization estimation Dec 16, 2025

robertgshaw2-redhat reviewed Dec 16, 2025

View reviewed changes

markmc added 2 commits December 17, 2025 13:12

[MFU] Refactor debug logging

661a766

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[MFU] Change CLI arg to --enable-mfu-metrics

fe6564e

And use VLLM_DEBUG_MFU_METRICS to enable debugging. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc force-pushed the main branch from 0d0730b to fe6564e Compare December 17, 2025 18:12

markmc approved these changes Dec 17, 2025

View reviewed changes

markmc enabled auto-merge (squash) December 17, 2025 19:40

Add Engine visibility to MFU logs & Fix PP duration

5302f3e

Signed-off-by: SungMinCho <tjdals4565@gmail.com>

auto-merge was automatically disabled December 17, 2025 21:09
Head branch was pushed to by a user without write access

zhuohan123 enabled auto-merge (squash) December 17, 2025 23:03

zhuohan123 merged commit a0b782f into vllm-project:main Dec 18, 2025
52 checks passed

markmc mentioned this pull request Dec 18, 2025

[Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) #30950

Open

markmc added this to Metrics & Observability Dec 19, 2025

github-project-automation bot moved this to Backlog in Metrics & Observability Dec 19, 2025

github-project-automation bot moved this from Backlog to Done in Metrics & Observability Dec 19, 2025

This was referenced Dec 19, 2025

[Feature]: Model FLOPs Utilization Reporting #24190

Closed

[Core] Add MFU tracking to GPU model execution #25091

Closed

markmc moved this from Done to Done - 0.14 in Metrics & Observability Feb 4, 2026

markmc mentioned this pull request Feb 6, 2026

[Core] Expose detailed scheduler stats #33845

Open

5 tasks

Uh oh!

Conversation

SungMinCho commented Dec 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

How to use

Purpose

Examples

GPT-OSS 120B TP=8 BatchSize=256 Input=6K Output=3K NumBatch=1

Some more examples with different parallelism setups

GPT-OSS 120B TP=4 BatchSize=256 Input=6K Output=3K NumBatch=1

GPT-OSS 120B TP=1 BatchSize=256 Input=6K Output=3K NumBatch=1

GPT-OSS 120B DP=2 TP=4 EP=8 BatchSize=256 Input=6K Output=3K NumBatch=1

Test Plan

Test Result

Notes

Uh oh!

chatgpt-codex-connector bot commented Dec 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

markmc Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

SungMinCho commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 16, 2025

Uh oh!

SungMinCho commented Dec 16, 2025

Uh oh!

mergify bot commented Dec 16, 2025

Uh oh!

mergify bot commented Dec 16, 2025

Uh oh!

markmc commented Dec 16, 2025

How to use

Uh oh!

markmc commented Dec 16, 2025

Uh oh!

robertgshaw2-redhat Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

markmc commented Dec 17, 2025

Uh oh!

SungMinCho commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SungMinCho commented Dec 17, 2025

Uh oh!

SungMinCho commented Dec 17, 2025

Uh oh!

SungMinCho commented Dec 18, 2025

Uh oh!

Uh oh!

SungMinCho commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SungMinCho commented Dec 16, 2025 •

edited by github-actions bot

Loading

SungMinCho commented Dec 16, 2025 •

edited

Loading

SungMinCho commented Dec 17, 2025 •

edited

Loading