[Metrics] Model FLOPs Utilization estimation#30738
[Metrics] Model FLOPs Utilization estimation#30738zhuohan123 merged 9 commits intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request introduces MFU (Model Flops Utilization) stats logging, which is a valuable feature for performance monitoring. The implementation is well-structured, particularly the new vllm/v1/metrics/perf.py file which uses a modular parser chain and component-based metrics calculation.
My main feedback is on a potential correctness issue in the final rate calculation for TFLOPs/s and GB/s when pipeline parallelism is enabled. The current logic seems to inflate these metrics by the pipeline parallel size. I've provided a suggestion to correct this.
Overall, this is a great addition to vLLM's observability features.
vllm/v1/metrics/perf.py
Outdated
| delta_time_per_gpu = delta_time / self.pp_size | ||
|
|
||
| avg_tflops_per_gpu = self.total_num_flops_per_gpu / delta_time_per_gpu / 1e12 | ||
| avg_gbps_per_gpu = ( | ||
| (self.total_read_bytes_per_gpu + self.total_write_bytes_per_gpu) | ||
| / delta_time_per_gpu | ||
| / 1e9 | ||
| ) |
There was a problem hiding this comment.
The calculation of avg_tflops_per_gpu and avg_gbps_per_gpu appears to be incorrect when pipeline parallelism is used (pp_size > 1).
The total_num_flops_per_gpu and total_*_bytes_per_gpu values are already calculated on a per-GPU basis. Dividing delta_time by pp_size to get delta_time_per_gpu incorrectly inflates the reported TFLOPs/s and GB/s rates by a factor of pp_size.
The rate should be calculated over the total delta_time during which the metrics were accumulated. Additionally, it's good practice to handle the case where delta_time could be zero or negative to avoid a ZeroDivisionError.
| delta_time_per_gpu = delta_time / self.pp_size | |
| avg_tflops_per_gpu = self.total_num_flops_per_gpu / delta_time_per_gpu / 1e12 | |
| avg_gbps_per_gpu = ( | |
| (self.total_read_bytes_per_gpu + self.total_write_bytes_per_gpu) | |
| / delta_time_per_gpu | |
| / 1e9 | |
| ) | |
| if delta_time <= 0.0: | |
| avg_tflops_per_gpu = 0.0 | |
| avg_gbps_per_gpu = 0.0 | |
| else: | |
| avg_tflops_per_gpu = self.total_num_flops_per_gpu / delta_time / 1e12 | |
| avg_gbps_per_gpu = ( | |
| (self.total_read_bytes_per_gpu + self.total_write_bytes_per_gpu) | |
| / delta_time | |
| / 1e9 | |
| ) |
|
Hi @markmc @zhuohan123 @bwasti I moved #28859 into this PR to bypass internal codebase sync problems. Could you guys review this one last time and land? If anything, by default this functionality is turned off so it should be fairly safe to land. Thanks! |
zhuohan123
left a comment
There was a problem hiding this comment.
LGTM! I think the main todo is to add support for more complex attention types?
|
Hi @SungMinCho, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Yes indeed (and maybe verify more sophisticated parallelism combinations etc). |
Head branch was pushed to by a user without write access
|
Hi @SungMinCho, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @SungMinCho, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
af378f9 to
0b0adc5
Compare
As per #25700 I think we should add a config option for this
However, I'd be inclined to do something more descriptive like And, at first glance, I think some of the verbose logging is more like debug logging for the calculation itself - e.g. we wouldn't add Prometheus metrics for most of those, probably - so I'd just log that stuff with |
|
xfref to my PR to add Prometheus support - SungMinCho#3 - which I guess you prefer we do as a follow-up |
vllm/envs.py
Outdated
| @@ -244,6 +244,7 @@ | |||
| VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLD: int = 256 | |||
| VLLM_COMPILE_CACHE_SAVE_FORMAT: Literal["binary", "unpacked"] = "binary" | |||
| VLLM_USE_V2_MODEL_RUNNER: bool = False | |||
| VLLM_MFU_LOGGING_LEVEL: int = 0 # 0: disabled, 1: enabled, 2: verbose | |||
There was a problem hiding this comment.
can you make this an engine arg instead?
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
And use VLLM_DEBUG_MFU_METRICS to enable debugging. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
Rebased to pick up #30878 |
|
Thank you @markmc for all of these follow up commits. I think they all make sense. (I was surprised with the test file because I had that exact same updated version in my internal diff which I just didn't care to include in this PR but you somehow wrote the exact replica lol). Yes I absolutely agree with the updated CLI arguments. Thank you for the clean refactorings too. Sorry about numerous back and forths. Let me just add one more commit on top to include these changes:
After that I'll import to fbsource and proceed to land. |
Signed-off-by: SungMinCho <tjdals4565@gmail.com>
Head branch was pushed to by a user without write access
|
@markmc JFYI I pushed a new commit as foretold above. I experimented with DP2TP4-EP8 setup and got the logs below, which confirms that the new Engine visibility works and also that your new CLI arguments work well. I will proceed to import and merge the PR unless you object. |
|
@markmc JFYI land is currently blocked due to some build fail issue in internal trunk which is irrelevant to this PR... We might have to wait until the oncall resolves that issue... It's really frustrating that I can't merge this PR to OSS without having to sync to internal stack. @zhuohan123 is there any possible bypass? |
|
Assuming @zhuohan123 can bypass internal sync and click merge button, the current CI still seems to have 2 failures.
Do we know if this is a known issue at the moment? (cc @markmc) |
|
Thanks @zhuohan123 for merging! |
Signed-off-by: SungMinCho <tjdals4565@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: SungMinCho <tjdals4565@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: SungMinCho <tjdals4565@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: SungMinCho tjdals4565@gmail.com
TL;DR
This PR implements optional "MFU stats logging", which appends "per-GPU flops/bandwidth" information to the existing periodic logs, reporting the average compute/memory performance of GPUs in each Engine for the duration of that log interval. These stats are calculated with minimal overhead by feeding in SchedulerOutput to the analytic config-based perf calculator at every iteration.
How to use
Set
--enable-mfu-metricswhen launching the vLLM server to enable MFU logging.VLLM_DEBUG_MFU_METRICS=1withVLLM_LOGGING_LEVEL=DEBUGis for debugging purposes for developers working on the metrics calculations.Purpose
To track the hardware performance over the course of vLLM serving.
Examples
(for each example, please find the MFU stats at the end of each logged line)
(all examples were gathered on B200 GPUs)
GPT-OSS 120B TP=8 BatchSize=256 Input=6K Output=3K NumBatch=1
with
VLLM_MFU_LOGGING_LEVEL=1with
VLLM_MFU_LOGGING_LEVEL=2https://gist.github.com/SungMinCho/9ed5254e4bd1b3e5eb05a360df0dcc88
gives you the details behind the logged MFU stats such as
Some more examples with different parallelism setups
GPT-OSS 120B TP=4 BatchSize=256 Input=6K Output=3K NumBatch=1
GPT-OSS 120B TP=1 BatchSize=256 Input=6K Output=3K NumBatch=1
GPT-OSS 120B DP=2 TP=4 EP=8 BatchSize=256 Input=6K Output=3K NumBatch=1
Note: This is to show that it works well under multiple-engine scenarios (i.e. DP > 1)
Test Plan
See Examples above.
Test Result
See Examples above.
Notes
This PR has been moved from #28859 due to code sync issues with Meta-internal codebase. See #28859 for some of the original discussions and reviews.