feat: update flashinfer ar oneshot params #22108

yyihuang · 2025-08-01T22:32:23Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

ust for use_oneshot auto-deduction.

The interface in flashinfer is updated as
flashinfer-ai/flashinfer#1365 (in next release)

Test Plan

vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --no-enable-prefix-caching -tp 4 --compilation-config='{"pass_config": {"enable_flashinfer_allreduce_fusion": true}, "custom_ops": ["+rms_norm"], "level":3}'

DURATION_SECONDS=60; qps=10; vllm bench serve --model meta-llama/Llama-3.1-8B-Inst
ruct --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --request-rate "$qps" --num-prompts $((DURATION_SECONDS * qps))

Test Result

-main branch
============ Serving Benchmark Result ============
Successful requests: 600
Request rate configured (RPS): 10.00
Benchmark duration (s): 60.63
Total input tokens: 305347
Total generated tokens: 90000
Request throughput (req/s): 9.90
Output token throughput (tok/s): 1484.31
Total Token throughput (tok/s): 6520.18
---------------Time to First Token----------------
Mean TTFT (ms): 18.26
Median TTFT (ms): 17.57
P99 TTFT (ms): 28.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 4.61
Median TPOT (ms): 4.60
P99 TPOT (ms): 4.89
---------------Inter-token Latency----------------
Mean ITL (ms): 4.61
Median ITL (ms): 4.40
P99 ITL (ms): 8.97

-this branch
============ Serving Benchmark Result ============
Successful requests: 600
Request rate configured (RPS): 10.00
Benchmark duration (s): 60.64
Total input tokens: 305347
Total generated tokens: 90000
Request throughput (req/s): 9.89
Output token throughput (tok/s): 1484.19
Total Token throughput (tok/s): 6519.67
---------------Time to First Token----------------
Mean TTFT (ms): 17.77
Median TTFT (ms): 17.51
P99 TTFT (ms): 27.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 4.60
Median TPOT (ms): 4.60
P99 TPOT (ms): 4.85
---------------Inter-token Latency----------------
Mean ITL (ms): 4.60
Median ITL (ms): 4.41
P99 ITL (ms): 8.94

(Optional) Documentation Update

github-actions · 2025-08-01T22:32:30Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request updates the flashinfer all-reduce fusion parameters by removing the use_oneshot argument from the trtllm_allreduce_fusion function call. This change aligns with a recent update in the flashinfer library where this parameter is now auto-deduced. The change is correct and necessary to maintain compatibility with the updated dependency.

mgoin · 2025-08-02T20:54:45Z

@yyihuang do you know when the next release will be?

yyihuang · 2025-08-07T05:30:50Z

@mgoin flashinfer new release is ready since yesterday.

zou3519 · 2025-08-07T05:44:35Z

vllm/compilation/collective_fusion.py

@@ -457,7 +457,6 @@ def call_trtllm_fused_allreduce_norm(
                hidden_dim=allreduce_in.shape[-1],
                workspace_ptrs=_FI_WORKSPACE_TENSOR,
                launch_with_pdl=launch_with_pdl,
-                use_oneshot=True,


Do you have more context over this? At the very least the comment above might need to be updated. "For the sizes that are smaller than the max size,
# we only use flashinfer one shot allreduce". Is there a test we can add for this?

We deduce the strategy by token num: https://github.com/flashinfer-ai/flashinfer/blob/main/flashinfer/comm/trtllm_ar.py#L826.

Different kernels would be called by this strategy: https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/comm/trtllm_allreduce_fusion.cuh#L1388-L1400

We can add a unit test of token_num > 128 if needed. And using some general model tests would also be okay.

And you might be interested in the first of the PR series on flashinfer's allreduce_fusion.
#20691

@yyihuang Do you have results of benchmarking for oneshot vs twoshot? Firstly, usage of two shot should not only depend on token_num but world_size, similarly what is done in custom_all_reduce.cuh. Secondly, In my benchmarking using two shot only made sense on Hopper, whereas one shot was better across all workloads on Blackwell.

Hi @ilmarkov I benched on h200. Let me do more benchmarks on blackwell.

The benchmark results are updated above. thx for review @mgoin @ilmarkov

@yyihuang I don't see any speedups with this PR. Also, in this PR, when I do isolated benchmarking against non-fused version, FI has different performance results on H100 and B200 but in none of them two shot is the one we'd want to use.

Thanks for you benchmark @ilmarkov! Let's keep this as a draft PR since we did not get speedup by this auto, until we figure out the problem shape and use case of each strategy cross-DLFW. In tllm we're taking this for min-latency case (https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/thop/allreduceOp.cpp#L453), which might not be the target case or there might be some framework diffs.

Signed-off-by: Avery Yingyi Huang <[email protected]>

mergify · 2025-08-11T21:54:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yyihuang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…' of github.com:vllm-project/vllm into auto-oneshot

yyihuang requested review from zou3519, youkaichao and ProExpertProg as code owners August 1, 2025 22:32

gemini-code-assist bot reviewed Aug 1, 2025

View reviewed changes

yyihuang force-pushed the auto-oneshot branch from cd47002 to 59c5591 Compare August 1, 2025 22:40

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 2, 2025

zou3519 reviewed Aug 7, 2025

View reviewed changes

yyihuang requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners August 11, 2025 02:29

mergify bot added the v1 label Aug 11, 2025

yyihuang marked this pull request as draft August 11, 2025 06:50

init

1aac329

Signed-off-by: Avery Yingyi Huang <[email protected]>

yyihuang force-pushed the auto-oneshot branch from cb460d6 to 96149ee Compare August 11, 2025 21:53

mergify bot added rocm Related to AMD ROCm structured-output speculative-decoding labels Aug 11, 2025

github-project-automation bot added this to Structured Output Aug 11, 2025

mergify bot added tpu Related to Google TPUs tool-calling labels Aug 11, 2025

github-project-automation bot added this to Tool Calling Aug 11, 2025

mergify bot added the needs-rebase label Aug 11, 2025

yyihuang force-pushed the auto-oneshot branch 2 times, most recently from 5c22ddb to 1aac329 Compare August 11, 2025 21:58

mergify bot removed tpu Related to Google TPUs needs-rebase labels Aug 11, 2025

Merge branch 'auto-oneshot' of github.com:yyihuang/vllm; branch 'main…

49a5a78

…' of github.com:vllm-project/vllm into auto-oneshot

yyihuang marked this pull request as ready for review August 11, 2025 22:05

yyihuang marked this pull request as draft August 12, 2025 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: update flashinfer ar oneshot params #22108

feat: update flashinfer ar oneshot params #22108

yyihuang commented Aug 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mgoin commented Aug 2, 2025

Uh oh!

yyihuang commented Aug 7, 2025

Uh oh!

zou3519 Aug 7, 2025

Uh oh!

yyihuang Aug 7, 2025 •

edited

Loading

Uh oh!

yyihuang Aug 7, 2025

Uh oh!

ilmarkov Aug 7, 2025

Uh oh!

yyihuang Aug 7, 2025

Uh oh!

yyihuang Aug 10, 2025

Uh oh!

ilmarkov Aug 12, 2025

Uh oh!

yyihuang Aug 12, 2025

Uh oh!

mergify bot commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

feat: update flashinfer ar oneshot params #22108

Are you sure you want to change the base?

feat: update flashinfer ar oneshot params #22108

Conversation

yyihuang commented Aug 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin commented Aug 2, 2025

Uh oh!

yyihuang commented Aug 7, 2025

Uh oh!

zou3519 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

yyihuang Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yyihuang Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

ilmarkov Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

yyihuang Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

yyihuang Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

ilmarkov Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

yyihuang Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Aug 11, 2025

Uh oh!

Uh oh!

yyihuang commented Aug 1, 2025 •

edited by github-actions bot

Loading

yyihuang Aug 7, 2025 •

edited

Loading