Optimize input preparation for FlashInfer [2/N] #23174

WoosukKwon · 2025-08-19T10:44:57Z

Should be merged after #23147

Signed-off-by: Woosuk Kwon <[email protected]>

github-actions · 2025-08-19T10:45:06Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces significant optimizations to the input preparation logic for the FlashInfer attention backend. The key changes include refactoring metadata handling by moving static parameters from FlashInferMetadata to FlashInferMetadataBuilder, replacing CPU-intensive PyTorch operations with faster NumPy equivalents, and substituting a slow torch.masked_select with a custom Triton kernel for preparing paged KV indices. Additionally, the calculation of max_seq_len is optimized by pre-computing it once. These changes are well-implemented and should lead to noticeable performance improvements. The code quality is high, and the optimizations are sound.

mergify · 2025-08-19T12:48:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Woosuk Kwon <[email protected]>

mergify · 2025-08-22T15:25:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

nvpohanh · 2025-08-25T07:38:58Z

@WoosukKwon We found that this optimization can reduce gaps between decoding steps when running with low concurrency. Do you plan to continue working on this so that this can be merged? Thanks!

Signed-off-by: Michael Goin <[email protected]>

WoosukKwon added 7 commits August 18, 2025 21:22

[Misc] Minor refactoring for FlashInfer backend

10116c7

Signed-off-by: Woosuk Kwon <[email protected]>

Merge branch 'main' into woosuk/flashinfer-refactor

f0e0055

opt

c5500a7

Signed-off-by: Woosuk Kwon <[email protected]>

minor

0d72371

Signed-off-by: Woosuk Kwon <[email protected]>

minor

21ca74c

Signed-off-by: Woosuk Kwon <[email protected]>

minor

142ba7e

Signed-off-by: Woosuk Kwon <[email protected]>

Optimize input preparation for FlashInfer [2/N]

b76a83f

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon requested review from robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners August 19, 2025 10:44

mergify bot added speculative-decoding v1 labels Aug 19, 2025

gemini-code-assist bot reviewed Aug 19, 2025

View reviewed changes

minor

0d724be

mergify bot added the needs-rebase label Aug 19, 2025

WoosukKwon added 7 commits August 19, 2025 10:23

Merge branch 'main' into woosuk/flashinfer-refactor

f9feaeb

Merge branch 'main' into woosuk/flashinfer-refactor

b4bb998

Merge branch 'main' into woosuk/max-seq-len

d47024f

Merge branch 'main' into woosuk/max-seq-len

ef53f25

[Misc] Add max_seq_len to CommonAttentionMetadata

22b6b3e

Signed-off-by: Woosuk Kwon <[email protected]>

fix:

7ab918b

Signed-off-by: Woosuk Kwon <[email protected]>

update

e9328c2

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon requested a review from tdoublep as a code owner August 20, 2025 00:24

WoosukKwon changed the base branch from main to woosuk/max-seq-len August 20, 2025 00:24

mergify bot removed the needs-rebase label Aug 20, 2025

minor

c7a56f9

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 20, 2025

Base automatically changed from woosuk/max-seq-len to main August 20, 2025 16:05

mergify bot added the needs-rebase label Aug 22, 2025

Merge branch 'main' into woosuk/flashinfer-prep

211278e

Signed-off-by: Michael Goin <[email protected]>

mergify bot removed the needs-rebase label Aug 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimize input preparation for FlashInfer [2/N] #23174

Optimize input preparation for FlashInfer [2/N] #23174

WoosukKwon commented Aug 19, 2025

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Aug 19, 2025

Uh oh!

mergify bot commented Aug 22, 2025

Uh oh!

nvpohanh commented Aug 25, 2025

Uh oh!

Uh oh!

Uh oh!

Optimize input preparation for FlashInfer [2/N] #23174

Are you sure you want to change the base?

Optimize input preparation for FlashInfer [2/N] #23174

Conversation

WoosukKwon commented Aug 19, 2025

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Aug 19, 2025

Uh oh!

mergify bot commented Aug 22, 2025

Uh oh!

nvpohanh commented Aug 25, 2025

Uh oh!

Uh oh!