Skip to content

Conversation

WoosukKwon
Copy link
Collaborator

Should be merged after #23147

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant optimizations to the input preparation logic for the FlashInfer attention backend. The key changes include refactoring metadata handling by moving static parameters from FlashInferMetadata to FlashInferMetadataBuilder, replacing CPU-intensive PyTorch operations with faster NumPy equivalents, and substituting a slow torch.masked_select with a custom Triton kernel for preparing paged KV indices. Additionally, the calculation of max_seq_len is optimized by pre-computing it once. These changes are well-implemented and should lead to noticeable performance improvements. The code quality is high, and the optimizations are sound.

Copy link

mergify bot commented Aug 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 19, 2025
@WoosukKwon WoosukKwon requested a review from tdoublep as a code owner August 20, 2025 00:24
@WoosukKwon WoosukKwon changed the base branch from main to woosuk/max-seq-len August 20, 2025 00:24
@mergify mergify bot removed the needs-rebase label Aug 20, 2025
Signed-off-by: Woosuk Kwon <[email protected]>
@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 20, 2025
Base automatically changed from woosuk/max-seq-len to main August 20, 2025 16:05
Copy link

mergify bot commented Aug 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 22, 2025
@nvpohanh
Copy link
Contributor

@WoosukKwon We found that this optimization can reduce gaps between decoding steps when running with low concurrency. Do you plan to continue working on this so that this can be merged? Thanks!

@mergify mergify bot removed the needs-rebase label Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants