Skip to content

Conversation

@dorhuri123
Copy link

@dorhuri123 dorhuri123 commented Feb 12, 2026

Summary

  • Adds cuDNN-accelerated attention via FlashInfer's cudnn_batch_prefill_with_kv_cache for Vision Transformer encoders
  • Supports bf16 with attention masks, targeting Qwen2.5-VL and Qwen3-VL models
  • Full integration for Qwen3-VL; parameter plumbing for Qwen2.5-VL (full integration in follow-up)

Key Changes

New cuDNN wrapper (vllm/v1/attention/ops/vit_attn_wrappers.py)

  • flashinfer_wrapper() calls FlashInfer's cuDNN batch prefill API
  • Handles 3-section cu_seqlens format (batch_offsets_qk, batch_offsets_v, batch_offsets_o)
  • Registered as custom op for torch.compile compatibility

MMEncoderAttention updates (vllm/model_executor/layers/attention/mm_encoder_attention.py)

  • New _forward_flashinfer() method with FlashInfer dispatch
  • workspace_buffer parameter for pre-allocated cuDNN workspace (128MB)
  • sequence_lengths parameter threaded through all forward methods

Qwen3-VL integration (vllm/model_executor/models/qwen3_vl.py)

  • Batch bucket padding (BATCH_BUCKETS = [8, 16, 32, 64]) for cuDNN graph caching
  • compute_flashinfer_cu_seqlens() computes the 3-section cu_seqlens format
  • Fixed max_seqlen to 128K to avoid cuDNN recompilation
  • FlashInfer added to supported backends

Platform support (vllm/platforms/cuda.py)

  • FLASHINFER added to CUDA's get_supported_vit_attn_backends()

Usage

Set the ViT attention backend to FlashInfer:

--override-mm-encoder-attn-backend FLASHINFER

Credits

Based on CentML/vllm#30.

Test plan

  • Verify Qwen3-VL inference with --override-mm-encoder-attn-backend FLASHINFER
  • Check no regression with default backends (FLASH_ATTN, TORCH_SDPA)
  • Verify cuDNN graph caching with batch bucket padding

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces FlashInfer cuDNN backend support for Vision Transformer (ViT) attention, targeting Qwen2.5-VL and Qwen3-VL models. The changes involve adding new wrappers for FlashInfer, updating MMEncoderAttention to dispatch to the new backend, and integrating it into Qwen3-VL with batch bucket padding for cuDNN graph caching. The workspace_buffer and sequence_lengths parameters are threaded through the attention layers to support the new backend. Overall, the implementation seems to correctly integrate the FlashInfer backend. However, there are a few areas that could be improved for clarity, maintainability, and torch.compile compatibility.

Adds cuDNN-accelerated attention via FlashInfer for Vision Transformer
encoders, supporting bf16 with attention masks. Currently integrated
for Qwen3-VL, with plumbing added for Qwen2.5-VL.

Based on CentML#30.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: dorh <[email protected]>
@dorhuri123 dorhuri123 force-pushed the add-flashinfer-cudnn-vit-attention branch from 50d61f6 to dfb49e1 Compare February 12, 2026 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia qwen Related to Qwen models v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant