[Feature] Add FlashInfer cuDNN backend for ViT attention #34441

dorhuri123 · 2026-02-12T16:48:29Z

Summary

Adds cuDNN-accelerated attention via FlashInfer's cudnn_batch_prefill_with_kv_cache for Vision Transformer encoders
Supports bf16 with attention masks, targeting Qwen2.5-VL and Qwen3-VL models
Full integration for Qwen3-VL; parameter plumbing for Qwen2.5-VL (full integration in follow-up)

Key Changes

New cuDNN wrapper (`vllm/v1/attention/ops/vit_attn_wrappers.py`)

flashinfer_wrapper() calls FlashInfer's cuDNN batch prefill API
Handles 3-section cu_seqlens format (batch_offsets_qk, batch_offsets_v, batch_offsets_o)
Registered as custom op for torch.compile compatibility

MMEncoderAttention updates (`vllm/model_executor/layers/attention/mm_encoder_attention.py`)

New _forward_flashinfer() method with FlashInfer dispatch
workspace_buffer parameter for pre-allocated cuDNN workspace (128MB)
sequence_lengths parameter threaded through all forward methods

Qwen3-VL integration (`vllm/model_executor/models/qwen3_vl.py`)

Batch bucket padding (BATCH_BUCKETS = [8, 16, 32, 64]) for cuDNN graph caching
compute_flashinfer_cu_seqlens() computes the 3-section cu_seqlens format
Fixed max_seqlen to 128K to avoid cuDNN recompilation
FlashInfer added to supported backends

Platform support (`vllm/platforms/cuda.py`)

FLASHINFER added to CUDA's get_supported_vit_attn_backends()

Usage

Set the ViT attention backend to FlashInfer:

--override-mm-encoder-attn-backend FLASHINFER

Credits

Based on CentML/vllm#30.

Test plan

Verify Qwen3-VL inference with --override-mm-encoder-attn-backend FLASHINFER
Check no regression with default backends (FLASH_ATTN, TORCH_SDPA)
Verify cuDNN graph caching with batch bucket padding

gemini-code-assist

Code Review

This pull request introduces FlashInfer cuDNN backend support for Vision Transformer (ViT) attention, targeting Qwen2.5-VL and Qwen3-VL models. The changes involve adding new wrappers for FlashInfer, updating MMEncoderAttention to dispatch to the new backend, and integrating it into Qwen3-VL with batch bucket padding for cuDNN graph caching. The workspace_buffer and sequence_lengths parameters are threaded through the attention layers to support the new backend. Overall, the implementation seems to correctly integrate the FlashInfer backend. However, there are a few areas that could be improved for clarity, maintainability, and torch.compile compatibility.

vllm/model_executor/layers/attention/mm_encoder_attention.py

vllm/model_executor/models/qwen3_vl.py

vllm/v1/attention/ops/vit_attn_wrappers.py

Adds cuDNN-accelerated attention via FlashInfer for Vision Transformer encoders, supporting bf16 with attention masks. Currently integrated for Qwen3-VL, with plumbing added for Qwen2.5-VL. Based on CentML#30. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: dorh <[email protected]>

dorhuri123 requested review from LucasWilkinson and sighingnow as code owners February 12, 2026 16:48

mergify bot added qwen Related to Qwen models nvidia v1 labels Feb 12, 2026

github-project-automation bot added this to NVIDIA Feb 12, 2026

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

dorhuri123 force-pushed the add-flashinfer-cudnn-vit-attention branch from 50d61f6 to dfb49e1 Compare February 12, 2026 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Add FlashInfer cuDNN backend for ViT attention #34441

[Feature] Add FlashInfer cuDNN backend for ViT attention #34441

dorhuri123 commented Feb 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[Feature] Add FlashInfer cuDNN backend for ViT attention #34441

Are you sure you want to change the base?

[Feature] Add FlashInfer cuDNN backend for ViT attention #34441

Conversation

dorhuri123 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

New cuDNN wrapper (vllm/v1/attention/ops/vit_attn_wrappers.py)

MMEncoderAttention updates (vllm/model_executor/layers/attention/mm_encoder_attention.py)

Qwen3-VL integration (vllm/model_executor/models/qwen3_vl.py)

Platform support (vllm/platforms/cuda.py)

Usage

Credits

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dorhuri123 commented Feb 12, 2026 •

edited

Loading

New cuDNN wrapper (`vllm/v1/attention/ops/vit_attn_wrappers.py`)

MMEncoderAttention updates (`vllm/model_executor/layers/attention/mm_encoder_attention.py`)

Qwen3-VL integration (`vllm/model_executor/models/qwen3_vl.py`)

Platform support (`vllm/platforms/cuda.py`)