[Feature] Add FlashInfer cuDNN backend for ViT attention #34441
+233
−11
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
cudnn_batch_prefill_with_kv_cachefor Vision Transformer encodersKey Changes
New cuDNN wrapper (
vllm/v1/attention/ops/vit_attn_wrappers.py)flashinfer_wrapper()calls FlashInfer's cuDNN batch prefill APIcu_seqlensformat (batch_offsets_qk, batch_offsets_v, batch_offsets_o)torch.compilecompatibilityMMEncoderAttention updates (
vllm/model_executor/layers/attention/mm_encoder_attention.py)_forward_flashinfer()method with FlashInfer dispatchworkspace_bufferparameter for pre-allocated cuDNN workspace (128MB)sequence_lengthsparameter threaded through all forward methodsQwen3-VL integration (
vllm/model_executor/models/qwen3_vl.py)BATCH_BUCKETS = [8, 16, 32, 64]) for cuDNN graph cachingcompute_flashinfer_cu_seqlens()computes the 3-section cu_seqlens formatmax_seqlento 128K to avoid cuDNN recompilationPlatform support (
vllm/platforms/cuda.py)FLASHINFERadded to CUDA'sget_supported_vit_attn_backends()Usage
Set the ViT attention backend to FlashInfer:
Credits
Based on CentML/vllm#30.
Test plan
--override-mm-encoder-attn-backend FLASHINFER