Skip to content

Conversation

russellb
Copy link
Member

  • Implement CrossAttentionManager for managing encoder states in KV cache
  • Add num_encoder_tokens parameter to allocation methods for cross-attention blocks
  • Update scheduler to handle encoder token allocation for Whisper models
  • Disable prefix caching for cross-attention blocks since encoder states are request-specific
  • Add encoder-decoder compatibility checks with KV connectors

This is a subset of the changes from #21088. It includes the changes
to the KV cache manager and scheduler for supporting cross-attention
for Whisper.

Signed-off-by: Russell Bryant [email protected]

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for cross-attention KV cache in encoder-decoder models, with an initial focus on Whisper. The changes to the KV cache coordinator and scheduler to handle encoder token allocation are logical. However, I've identified three critical issues that would lead to runtime errors. The new CrossAttentionManager incorrectly raises errors in find_longest_cache_hit and cache_blocks instead of handling the cases gracefully. Additionally, there's a call to a non-existent method get_encdec_max_encoder_len in MULTIMODAL_REGISTRY. These issues need to be addressed to ensure the functionality works as intended.

@russellb russellb requested a review from heheda12345 August 26, 2025 15:38
- Implement CrossAttentionManager for managing encoder states in KV cache
- Add num_encoder_tokens parameter to allocation methods for cross-attention blocks
- Update scheduler to handle encoder token allocation for Whisper models
- Disable prefix caching for cross-attention blocks since encoder states are request-specific
- Add encoder-decoder compatibility checks with KV connectors

This is a subset of the changes from vllm-project#21088. It includes the changes
to the KV cache manager and scheduler for supporting cross-attention
for Whisper.

Signed-off-by: Russell Bryant <[email protected]>
@russellb russellb force-pushed the kv-cache-manager-cross-attention branch from 86dc036 to 95b2163 Compare August 26, 2025 15:41
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Aug 26, 2025
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@heheda12345 heheda12345 enabled auto-merge (squash) August 26, 2025 16:23
@heheda12345 heheda12345 disabled auto-merge August 26, 2025 16:23
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 26, 2025
@heheda12345 heheda12345 changed the title v1: Add cross-attention KV cache support for encoder-decoder models [v1] Add cross-attention KV cache support for encoder-decoder models Aug 26, 2025
@heheda12345 heheda12345 enabled auto-merge (squash) August 26, 2025 16:23
@heheda12345 heheda12345 merged commit 98aa16f into vllm-project:main Aug 26, 2025
43 of 47 checks passed
MengqingCao pushed a commit to vllm-project/vllm-ascend that referenced this pull request Aug 27, 2025
UT is broken by vLLM commit
vllm-project/vllm#23664

This PR mock the related config to recover the CI

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@6dab89b

Signed-off-by: wangxiyuan <[email protected]>
tc-mb pushed a commit to tc-mb/vllm that referenced this pull request Aug 27, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
dumb0002 pushed a commit to dumb0002/vllm that referenced this pull request Aug 28, 2025
2015aroras pushed a commit to 2015aroras/vllm that referenced this pull request Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants