[v1] Add cross-attention KV cache support for encoder-decoder models #23664

russellb · 2025-08-26T15:35:50Z

Implement CrossAttentionManager for managing encoder states in KV cache
Add num_encoder_tokens parameter to allocation methods for cross-attention blocks
Update scheduler to handle encoder token allocation for Whisper models
Disable prefix caching for cross-attention blocks since encoder states are request-specific
Add encoder-decoder compatibility checks with KV connectors

This is a subset of the changes from #21088. It includes the changes
to the KV cache manager and scheduler for supporting cross-attention
for Whisper.

Signed-off-by: Russell Bryant [email protected]

gemini-code-assist

Code Review

This pull request introduces support for cross-attention KV cache in encoder-decoder models, with an initial focus on Whisper. The changes to the KV cache coordinator and scheduler to handle encoder token allocation are logical. However, I've identified three critical issues that would lead to runtime errors. The new CrossAttentionManager incorrectly raises errors in find_longest_cache_hit and cache_blocks instead of handling the cases gracefully. Additionally, there's a call to a non-existent method get_encdec_max_encoder_len in MULTIMODAL_REGISTRY. These issues need to be addressed to ensure the functionality works as intended.

vllm/v1/core/single_type_kv_cache_manager.py

vllm/v1/core/sched/scheduler.py

- Implement CrossAttentionManager for managing encoder states in KV cache - Add num_encoder_tokens parameter to allocation methods for cross-attention blocks - Update scheduler to handle encoder token allocation for Whisper models - Disable prefix caching for cross-attention blocks since encoder states are request-specific - Add encoder-decoder compatibility checks with KV connectors This is a subset of the changes from vllm-project#21088. It includes the changes to the KV cache manager and scheduler for supporting cross-attention for Whisper. Signed-off-by: Russell Bryant <[email protected]>

heheda12345

LGTM!

UT is broken by vLLM commit vllm-project/vllm#23664 This PR mock the related config to recover the CI - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@6dab89b Signed-off-by: wangxiyuan <[email protected]>

…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: tc-mb <[email protected]>

…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]>

…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]>

russellb requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners August 26, 2025 15:35

mergify bot added the v1 label Aug 26, 2025

russellb mentioned this pull request Aug 26, 2025

[v1] Add Whisper model support (encoder-decoder) #21088

Open

gemini-code-assist bot reviewed Aug 26, 2025

View reviewed changes

vllm/v1/core/single_type_kv_cache_manager.py Show resolved Hide resolved

vllm/v1/core/single_type_kv_cache_manager.py Show resolved Hide resolved

vllm/v1/core/sched/scheduler.py Show resolved Hide resolved

russellb requested a review from heheda12345 August 26, 2025 15:38

russellb force-pushed the kv-cache-manager-cross-attention branch from 86dc036 to 95b2163 Compare August 26, 2025 15:41

russellb requested a review from DarkLight1337 as a code owner August 26, 2025 15:41

mergify bot added the multi-modality Related to multi-modality (#4194) label Aug 26, 2025

heheda12345 approved these changes Aug 26, 2025

View reviewed changes

heheda12345 enabled auto-merge (squash) August 26, 2025 16:23

heheda12345 disabled auto-merge August 26, 2025 16:23

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 26, 2025

heheda12345 changed the title ~~v1: Add cross-attention KV cache support for encoder-decoder models~~ [v1] Add cross-attention KV cache support for encoder-decoder models Aug 26, 2025

heheda12345 enabled auto-merge (squash) August 26, 2025 16:23

heheda12345 merged commit 98aa16f into vllm-project:main Aug 26, 2025
43 of 47 checks passed

wangxiyuan mentioned this pull request Aug 27, 2025

[CI] Fix UT failure vllm-project/vllm-ascend#2563

Merged

tc-mb pushed a commit to tc-mb/vllm that referenced this pull request Aug 27, 2025

[v1] Add cross-attention KV cache support for encoder-decoder models (v…

255ed5b

…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: tc-mb <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[v1] Add cross-attention KV cache support for encoder-decoder models (v…

ff28cb2

…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[v1] Add cross-attention KV cache support for encoder-decoder models (v…

ecd4b8d

…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[v1] Add cross-attention KV cache support for encoder-decoder models (v…

1e84fbf

…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[v1] Add cross-attention KV cache support for encoder-decoder models (v…

bc50bb4

…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]>

dumb0002 pushed a commit to dumb0002/vllm that referenced this pull request Aug 28, 2025

[v1] Add cross-attention KV cache support for encoder-decoder models (v…

abd721b

…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]>

2015aroras pushed a commit to 2015aroras/vllm that referenced this pull request Aug 29, 2025

[v1] Add cross-attention KV cache support for encoder-decoder models (v…

a78f9c9

…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[v1] Add cross-attention KV cache support for encoder-decoder models #23664

[v1] Add cross-attention KV cache support for encoder-decoder models #23664

russellb commented Aug 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heheda12345 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[v1] Add cross-attention KV cache support for encoder-decoder models #23664

[v1] Add cross-attention KV cache support for encoder-decoder models #23664

Conversation

russellb commented Aug 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!