-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[v1] Add cross-attention KV cache support for encoder-decoder models #23664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1] Add cross-attention KV cache support for encoder-decoder models #23664
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for cross-attention KV cache in encoder-decoder models, with an initial focus on Whisper. The changes to the KV cache coordinator and scheduler to handle encoder token allocation are logical. However, I've identified three critical issues that would lead to runtime errors. The new CrossAttentionManager
incorrectly raises errors in find_longest_cache_hit
and cache_blocks
instead of handling the cases gracefully. Additionally, there's a call to a non-existent method get_encdec_max_encoder_len
in MULTIMODAL_REGISTRY
. These issues need to be addressed to ensure the functionality works as intended.
- Implement CrossAttentionManager for managing encoder states in KV cache - Add num_encoder_tokens parameter to allocation methods for cross-attention blocks - Update scheduler to handle encoder token allocation for Whisper models - Disable prefix caching for cross-attention blocks since encoder states are request-specific - Add encoder-decoder compatibility checks with KV connectors This is a subset of the changes from vllm-project#21088. It includes the changes to the KV cache manager and scheduler for supporting cross-attention for Whisper. Signed-off-by: Russell Bryant <[email protected]>
86dc036
to
95b2163
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
UT is broken by vLLM commit vllm-project/vllm#23664 This PR mock the related config to recover the CI - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@6dab89b Signed-off-by: wangxiyuan <[email protected]>
…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: tc-mb <[email protected]>
…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]>
…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]>
…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]>
…llm-project#23664) Signed-off-by: Russell Bryant <[email protected]>
This is a subset of the changes from #21088. It includes the changes
to the KV cache manager and scheduler for supporting cross-attention
for Whisper.
Signed-off-by: Russell Bryant [email protected]