-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[v1] Add Whisper model support (encoder-decoder) #21088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is a significant and well-structured pull request that adds Whisper (encoder-decoder) model support to vLLM's V1 engine. The changes are comprehensive, touching on the attention backend, KV cache management, scheduler, and GPU model runner to accommodate the new architecture.
I've identified one critical issue in _build_encoder_attn_metadata
where a missing else
block could lead to a size mismatch and a runtime error. I've provided a code suggestion to fix this potential bug. Other than that, the implementation looks solid and correctly integrates encoder-decoder support into the existing V1 framework. Great work on this complex feature!
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This is already some work to support encoder-decoder models: Can you coordinate with @maxdebayser to avoid duplicate work? |
Yeah, I've been talking with @russellb as there are a few overlapping points in our PRs for example disabling prefix caching and chunked prefill. |
Yep, we're in contact. Did you mean to link something different than #20226? Roughly though, Max had worked on encoder-only support, and I was doing encoder-decoder, which is mostly a superset of encoder-only changes, though I haven't actually tested any encoder-only models with my branch yet. |
follow-up on next steps and collaboration with @maxdebayser We're going to combine our work and try to land it all in a few stages. PR 1) Combine parts of his encoder-only PR (#19988) with the encoder-without-kv-cache changes in this branch. That will be a new jointly-authored PR that will cover encoder-only attention. PR 2) Update this PR with what's left to make Whisper / encoder-decoder work. That includes some Whisper model changes and a bunch of changes to support cross-attention (encoder-decoder type). PR 3) Add the last parts of Max's original PR, which supports token_type_ids to run the bert classifier models that need them. |
96be9ad
to
4da8b7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice one!
16f557d
to
a9e3459
Compare
I got this caught up with |
87d9bfa
to
f62a66e
Compare
This pull request has merge conflicts that must be resolved before it can be |
Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata. This PR combines elements of PRs vllm-project#21088 and vllm-project#19988 Summary of changes: **Flash Attention Backend:** - Implement encoder self-attention support without using KV cache **Scheduler:** - Disable chunked prefill for models without KV cache **GPU Model Runner:** - Implement encoder-only attention metadata building for self-attention Related to: - V0 deprecation: vllm-project#18571 - 2025 Q3 roadmap: vllm-project#20336 Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Russell Bryant <[email protected]>
- Implement CrossAttentionManager for managing encoder states in KV cache - Add num_encoder_tokens parameter to allocation methods for cross-attention blocks - Update scheduler to handle encoder token allocation for Whisper models - Disable prefix caching for cross-attention blocks since encoder states are request-specific - Add encoder-decoder compatibility checks with KV connectors This is a subset of the changes from vllm-project#21088. It includes the changes to the KV cache manager and scheduler for supporting cross-attention for Whisper. Signed-off-by: Russell Bryant <[email protected]>
…ctness Improve the performance of this test by only creating the tokenizer once instead of hundreds of times + serialized due to doing it while holding a semaphore. The previous code would also frequently get rate limited by HuggingFace from requesting https://huggingface.co/openai/whisper-large-v3/resolve/main/tokenizer_config.json too many times. This would sometimes cause the test to fail. On my laptop, here is the time difference: Before: - 5m3.389s After: - 2m5.471s This is a piece split out from vllm-project#21088. Signed-off-by: Russell Bryant <[email protected]>
The PR is up-to-date with I split another piece out into its own PR here: #23854 This PR contains everything remaining to make Whisper work in V1. The major design notes I have on the current state are:
|
This pull request has merge conflicts that must be resolved before it can be |
Updated to fix more conflicts. It's working when I test manually, but one of the tests is triggering a memory error. I'm trying to debug that. |
_dummy_blk_table_and_slot_mapping()) | ||
num_common_prefix_blocks = 0 | ||
causal_arg = False | ||
elif isinstance(kv_cache_group_spec.kv_cache_spec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put all these things into the builder.build of encoder attention and cross attention? Basically, you can still pass the original common_attn_metadata
that is only correct for decode to the build
function, and update the attributes that are special for encoder attention and cross attention in the build
. You can refer to the chunked_local_attention in llama4
common_attn_metadata = make_local_attention_virtual_batches( |
The necessary encoder_inputs can be passed to the builder via common_attn_metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - I had this in my notes here #21088 (comment)
I just hadn't gotten to it, since it took me a bit to get it all working again. I'm also fixing conflicts almost every day.
RIght now I'm focused on going through CI failures. I'll keep this on my TODO list.
@@ -3142,7 +3355,11 @@ def initialize_kv_cache(self, kv_cache_config: KVCacheConfig) -> None: | |||
|
|||
def may_add_encoder_only_layers_to_kv_cache_config(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def may_add_encoder_only_layers_to_kv_cache_config(self) -> None: | |
def may_add_encoder_layers_to_kv_cache_config(self) -> None: |
This pull request has merge conflicts that must be resolved before it can be |
Implements Whisper mdoel support in the V1 engine. Key changes include: - Add encoder-decoder architecture support with cross-attention KV cache management - Add CrossAttentionManager and CrossAttentionSpec for encoder-decoder KV cache - Update scheduler to handle cross-attention block allocation and disable prefix caching - Modify GPU model runner for encoder input processing and attention metadata - Disable BART / other enc-dec tests/examples (Whisper-only support for now) - Optimize test performance and fix various integration issues This closes a major feature gap between V0 and V1, enabling Whisper transcription in the new engine architecture while maintaining backward compatibility. Related to V0 deprecation (vllm-project#18571) and 2025 Q3 roadmap (vllm-project#20336). Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: NickLucche <[email protected]> Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
small is less reliable and doesn't seem to add much value in making sure whisper is working. Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
It causes startup to hang for me with this warning: huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
v1: Add Whisper encoder-decoder model support
Implements Whisper mdoel support in the V1 engine. Key changes include:
This closes a major feature gap between V0 and V1, enabling Whisper transcription
in the new engine architecture while maintaining backward compatibility.
Related to V0 deprecation (#18571) and 2025 Q3 roadmap (#20336).
Closes #12761
Signed-off-by: Russell Bryant [email protected]
Co-authored-by: NickLucche [email protected]