[v1] Add Whisper model support (encoder-decoder) #21088

russellb · 2025-07-17T02:13:06Z

v1: Add Whisper encoder-decoder model support

Implements Whisper mdoel support in the V1 engine. Key changes include:

Add encoder-decoder architecture support with cross-attention KV cache management
Add CrossAttentionManager and CrossAttentionSpec for encoder-decoder KV cache
Update scheduler to handle cross-attention block allocation and disable prefix caching
Modify GPU model runner for encoder input processing and attention metadata
Disable BART tests/examples (Whisper-only support for now)
Optimize test performance and fix various integration issues

This closes a major feature gap between V0 and V1, enabling Whisper transcription
in the new engine architecture while maintaining backward compatibility.

Related to V0 deprecation (#18571) and 2025 Q3 roadmap (#20336).

Closes #12761

Signed-off-by: Russell Bryant [email protected]
Co-authored-by: NickLucche [email protected]

mergify · 2025-07-17T02:13:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This is a significant and well-structured pull request that adds Whisper (encoder-decoder) model support to vLLM's V1 engine. The changes are comprehensive, touching on the attention backend, KV cache management, scheduler, and GPU model runner to accommodate the new architecture.

I've identified one critical issue in _build_encoder_attn_metadata where a missing else block could lead to a size mismatch and a runtime error. I've provided a code suggestion to fix this potential bug. Other than that, the implementation looks solid and correctly integrates encoder-decoder support into the existing V1 framework. Great work on this complex feature!

vllm/v1/worker/gpu_model_runner.py

github-actions · 2025-07-17T02:24:48Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-07-17T02:51:14Z

This is already some work to support encoder-decoder models:

Can you coordinate with @maxdebayser to avoid duplicate work?

maxdebayser · 2025-07-17T14:31:18Z

Yeah, I've been talking with @russellb as there are a few overlapping points in our PRs for example disabling prefix caching and chunked prefill.
Currently in my PR I'm not disabling the KV cache entirely because functionally it makes no difference for the encoder attention. So I can keep the diff small. But I do want to test if removing the KV cache will have a performance improvement for encoder models

vllm/v1/attention/backends/flash_attn.py

vllm/v1/core/single_type_kv_cache_manager.py

russellb · 2025-07-17T14:44:09Z

This is already some work to support encoder-decoder models:

Add support for encoder embedding models #19988

[Bug]: RuntimeError: NCCL error: unhandled cuda error #20226

Can you coordinate with @maxdebayser to avoid duplicate work?

Yep, we're in contact.

Did you mean to link something different than #20226?

Roughly though, Max had worked on encoder-only support, and I was doing encoder-decoder, which is mostly a superset of encoder-only changes, though I haven't actually tested any encoder-only models with my branch yet.

vllm/v1/attention/backends/flash_attn.py

russellb · 2025-07-17T18:35:17Z

follow-up on next steps and collaboration with @maxdebayser

We're going to combine our work and try to land it all in a few stages.

PR 1) Combine parts of his encoder-only PR (#19988) with the encoder-without-kv-cache changes in this branch. That will be a new jointly-authored PR that will cover encoder-only attention.

PR 2) Update this PR with what's left to make Whisper / encoder-decoder work. That includes some Whisper model changes and a bunch of changes to support cross-attention (encoder-decoder type).

PR 3) Add the last parts of Max's original PR, which supports token_type_ids to run the bert classifier models that need them.

NickLucche

nice one!

vllm/v1/attention/backends/flash_attn.py

vllm/v1/core/kv_cache_coordinator.py

russellb · 2025-07-18T20:48:24Z

I got this caught up with main with all conflicts resolved, but I haven't addressed feedback received so far.

mergify · 2025-07-19T09:39:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata. This PR combines elements of PRs vllm-project#21088 and vllm-project#19988 Summary of changes: **Flash Attention Backend:** - Implement encoder self-attention support without using KV cache **Scheduler:** - Disable chunked prefill for models without KV cache **GPU Model Runner:** - Implement encoder-only attention metadata building for self-attention Related to: - V0 deprecation: vllm-project#18571 - 2025 Q3 roadmap: vllm-project#20336 Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Russell Bryant <[email protected]>

russellb · 2025-08-26T15:37:02Z

I split out another chunk from this PR here: #23664

This PR has major conflicts that I'm working on this week, mostly from #23154

- Implement CrossAttentionManager for managing encoder states in KV cache - Add num_encoder_tokens parameter to allocation methods for cross-attention blocks - Update scheduler to handle encoder token allocation for Whisper models - Disable prefix caching for cross-attention blocks since encoder states are request-specific - Add encoder-decoder compatibility checks with KV connectors This is a subset of the changes from vllm-project#21088. It includes the changes to the KV cache manager and scheduler for supporting cross-attention for Whisper. Signed-off-by: Russell Bryant <[email protected]>

…ctness Improve the performance of this test by only creating the tokenizer once instead of hundreds of times + serialized due to doing it while holding a semaphore. The previous code would also frequently get rate limited by HuggingFace from requesting https://huggingface.co/openai/whisper-large-v3/resolve/main/tokenizer_config.json too many times. This would sometimes cause the test to fail. On my laptop, here is the time difference: Before: - 5m3.389s After: - 2m5.471s This is a piece split out from vllm-project#21088. Signed-off-by: Russell Bryant <[email protected]>

russellb · 2025-08-28T19:28:09Z

The PR is up-to-date with main again.

I split another piece out into its own PR here: #23854

This PR contains everything remaining to make Whisper work in V1.

The major design notes I have on the current state are:

This only supports Whisper and not other encoder-decoder models. I disabled some to get tests to pass. There's a lot that can be removed if we don't plan to support any other encoder-decoder models. I think we can just see what's left to remove after this and [V0 Deprecation] Drop V0 encoder-decoder runner #23300.
Since I originally did this, the encoder cache has been enhanced to be shared across requests instead of scoped to single-requests. This makes the encoder cache useful for Whisper. I left the implementation as-is where it doesn't use the encoder cache, because the changes actually came out less invasive this way. I do think it's worth revisiting integrating with the encoder cache as there's probably some scenarios where it'd be useful, but I'd prefer doing it as a follow-up.
Calculating slot mappings for cross-attention is done manually. Integrating this properly in the block manager would be better.
There may be a better way to organize the attention metadata additions. Right now we construct the contents of CommonAttentionMetadata based on the attention type. We could probably move a lot of that into the attention-type-specific metadata builder wrappers, like vllm.attention.layers.cross_attention. Some of the reason this isn't done yet is because I was updating an implementation that pre-dated the refactor that added that new interface. It seems like a reasonable cleanup that could come in a follow-up PR, or just let me know if you'd rather block on it.

mergify · 2025-08-28T22:10:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

russellb · 2025-08-29T12:27:26Z

Updated to fix more conflicts.

It's working when I test manually, but one of the tests is triggering a memory error. I'm trying to debug that.

heheda12345 · 2025-08-30T04:50:22Z

vllm/v1/worker/gpu_model_runner.py

+                    _dummy_blk_table_and_slot_mapping())
+                num_common_prefix_blocks = 0
+                causal_arg = False
+            elif isinstance(kv_cache_group_spec.kv_cache_spec,


Can you put all these things into the builder.build of encoder attention and cross attention? Basically, you can still pass the original common_attn_metadata that is only correct for decode to the build function, and update the attributes that are special for encoder attention and cross attention in the build. You can refer to the chunked_local_attention in llama4

vllm/vllm/attention/layers/chunked_local_attention.py

Line 36 in ee52a32

common_attn_metadata = make_local_attention_virtual_batches(

The necessary encoder_inputs can be passed to the builder via common_attn_metadata.

Right - I had this in my notes here #21088 (comment)

I just hadn't gotten to it, since it took me a bit to get it all working again. I'm also fixing conflicts almost every day.

RIght now I'm focused on going through CI failures. I'll keep this on my TODO list.

heheda12345 · 2025-08-30T04:53:16Z

vllm/v1/worker/gpu_model_runner.py

@@ -3142,7 +3355,11 @@ def initialize_kv_cache(self, kv_cache_config: KVCacheConfig) -> None:

    def may_add_encoder_only_layers_to_kv_cache_config(self) -> None:


Suggested change

def may_add_encoder_only_layers_to_kv_cache_config(self) -> None:

def may_add_encoder_layers_to_kv_cache_config(self) -> None:

mergify · 2025-08-30T19:09:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Implements Whisper mdoel support in the V1 engine. Key changes include: - Add encoder-decoder architecture support with cross-attention KV cache management - Add CrossAttentionManager and CrossAttentionSpec for encoder-decoder KV cache - Update scheduler to handle cross-attention block allocation and disable prefix caching - Modify GPU model runner for encoder input processing and attention metadata - Disable BART / other enc-dec tests/examples (Whisper-only support for now) - Optimize test performance and fix various integration issues This closes a major feature gap between V0 and V1, enabling Whisper transcription in the new engine architecture while maintaining backward compatibility. Related to V0 deprecation (vllm-project#18571) and 2025 Q3 roadmap (vllm-project#20336). Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: NickLucche <[email protected]> Signed-off-by: Russell Bryant <[email protected]>

Signed-off-by: Russell Bryant <[email protected]>

small is less reliable and doesn't seem to add much value in making sure whisper is working. Signed-off-by: Russell Bryant <[email protected]>

Signed-off-by: Russell Bryant <[email protected]>

It causes startup to hang for me with this warning: huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Signed-off-by: Russell Bryant <[email protected]>

Signed-off-by: Russell Bryant <[email protected]>

mergify · 2025-09-01T05:50:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

russellb mentioned this pull request Jul 16, 2025

[Roadmap] vLLM Roadmap Q3 2025 #20336

Open

mergify bot added v1 needs-rebase labels Jul 17, 2025

gemini-code-assist bot reviewed Jul 17, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

maxdebayser reviewed Jul 17, 2025

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

maxdebayser reviewed Jul 17, 2025

View reviewed changes

vllm/v1/core/single_type_kv_cache_manager.py Outdated Show resolved Hide resolved

LucasWilkinson reviewed Jul 17, 2025

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

russellb force-pushed the v1-whisper branch 3 times, most recently from 96be9ad to 4da8b7c Compare July 17, 2025 19:27

NickLucche suggested changes Jul 18, 2025

View reviewed changes

russellb force-pushed the v1-whisper branch 3 times, most recently from 16f557d to a9e3459 Compare July 18, 2025 20:46

mergify bot added documentation Improvements or additions to documentation and removed needs-rebase labels Jul 18, 2025

russellb force-pushed the v1-whisper branch from a9e3459 to 8b080c3 Compare July 18, 2025 20:47

russellb force-pushed the v1-whisper branch 2 times, most recently from 87d9bfa to f62a66e Compare July 18, 2025 21:00

mergify bot added the needs-rebase label Jul 19, 2025

maxdebayser mentioned this pull request Jul 20, 2025

Support encoder-only models without KV-Cache #21270

Merged

russellb mentioned this pull request Aug 26, 2025

[v1] Add cross-attention KV cache support for encoder-decoder models #23664

Merged

russellb mentioned this pull request Aug 28, 2025

[tests] Improve speed and reliability of test_transcription_api_correctness #23854

Merged

russellb force-pushed the v1-whisper branch from 6d66800 to b568d6d Compare August 28, 2025 19:15

russellb changed the title ~~v1: Add Whisper model support (encoder-decoder)~~ [v1] Add Whisper model support (encoder-decoder) Aug 28, 2025

mergify bot removed the needs-rebase label Aug 28, 2025

mergify bot added the needs-rebase label Aug 28, 2025

russellb force-pushed the v1-whisper branch from 35613d0 to f85b903 Compare August 29, 2025 12:26

mergify bot removed the needs-rebase label Aug 29, 2025

russellb requested a review from patrickvonplaten as a code owner August 29, 2025 16:23

russellb force-pushed the v1-whisper branch from ca4e437 to e5e151a Compare August 29, 2025 16:26

heheda12345 reviewed Aug 30, 2025

View reviewed changes

mergify bot added the needs-rebase label Aug 30, 2025

russellb and others added 4 commits August 31, 2025 01:54

prevent voxtral from being detected as encoder-decoder

f00a4a2

Signed-off-by: Russell Bryant <[email protected]>

Drop whisper-small from tests

a745908

small is less reliable and doesn't seem to add much value in making sure whisper is working. Signed-off-by: Russell Bryant <[email protected]>

Use correct number of encoder tokens in attention metadata

655f261

Signed-off-by: Russell Bryant <[email protected]>

russellb force-pushed the v1-whisper branch from de4d673 to 655f261 Compare August 31, 2025 03:36

mergify bot removed the needs-rebase label Aug 31, 2025

russellb added 3 commits August 31, 2025 11:37

Force spawn multiproc method in whisper example

6d8b951

Signed-off-by: Russell Bryant <[email protected]>

Run whisper test with spawn multiproc method

b1c61ea

Signed-off-by: Russell Bryant <[email protected]>

mergify bot added the needs-rebase label Sep 1, 2025

		@@ -3142,7 +3355,11 @@ def initialize_kv_cache(self, kv_cache_config: KVCacheConfig) -> None:

		def may_add_encoder_only_layers_to_kv_cache_config(self) -> None:

	def may_add_encoder_only_layers_to_kv_cache_config(self) -> None:
	def may_add_encoder_layers_to_kv_cache_config(self) -> None:

Uh oh!

[v1] Add Whisper model support (encoder-decoder) #21088

Are you sure you want to change the base?

[v1] Add Whisper model support (encoder-decoder) #21088

Conversation

russellb commented Jul 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jul 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

DarkLight1337 commented Jul 17, 2025

Uh oh!

maxdebayser commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

russellb commented Jul 17, 2025

Uh oh!

Uh oh!

russellb commented Jul 17, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

russellb commented Jul 18, 2025

Uh oh!

mergify bot commented Jul 19, 2025

Uh oh!

russellb commented Aug 26, 2025

Uh oh!

russellb commented Aug 28, 2025

Uh oh!

mergify bot commented Aug 28, 2025

Uh oh!

russellb commented Aug 29, 2025

Uh oh!

heheda12345 Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

russellb Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Aug 30, 2025

Uh oh!

mergify bot commented Sep 1, 2025

Uh oh!

Uh oh!

russellb commented Jul 17, 2025 •

edited by github-actions bot

Loading