[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models #23154

heheda12345 · 2025-08-19T05:27:20Z

Purpose

Clean up attention metadata preparation of encoder-only models. Prepare cleaner code base for encoder-decoder.

Test Plan

Test an attention free model by

pytest -vs test_gte.py::test_rerank_models_mteb

Test Result

Can pass

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 · 2025-08-19T05:27:42Z

CC @LucasWilkinson

mergify · 2025-08-19T05:28:02Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a significant and well-structured refactoring for handling attention metadata in encoder-only models. By creating a dedicated EncoderOnlyAttention class and treating these layers as a special type of KV cache group, you've successfully eliminated special-cased logic in the GPUModelRunner, leading to cleaner and more maintainable code. The changes are consistent across different model implementations. I've identified one critical issue regarding a missing self parameter that would cause a runtime error, and a related incorrect type hint. My review comments provide details and suggestions for these issues.

gemini-code-assist · 2025-08-19T05:28:27Z

vllm/attention/layers/chunked_local_attention.py

+    def patch_common_attn_metadata(cm: CommonAttentionMetadata,
+                                   scheduler_output: SchedulerOutput):
        return make_local_attention_virtual_batches(attention_chunk_size, cm,
                                                    block_size)


The method signature for patch_common_attn_metadata is missing the self parameter. When this function is assigned to the AttentionMetadataBuilder subclass, it becomes a method. When called as builder.patch_common_attn_metadata(...), the builder instance is passed as the first argument (self). Without self in the signature, the cm parameter will incorrectly receive the AttentionMetadataBuilder instance, which will cause a runtime AttributeError when make_local_attention_virtual_batches tries to access attributes of CommonAttentionMetadata on it.

Suggested change

def patch_common_attn_metadata(cm: CommonAttentionMetadata,

scheduler_output: SchedulerOutput):

return make_local_attention_virtual_batches(attention_chunk_size, cm,

block_size)

def patch_common_attn_metadata(self,

cm: CommonAttentionMetadata,

scheduler_output: SchedulerOutput):

return make_local_attention_virtual_batches(attention_chunk_size, cm,

block_size)

gemini-code-assist · 2025-08-19T05:28:28Z

vllm/v1/attention/backends/utils.py

+    patch_common_attn_metadata: Callable[
+        [CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata],


The type hint for patch_common_attn_metadata is incorrect. It should reflect that it will be used as a method on an AttentionMetadataBuilder instance, and thus will receive self as its first argument. The current type hint Callable[[CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata] is missing the self parameter. This can be misleading for developers and static analysis tools. A more accurate type hint would include self.

Suggested change

patch_common_attn_metadata: Callable[

[CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata],

patch_common_attn_metadata: Callable[

[Any, CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata],

Signed-off-by: Chen Zhang <[email protected]>

github-actions · 2025-08-19T05:50:02Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

noooop · 2025-08-19T06:24:22Z

Alibaba-NLP/gte-Qwen2-1.5B-instruct uses the methods mentioned in llm2vec

we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.

Therefore, there is no clear boundary between decoder-only Attention and encoder-only Attention, and I think it's better not to introduce two separate AttentionMetadata sections for decoder-only Attention and encoder-only Attention.

Perhaps #20930 some details would be helpful for this PR

Also #22637

heheda12345 · 2025-08-19T18:02:25Z

@noooop
For #20930, should (decoder/encoder_only) be orthogonal to pooling? I thought encoder_only refers to layers with bidirectional attention, so we can't do prefix caching and chunk prefill.
For #22637, whether sliding window is enabled is also orthogonal to attention type. In encoder-only case, attention backends can handle it by passing diffferent window size to attention kernels, and the engine doesn't need to be aware of the difference.

Signed-off-by: Chen Zhang <[email protected]>

…ctor Signed-off-by: Chen Zhang <[email protected]>

heheda12345 · 2025-08-19T20:54:48Z

@LucasWilkinson Ready for review now

LucasWilkinson · 2025-08-19T23:51:09Z

vllm/v1/worker/gpu_model_runner.py

+        # Attention layers that are only in the KVCacheConfig of the runner
+        # (e.g., KV sharing, encoder-only attention), but not in the
+        # KVCacheConfig of the scheduler.
+        self.runner_only_kv_layers: set[str] = set()


nit: maybe we should call these runner_only_attn_layers? runner_only_kv_layers makes it sound like theres special kv-handling when really these are just layers that need attention metadata to be built

LucasWilkinson · 2025-08-19T23:54:39Z

vllm/v1/worker/gpu_model_runner.py

@@ -3048,6 +3015,8 @@ def _reshape_kv_cache_tensors(
        for kv_cache_spec, group in self._kv_cache_spec_attn_group_iterator():
            attn_backend = group.backend
            for layer_name in group.layer_names:
+                if layer_name in self.runner_only_kv_layers:


actually do we need this? maybe we could just skip if kv_cache_spec.page_size_bytes == 0 then it'll just naturally skip encoder-only layers; and it would make sense to me that a kv_cache_spec with a 0 page size means no kv-cache

because kv sharing also need this.

LucasWilkinson · 2025-08-19T23:56:31Z

vllm/attention/layers/encoder_only_attention.py

+    builder_cls = subclass_attention_metadata_builder(
+        name_prefix=prefix,
+        builder_cls=underlying_attn_backend.get_builder_cls(),
+        patch_common_attn_metadata=patch_common_attn_metadata)


I agree with #22628 (comment) ; I think we should just to that instead of patch_common_attn_metadata; it might be a bit more verbose in this case but I agree with you that as things get more complicated the abstraction will stay cleaner

LucasWilkinson

Thanks for doing this! this is looking much much better!

noooop · 2025-08-20T03:15:52Z

vllm/model_executor/models/qwen2.py

        )
-        self.attn = Attention(
+        attn_cls = (EncoderOnlyAttention
+                    if attn_type == AttentionType.ENCODER_ONLY else Attention)
+        self.attn = attn_cls(
            self.num_heads,


As I mentioned earlier, any model that uses a decoder-only LLM can be converted into encoder-only Attention using an unsupervised method. (Very easy to use, the improvement is significant. so over time, an increasing number of models need to add this line of code

Alibaba-NLP/gte-Qwen2-1.5B-instruct uses the methods mentioned in llm2vec

we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.

腾讯Conan-Embedding-V2发布，登顶MTEB中英榜单
SoftMask
...
结果表明，初始阶段，使用软掩码的损失下降速度比不使用软掩码的损失更慢。然而，使用软掩码的最终损失更低。这表明软掩码方法使模型在训练早期能够学习到更全面的特征表示。

Do we really need to add EncoderOnlyAttention

@noooop For #20930, should (decoder/encoder_only) be orthogonal to pooling? I thought encoder_only refers to layers with bidirectional attention, so we can't do prefix caching and chunk prefill. For #22637, whether sliding window is enabled is also orthogonal to attention type. In encoder-only case, attention backends can handle it by passing diffferent window size to attention kernels, and the engine doesn't need to be aware of the difference.

These two aspects maybe need this PR to take care, maybe not. Sorry for confusing you.

But during serving, should it always be either decoder or encoder-only? To make a model support both encoder_only mode and decoder mode, you can see what I did on llama and qwen in this PR.

over time, an increasing number of models need to add this line of code,

As well as EncoderOnlyAttention and Attention interfaces should be exactly the same, then why do we need to using EncoderOnlyAttention

(My point is that the EncoderOnlyAttention functionality should become part of Attention, and it can be activated by using attn_type == AttentionType.ENCODER_ONLY. This way, we only need a single Attention interface.

over time, an increasing number of models need to add this line of code,
Not much agree. I think the main goal of vLLM is for decoder-only model so we won't add this line to more models. If you want some specific model to be encoder-only, you can define it as an out-of-tree model.
@LucasWilkinson WDYT?

@noooop Even if we keep the attention interfaces the same the model definitions would need to be updated to include

vllm/vllm/model_executor/models/qwen3.py

Lines 184 to 187 in 7be5d11

if getattr(config, "is_causal", True):

attn_type = AttentionType.DECODER

else:

attn_type = AttentionType.ENCODER_ONLY

anyways; so I dont think theres a huge difference between having to add 5 vs 4 lines to enable this

@noooop the context is that we are overhauling alot of the different attention layers in vLLM to make them more pluggable and backend-agnostic, as well as move away from bloating the Attention class, attention backends and/or gpu-model-runner with all the different schemes (source of merge conflicts and technical debt). For this reason we are moving to more specific attention subclasses instead of flags in attention, example #21588 moves from using a use_irope flag on Attention to a ChunkedLocalAttention layer.

With that being said since we do have 3 models already (qwen2, qwen3 and llama) that have this dual decoder-only or encoder-only support and may more come, so I could see how in this specific case it could make sense to roll it into the Attention class. I think this would be one of the few exceptions to our general preference for attention layer subclasses though. @heheda12345 I think this would be ok; but as the author I'll ultimately leave the decision up to you. I agree with you that decoder-only models are the priority for vLLM.

After careful consideration, introducing EncoderOnlyAttention does indeed have some advantages, and I am satisfied with this modification.

vllm has too many Jump wires, reducing one attn_type Jump wire is always good.

Thank you for your refactoring.

Signed-off-by: Chen Zhang <[email protected]>

mergify · 2025-08-20T17:55:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ctor Signed-off-by: Chen Zhang <[email protected]>

heheda12345 · 2025-08-20T18:12:56Z

@LucasWilkinson Can you take another look?

vllm/v1/attention/backends/utils.py

Signed-off-by: Chen Zhang <[email protected]>

LucasWilkinson · 2025-08-21T02:28:57Z

vllm/attention/layers/encoder_only_attention.py

+    def build(self,
+              common_prefix_len: int,
+              common_attn_metadata: CommonAttentionMetadata,
+              fast_build: bool = False) -> AttentionMetadata:
+        new_common_attn_metadata = copy(common_attn_metadata)
+        new_common_attn_metadata.causal = False
+        return super(self.__class__,
+                     self).build(common_prefix_len, new_common_attn_metadata,
+                                 fast_build)
+
+    builder_cls = subclass_attention_metadata_builder(
+        name_prefix=prefix,
+        builder_cls=underlying_attn_backend.get_builder_cls(),
+        build=build)
+    attn_backend = subclass_attention_backend(
+        name_prefix=prefix,
+        attention_backend_cls=underlying_attn_backend,
+        builder_cls=builder_cls)


Suggested change

def build(self,

common_prefix_len: int,

common_attn_metadata: CommonAttentionMetadata,

fast_build: bool = False) -> AttentionMetadata:

new_common_attn_metadata = copy(common_attn_metadata)

new_common_attn_metadata.causal = False

return super(self.__class__,

self).build(common_prefix_len, new_common_attn_metadata,

fast_build)

builder_cls = subclass_attention_metadata_builder(

name_prefix=prefix,

builder_cls=underlying_attn_backend.get_builder_cls(),

build=build)

attn_backend = subclass_attention_backend(

name_prefix=prefix,

attention_backend_cls=underlying_attn_backend,

builder_cls=builder_cls)

underlying_builder = underlying_attn_backend.get_builder_cls()

class Builder(underlying_builder):

def build(self,

common_prefix_len: int,

common_attn_metadata: CommonAttentionMetadata,

fast_build: bool = False) -> AttentionMetadata:

new_common_attn_metadata = copy(common_attn_metadata)

new_common_attn_metadata.causal = False

return super().build(common_prefix_len, new_common_attn_metadata,

fast_build)

attn_backend = subclass_attention_backend(

name_prefix=prefix,

attention_backend_cls=underlying_attn_backend,

builder_cls=Builder)

Alternative per our discussion in slack; confirmed this works fine with the caching PR

Yes this can be easier. Changed but prefer EncoderOnlyAttentionBuilder than Builder .

LucasWilkinson

LGTM! Thank you!! Left a couple final comments

Signed-off-by: Chen Zhang <[email protected]>

…ncoder_refactor

heheda12345 · 2025-08-21T08:28:38Z

@LucasWilkinson Can't pass type checker and chatgpt suggests me to go back https://chatgpt.com/share/68a6d863-b91c-800f-923b-ff8e61040cea . Will have more try tomorrow.

Signed-off-by: Chen Zhang <[email protected]>

mergify · 2025-08-21T18:11:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Chen Zhang <[email protected]>

…ctor Signed-off-by: Chen Zhang <[email protected]>

…odels (vllm-project#23154) Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: root <[email protected]>

…odels (vllm-project#23154) Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: FFFfff1FFFfff <[email protected]>

…odels (vllm-project#23154) Signed-off-by: Chen Zhang <[email protected]>

encoder refactor

8d7009b

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 requested review from sighingnow, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners August 19, 2025 05:27

heheda12345 marked this pull request as draft August 19, 2025 05:27

mergify bot added llama Related to Llama models qwen Related to Qwen models v1 labels Aug 19, 2025

mergify bot added the needs-rebase label Aug 19, 2025

gemini-code-assist bot reviewed Aug 19, 2025

View reviewed changes

heheda12345 added 2 commits August 18, 2025 22:31

update type hint

c86b4b7

Signed-off-by: Chen Zhang <[email protected]>

rename

c77560e

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 added 2 commits August 19, 2025 13:50

fix bug

3712114

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'main' of github.com:vllm-project/vllm into encoder_refa…

e806925

…ctor Signed-off-by: Chen Zhang <[email protected]>

heheda12345 marked this pull request as ready for review August 19, 2025 20:54

mergify bot removed the needs-rebase label Aug 19, 2025

heheda12345 mentioned this pull request Aug 19, 2025

[V1] Enable prefill optimization for Gemma3n #22628

Open

6 tasks

LucasWilkinson reviewed Aug 19, 2025

View reviewed changes

noooop reviewed Aug 20, 2025

View reviewed changes

use patch build

0213df5

Signed-off-by: Chen Zhang <[email protected]>

mergify bot added the needs-rebase label Aug 20, 2025

Merge branch 'main' of github.com:vllm-project/vllm into encoder_refa…

2753684

…ctor Signed-off-by: Chen Zhang <[email protected]>

mergify bot removed the needs-rebase label Aug 20, 2025

sarckk reviewed Aug 20, 2025

View reviewed changes

vllm/v1/attention/backends/utils.py Outdated Show resolved Hide resolved

sarckk reviewed Aug 20, 2025

View reviewed changes

vllm/v1/attention/backends/utils.py Outdated Show resolved Hide resolved

heheda12345 added 2 commits August 20, 2025 17:14

fix

43c3557

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'main' into encoder_refactor

2243510

LucasWilkinson reviewed Aug 21, 2025

View reviewed changes

LucasWilkinson approved these changes Aug 21, 2025

View reviewed changes

heheda12345 added 2 commits August 21, 2025 01:09

fix

f05b2dc

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'encoder_refactor' of github.com:heheda12345/vllm into e…

efc68df

…ncoder_refactor

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 21, 2025

type:ignore

92e26d2

Signed-off-by: Chen Zhang <[email protected]>

mergify bot added the needs-rebase label Aug 21, 2025

heheda12345 added 2 commits August 21, 2025 11:16

fix test

0b0d80e

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'main' of github.com:vllm-project/vllm into encoder_refa…

bb76606

…ctor Signed-off-by: Chen Zhang <[email protected]>

mergify bot removed the needs-rebase label Aug 21, 2025

LucasWilkinson enabled auto-merge (squash) August 22, 2025 01:22

LucasWilkinson merged commit 17373dc into vllm-project:main Aug 22, 2025
44 checks passed

Xu-Wenqing pushed a commit to Xu-Wenqing/vllm that referenced this pull request Aug 23, 2025

[Attention] Refactor AttentionMetadata Preparation for Encoder-only M…

ab9b072

…odels (vllm-project#23154) Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: root <[email protected]>

russellb mentioned this pull request Aug 26, 2025

v1: Add Whisper model support (encoder-decoder) #21088

Open

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Attention] Refactor AttentionMetadata Preparation for Encoder-only M…

edef819

…odels (vllm-project#23154) Signed-off-by: Chen Zhang <[email protected]>

		patch_common_attn_metadata: Callable[
		[CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata],

	if getattr(config, "is_causal", True):
	attn_type = AttentionType.DECODER
	else:
	attn_type = AttentionType.ENCODER_ONLY

Uh oh!

[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models #23154

[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models #23154

Uh oh!

Conversation

heheda12345 commented Aug 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

heheda12345 commented Aug 19, 2025

Uh oh!

mergify bot commented Aug 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

noooop commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heheda12345 commented Aug 19, 2025

Uh oh!

heheda12345 commented Aug 19, 2025

Uh oh!

LucasWilkinson Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

noooop Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

noooop Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noooop Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Aug 20, 2025

Uh oh!

heheda12345 commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

LucasWilkinson Aug 21, 2025

heheda12345 commented Aug 19, 2025 •

edited by github-actions bot

Loading

noooop commented Aug 19, 2025 •

edited

Loading

LucasWilkinson Aug 19, 2025 •

edited

Loading

noooop Aug 20, 2025 •

edited

Loading

noooop Aug 20, 2025 •

edited

Loading

LucasWilkinson Aug 21, 2025 •

edited

Loading

noooop Aug 21, 2025 •

edited

Loading