Skip to content

Conversation

heheda12345
Copy link
Collaborator

@heheda12345 heheda12345 commented Aug 19, 2025

Purpose

Clean up attention metadata preparation of encoder-only models. Prepare cleaner code base for encoder-decoder.

Test Plan

Test an attention free model by

pytest -vs test_gte.py::test_rerank_models_mteb

Test Result

Can pass

(Optional) Documentation Update


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Chen Zhang <[email protected]>
@heheda12345
Copy link
Collaborator Author

CC @LucasWilkinson

@mergify mergify bot added llama Related to Llama models qwen Related to Qwen models v1 labels Aug 19, 2025
Copy link

mergify bot commented Aug 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 19, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-structured refactoring for handling attention metadata in encoder-only models. By creating a dedicated EncoderOnlyAttention class and treating these layers as a special type of KV cache group, you've successfully eliminated special-cased logic in the GPUModelRunner, leading to cleaner and more maintainable code. The changes are consistent across different model implementations. I've identified one critical issue regarding a missing self parameter that would cause a runtime error, and a related incorrect type hint. My review comments provide details and suggestions for these issues.

Comment on lines 28 to 31
def patch_common_attn_metadata(cm: CommonAttentionMetadata,
scheduler_output: SchedulerOutput):
return make_local_attention_virtual_batches(attention_chunk_size, cm,
block_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The method signature for patch_common_attn_metadata is missing the self parameter. When this function is assigned to the AttentionMetadataBuilder subclass, it becomes a method. When called as builder.patch_common_attn_metadata(...), the builder instance is passed as the first argument (self). Without self in the signature, the cm parameter will incorrectly receive the AttentionMetadataBuilder instance, which will cause a runtime AttributeError when make_local_attention_virtual_batches tries to access attributes of CommonAttentionMetadata on it.

Suggested change
def patch_common_attn_metadata(cm: CommonAttentionMetadata,
scheduler_output: SchedulerOutput):
return make_local_attention_virtual_batches(attention_chunk_size, cm,
block_size)
def patch_common_attn_metadata(self,
cm: CommonAttentionMetadata,
scheduler_output: SchedulerOutput):
return make_local_attention_virtual_batches(attention_chunk_size, cm,
block_size)

Comment on lines 550 to 551
patch_common_attn_metadata: Callable[
[CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The type hint for patch_common_attn_metadata is incorrect. It should reflect that it will be used as a method on an AttentionMetadataBuilder instance, and thus will receive self as its first argument. The current type hint Callable[[CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata] is missing the self parameter. This can be misleading for developers and static analysis tools. A more accurate type hint would include self.

Suggested change
patch_common_attn_metadata: Callable[
[CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata],
patch_common_attn_metadata: Callable[
[Any, CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata],

Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@noooop
Copy link
Contributor

noooop commented Aug 19, 2025

Alibaba-NLP/gte-Qwen2-1.5B-instruct uses the methods mentioned in llm2vec

we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.

Therefore, there is no clear boundary between decoder-only Attention and encoder-only Attention, and I think it's better not to introduce two separate AttentionMetadata sections for decoder-only Attention and encoder-only Attention.

Perhaps #20930 some details would be helpful for this PR

Also #22637

@heheda12345
Copy link
Collaborator Author

@noooop
For #20930, should (decoder/encoder_only) be orthogonal to pooling? I thought encoder_only refers to layers with bidirectional attention, so we can't do prefix caching and chunk prefill.
For #22637, whether sliding window is enabled is also orthogonal to attention type. In encoder-only case, attention backends can handle it by passing diffferent window size to attention kernels, and the engine doesn't need to be aware of the difference.

@heheda12345 heheda12345 marked this pull request as ready for review August 19, 2025 20:54
@mergify mergify bot removed the needs-rebase label Aug 19, 2025
@heheda12345
Copy link
Collaborator Author

@LucasWilkinson Ready for review now

# Attention layers that are only in the KVCacheConfig of the runner
# (e.g., KV sharing, encoder-only attention), but not in the
# KVCacheConfig of the scheduler.
self.runner_only_kv_layers: set[str] = set()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we should call these runner_only_attn_layers? runner_only_kv_layers makes it sound like theres special kv-handling when really these are just layers that need attention metadata to be built

@@ -3048,6 +3015,8 @@ def _reshape_kv_cache_tensors(
for kv_cache_spec, group in self._kv_cache_spec_attn_group_iterator():
attn_backend = group.backend
for layer_name in group.layer_names:
if layer_name in self.runner_only_kv_layers:
Copy link
Collaborator

@LucasWilkinson LucasWilkinson Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually do we need this? maybe we could just skip if kv_cache_spec.page_size_bytes == 0 then it'll just naturally skip encoder-only layers; and it would make sense to me that a kv_cache_spec with a 0 page size means no kv-cache

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because kv sharing also need this.

builder_cls = subclass_attention_metadata_builder(
name_prefix=prefix,
builder_cls=underlying_attn_backend.get_builder_cls(),
patch_common_attn_metadata=patch_common_attn_metadata)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with #22628 (comment) ; I think we should just to that instead of patch_common_attn_metadata; it might be a bit more verbose in this case but I agree with you that as things get more complicated the abstraction will stay cleaner

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! this is looking much much better!

Comment on lines 162 to 166
)
self.attn = Attention(
attn_cls = (EncoderOnlyAttention
if attn_type == AttentionType.ENCODER_ONLY else Attention)
self.attn = attn_cls(
self.num_heads,
Copy link
Contributor

@noooop noooop Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned earlier, any model that uses a decoder-only LLM can be converted into encoder-only Attention using an unsupervised method. (Very easy to use, the improvement is significant. so over time, an increasing number of models need to add this line of code

Alibaba-NLP/gte-Qwen2-1.5B-instruct uses the methods mentioned in llm2vec

we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.

腾讯Conan-Embedding-V2发布,登顶MTEB中英榜单
SoftMask
...
结果表明,初始阶段,使用软掩码的损失下降速度比不使用软掩码的损失更慢。然而,使用软掩码的最终损失更低。 这表明软掩码方法使模型在训练早期能够学习到更全面的特征表示。

Do we really need to add EncoderOnlyAttention


@noooop For #20930, should (decoder/encoder_only) be orthogonal to pooling? I thought encoder_only refers to layers with bidirectional attention, so we can't do prefix caching and chunk prefill. For #22637, whether sliding window is enabled is also orthogonal to attention type. In encoder-only case, attention backends can handle it by passing diffferent window size to attention kernels, and the engine doesn't need to be aware of the difference.

These two aspects maybe need this PR to take care, maybe not. Sorry for confusing you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But during serving, should it always be either decoder or encoder-only? To make a model support both encoder_only mode and decoder mode, you can see what I did on llama and qwen in this PR.

Copy link
Contributor

@noooop noooop Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

over time, an increasing number of models need to add this line of code,

As well as EncoderOnlyAttention and Attention interfaces should be exactly the same, then why do we need to using EncoderOnlyAttention


(My point is that the EncoderOnlyAttention functionality should become part of Attention, and it can be activated by using attn_type == AttentionType.ENCODER_ONLY. This way, we only need a single Attention interface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

over time, an increasing number of models need to add this line of code,
Not much agree. I think the main goal of vLLM is for decoder-only model so we won't add this line to more models. If you want some specific model to be encoder-only, you can define it as an out-of-tree model.
@LucasWilkinson WDYT?

Copy link
Collaborator

@LucasWilkinson LucasWilkinson Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noooop Even if we keep the attention interfaces the same the model definitions would need to be updated to include

if getattr(config, "is_causal", True):
attn_type = AttentionType.DECODER
else:
attn_type = AttentionType.ENCODER_ONLY
anyways; so I dont think theres a huge difference between having to add 5 vs 4 lines to enable this

@noooop the context is that we are overhauling alot of the different attention layers in vLLM to make them more pluggable and backend-agnostic, as well as move away from bloating the Attention class, attention backends and/or gpu-model-runner with all the different schemes (source of merge conflicts and technical debt). For this reason we are moving to more specific attention subclasses instead of flags in attention, example #21588 moves from using a use_irope flag on Attention to a ChunkedLocalAttention layer.

With that being said since we do have 3 models already (qwen2, qwen3 and llama) that have this dual decoder-only or encoder-only support and may more come, so I could see how in this specific case it could make sense to roll it into the Attention class. I think this would be one of the few exceptions to our general preference for attention layer subclasses though. @heheda12345 I think this would be ok; but as the author I'll ultimately leave the decision up to you. I agree with you that decoder-only models are the priority for vLLM.

Copy link
Contributor

@noooop noooop Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After careful consideration, introducing EncoderOnlyAttention does indeed have some advantages, and I am satisfied with this modification.

vllm has too many Jump wires, reducing one attn_type Jump wire is always good.

Thank you for your refactoring.

Signed-off-by: Chen Zhang <[email protected]>
Copy link

mergify bot commented Aug 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 20, 2025
@mergify mergify bot removed the needs-rebase label Aug 20, 2025
@heheda12345
Copy link
Collaborator Author

@LucasWilkinson Can you take another look?

Comment on lines 25 to 42
def build(self,
common_prefix_len: int,
common_attn_metadata: CommonAttentionMetadata,
fast_build: bool = False) -> AttentionMetadata:
new_common_attn_metadata = copy(common_attn_metadata)
new_common_attn_metadata.causal = False
return super(self.__class__,
self).build(common_prefix_len, new_common_attn_metadata,
fast_build)

builder_cls = subclass_attention_metadata_builder(
name_prefix=prefix,
builder_cls=underlying_attn_backend.get_builder_cls(),
build=build)
attn_backend = subclass_attention_backend(
name_prefix=prefix,
attention_backend_cls=underlying_attn_backend,
builder_cls=builder_cls)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def build(self,
common_prefix_len: int,
common_attn_metadata: CommonAttentionMetadata,
fast_build: bool = False) -> AttentionMetadata:
new_common_attn_metadata = copy(common_attn_metadata)
new_common_attn_metadata.causal = False
return super(self.__class__,
self).build(common_prefix_len, new_common_attn_metadata,
fast_build)
builder_cls = subclass_attention_metadata_builder(
name_prefix=prefix,
builder_cls=underlying_attn_backend.get_builder_cls(),
build=build)
attn_backend = subclass_attention_backend(
name_prefix=prefix,
attention_backend_cls=underlying_attn_backend,
builder_cls=builder_cls)
underlying_builder = underlying_attn_backend.get_builder_cls()
class Builder(underlying_builder):
def build(self,
common_prefix_len: int,
common_attn_metadata: CommonAttentionMetadata,
fast_build: bool = False) -> AttentionMetadata:
new_common_attn_metadata = copy(common_attn_metadata)
new_common_attn_metadata.causal = False
return super().build(common_prefix_len, new_common_attn_metadata,
fast_build)
attn_backend = subclass_attention_backend(
name_prefix=prefix,
attention_backend_cls=underlying_attn_backend,
builder_cls=Builder)

Alternative per our discussion in slack; confirmed this works fine with the caching PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this can be easier. Changed but prefer EncoderOnlyAttentionBuilder than Builder .

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!! Left a couple final comments

@heheda12345 heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 21, 2025
@heheda12345
Copy link
Collaborator Author

@LucasWilkinson Can't pass type checker and chatgpt suggests me to go back https://chatgpt.com/share/68a6d863-b91c-800f-923b-ff8e61040cea . Will have more try tomorrow.

Signed-off-by: Chen Zhang <[email protected]>
Copy link

mergify bot commented Aug 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 21, 2025
@mergify mergify bot removed the needs-rebase label Aug 21, 2025
@LucasWilkinson LucasWilkinson enabled auto-merge (squash) August 22, 2025 01:22
@LucasWilkinson LucasWilkinson merged commit 17373dc into vllm-project:main Aug 22, 2025
44 checks passed
Xu-Wenqing pushed a commit to Xu-Wenqing/vllm that referenced this pull request Aug 23, 2025
FFFfff1FFFfff pushed a commit to FFFfff1FFFfff/my_vllm that referenced this pull request Aug 25, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llama Related to Llama models qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants