-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models #23154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models #23154
Conversation
Signed-off-by: Chen Zhang <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and well-structured refactoring for handling attention metadata in encoder-only models. By creating a dedicated EncoderOnlyAttention
class and treating these layers as a special type of KV cache group, you've successfully eliminated special-cased logic in the GPUModelRunner
, leading to cleaner and more maintainable code. The changes are consistent across different model implementations. I've identified one critical issue regarding a missing self
parameter that would cause a runtime error, and a related incorrect type hint. My review comments provide details and suggestions for these issues.
def patch_common_attn_metadata(cm: CommonAttentionMetadata, | ||
scheduler_output: SchedulerOutput): | ||
return make_local_attention_virtual_batches(attention_chunk_size, cm, | ||
block_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method signature for patch_common_attn_metadata
is missing the self
parameter. When this function is assigned to the AttentionMetadataBuilder
subclass, it becomes a method. When called as builder.patch_common_attn_metadata(...)
, the builder
instance is passed as the first argument (self
). Without self
in the signature, the cm
parameter will incorrectly receive the AttentionMetadataBuilder
instance, which will cause a runtime AttributeError
when make_local_attention_virtual_batches
tries to access attributes of CommonAttentionMetadata
on it.
def patch_common_attn_metadata(cm: CommonAttentionMetadata, | |
scheduler_output: SchedulerOutput): | |
return make_local_attention_virtual_batches(attention_chunk_size, cm, | |
block_size) | |
def patch_common_attn_metadata(self, | |
cm: CommonAttentionMetadata, | |
scheduler_output: SchedulerOutput): | |
return make_local_attention_virtual_batches(attention_chunk_size, cm, | |
block_size) |
vllm/v1/attention/backends/utils.py
Outdated
patch_common_attn_metadata: Callable[ | ||
[CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint for patch_common_attn_metadata
is incorrect. It should reflect that it will be used as a method on an AttentionMetadataBuilder
instance, and thus will receive self
as its first argument. The current type hint Callable[[CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata]
is missing the self
parameter. This can be misleading for developers and static analysis tools. A more accurate type hint would include self
.
patch_common_attn_metadata: Callable[ | |
[CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata], | |
patch_common_attn_metadata: Callable[ | |
[Any, CommonAttentionMetadata, "SchedulerOutput"], CommonAttentionMetadata], |
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Alibaba-NLP/gte-Qwen2-1.5B-instruct uses the methods mentioned in llm2vec
Therefore, there is no clear boundary between decoder-only Attention and encoder-only Attention, and I think it's better not to introduce two separate AttentionMetadata sections for decoder-only Attention and encoder-only Attention. Perhaps #20930 some details would be helpful for this PR Also #22637 |
@noooop |
Signed-off-by: Chen Zhang <[email protected]>
…ctor Signed-off-by: Chen Zhang <[email protected]>
@LucasWilkinson Ready for review now |
vllm/v1/worker/gpu_model_runner.py
Outdated
# Attention layers that are only in the KVCacheConfig of the runner | ||
# (e.g., KV sharing, encoder-only attention), but not in the | ||
# KVCacheConfig of the scheduler. | ||
self.runner_only_kv_layers: set[str] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe we should call these runner_only_attn_layers
? runner_only_kv_layers
makes it sound like theres special kv-handling when really these are just layers that need attention metadata to be built
vllm/v1/worker/gpu_model_runner.py
Outdated
@@ -3048,6 +3015,8 @@ def _reshape_kv_cache_tensors( | |||
for kv_cache_spec, group in self._kv_cache_spec_attn_group_iterator(): | |||
attn_backend = group.backend | |||
for layer_name in group.layer_names: | |||
if layer_name in self.runner_only_kv_layers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually do we need this? maybe we could just skip if kv_cache_spec.page_size_bytes == 0
then it'll just naturally skip encoder-only layers; and it would make sense to me that a kv_cache_spec
with a 0 page size means no kv-cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because kv sharing also need this.
builder_cls = subclass_attention_metadata_builder( | ||
name_prefix=prefix, | ||
builder_cls=underlying_attn_backend.get_builder_cls(), | ||
patch_common_attn_metadata=patch_common_attn_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with #22628 (comment) ; I think we should just to that instead of patch_common_attn_metadata
; it might be a bit more verbose in this case but I agree with you that as things get more complicated the abstraction will stay cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! this is looking much much better!
) | ||
self.attn = Attention( | ||
attn_cls = (EncoderOnlyAttention | ||
if attn_type == AttentionType.ENCODER_ONLY else Attention) | ||
self.attn = attn_cls( | ||
self.num_heads, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned earlier, any model that uses a decoder-only LLM can be converted into encoder-only Attention using an unsupervised method. (Very easy to use, the improvement is significant. so over time, an increasing number of models need to add this line of code
Alibaba-NLP/gte-Qwen2-1.5B-instruct uses the methods mentioned in llm2vec
we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.
腾讯Conan-Embedding-V2发布,登顶MTEB中英榜单
SoftMask
...
结果表明,初始阶段,使用软掩码的损失下降速度比不使用软掩码的损失更慢。然而,使用软掩码的最终损失更低。 这表明软掩码方法使模型在训练早期能够学习到更全面的特征表示。
Do we really need to add EncoderOnlyAttention
@noooop For #20930, should (decoder/encoder_only) be orthogonal to pooling? I thought encoder_only refers to layers with bidirectional attention, so we can't do prefix caching and chunk prefill. For #22637, whether sliding window is enabled is also orthogonal to attention type. In encoder-only case, attention backends can handle it by passing diffferent window size to attention kernels, and the engine doesn't need to be aware of the difference.
These two aspects maybe need this PR to take care, maybe not. Sorry for confusing you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But during serving, should it always be either decoder or encoder-only? To make a model support both encoder_only mode and decoder mode, you can see what I did on llama and qwen in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
over time, an increasing number of models need to add this line of code,
As well as EncoderOnlyAttention and Attention interfaces should be exactly the same, then why do we need to using EncoderOnlyAttention
(My point is that the EncoderOnlyAttention functionality should become part of Attention, and it can be activated by using attn_type == AttentionType.ENCODER_ONLY. This way, we only need a single Attention interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
over time, an increasing number of models need to add this line of code,
Not much agree. I think the main goal of vLLM is for decoder-only model so we won't add this line to more models. If you want some specific model to be encoder-only, you can define it as an out-of-tree model.
@LucasWilkinson WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noooop Even if we keep the attention interfaces the same the model definitions would need to be updated to include
vllm/vllm/model_executor/models/qwen3.py
Lines 184 to 187 in 7be5d11
if getattr(config, "is_causal", True): | |
attn_type = AttentionType.DECODER | |
else: | |
attn_type = AttentionType.ENCODER_ONLY |
@noooop the context is that we are overhauling alot of the different attention layers in vLLM to make them more pluggable and backend-agnostic, as well as move away from bloating the Attention class, attention backends and/or gpu-model-runner with all the different schemes (source of merge conflicts and technical debt). For this reason we are moving to more specific attention subclasses instead of flags in attention, example #21588 moves from using a use_irope
flag on Attention
to a ChunkedLocalAttention
layer.
With that being said since we do have 3 models already (qwen2, qwen3 and llama) that have this dual decoder-only or encoder-only support and may more come, so I could see how in this specific case it could make sense to roll it into the Attention
class. I think this would be one of the few exceptions to our general preference for attention layer subclasses though. @heheda12345 I think this would be ok; but as the author I'll ultimately leave the decision up to you. I agree with you that decoder-only models are the priority for vLLM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After careful consideration, introducing EncoderOnlyAttention does indeed have some advantages, and I am satisfied with this modification.
vllm has too many Jump wires, reducing one attn_type Jump wire is always good.
Thank you for your refactoring.
Signed-off-by: Chen Zhang <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
…ctor Signed-off-by: Chen Zhang <[email protected]>
@LucasWilkinson Can you take another look? |
Signed-off-by: Chen Zhang <[email protected]>
def build(self, | ||
common_prefix_len: int, | ||
common_attn_metadata: CommonAttentionMetadata, | ||
fast_build: bool = False) -> AttentionMetadata: | ||
new_common_attn_metadata = copy(common_attn_metadata) | ||
new_common_attn_metadata.causal = False | ||
return super(self.__class__, | ||
self).build(common_prefix_len, new_common_attn_metadata, | ||
fast_build) | ||
|
||
builder_cls = subclass_attention_metadata_builder( | ||
name_prefix=prefix, | ||
builder_cls=underlying_attn_backend.get_builder_cls(), | ||
build=build) | ||
attn_backend = subclass_attention_backend( | ||
name_prefix=prefix, | ||
attention_backend_cls=underlying_attn_backend, | ||
builder_cls=builder_cls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def build(self, | |
common_prefix_len: int, | |
common_attn_metadata: CommonAttentionMetadata, | |
fast_build: bool = False) -> AttentionMetadata: | |
new_common_attn_metadata = copy(common_attn_metadata) | |
new_common_attn_metadata.causal = False | |
return super(self.__class__, | |
self).build(common_prefix_len, new_common_attn_metadata, | |
fast_build) | |
builder_cls = subclass_attention_metadata_builder( | |
name_prefix=prefix, | |
builder_cls=underlying_attn_backend.get_builder_cls(), | |
build=build) | |
attn_backend = subclass_attention_backend( | |
name_prefix=prefix, | |
attention_backend_cls=underlying_attn_backend, | |
builder_cls=builder_cls) | |
underlying_builder = underlying_attn_backend.get_builder_cls() | |
class Builder(underlying_builder): | |
def build(self, | |
common_prefix_len: int, | |
common_attn_metadata: CommonAttentionMetadata, | |
fast_build: bool = False) -> AttentionMetadata: | |
new_common_attn_metadata = copy(common_attn_metadata) | |
new_common_attn_metadata.causal = False | |
return super().build(common_prefix_len, new_common_attn_metadata, | |
fast_build) | |
attn_backend = subclass_attention_backend( | |
name_prefix=prefix, | |
attention_backend_cls=underlying_attn_backend, | |
builder_cls=Builder) |
Alternative per our discussion in slack; confirmed this works fine with the caching PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this can be easier. Changed but prefer EncoderOnlyAttentionBuilder
than Builder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you!! Left a couple final comments
Signed-off-by: Chen Zhang <[email protected]>
@LucasWilkinson Can't pass type checker and chatgpt suggests me to go back https://chatgpt.com/share/68a6d863-b91c-800f-923b-ff8e61040cea . Will have more try tomorrow. |
Signed-off-by: Chen Zhang <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chen Zhang <[email protected]>
…ctor Signed-off-by: Chen Zhang <[email protected]>
…odels (vllm-project#23154) Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: root <[email protected]>
…odels (vllm-project#23154) Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: FFFfff1FFFfff <[email protected]>
…odels (vllm-project#23154) Signed-off-by: Chen Zhang <[email protected]>
Purpose
Clean up attention metadata preparation of encoder-only models. Prepare cleaner code base for encoder-decoder.
Test Plan
Test an attention free model by
Test Result
Can pass
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.