Separate MLAAttention class from Attention (needs Review) #25103

therealnaveenkamal · 2025-09-17T22:06:09Z

Purpose

This PR implements the first step of #24620 by separating Multi-Head Latent Attention into its own dedicated AttentionLayerBase subclass.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

github-actions · 2025-09-17T22:06:20Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request refactors the Multi-Head Latent Attention (MLA) logic out of the generic Attention class and into a new, dedicated MLAAttention class. This is a good step towards better code organization and separation of concerns. The changes in vllm/attention/layer.py and vllm/model_executor/layers/mla.py correctly remove the old MLA logic and adopt the new class. However, the new MLAAttention class in vllm/model_executor/layers/mla_attention.py has critical implementation issues. It fails to properly instantiate and call the attention backend, and it lacks the necessary integration with the KV cache and attention metadata management. These issues will prevent the MLA feature from functioning. I've left detailed comments on how to address these critical problems.

vllm/model_executor/layers/mla_attention.py

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

ProExpertProg

A few minor notes

ProExpertProg · 2025-09-18T14:30:56Z

vllm/model_executor/layers/mla.py

-            k_pe,
-            output_shape=(hidden_states.shape[0],
-                          self.num_heads * self.v_head_dim))
-        return self.o_proj(attn_out)[0]


I think we want to keep the abstraction where the MLAAttentionLayer does not handle its own rope, qkv_proj, o_proj, etc.

@ProExpertProg I've made changes to this. MLAAttention.forward() takes care of this. Correct me if I'm wrong

ProExpertProg · 2025-09-18T14:33:26Z

vllm/model_executor/layers/mla_attention.py

+        kv_c_normed = key  # normalized KV cache
+        k_pe = value.unsqueeze(1) if value.dim() == 2 else value
+
+        attn_out = self.impl.forward(


We need to wrap into a custom op, could you make a unified_mla_attention/unified_mla_attention_with_output custom op(s), and add them to splitting ops by default etc.

(still respect the use_direct_call from the backend/platform)

MatthewBonanni · 2025-09-18T16:16:16Z

vllm/model_executor/layers/mla_attention.py

@LucasWilkinson Should we make a vllm/model_executor/layers/mla folder containing this file and mla.py?

I think we should just put this code in mla.py; I dont think we need 2 files

vllm/model_executor/layers/mla.py

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

therealnaveenkamal · 2025-09-20T00:04:08Z

@ProExpertProg i'm working on unified_mla_attention ops - how do you want it to be? any inputs would be helpful.

ProExpertProg · 2025-09-20T00:08:02Z

Yeah to start they can just mimic the unified_attention and unified_attention_with_output ops. Also please keep the existing MLAAttentionWrapper as is and make the new MLAAttention layer the same in scope as Attention (no rope, no o_proj, etc.)

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

therealnaveenkamal · 2025-09-20T02:14:13Z

Hi @ProExpertProg, thanks for the feedback.

I've added the unified_mla_attention and unified_mla_attention_with_output ops, which mimic the existing unified attention ops.

MLAAttention layer has been created in mla.py...scoped similarly to the base Attention layer and does not handle projections or rotary embeddings.

The MultiHeadLatentAttentionWrapper uses the new MLAAttention layer to handle the core attention logic.

Let me know what you think. Thanks

ProExpertProg · 2025-09-22T12:07:06Z

vllm/attention/layer.py

+    attn_metadata = forward_context.attn_metadata
+    if isinstance(attn_metadata, dict):
+        attn_metadata = attn_metadata[layer_name]
+    self = forward_context.no_compile_layers[layer_name]


We should type-annotate self here

ProExpertProg · 2025-09-22T12:10:16Z

vllm/model_executor/layers/mla.py

+class MultiHeadLatentAttentionWrapper(CustomOp):
    """MLA layer registered as CustomOp.


Something like this:

Suggested change

class MultiHeadLatentAttentionWrapper(CustomOp):

"""MLA layer registered as CustomOp.

class MultiHeadLatentAttentionWrapper(CustomOp):

"""MLA layer registered as CustomOp to allow OOT backends to add custom implementations of the outer MLA layer (including rope & o_proj).

ProExpertProg · 2025-09-22T12:12:39Z

vllm/model_executor/layers/mla.py

    q_proj: Optional[torch.nn.Module]


+class MLAAttention(nn.Module, AttentionLayerBase):


I think I'd rather see this in vllm/attention/layer.py or vllm/attention/mla.py - @LucasWilkinson what do you think?

vllm/attention/layer.py makes sense to me

first impl of common MLAAttentionLayer - needs review

4acdca9

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

therealnaveenkamal requested a review from LucasWilkinson as a code owner September 17, 2025 22:06

therealnaveenkamal changed the title ~~Separate MLAAttention class from, Attention (needs Review)~~ Separate MLAAttention class from Attention (needs Review) Sep 17, 2025

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

vllm/model_executor/layers/mla_attention.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/mla_attention.py Outdated Show resolved Hide resolved

major fixes2

827ba33

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

ProExpertProg reviewed Sep 18, 2025

View reviewed changes

MatthewBonanni reviewed Sep 18, 2025

View reviewed changes

LucasWilkinson reviewed Sep 18, 2025

View reviewed changes

vllm/model_executor/layers/mla.py Outdated Show resolved Hide resolved

mla wrapper abstraction and impl use_direct_call

8e49742

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

mergify bot added the deepseek Related to DeepSeek models label Sep 19, 2025

added unified_mla funcs and few fixes

e74ef22

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

therealnaveenkamal requested review from simon-mo, WoosukKwon, youkaichao, robertgshaw2-redhat, mgoin, tlrmchlsmth, houseroad, hmellor and yewentao256 as code owners September 20, 2025 02:03

ProExpertProg reviewed Sep 22, 2025

View reviewed changes

		class MultiHeadLatentAttentionWrapper(CustomOp):
		"""MLA layer registered as CustomOp.

		q_proj: Optional[torch.nn.Module]


		class MLAAttention(nn.Module, AttentionLayerBase):

Uh oh!

Separate MLAAttention class from Attention (needs Review) #25103

Are you sure you want to change the base?

Separate MLAAttention class from Attention (needs Review) #25103

Conversation

therealnaveenkamal commented Sep 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

github-actions bot commented Sep 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

therealnaveenkamal commented Sep 20, 2025

Uh oh!

ProExpertProg commented Sep 20, 2025

Uh oh!

therealnaveenkamal commented Sep 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

therealnaveenkamal commented Sep 17, 2025 •

edited by github-actions bot

Loading

MatthewBonanni Sep 18, 2025 •

edited

Loading