[feat]dcp pcp support aclgraph #3731

weiguihua2 · 2025-10-24T13:39:19Z

What this PR does / why we need it?

1、dcp pcp support full aclgraph, including mla attention_v1

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0rc3
vLLM main: vllm-project/vllm@c9461e0

github-actions · 2025-10-24T13:39:32Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces support for full aclgraph, including mla attention_v1, for dcp pcp. The changes involve modifications to attention_v1.py, mla_v1.py, acl_graph.py, and model_runner_v1.py to accommodate the new aclgraph features. The review focuses on identifying critical and high severity issues.

gemini-code-assist · 2025-10-24T13:41:04Z

vllm_ascend/attention/attention_v1.py

+            graph_params.attn_params[num_tokens].append(
+                (q_nope, k_nope, value, self.num_heads, self.num_kv_heads,
+                 self.scale, attn_metadata.block_tables, self.key_cache.shape[1],
+                 attn_metadata.decode_meta.num_computed_tokens_of_pcp_dcp[:, self.pcp_rank, self.dcp_rank], workspace,
+                 attn_out, attn_lse, self.pcp_rank, self.dcp_rank, self.dcp_size))


The self.num_heads attribute is being passed directly to the graph parameters, but it might be modified later (e.g., in the dcp size > 1 condition). It's crucial to ensure that the correct value of num_heads is used within the graph. Consider passing the potentially modified num_heads value instead of self.num_heads to avoid inconsistencies.

If num_heads is modified after this point, the captured graph will use the original value, leading to incorrect computations. This is a critical issue because it directly affects the correctness of the attention mechanism.

gemini-code-assist · 2025-10-24T13:41:04Z

vllm_ascend/compilation/acl_graph.py

+        if dcp_size > 1:
+            num_heads = num_heads * dcp_size


The num_heads variable is potentially modified based on dcp_size. It's crucial to ensure that the correct value of num_heads is used within the graph. Consider passing the potentially modified num_heads value instead of the original to avoid inconsistencies. This is a critical issue because it directly affects the correctness of the attention mechanism.

gemini-code-assist · 2025-10-24T13:41:04Z

vllm_ascend/attention/mla_v1.py

+                seq_mask_pcp = torch.where(
+                    torch.tensor(num_computed_tokens_of_cp_dcp_array.sum(2)) == 0, 0,
+                    1).to(torch.uint8)


The condition torch.tensor(num_computed_tokens_of_cp_dcp_array.sum(2)) == 0 could potentially lead to incorrect masking if num_computed_tokens_of_cp_dcp_array contains very small non-zero values due to numerical precision issues. Consider using torch.allclose with a suitable tolerance to account for potential floating-point errors. This is a high severity issue because it can lead to incorrect masking and affect the model's accuracy.

seq_mask_pcp = torch.where( torch.allclose(torch.tensor(num_computed_tokens_of_cp_dcp_array.sum(2)), torch.tensor(0.0), atol=1e-5), 0, 1).to(torch.uint8)

gemini-code-assist · 2025-10-24T13:41:04Z

vllm_ascend/attention/mla_v1.py

+                    torch.tensor(num_computed_tokens_of_cp_dcp_array[:,
+                                                            self.cp_rank, :]) == 0,
+                    0, 1).to(torch.uint8)


Similar to the previous comment, the condition torch.tensor(num_computed_tokens_of_cp_dcp_array[:, self.cp_rank, :]) == 0 could be susceptible to numerical precision issues. Using torch.allclose with a tolerance would be more robust. This is a high severity issue because it can lead to incorrect masking and affect the model's accuracy.

seq_mask_dcp = torch.where( torch.allclose(torch.tensor( num_computed_tokens_of_cp_dcp_array[:, self.cp_rank, :]), torch.tensor(0.0), atol=1e-5), 0, 1).to(torch.uint8)

gemini-code-assist · 2025-10-24T13:41:04Z

vllm_ascend/compilation/acl_graph.py

+        actual_seq_lengths_kv = forward_context.attn_metadata[key].decode_meta.num_computed_tokens_of_pcp_dcp[:, cp_rank, dcp_rank]
+        pad_length = runtime_shape - len(actual_seq_lengths_kv)
+        pad_tensor = np.zeros(pad_length, dtype=actual_seq_lengths_kv.dtype)
+        actual_seq_lengths_kv = np.concatenate([actual_seq_lengths_kv, pad_tensor])


The use of np.concatenate after converting actual_seq_lengths_kv to a NumPy array may introduce a performance bottleneck, especially if this operation is frequently executed. Consider performing the padding and concatenation directly using PyTorch tensors to leverage hardware acceleration. This is a high severity issue because it can significantly impact the overall performance of the model.

pad_length = runtime_shape - len(actual_seq_lengths_kv) pad_tensor = torch.zeros(pad_length, dtype=actual_seq_lengths_kv.dtype, device=q_nope.device) actual_seq_lengths_kv = torch.cat([torch.tensor(actual_seq_lengths_kv, device=q_nope.device), pad_tensor])

gemini-code-assist · 2025-10-24T13:41:05Z

vllm_ascend/worker/model_runner_v1.py

+                if self.pcp_size * self.dcp_size > 1:
+                    # FIXME: Try using `auto_dispatch_capture=True`
+                    update_mla_attn_dcp_pcp_params(self.update_stream, forward_context,
+                                        positions.shape[0],
+                                        self.speculative_config)


The conditional execution of update_mla_attn_dcp_pcp_params and update_mla_attn_params based on self.pcp_size * self.dcp_size > 1 introduces code duplication and potential for divergence in behavior. Consider refactoring this logic into a single function that handles both cases, or using a more generic approach to parameter updates. This is a high severity issue because it increases the complexity of the code and makes it harder to maintain.

gemini-code-assist · 2025-10-24T13:41:05Z

vllm_ascend/worker/model_runner_v1.py

-                update_mla_attn_params(self.update_stream, forward_context,
-                                       positions.shape[0],
-                                       self.speculative_config)
+                if self.pcp_size * self.dcp_size > 1:


The comment FIXME: Try using auto_dispatch_capture=True indicates an area where the code can be improved. It's important to address this FIXME by either implementing the suggested change or providing a clear explanation of why it cannot be done. This is a high severity issue because it indicates a potential area for optimization or bug fix.

github-actions · 2025-10-25T01:53:54Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

yiz-liu · 2025-10-25T08:10:35Z

vllm_ascend/attention/mla_v1.py

+                    q_nope, q_pe, k_nope, k_pe, decode_meta.block_table,
+                    seq_len, num_heads, self.scale, self.num_kv_heads,
+                    **common_kwargs)
+                graph_params.workspaces[num_tokens] = workspace


Add weak_ref_tensors here.

I mean:

Suggested change

graph_params.workspaces[num_tokens] = workspace

graph_params.workspaces[num_tokens] = weak_ref_tensors(workspace)

Also please change to update_graph_params_workspaces.

whx-sjtu · 2025-10-25T08:22:30Z

vllm_ascend/worker/model_runner_v1.py

-                                       self.speculative_config)
+                if self.pcp_size * self.dcp_size > 1:
+                    # FIXME: Try using `auto_dispatch_capture=True`
+                    update_mla_attn_dcp_pcp_params(self.update_stream,


Refactor to put all extra streams into a common position later.

OK, we will do it later

Signed-off-by: weiguihua2 <[email protected]>

gemini-code-assist bot reviewed Oct 24, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Oct 25, 2025

weiguihua2 force-pushed the new_main branch 2 times, most recently from bbe6ebb to 17953b6 Compare October 25, 2025 04:35

github-actions bot removed the merge-conflicts label Oct 25, 2025

weiguihua2 force-pushed the new_main branch from 17953b6 to a710065 Compare October 25, 2025 06:51

github-actions bot added the module:tests label Oct 25, 2025

yiz-liu requested changes Oct 25, 2025

View reviewed changes

whx-sjtu reviewed Oct 25, 2025

View reviewed changes

weiguihua2 force-pushed the new_main branch from aba1575 to 388d6e9 Compare October 25, 2025 09:48

weiguihua2 added the ready-for-test start test by label for PR label Oct 25, 2025

weiguihua2 force-pushed the new_main branch from 2d27c03 to 2ee9dfd Compare October 25, 2025 10:32

weiguihua2 requested a review from yiz-liu October 25, 2025 11:18

weijinqian0 added ready read for review ready-for-test start test by label for PR and removed ready-for-test start test by label for PR labels Oct 25, 2025

weiguihua2 force-pushed the new_main branch from 2ee9dfd to d181580 Compare October 25, 2025 13:14

dcp pcp support aclgraph

898965f

Signed-off-by: weiguihua2 <[email protected]>

weiguihua2 force-pushed the new_main branch from 1ab6a12 to 898965f Compare October 25, 2025 16:02

	graph_params.workspaces[num_tokens] = workspace
	graph_params.workspaces[num_tokens] = weak_ref_tensors(workspace)

[feat]dcp pcp support aclgraph #3731

Are you sure you want to change the base?

[feat]dcp pcp support aclgraph #3731

Conversation

weiguihua2 commented Oct 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 25, 2025

Uh oh!

yiz-liu Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

whx-sjtu Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

weiguihua2 Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

weiguihua2 commented Oct 24, 2025 •

edited by github-actions bot

Loading