Skip to content

Conversation

@weiguihua2
Copy link
Collaborator

@weiguihua2 weiguihua2 commented Oct 24, 2025

What this PR does / why we need it?

1、dcp pcp support full aclgraph, including mla attention_v1

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for full aclgraph, including mla attention_v1, for dcp pcp. The changes involve modifications to attention_v1.py, mla_v1.py, acl_graph.py, and model_runner_v1.py to accommodate the new aclgraph features. The review focuses on identifying critical and high severity issues.

Comment on lines 914 to 918
graph_params.attn_params[num_tokens].append(
(q_nope, k_nope, value, self.num_heads, self.num_kv_heads,
self.scale, attn_metadata.block_tables, self.key_cache.shape[1],
attn_metadata.decode_meta.num_computed_tokens_of_pcp_dcp[:, self.pcp_rank, self.dcp_rank], workspace,
attn_out, attn_lse, self.pcp_rank, self.dcp_rank, self.dcp_size))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The self.num_heads attribute is being passed directly to the graph parameters, but it might be modified later (e.g., in the dcp size > 1 condition). It's crucial to ensure that the correct value of num_heads is used within the graph. Consider passing the potentially modified num_heads value instead of self.num_heads to avoid inconsistencies.

If num_heads is modified after this point, the captured graph will use the original value, leading to incorrect computations. This is a critical issue because it directly affects the correctness of the attention mechanism.

Comment on lines +323 to +325
if dcp_size > 1:
num_heads = num_heads * dcp_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The num_heads variable is potentially modified based on dcp_size. It's crucial to ensure that the correct value of num_heads is used within the graph. Consider passing the potentially modified num_heads value instead of the original to avoid inconsistencies. This is a critical issue because it directly affects the correctness of the attention mechanism.

Comment on lines 500 to 494
seq_mask_pcp = torch.where(
torch.tensor(num_computed_tokens_of_cp_dcp_array.sum(2)) == 0, 0,
1).to(torch.uint8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The condition torch.tensor(num_computed_tokens_of_cp_dcp_array.sum(2)) == 0 could potentially lead to incorrect masking if num_computed_tokens_of_cp_dcp_array contains very small non-zero values due to numerical precision issues. Consider using torch.allclose with a suitable tolerance to account for potential floating-point errors. This is a high severity issue because it can lead to incorrect masking and affect the model's accuracy.

seq_mask_pcp = torch.where(
    torch.allclose(torch.tensor(num_computed_tokens_of_cp_dcp_array.sum(2)), torch.tensor(0.0), atol=1e-5), 0,
    1).to(torch.uint8)

Comment on lines 507 to 509
torch.tensor(num_computed_tokens_of_cp_dcp_array[:,
self.cp_rank, :]) == 0,
0, 1).to(torch.uint8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the previous comment, the condition torch.tensor(num_computed_tokens_of_cp_dcp_array[:, self.cp_rank, :]) == 0 could be susceptible to numerical precision issues. Using torch.allclose with a tolerance would be more robust. This is a high severity issue because it can lead to incorrect masking and affect the model's accuracy.

seq_mask_dcp = torch.where(
    torch.allclose(torch.tensor(
        num_computed_tokens_of_cp_dcp_array[:, self.cp_rank, :]), torch.tensor(0.0), atol=1e-5),
    0, 1).to(torch.uint8)

Comment on lines 319 to 322
actual_seq_lengths_kv = forward_context.attn_metadata[key].decode_meta.num_computed_tokens_of_pcp_dcp[:, cp_rank, dcp_rank]
pad_length = runtime_shape - len(actual_seq_lengths_kv)
pad_tensor = np.zeros(pad_length, dtype=actual_seq_lengths_kv.dtype)
actual_seq_lengths_kv = np.concatenate([actual_seq_lengths_kv, pad_tensor])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of np.concatenate after converting actual_seq_lengths_kv to a NumPy array may introduce a performance bottleneck, especially if this operation is frequently executed. Consider performing the padding and concatenation directly using PyTorch tensors to leverage hardware acceleration. This is a high severity issue because it can significantly impact the overall performance of the model.

pad_length = runtime_shape - len(actual_seq_lengths_kv)
pad_tensor = torch.zeros(pad_length, dtype=actual_seq_lengths_kv.dtype, device=q_nope.device)
actual_seq_lengths_kv = torch.cat([torch.tensor(actual_seq_lengths_kv, device=q_nope.device), pad_tensor])

Comment on lines 2596 to 2600
if self.pcp_size * self.dcp_size > 1:
# FIXME: Try using `auto_dispatch_capture=True`
update_mla_attn_dcp_pcp_params(self.update_stream, forward_context,
positions.shape[0],
self.speculative_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The conditional execution of update_mla_attn_dcp_pcp_params and update_mla_attn_params based on self.pcp_size * self.dcp_size > 1 introduces code duplication and potential for divergence in behavior. Consider refactoring this logic into a single function that handles both cases, or using a more generic approach to parameter updates. This is a high severity issue because it increases the complexity of the code and makes it harder to maintain.

update_mla_attn_params(self.update_stream, forward_context,
positions.shape[0],
self.speculative_config)
if self.pcp_size * self.dcp_size > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The comment FIXME: Try using auto_dispatch_capture=True indicates an area where the code can be improved. It's important to address this FIXME by either implementing the suggested change or providing a clear explanation of why it cannot be done. This is a high severity issue because it indicates a potential area for optimization or bug fix.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

q_nope, q_pe, k_nope, k_pe, decode_meta.block_table,
seq_len, num_heads, self.scale, self.num_kv_heads,
**common_kwargs)
graph_params.workspaces[num_tokens] = workspace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add weak_ref_tensors here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean:

Suggested change
graph_params.workspaces[num_tokens] = workspace
graph_params.workspaces[num_tokens] = weak_ref_tensors(workspace)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please change to update_graph_params_workspaces.

self.speculative_config)
if self.pcp_size * self.dcp_size > 1:
# FIXME: Try using `auto_dispatch_capture=True`
update_mla_attn_dcp_pcp_params(self.update_stream,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor to put all extra streams into a common position later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we will do it later

@weiguihua2 weiguihua2 added the ready-for-test start test by label for PR label Oct 25, 2025
@weiguihua2 weiguihua2 requested a review from yiz-liu October 25, 2025 11:18
@weijinqian0 weijinqian0 added ready read for review ready-for-test start test by label for PR and removed ready-for-test start test by label for PR labels Oct 25, 2025
Signed-off-by: weiguihua2 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants