-
Notifications
You must be signed in to change notification settings - Fork 515
[feat]dcp pcp support aclgraph #3731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for full aclgraph, including mla attention_v1, for dcp pcp. The changes involve modifications to attention_v1.py, mla_v1.py, acl_graph.py, and model_runner_v1.py to accommodate the new aclgraph features. The review focuses on identifying critical and high severity issues.
| graph_params.attn_params[num_tokens].append( | ||
| (q_nope, k_nope, value, self.num_heads, self.num_kv_heads, | ||
| self.scale, attn_metadata.block_tables, self.key_cache.shape[1], | ||
| attn_metadata.decode_meta.num_computed_tokens_of_pcp_dcp[:, self.pcp_rank, self.dcp_rank], workspace, | ||
| attn_out, attn_lse, self.pcp_rank, self.dcp_rank, self.dcp_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self.num_heads attribute is being passed directly to the graph parameters, but it might be modified later (e.g., in the dcp size > 1 condition). It's crucial to ensure that the correct value of num_heads is used within the graph. Consider passing the potentially modified num_heads value instead of self.num_heads to avoid inconsistencies.
If num_heads is modified after this point, the captured graph will use the original value, leading to incorrect computations. This is a critical issue because it directly affects the correctness of the attention mechanism.
| if dcp_size > 1: | ||
| num_heads = num_heads * dcp_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The num_heads variable is potentially modified based on dcp_size. It's crucial to ensure that the correct value of num_heads is used within the graph. Consider passing the potentially modified num_heads value instead of the original to avoid inconsistencies. This is a critical issue because it directly affects the correctness of the attention mechanism.
| seq_mask_pcp = torch.where( | ||
| torch.tensor(num_computed_tokens_of_cp_dcp_array.sum(2)) == 0, 0, | ||
| 1).to(torch.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition torch.tensor(num_computed_tokens_of_cp_dcp_array.sum(2)) == 0 could potentially lead to incorrect masking if num_computed_tokens_of_cp_dcp_array contains very small non-zero values due to numerical precision issues. Consider using torch.allclose with a suitable tolerance to account for potential floating-point errors. This is a high severity issue because it can lead to incorrect masking and affect the model's accuracy.
seq_mask_pcp = torch.where(
torch.allclose(torch.tensor(num_computed_tokens_of_cp_dcp_array.sum(2)), torch.tensor(0.0), atol=1e-5), 0,
1).to(torch.uint8)
vllm_ascend/attention/mla_v1.py
Outdated
| torch.tensor(num_computed_tokens_of_cp_dcp_array[:, | ||
| self.cp_rank, :]) == 0, | ||
| 0, 1).to(torch.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous comment, the condition torch.tensor(num_computed_tokens_of_cp_dcp_array[:, self.cp_rank, :]) == 0 could be susceptible to numerical precision issues. Using torch.allclose with a tolerance would be more robust. This is a high severity issue because it can lead to incorrect masking and affect the model's accuracy.
seq_mask_dcp = torch.where(
torch.allclose(torch.tensor(
num_computed_tokens_of_cp_dcp_array[:, self.cp_rank, :]), torch.tensor(0.0), atol=1e-5),
0, 1).to(torch.uint8)
vllm_ascend/compilation/acl_graph.py
Outdated
| actual_seq_lengths_kv = forward_context.attn_metadata[key].decode_meta.num_computed_tokens_of_pcp_dcp[:, cp_rank, dcp_rank] | ||
| pad_length = runtime_shape - len(actual_seq_lengths_kv) | ||
| pad_tensor = np.zeros(pad_length, dtype=actual_seq_lengths_kv.dtype) | ||
| actual_seq_lengths_kv = np.concatenate([actual_seq_lengths_kv, pad_tensor]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of np.concatenate after converting actual_seq_lengths_kv to a NumPy array may introduce a performance bottleneck, especially if this operation is frequently executed. Consider performing the padding and concatenation directly using PyTorch tensors to leverage hardware acceleration. This is a high severity issue because it can significantly impact the overall performance of the model.
pad_length = runtime_shape - len(actual_seq_lengths_kv)
pad_tensor = torch.zeros(pad_length, dtype=actual_seq_lengths_kv.dtype, device=q_nope.device)
actual_seq_lengths_kv = torch.cat([torch.tensor(actual_seq_lengths_kv, device=q_nope.device), pad_tensor])| if self.pcp_size * self.dcp_size > 1: | ||
| # FIXME: Try using `auto_dispatch_capture=True` | ||
| update_mla_attn_dcp_pcp_params(self.update_stream, forward_context, | ||
| positions.shape[0], | ||
| self.speculative_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditional execution of update_mla_attn_dcp_pcp_params and update_mla_attn_params based on self.pcp_size * self.dcp_size > 1 introduces code duplication and potential for divergence in behavior. Consider refactoring this logic into a single function that handles both cases, or using a more generic approach to parameter updates. This is a high severity issue because it increases the complexity of the code and makes it harder to maintain.
| update_mla_attn_params(self.update_stream, forward_context, | ||
| positions.shape[0], | ||
| self.speculative_config) | ||
| if self.pcp_size * self.dcp_size > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment FIXME: Try using auto_dispatch_capture=True indicates an area where the code can be improved. It's important to address this FIXME by either implementing the suggested change or providing a clear explanation of why it cannot be done. This is a high severity issue because it indicates a potential area for optimization or bug fix.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
bbe6ebb to
17953b6
Compare
vllm_ascend/attention/mla_v1.py
Outdated
| q_nope, q_pe, k_nope, k_pe, decode_meta.block_table, | ||
| seq_len, num_heads, self.scale, self.num_kv_heads, | ||
| **common_kwargs) | ||
| graph_params.workspaces[num_tokens] = workspace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add weak_ref_tensors here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean:
| graph_params.workspaces[num_tokens] = workspace | |
| graph_params.workspaces[num_tokens] = weak_ref_tensors(workspace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please change to update_graph_params_workspaces.
| self.speculative_config) | ||
| if self.pcp_size * self.dcp_size > 1: | ||
| # FIXME: Try using `auto_dispatch_capture=True` | ||
| update_mla_attn_dcp_pcp_params(self.update_stream, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor to put all extra streams into a common position later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we will do it later
Signed-off-by: weiguihua2 <[email protected]>
What this PR does / why we need it?
1、dcp pcp support full aclgraph, including mla attention_v1
Does this PR introduce any user-facing change?
How was this patch tested?