-
Notifications
You must be signed in to change notification settings - Fork 450
[fix] prefill unsupport sliding window attention #2758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: nsdie <[email protected]>
Signed-off-by: nsdie <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in the prefill phase for sliding window attention by removing the specialized code path that used npu_fused_infer_attention_score
. While this correctly resolves the main issue, a related change to the _repeat_kv
helper function introduces a subtle bug where it implements torch.repeat
instead of the documented torch.repeat_interleave
. I've provided a critical comment with a suggested fix for this function, and also noted that it appears to be unused after these changes and could potentially be removed.
hidden_states = hidden_states[:, None, :, :].expand( | ||
num_key_value_heads, n_rep, slen, head_dim) | ||
return hidden_states.reshape(num_key_value_heads * n_rep, slen, | ||
slen, n_rep, num_key_value_heads, head_dim) | ||
return hidden_states.reshape(slen, num_key_value_heads * n_rep, | ||
head_dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of _repeat_kv
does not match its docstring, which states it should be equivalent to torch.repeat_interleave
. The current implementation performs a torch.repeat
operation, not torch.repeat_interleave
. This can lead to incorrect attention calculations in Grouped-Query Attention (GQA) scenarios where key and value states are expanded.
For a tensor with shape (slen, num_kv_heads, head_dim)
, repeat_interleave
on dim=1
should result in each head being repeated n_rep
times consecutively. The current implementation repeats the whole sequence of heads n_rep
times.
Additionally, after this pull request's changes, this function appears to be unused and could potentially be removed.
hidden_states = hidden_states[:, None, :, :].expand( | |
num_key_value_heads, n_rep, slen, head_dim) | |
return hidden_states.reshape(num_key_value_heads * n_rep, slen, | |
slen, n_rep, num_key_value_heads, head_dim) | |
return hidden_states.reshape(slen, num_key_value_heads * n_rep, | |
head_dim) | |
hidden_states = hidden_states.unsqueeze(2).expand( | |
slen, num_key_value_heads, n_rep, head_dim) | |
return hidden_states.reshape(slen, num_key_value_heads * n_rep, | |
head_dim) |
Signed-off-by: nsdie <[email protected]>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: nsdie <[email protected]>
key = self._repeat_kv(key, self.num_heads // self.num_kv_heads) | ||
value = self._repeat_kv(value, self.num_heads // self.num_kv_heads) | ||
|
||
output, _ = torch_npu.npu_fused_infer_attention_score( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it's confirmed that this op doesn't work for prefill? can you paste any link or explain more for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The npu_fused_infer_attention_score operator does not support the prefill phase
Signed-off-by: nsdie <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2758 +/- ##
==========================================
- Coverage 72.99% 72.90% -0.09%
==========================================
Files 153 153
Lines 21338 21368 +30
==========================================
+ Hits 15575 15579 +4
- Misses 5763 5789 +26
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This is a fallback PR for #2528 |
### What this PR does / why we need it? fix prefill attention bug,not support sliding window. npu_fused_infer_attention_score head_dim only equal 128, not support other number. ### Does this PR introduce _any_ user-facing change? remove prefill phase npu_fused_infer_attention_score ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@e599e2c --------- Signed-off-by: nsdie <[email protected]>
### What this PR does / why we need it? fix prefill attention bug,not support sliding window. npu_fused_infer_attention_score head_dim only equal 128, not support other number. ### Does this PR introduce _any_ user-facing change? remove prefill phase npu_fused_infer_attention_score ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@e599e2c --------- Signed-off-by: nsdie <[email protected]> Signed-off-by: offline0806 <[email protected]>
What this PR does / why we need it?
fix prefill attention bug,not support sliding window. npu_fused_infer_attention_score head_dim only equal 128, not support other number.
Does this PR introduce any user-facing change?
remove prefill phase npu_fused_infer_attention_score
How was this patch tested?