-
Notifications
You must be signed in to change notification settings - Fork 536
[Bugfix] Fix duplicated KV cache allocation in qwen3-Next #3404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: QilaiZhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a duplicated KV cache allocation bug by refactoring the tensor initialization logic. The new code is much cleaner and avoids redundant allocations. However, this refactoring introduces a critical memory alignment issue for linear_attn layers when kv_transfer_config is enabled. The use of torch.cat results in an unaligned tensor, which violates a requirement for llmdatadist and could lead to runtime errors. This needs to be addressed.
| else: | ||
| cache_size = kv_cache_tensor.size // 2 | ||
| cache_size_aligned = cache_size + alignment | ||
| k_tensor_aligned = torch.zeros(cache_size_aligned, | ||
| dtype=torch.int8, | ||
| device=self.device) | ||
| v_tensor_aligned = torch.zeros(cache_size_aligned, | ||
| dtype=torch.int8, | ||
| device=self.device) | ||
| k_tensor = self._align_memory(k_tensor_aligned, | ||
| alignment)[:cache_size] | ||
| v_tensor = self._align_memory(v_tensor_aligned, | ||
| alignment)[:cache_size] | ||
| tensor = torch.cat([k_tensor, v_tensor]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactoring introduces a memory alignment issue for linear_attn layers when kv_transfer_config is enabled.
The code now allocates k_tensor and v_tensor from separate 2M-aligned memory blocks, which is correct for standard attention layers. However, it then uses tensor = torch.cat([k_tensor, v_tensor]) to create the tensor for linear_attn layers. torch.cat on tensors from different storage will create a new tensor by copying data, and the memory for this new tensor is not guaranteed to be 2M-aligned.
This appears to break the requirement for llmdatadist, which, according to the comment on line 2850, needs the cache tensor to be aligned. This can lead to runtime errors or incorrect behavior. The previous implementation correctly aligned the tensor for linear_attn.
Given the conflicting alignment requirements for a shared hybrid cache (one large aligned buffer for linear_attn vs. two separate aligned buffers for standard attn), please revisit this implementation to ensure all tensors meet their alignment requirements.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: QilaiZhang <[email protected]>
| tensor = self._align_memory( | ||
| tensor, alignment)[:kv_cache_tensor.size] | ||
| kv_cache_raw_tensors[layer_name_inner] = tensor | ||
| kv_cache_raw_tensors[layer_name] = tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharing kv cache memory between linear_attn and self_attn will cause accuracy issue in Qwen3-Next. Thus we should allocate memory for linear_attn and self_attn seperately.
Actually I also have a PR on fixing this issue and refactor the kvcache initialization logic, but it is not ready for review now. Feel free to change the bug fix logic in this pr according to #3106
|
@QilaiZhang plz take a look at #3760, which shared kvcache in the same attention type, before fixing accuracy issue on sharing buffer between |
@MengqingCao Thanks for flagging this. Acknowledged - we'll preserve the current memory usage to guarantee accuracy. We'll look into #3760 and revisit buffer sharing once the accuracy issues are fixed. |
What this PR does / why we need it?
Fix the bug described in #3368: Duplicated Allocation of Shared Hybrid Cache in Qwen3-Next
Does this PR introduce any user-facing change?
No
How was this patch tested?
The model qwen3-next has been tested.