Skip to content

Conversation

@QilaiZhang
Copy link
Contributor

@QilaiZhang QilaiZhang commented Oct 13, 2025

What this PR does / why we need it?

Fix the bug described in #3368: Duplicated Allocation of Shared Hybrid Cache in Qwen3-Next

Does this PR introduce any user-facing change?

No

How was this patch tested?

The model qwen3-next has been tested.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a duplicated KV cache allocation bug by refactoring the tensor initialization logic. The new code is much cleaner and avoids redundant allocations. However, this refactoring introduces a critical memory alignment issue for linear_attn layers when kv_transfer_config is enabled. The use of torch.cat results in an unaligned tensor, which violates a requirement for llmdatadist and could lead to runtime errors. This needs to be addressed.

Comment on lines +2858 to +2871
else:
cache_size = kv_cache_tensor.size // 2
cache_size_aligned = cache_size + alignment
k_tensor_aligned = torch.zeros(cache_size_aligned,
dtype=torch.int8,
device=self.device)
v_tensor_aligned = torch.zeros(cache_size_aligned,
dtype=torch.int8,
device=self.device)
k_tensor = self._align_memory(k_tensor_aligned,
alignment)[:cache_size]
v_tensor = self._align_memory(v_tensor_aligned,
alignment)[:cache_size]
tensor = torch.cat([k_tensor, v_tensor])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This refactoring introduces a memory alignment issue for linear_attn layers when kv_transfer_config is enabled.

The code now allocates k_tensor and v_tensor from separate 2M-aligned memory blocks, which is correct for standard attention layers. However, it then uses tensor = torch.cat([k_tensor, v_tensor]) to create the tensor for linear_attn layers. torch.cat on tensors from different storage will create a new tensor by copying data, and the memory for this new tensor is not guaranteed to be 2M-aligned.

This appears to break the requirement for llmdatadist, which, according to the comment on line 2850, needs the cache tensor to be aligned. This can lead to runtime errors or incorrect behavior. The previous implementation correctly aligned the tensor for linear_attn.

Given the conflicting alignment requirements for a shared hybrid cache (one large aligned buffer for linear_attn vs. two separate aligned buffers for standard attn), please revisit this implementation to ensure all tensors meet their alignment requirements.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

tensor = self._align_memory(
tensor, alignment)[:kv_cache_tensor.size]
kv_cache_raw_tensors[layer_name_inner] = tensor
kv_cache_raw_tensors[layer_name] = tensor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing kv cache memory between linear_attn and self_attn will cause accuracy issue in Qwen3-Next. Thus we should allocate memory for linear_attn and self_attn seperately.
Actually I also have a PR on fixing this issue and refactor the kvcache initialization logic, but it is not ready for review now. Feel free to change the bug fix logic in this pr according to #3106

@MengqingCao
Copy link
Collaborator

@QilaiZhang plz take a look at #3760, which shared kvcache in the same attention type, before fixing accuracy issue on sharing buffer between linear_attn and self_attn, I prefer to waste some memory to guarantee the accuracy

@QilaiZhang
Copy link
Contributor Author

@QilaiZhang plz take a look at #3760, which shared kvcache in the same attention type, before fixing accuracy issue on sharing buffer between linear_attn and self_attn, I prefer to waste some memory to guarantee the accuracy

@MengqingCao Thanks for flagging this. Acknowledged - we'll preserve the current memory usage to guarantee accuracy. We'll look into #3760 and revisit buffer sharing once the accuracy issues are fixed.

@QilaiZhang QilaiZhang closed this Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants