[Bugfix] Fix duplicated KV cache allocation in qwen3-Next #3404

QilaiZhang · 2025-10-13T03:51:26Z

What this PR does / why we need it?

Fix the bug described in #3368: Duplicated Allocation of Shared Hybrid Cache in Qwen3-Next

Does this PR introduce any user-facing change?

No

How was this patch tested?

The model qwen3-next has been tested.

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: QilaiZhang <[email protected]>

gemini-code-assist

Code Review

This pull request correctly fixes a duplicated KV cache allocation bug by refactoring the tensor initialization logic. The new code is much cleaner and avoids redundant allocations. However, this refactoring introduces a critical memory alignment issue for linear_attn layers when kv_transfer_config is enabled. The use of torch.cat results in an unaligned tensor, which violates a requirement for llmdatadist and could lead to runtime errors. This needs to be addressed.

gemini-code-assist · 2025-10-13T03:54:13Z

vllm_ascend/worker/model_runner_v1.py

+            else:
+                cache_size = kv_cache_tensor.size // 2
+                cache_size_aligned = cache_size + alignment
+                k_tensor_aligned = torch.zeros(cache_size_aligned,
+                                               dtype=torch.int8,
+                                               device=self.device)
+                v_tensor_aligned = torch.zeros(cache_size_aligned,
+                                               dtype=torch.int8,
+                                               device=self.device)
+                k_tensor = self._align_memory(k_tensor_aligned,
+                                              alignment)[:cache_size]
+                v_tensor = self._align_memory(v_tensor_aligned,
+                                              alignment)[:cache_size]
+                tensor = torch.cat([k_tensor, v_tensor])


This refactoring introduces a memory alignment issue for linear_attn layers when kv_transfer_config is enabled.

The code now allocates k_tensor and v_tensor from separate 2M-aligned memory blocks, which is correct for standard attention layers. However, it then uses tensor = torch.cat([k_tensor, v_tensor]) to create the tensor for linear_attn layers. torch.cat on tensors from different storage will create a new tensor by copying data, and the memory for this new tensor is not guaranteed to be 2M-aligned.

This appears to break the requirement for llmdatadist, which, according to the comment on line 2850, needs the cache tensor to be aligned. This can lead to runtime errors or incorrect behavior. The previous implementation correctly aligned the tensor for linear_attn.

Given the conflicting alignment requirements for a shared hybrid cache (one large aligned buffer for linear_attn vs. two separate aligned buffers for standard attn), please revisit this implementation to ensure all tensors meet their alignment requirements.

github-actions · 2025-10-13T04:05:40Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: QilaiZhang <[email protected]>

MengqingCao · 2025-10-21T01:19:07Z

vllm_ascend/worker/model_runner_v1.py

-                            tensor = self._align_memory(
-                                tensor, alignment)[:kv_cache_tensor.size]
-                        kv_cache_raw_tensors[layer_name_inner] = tensor
+                    kv_cache_raw_tensors[layer_name] = tensor


Sharing kv cache memory between linear_attn and self_attn will cause accuracy issue in Qwen3-Next. Thus we should allocate memory for linear_attn and self_attn seperately.
Actually I also have a PR on fixing this issue and refactor the kvcache initialization logic, but it is not ready for review now. Feel free to change the bug fix logic in this pr according to #3106

MengqingCao · 2025-10-25T08:04:08Z

@QilaiZhang plz take a look at #3760, which shared kvcache in the same attention type, before fixing accuracy issue on sharing buffer between linear_attn and self_attn, I prefer to waste some memory to guarantee the accuracy

QilaiZhang · 2025-10-27T01:07:39Z

@QilaiZhang plz take a look at #3760, which shared kvcache in the same attention type, before fixing accuracy issue on sharing buffer between linear_attn and self_attn, I prefer to waste some memory to guarantee the accuracy

@MengqingCao Thanks for flagging this. Acknowledged - we'll preserve the current memory usage to guarantee accuracy. We'll look into #3760 and revisit buffer sharing once the accuracy issues are fixed.

[Bugfix] Fix duplicated KV cache allocation in qwen3-Next

ead4f2f

Signed-off-by: QilaiZhang <[email protected]>

gemini-code-assist bot reviewed Oct 13, 2025

View reviewed changes

[Bugfix] Fix duplicated KV cache allocation in qwen3-Next

1fe26cb

Signed-off-by: QilaiZhang <[email protected]>

paulyu12 requested a review from wangxiyuan October 21, 2025 00:45

wangxiyuan requested a review from MengqingCao October 21, 2025 01:00

MengqingCao reviewed Oct 21, 2025

View reviewed changes

wangxiyuan mentioned this pull request Oct 25, 2025

[HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type #3760

Merged

QilaiZhang closed this Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix duplicated KV cache allocation in qwen3-Next #3404

[Bugfix] Fix duplicated KV cache allocation in qwen3-Next #3404

Uh oh!

QilaiZhang commented Oct 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Uh oh!

github-actions bot commented Oct 13, 2025

Uh oh!

MengqingCao Oct 21, 2025

Uh oh!

MengqingCao commented Oct 25, 2025

Uh oh!

QilaiZhang commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Bugfix] Fix duplicated KV cache allocation in qwen3-Next #3404

[Bugfix] Fix duplicated KV cache allocation in qwen3-Next #3404

Uh oh!

Conversation

QilaiZhang commented Oct 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 13, 2025

Uh oh!

MengqingCao Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

MengqingCao commented Oct 25, 2025

Uh oh!

QilaiZhang commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

QilaiZhang commented Oct 13, 2025 •

edited by github-actions bot

Loading