[HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type #3760

MengqingCao · 2025-10-25T07:54:37Z

What this PR does / why we need it?

Part of #3106
Fix Hybrid kvcache sharing bug in same attention type
Change the shared_by logic so that the same attention spec could share the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with before (self_attn:linear_attn=1:3 in an attn_group), and gpu_memory_utilization could increase to 0.8 on Qwen3-Next when running on A2 64G/card with tp4

Does this PR introduce any user-facing change?

How was this patch tested?

Test pass with the latest e2e test case on qwen3-next

vLLM version: v0.11.0rc3
vLLM main: vllm-project/vllm@c9461e0

gemini-code-assist

Code Review

This pull request addresses a bug in hybrid KV cache sharing for models with multiple attention types. The changes in vllm_ascend/worker/model_runner_v1.py correctly refactor the cache initialization logic to ensure that tensors are allocated only once per shared group and attention type, then assigned to all relevant layers. This fixes the issue where tensors were being re-allocated instead of shared. The logic is sound and effectively resolves the bug. The test case in tests/e2e/multicard/test_qwen3_next.py has been updated to use a larger batch size, which is a suitable change to validate the fix under multi-card parallelism. Overall, this is a solid bug fix.

github-actions · 2025-10-25T08:00:33Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

wangxiyuan · 2025-10-25T08:01:46Z

any relationship with this PR #3404 ?

vllm_ascend/worker/model_runner_v1.py

MengqingCao · 2025-10-25T08:52:31Z

any relationship with this PR #3404 ?

We both attempt to fix kvcache allocation duplication issue, but I keep allocating buffer for different attn type seperately to avoid accuracy issue

Signed-off-by: MengqingCao <[email protected]>

github-actions bot added the module:tests label Oct 25, 2025

gemini-code-assist bot reviewed Oct 25, 2025

View reviewed changes

MengqingCao mentioned this pull request Oct 25, 2025

[Bugfix] Fix duplicated KV cache allocation in qwen3-Next #3404

Closed

MengqingCao commented Oct 25, 2025

View reviewed changes

vllm_ascend/worker/model_runner_v1.py Outdated Show resolved Hide resolved

[HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type

1697ba6

Signed-off-by: MengqingCao <[email protected]>

MengqingCao force-pushed the next_kv branch from cd4e928 to 1697ba6 Compare October 25, 2025 10:08

MengqingCao added ready read for review ready-for-test start test by label for PR labels Oct 25, 2025

tiny fix

96c6e22

Signed-off-by: MengqingCao <[email protected]>

MengqingCao added the dist-test label Oct 27, 2025

This was referenced Oct 28, 2025

[Bug]: Qwen3-next: Out of Memory due to repeated allocation of shared hybrid cache #3368

Open

[Usage]: Qwen3-next Tutorials and FAQs #3854

Open

yiz-liu approved these changes Oct 29, 2025

View reviewed changes

yiz-liu merged commit 900086f into vllm-project:main Oct 29, 2025
39 checks passed

MengqingCao deleted the next_kv branch October 29, 2025 06:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type #3760

[HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type #3760

Uh oh!

MengqingCao commented Oct 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Oct 25, 2025

Uh oh!

wangxiyuan commented Oct 25, 2025

Uh oh!

Uh oh!

MengqingCao commented Oct 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type #3760

[HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type #3760

Uh oh!

Conversation

MengqingCao commented Oct 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Oct 25, 2025

Uh oh!

wangxiyuan commented Oct 25, 2025

Uh oh!

Uh oh!

MengqingCao commented Oct 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MengqingCao commented Oct 25, 2025 •

edited by github-actions bot

Loading