Skip to content

Conversation

@MengqingCao
Copy link
Collaborator

@MengqingCao MengqingCao commented Oct 25, 2025

What this PR does / why we need it?

Part of #3106
Fix Hybrid kvcache sharing bug in same attention type
Change the shared_by logic so that the same attention spec could share the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with before (self_attn:linear_attn=1:3 in an attn_group), and gpu_memory_utilization could increase to 0.8 on Qwen3-Next when running on A2 64G/card with tp4

image

Does this PR introduce any user-facing change?

How was this patch tested?

Test pass with the latest e2e test case on qwen3-next

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in hybrid KV cache sharing for models with multiple attention types. The changes in vllm_ascend/worker/model_runner_v1.py correctly refactor the cache initialization logic to ensure that tensors are allocated only once per shared group and attention type, then assigned to all relevant layers. This fixes the issue where tensors were being re-allocated instead of shared. The logic is sound and effectively resolves the bug. The test case in tests/e2e/multicard/test_qwen3_next.py has been updated to use a larger batch size, which is a suitable change to validate the fix under multi-card parallelism. Overall, this is a solid bug fix.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@wangxiyuan
Copy link
Collaborator

any relationship with this PR #3404 ?

@MengqingCao
Copy link
Collaborator Author

any relationship with this PR #3404 ?

We both attempt to fix kvcache allocation duplication issue, but I keep allocating buffer for different attn type seperately to avoid accuracy issue

@MengqingCao MengqingCao added ready read for review ready-for-test start test by label for PR labels Oct 25, 2025
Signed-off-by: MengqingCao <[email protected]>
@yiz-liu yiz-liu merged commit 900086f into vllm-project:main Oct 29, 2025
39 checks passed
@MengqingCao MengqingCao deleted the next_kv branch October 29, 2025 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dist-test module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants