-
Notifications
You must be signed in to change notification settings - Fork 531
[HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type #3760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in hybrid KV cache sharing for models with multiple attention types. The changes in vllm_ascend/worker/model_runner_v1.py correctly refactor the cache initialization logic to ensure that tensors are allocated only once per shared group and attention type, then assigned to all relevant layers. This fixes the issue where tensors were being re-allocated instead of shared. The logic is sound and effectively resolves the bug. The test case in tests/e2e/multicard/test_qwen3_next.py has been updated to use a larger batch size, which is a suitable change to validate the fix under multi-card parallelism. Overall, this is a solid bug fix.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
any relationship with this PR #3404 ? |
We both attempt to fix kvcache allocation duplication issue, but I keep allocating buffer for different attn type seperately to avoid accuracy issue |
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
What this PR does / why we need it?
Part of #3106
Fix Hybrid kvcache sharing bug in same attention type
Change the
shared_bylogic so that the same attention spec could share the same buffer instead of allocating more hbm.After this pr, kvcache memory saved 50% in qwen3-next compared with before (
self_attn:linear_attn=1:3in anattn_group), andgpu_memory_utilizationcould increase to0.8on Qwen3-Next when running on A2 64G/card with tp4Does this PR introduce any user-facing change?
How was this patch tested?
Test pass with the latest e2e test case on qwen3-next