Skip to content

[main] Fuse GroupedMatmul, Swiglu and DynamicQuant in W8A8_DYNAMIC quantized MoE layers #2275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

zhoux77899
Copy link
Contributor

@zhoux77899 zhoux77899 commented Aug 8, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Tested on W8A8 quantized Qwen3-235B-A22B model with bs=16

  1. tp=8, dp=1, moe_tp=8, moe_ep=1, TPOP increased 21.54%, Output Token Throughput increased 27.35%
image
  1. tp=8, dp=1, moe_tp=1, moe_ep=8, TPOP increased 17.38%, Output Token Throughput increased 6.86%
image

Copy link

github-actions bot commented Aug 8, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: zhoux77899 <[email protected]>
Copy link

codecov bot commented Aug 8, 2025

Codecov Report

❌ Patch coverage is 83.73494% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.92%. Comparing base (1de16ea) to head (90ea998).

Files with missing lines Patch % Lines
vllm_ascend/quantization/w8a8_dynamic.py 43.33% 17 Missing ⚠️
vllm_ascend/quantization/w4a8_dynamic.py 70.58% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2275      +/-   ##
==========================================
+ Coverage   77.37%   77.92%   +0.54%     
==========================================
  Files         128      128              
  Lines       16455    16608     +153     
==========================================
+ Hits        12732    12941     +209     
+ Misses       3723     3667      -56     
Flag Coverage Δ
unittests 77.92% <83.73%> (+0.54%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

x=hidden_states,
weight=w1,
group_list=group_list if group_list_type == 0 else group_list.cumsum(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to modify fused_experts_with_mc2(): pass expert_token_nums_type=1 to npu_moe_distribute_dispatch() and pass group_list_type = 0 to apply_mlp_decode()

Signed-off-by: zhoux77899 <[email protected]>
@zhoux77899 zhoux77899 changed the title [main] Support GroupedMatmulSwigluQuant in W8A8_DYNAMIC quantized MoE layers [main] Fuse GroupedMatmul, Swiglu and DynamicQuant in W8A8_DYNAMIC quantized MoE layers Aug 16, 2025
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants