-
Notifications
You must be signed in to change notification settings - Fork 352
[main] Fuse GroupedMatmul, Swiglu and DynamicQuant in W8A8_DYNAMIC
quantized MoE layers
#2275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…C` quantized MoE layers Signed-off-by: zhoux77899 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: zhoux77899 <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2275 +/- ##
==========================================
+ Coverage 77.37% 77.92% +0.54%
==========================================
Files 128 128
Lines 16455 16608 +153
==========================================
+ Hits 12732 12941 +209
+ Misses 3723 3667 -56
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
…llm-ascend into main_gmmswigluquant
Signed-off-by: zhoux77899 <[email protected]>
x=hidden_states, | ||
weight=w1, | ||
group_list=group_list if group_list_type == 0 else group_list.cumsum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to modify fused_experts_with_mc2(): pass expert_token_nums_type=1
to npu_moe_distribute_dispatch()
and pass group_list_type = 0
to apply_mlp_decode()
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
a4bb618
to
4705afb
Compare
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
GroupedMatmulSwigluQuant
in W8A8_DYNAMIC
quantized MoE layersW8A8_DYNAMIC
quantized MoE layers
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?
Tested on W8A8 quantized Qwen3-235B-A22B model with
bs=16
tp=8
,dp=1
,moe_tp=8
,moe_ep=1
, TPOP increased 21.54%, Output Token Throughput increased 27.35%tp=8
,dp=1
,moe_tp=1
,moe_ep=8
, TPOP increased 17.38%, Output Token Throughput increased 6.86%