-
Notifications
You must be signed in to change notification settings - Fork 351
[main][quantization] Adapt to the new format of ds w4a8 weight #2392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…weights Signed-off-by: Wang Kunpeng <[email protected]>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adapts the w4a8 dynamic quantization method to support a new weight format, introducing version-dependent logic for weight and parameter creation and processing. The changes are mostly in vllm_ascend/quantization/w4a8_dynamic.py
and are accompanied by updates to the unit tests. While the changes are generally well-structured, I found a critical bug related to a shape inconsistency in the new format's w13_scale_bias
parameter, which would likely lead to runtime errors. I've provided suggestions to fix this in both the implementation and the tests.
param_dict["w13_scale_bias"] = torch.empty( | ||
num_experts, | ||
2 * intermediate_size_per_partition, | ||
1, | ||
dtype=torch.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a shape inconsistency for w13_scale_bias
in the new quantization format. For the new version, w13_weight
is created with w13_output_size = intermediate_size_per_partition
. However, w13_scale_bias
is created with a dimension of 2 * intermediate_size_per_partition
. This is inconsistent with the corresponding weight and will likely cause runtime errors or incorrect results. The dimension should probably be intermediate_size_per_partition
to match w13_weight
.
param_dict["w13_scale_bias"] = torch.empty( | |
num_experts, | |
2 * intermediate_size_per_partition, | |
1, | |
dtype=torch.float32) | |
param_dict["w13_scale_bias"] = torch.empty( | |
num_experts, | |
intermediate_size_per_partition, | |
1, | |
dtype=torch.float32) |
w13_scale_bias = torch.zeros((self.experts, 2 * self.input_size, 1), | ||
dtype=torch.float32) | ||
new_layer.w13_scale_bias = torch.nn.Parameter(w13_scale_bias, | ||
requires_grad=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setup for w13_scale_bias
seems to be based on an incorrect shape definition in get_dynamic_quant_param
. The dimension 2 * self.input_size
is inconsistent with the corresponding w13_weight
's dimension for the new quantization version. This should be self.input_size
to match the weight.
w13_scale_bias = torch.zeros((self.experts, 2 * self.input_size, 1), | |
dtype=torch.float32) | |
new_layer.w13_scale_bias = torch.nn.Parameter(w13_scale_bias, | |
requires_grad=False) | |
w13_scale_bias = torch.zeros((self.experts, self.input_size, 1), | |
dtype=torch.float32) | |
new_layer.w13_scale_bias = torch.nn.Parameter(w13_scale_bias, | |
requires_grad=False) |
self.assertEqual(new_layer.w13_scale_bias.data.shape, | ||
(self.experts, 2 * self.input_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion checks for a shape that is inconsistent with the corresponding weight's shape. Following the correction in the w13_scale_bias
setup, this assertion should be updated to check for the correct shape, which should use self.input_size
instead of 2 * self.input_size
.
self.assertEqual(new_layer.w13_scale_bias.data.shape, | |
(self.experts, 2 * self.input_size)) | |
self.assertEqual(new_layer.w13_scale_bias.data.shape, | |
(self.experts, self.input_size)) |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2392 +/- ##
=======================================
Coverage ? 76.26%
=======================================
Files ? 120
Lines ? 13582
Branches ? 0
=======================================
Hits ? 10358
Misses ? 3224
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Wang Kunpeng <[email protected]>
…to main-w4a8-0815
…to main-w4a8-0815
…to main-w4a8-0815
e2e passed here:https://github.com/vllm-project/vllm-ascend/actions/runs/17088776648 |
What this PR does / why we need it?
The deepseek w4a8 weights we supported before were in mindie-format format. It uses int8 to represent int4, so the weight size is similar to w8a8, and we need to do a few extra steps to make vllm-ascend load it normally.
Now we can directly use the new weight format, which uses two int4 packs to save the weight, the weight size is reduced, and there is no need to do many extra operations to directly use it on vllm-ascend, but we are also compatible with the weights of the previous mindie format.
The weight changes in the new version:
Does this PR introduce any user-facing change?
no
How was this patch tested?
Adding ut case in
tests/ut/quantization/test_w4a8_dynamic.py
1.How to get weights using Modelslim
Installation steps
we can use the branch br_release_MindStudio_8.1.RC2_TR5_20260624
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh
Generate w4a8 weights
cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the pre-check and DeepSeek-R1 w4a8 mix quantization chapter
Reference command:python3 quant_deepseek_w4a8.py --model_path {Original weight path} --save_path {Generate weight path}
Adapt to vllm-ascend
Modification in
config.json
:"model_type":deepseekv2
is changed to"model_type":deepseek_v3
;2.How to run w4a8
a.How to run eager mode
export VLLM_ASCEND_MLA_PA=1
python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6 --enforce-eager
eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --max-num-seqs 128 --enforce-eager
b.How to run graph mode
export HCCL_BUFFSIZE=1024
python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
eg: python -m vllm.entrypoints.openai.api_server --model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'