-
Notifications
You must be signed in to change notification settings - Fork 451
support qwen25 vl w8a8 quantization #2778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for W8A8 quantization for the Qwen2.5 Vision-Language model on Ascend hardware. The changes include adding new padding functions for quantized weights, updating the weight loading logic to handle these new parameters, and adding corresponding unit tests. My review has identified a few areas for improvement. There are a couple of opportunities to optimize the new padding functions by avoiding redundant reshape operations. More importantly, there appears to be an incomplete implementation in the weight loading logic for projection-related quantization parameters, which could be a potential bug. Please see the detailed comments for suggestions.
vllm_ascend/models/qwen2_5_vl.py
Outdated
data1 = data.reshape( | ||
-1, 3, self.origin_hidden_size_per_attention_head, 1 | ||
)[:, :, :self.half_origin_hidden_size_per_attention_head, :] | ||
data2 = data.reshape( | ||
-1, 3, self.origin_hidden_size_per_attention_head, 1 | ||
)[:, :, self.half_origin_hidden_size_per_attention_head:, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data.reshape(...)
operation is performed twice with the same arguments. This is inefficient. You can store the result of the reshape operation in a variable and reuse it to improve performance and readability.
reshaped_data = data.reshape(
-1, 3, self.origin_hidden_size_per_attention_head, 1
)
data1 = reshaped_data[:, :, :self.half_origin_hidden_size_per_attention_head, :]
data2 = reshaped_data[:, :, self.half_origin_hidden_size_per_attention_head:, :]
vllm_ascend/models/qwen2_5_vl.py
Outdated
data1 = data.reshape( | ||
-1, 3, self.origin_hidden_size_per_attention_head | ||
)[:, :, :self.half_origin_hidden_size_per_attention_head] | ||
data2 = data.reshape( | ||
-1, 3, self.origin_hidden_size_per_attention_head | ||
)[:, :, self.half_origin_hidden_size_per_attention_head:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to pad_qkv_weight_scale_offset
, the data.reshape(...)
operation is performed twice here. This is inefficient. Please store the result of the reshape operation in a variable and reuse it.
reshaped_data = data.reshape(
-1, 3, self.origin_hidden_size_per_attention_head
)
data1 = reshaped_data[:, :, :self.half_origin_hidden_size_per_attention_head]
data2 = reshaped_data[:, :, self.half_origin_hidden_size_per_attention_head:]
vllm_ascend/models/qwen2_5_vl.py
Outdated
if ("attn.proj.weight_scale" in name or "attn.proj.weight_offset" in name) and self.enable_pad: | ||
... | ||
elif ("attn.proj.deq_scale" in name or "attn.proj.quant_bias" in name) and self.enable_pad: | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditional blocks for handling attn.proj.weight_scale
, attn.proj.weight_offset
, attn.proj.deq_scale
, and attn.proj.quant_bias
contain only an ellipsis (...
). This suggests an incomplete implementation. If these projection-related quantization parameters require padding when self.enable_pad
is true, this is a bug that could lead to incorrect model behavior. Please implement the necessary padding logic for these parameters, similar to how it's done for attn.qkv
parameters, or add a comment explaining why they are intentionally skipped.
5bce20e
to
d1c4720
Compare
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: lijiaojiao <[email protected]>
Signed-off-by: lijiaojiao <[email protected]>
Signed-off-by: lijiaojiao <[email protected]>
Signed-off-by: lijiaojiao <[email protected]>
Signed-off-by: lijiaojiao <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2778 +/- ##
==========================================
- Coverage 72.99% 72.73% -0.27%
==========================================
Files 153 153
Lines 21331 21343 +12
==========================================
- Hits 15571 15523 -48
- Misses 5760 5820 +60
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
### What this PR does / why we need it? support qwen25 vl w8a8 quantization ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@62f66be --------- Signed-off-by: lijiaojiao <[email protected]> Co-authored-by: lijiaojiao <[email protected]> Signed-off-by: Yizhou Liu <[email protected]>
### What this PR does / why we need it? support qwen25 vl w8a8 quantization ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@62f66be --------- Signed-off-by: lijiaojiao <[email protected]> Co-authored-by: lijiaojiao <[email protected]> Signed-off-by: offline0806 <[email protected]>
What this PR does / why we need it?
support qwen25 vl w8a8 quantization
Does this PR introduce any user-facing change?
N/A
How was this patch tested?