Skip to content

Conversation

wenba0
Copy link
Contributor

@wenba0 wenba0 commented Sep 5, 2025

What this PR does / why we need it?

support qwen25 vl w8a8 quantization

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for W8A8 quantization for the Qwen2.5 Vision-Language model on Ascend hardware. The changes include adding new padding functions for quantized weights, updating the weight loading logic to handle these new parameters, and adding corresponding unit tests. My review has identified a few areas for improvement. There are a couple of opportunities to optimize the new padding functions by avoiding redundant reshape operations. More importantly, there appears to be an incomplete implementation in the weight loading logic for projection-related quantization parameters, which could be a potential bug. Please see the detailed comments for suggestions.

Comment on lines 295 to 300
data1 = data.reshape(
-1, 3, self.origin_hidden_size_per_attention_head, 1
)[:, :, :self.half_origin_hidden_size_per_attention_head, :]
data2 = data.reshape(
-1, 3, self.origin_hidden_size_per_attention_head, 1
)[:, :, self.half_origin_hidden_size_per_attention_head:, :]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The data.reshape(...) operation is performed twice with the same arguments. This is inefficient. You can store the result of the reshape operation in a variable and reuse it to improve performance and readability.

        reshaped_data = data.reshape(
            -1, 3, self.origin_hidden_size_per_attention_head, 1
        )
        data1 = reshaped_data[:, :, :self.half_origin_hidden_size_per_attention_head, :]
        data2 = reshaped_data[:, :, self.half_origin_hidden_size_per_attention_head:, :]

Comment on lines 316 to 321
data1 = data.reshape(
-1, 3, self.origin_hidden_size_per_attention_head
)[:, :, :self.half_origin_hidden_size_per_attention_head]
data2 = data.reshape(
-1, 3, self.origin_hidden_size_per_attention_head
)[:, :, self.half_origin_hidden_size_per_attention_head:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to pad_qkv_weight_scale_offset, the data.reshape(...) operation is performed twice here. This is inefficient. Please store the result of the reshape operation in a variable and reuse it.

        reshaped_data = data.reshape(
            -1, 3, self.origin_hidden_size_per_attention_head
        )
        data1 = reshaped_data[:, :, :self.half_origin_hidden_size_per_attention_head]
        data2 = reshaped_data[:, :, self.half_origin_hidden_size_per_attention_head:]

Comment on lines 359 to 360
if ("attn.proj.weight_scale" in name or "attn.proj.weight_offset" in name) and self.enable_pad:
...
elif ("attn.proj.deq_scale" in name or "attn.proj.quant_bias" in name) and self.enable_pad:
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The conditional blocks for handling attn.proj.weight_scale, attn.proj.weight_offset, attn.proj.deq_scale, and attn.proj.quant_bias contain only an ellipsis (...). This suggests an incomplete implementation. If these projection-related quantization parameters require padding when self.enable_pad is true, this is a bug that could lead to incorrect model behavior. Please implement the necessary padding logic for these parameters, similar to how it's done for attn.qkv parameters, or add a comment explaining why they are intentionally skipped.

@wenba0 wenba0 force-pushed the qwen25_vl_7b branch 2 times, most recently from 5bce20e to d1c4720 Compare September 5, 2025 07:43
Copy link

github-actions bot commented Sep 5, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

lijiaojiao added 5 commits September 7, 2025 14:50
Signed-off-by: lijiaojiao <[email protected]>
Signed-off-by: lijiaojiao <[email protected]>
Signed-off-by: lijiaojiao <[email protected]>
Signed-off-by: lijiaojiao <[email protected]>
Copy link

codecov bot commented Sep 7, 2025

Codecov Report

❌ Patch coverage is 83.33333% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.73%. Comparing base (4c90fa7) to head (26c4524).
⚠️ Report is 41 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/models/qwen2_5_vl.py 63.33% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2778      +/-   ##
==========================================
- Coverage   72.99%   72.73%   -0.27%     
==========================================
  Files         153      153              
  Lines       21331    21343      +12     
==========================================
- Hits        15571    15523      -48     
- Misses       5760     5820      +60     
Flag Coverage Δ
unittests 72.73% <83.33%> (-0.27%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wangxiyuan wangxiyuan merged commit bd3dede into vllm-project:main Sep 11, 2025
25 checks passed
yiz-liu pushed a commit to linfeng-yuan/vllm-ascend that referenced this pull request Sep 12, 2025
### What this PR does / why we need it?
support qwen25 vl w8a8 quantization
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@62f66be

---------

Signed-off-by: lijiaojiao <[email protected]>
Co-authored-by: lijiaojiao <[email protected]>
Signed-off-by: Yizhou Liu <[email protected]>
offline893 pushed a commit to offline893/vllm-ascend that referenced this pull request Sep 16, 2025
### What this PR does / why we need it?
support qwen25 vl w8a8 quantization
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@62f66be

---------

Signed-off-by: lijiaojiao <[email protected]>
Co-authored-by: lijiaojiao <[email protected]>
Signed-off-by: offline0806 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants