Refactor dense FP8 tensor/channel/block utils and add CT FP8 block #21404

mgoin · 2025-07-22T20:18:10Z

Purpose

Relies on recent support in compressed-tensors (neuralmagic/compressed-tensors#372) and llm-compressor (vllm-project/llm-compressor#1607) to produce the models.

This PR implements dense Linear W8A8 FP8 block quantization support for compressed-tensors models. This is focused on supporting the DeepSeekV3-style format, which has 128x128 block weights and 1x128 block activations (really per-token-group).

Most of the logic is ported directly from fp8.py into fp8_utils.py and generalized between the user.

Test Plan

Green CI from the many FP8 tests and some manual benchmarks.

Test Result

lm_eval --model vllm --model_args pretrained=RedHatAI/Qwen3-0.6B-FP8-BLOCK --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
vllm (pretrained=RedHatAI/Qwen3-0.6B-FP8-BLOCK,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4041|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.4079|±  |0.0135|

Signed-off-by: mgoin <[email protected]>

github-actions · 2025-07-22T20:18:18Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces a significant and valuable refactoring of the FP8 quantization utilities. By centralizing the logic into fp8_utils.py, the code becomes much cleaner, more modular, and easier to maintain. The addition of block quantization support for compressed tensors is also a great enhancement.

Overall, the changes are well-structured. However, I've identified a critical bug concerning a double-transpose operation that could lead to incorrect model outputs, and a high-severity regression in the block shape validation logic that might break certain models. Addressing these issues will be important for the stability and correctness of the implementation.

vllm/model_executor/layers/quantization/fp8.py

vllm/model_executor/layers/quantization/utils/fp8_utils.py

Signed-off-by: mgoin <[email protected]>

mergify · 2025-08-15T19:40:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: mgoin <[email protected]>

mergify · 2025-08-19T22:22:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: mgoin <[email protected]>

Refactor FP8 tensor/channel/block utils and add CT FP8 block

70e8d5a

Signed-off-by: mgoin <[email protected]>

mgoin requested review from robertgshaw2-redhat and tlrmchlsmth as code owners July 22, 2025 20:18

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Show resolved Hide resolved

vllm/model_executor/layers/quantization/utils/fp8_utils.py Outdated Show resolved Hide resolved

mgoin added 3 commits July 22, 2025 16:28

Fix bad changes

ef44c10

Signed-off-by: mgoin <[email protected]>

Refactor param map

637e613

Signed-off-by: mgoin <[email protected]>

Add back missed fp8 block case

18df648

Signed-off-by: mgoin <[email protected]>

mgoin changed the title ~~Refactor FP8 tensor/channel/block utils and add CT FP8 block~~ Refactor dense FP8 tensor/channel/block utils and add CT FP8 block Jul 22, 2025

Fix double transpose

bdf52d9

Signed-off-by: mgoin <[email protected]>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 23, 2025

shanjiaz mentioned this pull request Aug 7, 2025

[question]: How to serve FP8_BLOCK? vllm-project/llm-compressor#1700

Closed

shanjiaz mentioned this pull request Aug 15, 2025

pass in tensor_id for calculate_qparam vllm-project/llm-compressor#1709

Merged

Merge branch 'main' into refactor-fp8-block-support

d175c56

Signed-off-by: mgoin <[email protected]>

mgoin requested a review from yewentao256 as a code owner August 15, 2025 15:08

Fix param issues

0d44263

Signed-off-by: mgoin <[email protected]>

mergify bot added the needs-rebase label Aug 15, 2025

wangwenmingaa mentioned this pull request Aug 19, 2025

Does SGLang support Qwen3 MOE FP8 BLOCK? vllm-project/llm-compressor#1758

Open

mgoin added 2 commits August 19, 2025 15:51

Merge branch 'main' into refactor-fp8-block-support

8bd5994

Signed-off-by: mgoin <[email protected]>

Upgrade compressed-tensors==0.11.0

a232c20

Signed-off-by: mgoin <[email protected]>

mergify bot added ci/build and removed needs-rebase labels Aug 19, 2025

mgoin mentioned this pull request Aug 19, 2025

[Quantization] Bump Compressed Tensors Version #23202

Merged

mergify bot added the needs-rebase label Aug 19, 2025

mgoin added 2 commits August 19, 2025 21:17

Merge branch 'main' into refactor-fp8-block-support

161a2dc

Signed-off-by: mgoin <[email protected]>

Fix input scale

55c801d

Signed-off-by: mgoin <[email protected]>

mergify bot removed the needs-rebase label Aug 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Refactor dense FP8 tensor/channel/block utils and add CT FP8 block #21404

Refactor dense FP8 tensor/channel/block utils and add CT FP8 block #21404

mgoin commented Jul 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Aug 15, 2025

Uh oh!

mergify bot commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

Refactor dense FP8 tensor/channel/block utils and add CT FP8 block #21404

Are you sure you want to change the base?

Refactor dense FP8 tensor/channel/block utils and add CT FP8 block #21404

Conversation

mgoin commented Jul 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jul 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Aug 15, 2025

Uh oh!

mergify bot commented Aug 19, 2025

Uh oh!

Uh oh!

mgoin commented Jul 22, 2025 •

edited by github-actions bot

Loading