-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
Refactor dense FP8 tensor/channel/block utils and add CT FP8 block #21404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: mgoin <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and valuable refactoring of the FP8 quantization utilities. By centralizing the logic into fp8_utils.py
, the code becomes much cleaner, more modular, and easier to maintain. The addition of block quantization support for compressed tensors is also a great enhancement.
Overall, the changes are well-structured. However, I've identified a critical bug concerning a double-transpose operation that could lead to incorrect model outputs, and a high-severity regression in the block shape validation logic that might break certain models. Addressing these issues will be important for the stability and correctness of the implementation.
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Purpose
Relies on recent support in compressed-tensors (neuralmagic/compressed-tensors#372) and llm-compressor (vllm-project/llm-compressor#1607) to produce the models.
This PR implements dense Linear W8A8 FP8 block quantization support for compressed-tensors models. This is focused on supporting the DeepSeekV3-style format, which has 128x128 block weights and 1x128 block activations (really per-token-group).
Most of the logic is ported directly from
fp8.py
intofp8_utils.py
and generalized between the user.Test Plan
Green CI from the many FP8 tests and some manual benchmarks.
Test Result