block wise quantization support by ved1beta · Pull Request #1497 · vllm-project/llm-compressor

ved1beta · 2025-06-01T12:40:12Z

SUMMARY:
added support for blcok wise quant changes in def calculate_qparams

def calculate_qparams(
       self,
       observed: Tensor,
       reduce_dims: Optional[Tuple[int]] = None,
       tensor_id: Optional[Any] = None,
       global_scale: Optional[Tensor] = None,
   ) -> Tuple[FloatTensor, IntTensor]:

fixes #1475

TEST PLAN:
this is repro script from the issue passes


from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
form transformers import AutoModelForCausalLM
from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationConfig,
    QuantizationScheme,
    QuantizationStrategy,
    QuantizationType,
)

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"

# define a llmcompressor recipe for FP8 W8A8 quantization
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
# list so they remain at full precision
recipe = [
    QuantizationModifier(
        ignore=["lm_head", "re:.*mlp.gate$"],
        config_groups={
            "group_0": QuantizationScheme(
                targets=["Linear"],
                weights=QuantizationArgs(
                    num_bits=4,
                    type=QuantizationType.INT,
                    dynamic=False,
                    symmetric=False,
                    strategy=QuantizationStrategy.BLOCK,
                    # group_size=128,
                    block_structure="128x128",
                ),
            )
        },
    )
]

SAVE_DIR = MODEL_ID + "-W4A16-BLOCK128"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="bfloat16", trust_remote_code=True
)


oneshot(
    model=model,
    recipe=recipe,
    save_compressed=True,
    output_dir=SAVE_DIR,
)

EDIT:
ERROR: RuntimeError: output with shape [1] doesn't match the broadcast shape [512, 8]

github-actions · 2025-06-01T12:40:22Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

brian-dellabetta

Thanks for taking a look! Left a few comments. it would also be good to make sure block-wise quantization can run on vllm, beyond just making sure the script runs. Apparently some of this is set up in vllm already -- #1475 (comment)

src/llmcompressor/observers/base.py

brian-dellabetta · 2025-06-02T21:34:33Z

src/llmcompressor/observers/base.py

+                self._scale, self._zero_point = self.calculate_qparams(
+                    observed, tensor_id=None, global_scale=global_scale
+                )


I think the majority of your logic you'll want to have in here, or in a helper method that this calls to help with readability.

brian-dellabetta · 2025-06-02T21:37:04Z

src/llmcompressor/observers/base.py

+                scale_tensor = torch.zeros_like(observed)
+                zero_point_tensor = torch.zeros_like(observed, dtype=torch.int32)


these should have shape (rows, num_blocks), similar to how group-wise is set up here

dsikka

Generally speaking, all quantization types and their applications should live in compressed tensors

shuxiaobo · 2025-06-09T08:17:42Z

@dsikka @brian-dellabetta Is there any progress?

ved1beta · 2025-06-09T09:38:51Z

so i tried moving all the code to get_params then got a shape mismatch issue .The implementation has shape mismatches that cause runtime errors when trying to update quantization parameters.
The main error is: output with shape [1] doesn't match the broadcast shape [512, 8] or [768, 6]. This happens because a scalar parameter is being used where a 2D tensor with specific dimensions is expected.
Should i push the changes for you guys to have a look ? i am not sure what todo next

brian-dellabetta · 2025-06-10T20:29:13Z

Hi @ved1beta , thanks for the update, feel free to push the changes. This is a more difficult issue than most of our "good first issue"s. I can take a look when i have some down time.

Hi @shuxiaobo , this is lower priority given the other work going on in llm-compressor, so might take some time. And we'll have to figure out which configurations are optimized to work well in vllm. Consider using group instead of block for your runs, which has good support already.

brian-dellabetta · 2025-06-30T22:03:27Z

Hi @ved1beta , I am going to close this in favor of the PRs to support block-wise quantization from one of the vllm maintainers. You can see how functionality was added in these PRs:

We appreciate you taking an initial stab at this though. The implementation here is the meat of adding it to llmcompressor, but as you can see from the PRs there are a lot of other things to consider. We're still trying to figure out how best to label good first issues and encourage community involvement

block wise quantization support

07e46b4

brian-dellabetta requested changes Jun 2, 2025

View reviewed changes

dsikka requested changes Jun 5, 2025

View reviewed changes

moved block qunatization logic to get_params

523de8d

ved1beta requested a review from dsikka June 11, 2025 09:34

brian-dellabetta closed this Jun 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

block wise quantization support#1497

block wise quantization support#1497
ved1beta wants to merge 2 commits intovllm-project:mainfrom
ved1beta:Blockwise_quantization

ved1beta commented Jun 1, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 1, 2025

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

brian-dellabetta Jun 2, 2025

Uh oh!

brian-dellabetta Jun 2, 2025

Uh oh!

dsikka left a comment

Uh oh!

shuxiaobo commented Jun 9, 2025

Uh oh!

ved1beta commented Jun 9, 2025

Uh oh!

brian-dellabetta commented Jun 10, 2025

Uh oh!

brian-dellabetta commented Jun 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		scale_tensor = torch.zeros_like(observed)
		zero_point_tensor = torch.zeros_like(observed, dtype=torch.int32)

Conversation

ved1beta commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 1, 2025

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

shuxiaobo commented Jun 9, 2025

Uh oh!

ved1beta commented Jun 9, 2025

Uh oh!

brian-dellabetta commented Jun 10, 2025

Uh oh!

brian-dellabetta commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ved1beta commented Jun 1, 2025 •

edited

Loading

brian-dellabetta commented Jun 30, 2025 •

edited

Loading