block wise quantization support#1497
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
brian-dellabetta
left a comment
There was a problem hiding this comment.
Thanks for taking a look! Left a few comments. it would also be good to make sure block-wise quantization can run on vllm, beyond just making sure the script runs. Apparently some of this is set up in vllm already -- #1475 (comment)
src/llmcompressor/observers/base.py
Outdated
| self._scale, self._zero_point = self.calculate_qparams( | ||
| observed, tensor_id=None, global_scale=global_scale | ||
| ) |
There was a problem hiding this comment.
I think the majority of your logic you'll want to have in here, or in a helper method that this calls to help with readability.
src/llmcompressor/observers/base.py
Outdated
| scale_tensor = torch.zeros_like(observed) | ||
| zero_point_tensor = torch.zeros_like(observed, dtype=torch.int32) |
There was a problem hiding this comment.
these should have shape (rows, num_blocks), similar to how group-wise is set up here
dsikka
left a comment
There was a problem hiding this comment.
Generally speaking, all quantization types and their applications should live in compressed tensors
|
@dsikka @brian-dellabetta Is there any progress? |
|
so i tried moving all the code to get_params then got a shape mismatch issue .The implementation has shape mismatches that cause runtime errors when trying to update quantization parameters. |
|
Hi @ved1beta , thanks for the update, feel free to push the changes. This is a more difficult issue than most of our "good first issue"s. I can take a look when i have some down time. Hi @shuxiaobo , this is lower priority given the other work going on in |
|
Hi @ved1beta , I am going to close this in favor of the PRs to support block-wise quantization from one of the vllm maintainers. You can see how functionality was added in these PRs:
We appreciate you taking an initial stab at this though. The implementation here is the meat of adding it to llmcompressor, but as you can see from the PRs there are a lot of other things to consider. We're still trying to figure out how best to label good first issues and encourage community involvement |
SUMMARY:
added support for blcok wise quant changes in
def calculate_qparamsfixes #1475
TEST PLAN:
this is repro script from the issue passes
EDIT:
ERROR:
RuntimeError: output with shape [1] doesn't match the broadcast shape [512, 8]