-
Notifications
You must be signed in to change notification settings - Fork 179
Use torch.compile to speed up GPTQ algo #1561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @aladerran, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request aims to enhance the performance of the GPTQ quantization process by integrating torch.compile
. The changes primarily focus on optimizing the quantize_weight
function, which is a critical component of the GPTQ algorithm, to achieve faster execution times.
Highlights
- Performance Optimization: The core change involves applying
torch.compile
to thequantize_weight
function within the GPTQ quantization algorithm to significantly speed up its execution. - TorchDynamo Configuration: Necessary
torch._dynamo.config
imports and settings, specificallytorch._dynamo.config.capture_scalar_outputs = True
, have been added to ensure proper compilation and avoid potential issues with scalar outputs.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully integrates torch.compile
into the GPTQ quantization process, which the provided profiling results indicate leads to significant speedups. The changes involve adding the necessary import, applying the @torch.compile
decorator, and setting a torch._dynamo.config
flag. The core change of applying the decorator is well-placed. The only feedback is a minor suggestion regarding the global configuration setting for torch.compile
to improve code clarity and maintainability.
@@ -16,6 +17,8 @@ | |||
from llmcompressor.observers.base import Observer | |||
from llmcompressor.pytorch.utils.helpers import tensor_sparsity | |||
|
|||
torch._dynamo.config.capture_scalar_outputs = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting torch._dynamo.config.capture_scalar_outputs = True
at the module level applies this configuration globally to any code that imports this module. While this might be necessary for torch.compile
to function correctly with the quantize_weight
function, it's a broad setting that could potentially affect other parts of the codebase in unexpected ways. Consider adding a brief comment explaining why this setting is needed specifically for this module/function and acknowledging its global scope.
torch._dynamo.config.capture_scalar_outputs = True | |
# Enable scalar capture for torch.compile, potentially needed for control flow | |
torch._dynamo.config.capture_scalar_outputs = True |
Benchmarking script I used:
|
Hi @aladerran! Thank you for your contribution and thorough profiling data! It seems like the new runtime is about 86% of the original, a notable improvement! This change should be good to merge now, but there are a few other small modifications to the gptq_quantize method that have the potential to drastically improve runtime. Specifically, removing branching logic in the algorithm in order to reduce graph breaks. You can debug graph breaks with |
Hi @kylesayrs, Thank you for the feedback! I'll look into further optimizing the runtime. |
Signed-off-by: aladerran <[email protected]>
Signed-off-by: aladerran <[email protected]>
Hi @kylesayrs, I introduce quantize_weight_optimized in a new commit, which isolates the main GPTQ quantization loop into a function that can be accelerated with torch.compile. The core logic should remain functionally equivalent to the original implementation. Without torch.compile, this version already achieves ~70% of the original runtime. With torch.compile enabled, execution time drops further to ~10-20% of the original. I have updated my test script above and some of the test results are shown here: gptq_baseline_profile.txt However, there are a few considerations:
Given the overhead, should we set the torch.compile as an optional feature? Any feedback on how to best make this optimization feature would be great. |
@aladerran Amazing work! Thank you for the contribution! I'll verify this asap so we can start quantizing faster ⚡💪 |
For internal testing: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/15985662662 |
These compile times are very long, even with gptq_log
|
@kylesayrs Thanks for the test. I will look into it. |
Signed-off-by: aladerran <[email protected]>
@kylesayrs Could you please review this version? I think the compilation time should be reduced to tens of seconds now. I isolated the per-block quantization code from the main loop and applied torch.compile to it to get faster kernel. This can improve speed and reduce compilation time, but it may sacrifice some memory usage. I can provide the detail performance metric log later. |
FYI I would only recommend using torch.compile with |
Hi all, any updates on this issue? I tested w. using GPTQModifier to quantize Qwen3-8B, the output were expected.
oneshot_baseline.log Btw I am also using AWQ feature recently. If there is any plan to accelerate quantization through torch.compile/parallelism, I would be interested to get involved. |
Hi @aladerran , regarding AWQ, would love some help on improving it with AWQ with Will sync with @kylesayrs on the GPTQ updates, we are under a crunch trying to wrap up a feature for transform-based compression (Quip, SpinQuant), but would be great to get this in soon |
torch._inductor.config.triton.tile_reductions = True | ||
torch.set_float32_matmul_precision("high") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this -- you don't want to set these globally.
a blanket dynamic=True has the footgun that it has potentially long compile times. If we know anything about the dynamism for the model (e.g. there is a dynamic batch size and a sequence length) then we can apply torch._dynamo.mark_dynamic to specific exactly which dimensions are dynamic. This will also produce better code for torch.compile. |
@brian-dellabetta Thanks for the update! I will take a look at the related PRs to see where I can help. @zou3519 Thanks very much for the suggestion! I will do more research on what you just mentioned. |
See https://docs.pytorch.org/docs/stable/torch.compiler_dynamic_shapes.html#abridged-public-api for more details around mark_dynamic. |
SUMMARY:
In response to #1496, this PR uses torch.compile to speed up the GPTQ quantization process in gptq_quantize.py, along with simple benchmarking tools.
I tested on a single NVIDIA A100-SXM4-80GB, with:
PyTorch version: 2.7.0+cu126
CUDA version: 12.6
cuDNN version: 90501
gptq_baseline_profile.txt
gptq_tc_profile.txt
TEST PLAN:
First-time contributor here, please let me know if you have any tips!