-
Notifications
You must be signed in to change notification settings - Fork 179
[Performance] Parallelize modifier compression #1558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kylesayrs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces parallel processing to the OBCQ (Optimal Brain Compression Quantization) algorithm, specifically targeting the module quantization step within the GPTQ modifier. The primary goal is to improve performance by allowing multiple layers to be quantized concurrently using a thread pool. While the PR description notes that initial tests show a performance regression, this change lays the groundwork for future optimizations and ensures thread-safe execution of underlying PyTorch linear algebra operations.
Highlights
- Parallelization of Module Quantization: The
compress_modules
method inGPTQModifier
has been refactored to leverageconcurrent.futures.ThreadPoolExecutor
for parallelizing the quantization of individual modules. This aims to speed up the overall compression process by distributing the workload across multiple threads. - Extraction of Module Compression Logic: The core logic for quantizing a single module has been extracted into a new private helper method,
_compress_module
. This refactoring makes the code more modular and enables its concurrent execution within the thread pool. - Thread-Safe Linear Algebra Initialization: A new function,
initialize_linalg
, has been introduced and integrated to pre-loadtorch.linalg
operations on specific devices. This addresses a known issue where lazy loading oftorch.linalg
within multiple threads could lead to assertion errors, ensuring stability in the parallelized environment. - Logging and Performance Tracking: The
CompressionLogger
context manager has been removed from the per-module loop, and overall timing for the parallel compression process is now explicitly measured and logged, providing a clearer picture of the total execution time.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to improve performance by parallelizing the OBCQ compression process using ThreadPoolExecutor
. It introduces a helper function initialize_linalg
to pre-load torch.linalg
and prevent potential lazy loading issues in threaded environments.
While the intent is performance improvement, the PR description notes that the parallelized version is currently slower. My main feedback points revolve around this performance regression, the choice of ThreadPoolExecutor
for potentially CPU-bound tasks, error handling in the parallel execution, and a minor style point. Addressing the performance issue is key, and further investigation into why the threaded version is slower, possibly exploring ProcessPoolExecutor
or optimizing the quantize_weight
function itself (as you suggested regarding torch.compile
), will be important.
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
3 similar comments
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
#1382
A promising approach to reduce runtime, which I scoped out with @anmarques, would be to implement the following:
Implementing these two features would allow a user to specify a sequential target (such as a decoder layer), and as long as one layer + hessians fits across their N gpus, all quantization operations would be fully parallelized.
This would enable maximal parallelization (excluding parallelizing calibration, which is more honorus and less beneficial than quant parallel). In theory, you could quantize deepseekv3 in 20 minutes across 4 A100s.
Notes:
It seems like parallel compression of layers is slower (33s vs 18s). I suspect this is because the GPTQ algorithm is very instruction intensive and has lots of branching. This change may have to be preceded by a change to the
quantize_weight
function to make it more torch.compilable