Skip to content

[Performance] Parallelize modifier compression #1558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Jun 16, 2025

#1382

A promising approach to reduce runtime, which I scoped out with @anmarques, would be to implement the following:

  1. Implement the option to dispatch a sequential target across N-many GPUs (where N is how many are available). This dispatch would occur before calibration
  2. Implement async gptq quantization (each quantization step kicks off an async thread which operates on the same device as the module & hessian)

Implementing these two features would allow a user to specify a sequential target (such as a decoder layer), and as long as one layer + hessians fits across their N gpus, all quantization operations would be fully parallelized.

This would enable maximal parallelization (excluding parallelizing calibration, which is more honorus and less beneficial than quant parallel). In theory, you could quantize deepseekv3 in 20 minutes across 4 A100s.

Notes:

It seems like parallel compression of layers is slower (33s vs 18s). I suspect this is because the GPTQ algorithm is very instruction intensive and has lots of branching. This change may have to be preceded by a change to the quantize_weight function to make it more torch.compilable

kylesayrs and others added 30 commits June 5, 2025 13:57
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
kylesayrs added 11 commits June 16, 2025 00:01
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kylesayrs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces parallel processing to the OBCQ (Optimal Brain Compression Quantization) algorithm, specifically targeting the module quantization step within the GPTQ modifier. The primary goal is to improve performance by allowing multiple layers to be quantized concurrently using a thread pool. While the PR description notes that initial tests show a performance regression, this change lays the groundwork for future optimizations and ensures thread-safe execution of underlying PyTorch linear algebra operations.

Highlights

  • Parallelization of Module Quantization: The compress_modules method in GPTQModifier has been refactored to leverage concurrent.futures.ThreadPoolExecutor for parallelizing the quantization of individual modules. This aims to speed up the overall compression process by distributing the workload across multiple threads.
  • Extraction of Module Compression Logic: The core logic for quantizing a single module has been extracted into a new private helper method, _compress_module. This refactoring makes the code more modular and enables its concurrent execution within the thread pool.
  • Thread-Safe Linear Algebra Initialization: A new function, initialize_linalg, has been introduced and integrated to pre-load torch.linalg operations on specific devices. This addresses a known issue where lazy loading of torch.linalg within multiple threads could lead to assertion errors, ensuring stability in the parallelized environment.
  • Logging and Performance Tracking: The CompressionLogger context manager has been removed from the per-module loop, and overall timing for the parallel compression process is now explicitly measured and logged, providing a clearer picture of the total execution time.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve performance by parallelizing the OBCQ compression process using ThreadPoolExecutor. It introduces a helper function initialize_linalg to pre-load torch.linalg and prevent potential lazy loading issues in threaded environments.

While the intent is performance improvement, the PR description notes that the parallelized version is currently slower. My main feedback points revolve around this performance regression, the choice of ThreadPoolExecutor for potentially CPU-bound tasks, error handling in the parallel execution, and a minor style point. Addressing the performance issue is key, and further investigation into why the threaded version is slower, possibly exploring ProcessPoolExecutor or optimizing the quantize_weight function itself (as you suggested regarding torch.compile), will be important.

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

3 similar comments
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Base automatically changed from kylesayrs/sequential-onloading to main June 17, 2025 20:45
@kylesayrs kylesayrs mentioned this pull request Jun 18, 2025
@kylesayrs kylesayrs changed the title [Performance] Parallelize OBCQ compression [Performance] Parallelize modifier compression Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants