[Performance] Parallelize modifier compression #1558

kylesayrs · 2025-06-16T22:05:06Z

A promising approach to reduce runtime, which I scoped out with @anmarques, would be to implement the following:

Implement the option to dispatch a sequential target across N-many GPUs (where N is how many are available). This dispatch would occur before calibration
Implement async gptq quantization (each quantization step kicks off an async thread which operates on the same device as the module & hessian)

Implementing these two features would allow a user to specify a sequential target (such as a decoder layer), and as long as one layer + hessians fits across their N gpus, all quantization operations would be fully parallelized.

This would enable maximal parallelization (excluding parallelizing calibration, which is more honorus and less beneficial than quant parallel). In theory, you could quantize deepseekv3 in 20 minutes across 4 A100s.

Notes:

It seems like parallel compression of layers is slower (33s vs 18s). I suspect this is because the GPTQ algorithm is very instruction intensive and has lots of branching. This change may have to be preceded by a change to the quantize_weight function to make it more torch.compilable

Signed-off-by: Kyle Sayers <[email protected]>

Signed-off-by: Brian Dellabetta <[email protected]>

Signed-off-by: Kyle Sayers <[email protected]>

…ding

Signed-off-by: Kyle Sayers <[email protected]>

…ding

Signed-off-by: Kyle Sayers <[email protected]>

…ding

Signed-off-by: Kyle Sayers <[email protected]>

…ding

Signed-off-by: Kyle Sayers <[email protected]>

gemini-code-assist

Summary of Changes

Hello @kylesayrs, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces parallel processing to the OBCQ (Optimal Brain Compression Quantization) algorithm, specifically targeting the module quantization step within the GPTQ modifier. The primary goal is to improve performance by allowing multiple layers to be quantized concurrently using a thread pool. While the PR description notes that initial tests show a performance regression, this change lays the groundwork for future optimizations and ensures thread-safe execution of underlying PyTorch linear algebra operations.

Highlights

Parallelization of Module Quantization: The compress_modules method in GPTQModifier has been refactored to leverage concurrent.futures.ThreadPoolExecutor for parallelizing the quantization of individual modules. This aims to speed up the overall compression process by distributing the workload across multiple threads.
Extraction of Module Compression Logic: The core logic for quantizing a single module has been extracted into a new private helper method, _compress_module. This refactoring makes the code more modular and enables its concurrent execution within the thread pool.
Thread-Safe Linear Algebra Initialization: A new function, initialize_linalg, has been introduced and integrated to pre-load torch.linalg operations on specific devices. This addresses a known issue where lazy loading of torch.linalg within multiple threads could lead to assertion errors, ensuring stability in the parallelized environment.
Logging and Performance Tracking: The CompressionLogger context manager has been removed from the per-module loop, and overall timing for the parallel compression process is now explicitly measured and logged, providing a clearer picture of the total execution time.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to improve performance by parallelizing the OBCQ compression process using ThreadPoolExecutor. It introduces a helper function initialize_linalg to pre-load torch.linalg and prevent potential lazy loading issues in threaded environments.

While the intent is performance improvement, the PR description notes that the parallelized version is currently slower. My main feedback points revolve around this performance regression, the choice of ThreadPoolExecutor for potentially CPU-bound tasks, error handling in the parallel execution, and a minor style point. Addressing the performance issue is key, and further investigation into why the threaded version is slower, possibly exploring ProcessPoolExecutor or optimizing the quantize_weight function itself (as you suggested regarding torch.compile), will be important.

src/llmcompressor/modifiers/quantization/gptq/base.py

gemini-code-assist · 2025-06-16T23:25:45Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-06-16T23:25:52Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-06-16T23:25:57Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-06-16T23:26:04Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

kylesayrs and others added 30 commits June 5, 2025 13:57

wip: alignment context

1aea4dd

Signed-off-by: Kyle Sayers <[email protected]>

touchups based on remaining steps

6705bf4

Signed-off-by: Brian Dellabetta <[email protected]>

implement oneshot_device, pipeline warnings

cf1f87d

Signed-off-by: Kyle Sayers <[email protected]>

simplify example

97c8d30

Signed-off-by: Kyle Sayers <[email protected]>

move offloading outside of preprocess, which is shared with train

ecfe15d

Signed-off-by: Kyle Sayers <[email protected]>

cleanup

6f86244

Signed-off-by: Kyle Sayers <[email protected]>

update examples, remove offload devicemap utils

929f678

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…

0348243

…ding

update examples to load before generating

a275f53

Signed-off-by: Kyle Sayers <[email protected]>

remove hooks

9d6c227

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…

fab6fe1

…ding

Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…

6fdcdb1

…ding

name change

8351ac9

Signed-off-by: Kyle Sayers <[email protected]>

cleanup and nits

ad71c5b

Signed-off-by: Kyle Sayers <[email protected]>

rename function

819df1c

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…

6d942cc

…ding

add dispatch utility

7dd71b9

Signed-off-by: Kyle Sayers <[email protected]>

apply style

8ba0f2c

Signed-off-by: Kyle Sayers <[email protected]>

update examples

fbf2a6d

Signed-off-by: Kyle Sayers <[email protected]>

update examples 2

91b349b

Signed-off-by: Kyle Sayers <[email protected]>

remove fallback_to_cpu, use ct utils

8e58e35

Signed-off-by: Kyle Sayers <[email protected]>

remove hook from module within utils function

96631d1

Signed-off-by: Kyle Sayers <[email protected]>

remove unused util

96476fe

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…

2d87993

…ding

docstring

cb965c9

Signed-off-by: Kyle Sayers <[email protected]>

remove big model example tests

8769b85

Signed-off-by: Kyle Sayers <[email protected]>

big modeling example readme

a389d14

Signed-off-by: Kyle Sayers <[email protected]>

deprecate sequential_targets on modifiers

b336fa2

Signed-off-by: Kyle Sayers <[email protected]>

update examples

34ef394

Signed-off-by: Kyle Sayers <[email protected]>

fix deprecation warning

58fe929

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added 11 commits June 16, 2025 00:01

fix layer sequential pipeline

54ef06a

Signed-off-by: Kyle Sayers <[email protected]>

remove unused import

4bb86e5

Signed-off-by: Kyle Sayers <[email protected]>

dispatch in pipelines

b2367ce

Signed-off-by: Kyle Sayers <[email protected]>

add train dispatch

06bb661

Signed-off-by: Kyle Sayers <[email protected]>

use remove_dispatch

a64a777

Signed-off-by: Kyle Sayers <[email protected]>

fix example

8f71004

Signed-off-by: Kyle Sayers <[email protected]>

remove device arg from e2e

7d7b00d

Signed-off-by: Kyle Sayers <[email protected]>

simplify pipeline inference logic, add comment

501056e

Signed-off-by: Kyle Sayers <[email protected]>

update examples imports

74aa7c9

Signed-off-by: Kyle Sayers <[email protected]>

fix call

e4487e2

Signed-off-by: Kyle Sayers <[email protected]>

wip: run compression in parallel

f134e56

Signed-off-by: Kyle Sayers <[email protected]>

gemini-code-assist bot reviewed Jun 16, 2025

View reviewed changes

Base automatically changed from kylesayrs/sequential-onloading to main June 17, 2025 20:45

kylesayrs mentioned this pull request Jun 18, 2025

Remove oneshot_device #1568

Merged

brian-dellabetta mentioned this pull request Jun 18, 2025

NotImplementedError: Operator aten.amin.default does not have a sharding strategy registered. #1537

Open

kylesayrs changed the title ~~[Performance] Parallelize OBCQ compression~~ [Performance] Parallelize modifier compression Jun 24, 2025

kylesayrs mentioned this pull request Jul 1, 2025

how to use Multi-GPU to #1610

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Parallelize modifier compression #1558

[Performance] Parallelize modifier compression #1558

Uh oh!

kylesayrs commented Jun 16, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 16, 2025

Uh oh!

gemini-code-assist bot commented Jun 16, 2025

Uh oh!

gemini-code-assist bot commented Jun 16, 2025

Uh oh!

gemini-code-assist bot commented Jun 16, 2025

Uh oh!

Uh oh!

[Performance] Parallelize modifier compression #1558

Are you sure you want to change the base?

[Performance] Parallelize modifier compression #1558

Uh oh!

Conversation

kylesayrs commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 16, 2025

Uh oh!

gemini-code-assist bot commented Jun 16, 2025

Uh oh!

gemini-code-assist bot commented Jun 16, 2025

Uh oh!

gemini-code-assist bot commented Jun 16, 2025

Uh oh!

Uh oh!

kylesayrs commented Jun 16, 2025 •

edited

Loading