Use torch.compile to speed up GPTQ algo #1561

aladerran · 2025-06-17T14:15:27Z

SUMMARY:
In response to #1496, this PR uses torch.compile to speed up the GPTQ quantization process in gptq_quantize.py, along with simple benchmarking tools.

I tested on a single NVIDIA A100-SXM4-80GB, with:
PyTorch version: 2.7.0+cu126
CUDA version: 12.6
cuDNN version: 90501

gptq_baseline_profile.txt
gptq_tc_profile.txt

TEST PLAN:

# Run with torch.compile
CUDA_VISIBLE_DEVICES=0 python benchmarks/gptq_runtime_profiler.py > gptq_tc_profile.txt 2>&1
# Baseline run without torch.compile
CUDA_VISIBLE_DEVICES=0 python benchmarks/gptq_runtime_profiler.py > gptq_baseline_profile.txt 2>&1

First-time contributor here, please let me know if you have any tips!

github-actions · 2025-06-17T14:15:37Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Summary of Changes

Hello @aladerran, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aims to enhance the performance of the GPTQ quantization process by integrating torch.compile. The changes primarily focus on optimizing the quantize_weight function, which is a critical component of the GPTQ algorithm, to achieve faster execution times.

Highlights

Performance Optimization: The core change involves applying torch.compile to the quantize_weight function within the GPTQ quantization algorithm to significantly speed up its execution.
TorchDynamo Configuration: Necessary torch._dynamo.config imports and settings, specifically torch._dynamo.config.capture_scalar_outputs = True, have been added to ensure proper compilation and avoid potential issues with scalar outputs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully integrates torch.compile into the GPTQ quantization process, which the provided profiling results indicate leads to significant speedups. The changes involve adding the necessary import, applying the @torch.compile decorator, and setting a torch._dynamo.config flag. The core change of applying the decorator is well-placed. The only feedback is a minor suggestion regarding the global configuration setting for torch.compile to improve code clarity and maintainability.

gemini-code-assist · 2025-06-17T14:16:04Z

src/llmcompressor/modifiers/quantization/gptq/gptq_quantize.py

@@ -16,6 +17,8 @@
 from llmcompressor.observers.base import Observer
 from llmcompressor.pytorch.utils.helpers import tensor_sparsity

+torch._dynamo.config.capture_scalar_outputs = True


Setting torch._dynamo.config.capture_scalar_outputs = True at the module level applies this configuration globally to any code that imports this module. While this might be necessary for torch.compile to function correctly with the quantize_weight function, it's a broad setting that could potentially affect other parts of the codebase in unexpected ways. Consider adding a brief comment explaining why this setting is needed specifically for this module/function and acknowledging its global scope.

Suggested change

torch._dynamo.config.capture_scalar_outputs = True

# Enable scalar capture for torch.compile, potentially needed for control flow

torch._dynamo.config.capture_scalar_outputs = True

aladerran · 2025-06-17T14:25:56Z

Benchmarking script I used:

import torch
from contextlib import contextmanager
from typing import Dict, List, Tuple, Callable
from compressed_tensors.quantization import QuantizationArgs, QuantizationStrategy
from llmcompressor.modifiers.quantization.gptq.gptq_quantize import (
    quantize_weight,
    quantize_weight_optimized,
    GPTQ_PRECISION
)
from llmcompressor.utils.pytorch.utils import measure_cuda_memory

# Tolerance values for floating-point comparisons
RTOL = 1e-5
ATOL = 1e-8

class GPTQRuntimeProfiler:
    def __init__(self, warmup_runs: int = 3, benchmark_runs: int = 5, quant_args: QuantizationArgs = None):
        self.timing_results = {}
        self.profile_data = {}
        self.warmup_runs = warmup_runs
        self.benchmark_runs = benchmark_runs
        
        # Quantization parameter configuration
        if quant_args is None:
            self.quant_args = QuantizationArgs(
                num_bits=4,
                type="int",
                symmetric=True,
                strategy=QuantizationStrategy.CHANNEL
            )
        else:
            self.quant_args = quant_args


    @contextmanager
    def cuda_time_section(self, section_name: str):
        """Measure GPU execution time"""
        if torch.cuda.is_available():
            torch.cuda.synchronize()

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)

        start_event.record()
        yield
        end_event.record()

        if torch.cuda.is_available():
            torch.cuda.synchronize()

        gpu_time = start_event.elapsed_time(end_event) / 1000.0  # Convert to seconds
        self.timing_results[section_name] = gpu_time

    def warmup_gpu(self, module: torch.nn.Module, H: torch.Tensor):
        """Stabilize GPU timing through warmup runs"""
        print("Performing GPU warmup...")

        for _ in range(self.warmup_runs):
            H_copy = H.clone()
            hessians_dict = {module: H_copy}
            hessians_dict_optimized = {module: H_copy}

            _ = quantize_weight(
                module=module,
                quant_args=self.quant_args,
                hessians_dict=hessians_dict,
                blocksize=128,
                percdamp=0.01,
            )

            _ = quantize_weight_optimized(
                module=module,
                quant_args=self.quant_args,
                hessians_dict=hessians_dict_optimized,
                blocksize=128,
                percdamp=0.01,
            )

            if torch.cuda.is_available():
                torch.cuda.synchronize()

        print(f"Warmup completed ({self.warmup_runs} runs)")

    def _profile_function(self, func: Callable, module: torch.nn.Module, 
                         H: torch.Tensor, section_name: str) -> Dict:
        """Profile function performance with multiple runs"""
        run_times = []
        all_results = []  # Store results for validation
        
        with measure_cuda_memory() as mem_tracker:
            for _ in range(self.benchmark_runs):
                H_copy = H.clone()
                hessians_dict = {module: H_copy}

                with self.cuda_time_section(section_name):
                    result = func(
                        module=module,
                        quant_args=self.quant_args,
                        hessians_dict=hessians_dict,
                        blocksize=128,
                        percdamp=0.01,
                    )
                run_times.append(self.timing_results[section_name])
                all_results.append(result)

        # Validate result consistency
        self._validate_results(all_results, section_name)
        
        avg_time = sum(run_times) / len(run_times)
        min_time = min(run_times)
        max_time = max(run_times)
        std_dev = (sum((t - avg_time) ** 2 for t in run_times) / len(run_times)) ** 0.5

        return {
            "average": avg_time,
            "min": min_time,
            "max": max_time,
            "std_dev": std_dev,
            "peak_memory_mb": mem_tracker.peak_consumed_memory / 1024 / 1024,
        }

    def _validate_results(self, results: List, section_name: str):
        """Validate consistency across multiple runs"""
        if len(results) < 2:
            return
            
        ref_loss, ref_weight, ref_scale, ref_zero, ref_g_idx = results[0]
        
        for i, result in enumerate(results[1:], start=1):
            loss, weight, scale, zero, g_idx = result
            
            # Validate loss value
            assert abs(loss - ref_loss) < ATOL, (
                f"{section_name} run {i} loss mismatch: {loss} vs {ref_loss}"
            )
            
            # Validate weight tensor
            assert torch.allclose(weight, ref_weight, rtol=RTOL, atol=ATOL), (
                f"{section_name} run {i} weight mismatch"
            )
            
            # Validate scale
            assert torch.allclose(scale, ref_scale, rtol=RTOL, atol=ATOL), (
                f"{section_name} run {i} scale mismatch"
            )
            
            # Validate zero-point
            assert torch.allclose(zero, ref_zero, rtol=RTOL, atol=ATOL), (
                f"{section_name} run {i} zero-point mismatch"
            )
            
            # Validate g_idx
            if ref_g_idx is None:
                assert g_idx is None, f"{section_name} run {i} g_idx should be None"
            else:
                assert torch.equal(g_idx, ref_g_idx), f"{section_name} run {i} g_idx mismatch"
        
        print(f"Validation passed for {section_name}: {len(results)} runs consistent")

    def profile_quantize_weight(self, module: torch.nn.Module, H: torch.Tensor):
        """Profile quantize_weight function"""
        print("\n=== Benchmarking quantize_weight ===")
        result = self._profile_function(quantize_weight, module, H, "quantize_weight")
        self._print_results(result)
        return result

    def profile_quantize_weight_optimized(self, module: torch.nn.Module, H: torch.Tensor):
        """Profile quantize_weight_optimized function"""
        print("\n=== Benchmarking quantize_weight_optimized ===")
        result = self._profile_function(quantize_weight_optimized, module, H, "quantize_weight_optimized")
        self._print_results(result)
        return result
    
    def _print_results(self, result: Dict):
        """Print performance metrics"""
        print(f"Average time: {result['average']:.4f}s")
        print(f"Min time: {result['min']:.4f}s")
        print(f"Max time: {result['max']:.4f}s")
        print(f"Std dev: {result['std_dev']:.4f}s")
        print(f"Peak memory: {result['peak_memory_mb']:.2f} MB")

    def validate_implementations(self, module: torch.nn.Module, H: torch.Tensor):
        """Validate equivalence between implementations"""
        print("\n=== Validating implementations ===")
        
        # Run reference implementation
        H_ref = H.clone()
        hessians_ref = {module: H_ref}
        ref_result = quantize_weight(
            module=module,
            quant_args=self.quant_args,
            hessians_dict=hessians_ref,
            blocksize=128,
            percdamp=0.01,
        )
        
        # Run optimized implementation
        H_opt = H.clone()
        hessians_opt = {module: H_opt}
        opt_result = quantize_weight_optimized(
            module=module,
            quant_args=self.quant_args,
            hessians_dict=hessians_opt,
            blocksize=128,
            percdamp=0.01,
        )
        
        # Unpack results
        ref_loss, ref_weight, ref_scale, ref_zero, ref_g_idx = ref_result
        opt_loss, opt_weight, opt_scale, opt_zero, opt_g_idx = opt_result
        
        # Validate numerical equivalence
        assert abs(ref_loss - opt_loss) < ATOL, f"Loss mismatch: {ref_loss} vs {opt_loss}"
        assert torch.allclose(ref_weight, opt_weight, rtol=RTOL, atol=ATOL), "Weight tensors differ"
        assert torch.allclose(ref_scale, opt_scale, rtol=RTOL, atol=ATOL), "Scale tensors differ"
        assert torch.allclose(ref_zero, opt_zero, rtol=RTOL, atol=ATOL), "Zero-point tensors differ"
        
        if ref_g_idx is None:
            assert opt_g_idx is None, "g_idx should be None"
        else:
            assert torch.equal(ref_g_idx, opt_g_idx), "g_idx tensors differ"
        
        print("Validation passed: Both implementations return identical results")

    def benchmark_matrix_sizes(self, sizes: List[Tuple[int, int]]):
        """Benchmark functions across different matrix sizes"""
        results = {}

        for rows, cols in sizes:
            print(f"\n=== Benchmarking {rows}x{cols} matrix (GPU) ===")
            module = torch.nn.Linear(cols, rows).cuda()
            H = torch.randn(cols, cols, device='cuda', dtype=GPTQ_PRECISION)
            H = H @ H.T  # Ensure positive definite
            
            # GPU warmup
            self.warmup_gpu(module, H)
            
            # Implementation validation
            self.validate_implementations(module, H)
            
            # Performance testing
            print(f"\nTesting quantize_weight:")
            results[f"{rows}x{cols}_quantize_weight"] = self.profile_quantize_weight(module, H)

            print(f"\nTesting quantize_weight_optimized:")
            results[f"{rows}x{cols}_quantize_weight_optimized"] = self.profile_quantize_weight_optimized(module, H)

        return results


def main():
    if not torch.cuda.is_available():
        print("CUDA not available, skipping GPU benchmarks")
        return

    matrix_sizes = [
        (512, 512),
        (1024, 1024),
        (2048, 2048),
        (4096, 4096)
    ]
    
    strategies = [
        {
            "name": QuantizationStrategy.TENSOR,
            "args": {"strategy": QuantizationStrategy.TENSOR}
        },
        {
            "name": QuantizationStrategy.CHANNEL,
            "args": {"strategy": QuantizationStrategy.CHANNEL}
        },
        {
            "name": QuantizationStrategy.GROUP,
            "args": {"strategy": QuantizationStrategy.GROUP, "group_size": 128}
        }
    ]

    print("=== GPTQ Runtime Profiling (GPU) ===")
    
    for strategy_config in strategies:
        print(f"\n{'='*50}")
        print(f"Testing strategy: {strategy_config['name']}")
        print(f"{'='*50}")
        
        # Create quantization args with strategy-specific parameters
        quant_args = QuantizationArgs(
            num_bits=4,
            type="int",
            symmetric=True,
            **strategy_config["args"]
        )
        
        # Create profiler with current strategy
        profiler = GPTQRuntimeProfiler(
            warmup_runs=10,
            benchmark_runs=50,
            quant_args=quant_args
        )
        
        # Run benchmark
        results = profiler.benchmark_matrix_sizes(matrix_sizes)

        # Print summary for current strategy
        print(f"\n=== Summary for {strategy_config['name']} strategy ===")
        for key, data in results.items():
            avg_time = data['average']
            peak_mem = data['peak_memory_mb']
            print(f"{key}: {avg_time:.4f}s (±{data['max'] - data['min']:.4f}s), {peak_mem:.1f}MB")


if __name__ == "__main__":
    main()

kylesayrs · 2025-06-17T14:31:22Z

Hi @aladerran!

Thank you for your contribution and thorough profiling data! It seems like the new runtime is about 86% of the original, a notable improvement! This change should be good to merge now, but there are a few other small modifications to the gptq_quantize method that have the potential to drastically improve runtime.

Specifically, removing branching logic in the algorithm in order to reduce graph breaks. You can debug graph breaks with TORCH_LOGS="graph_breaks". Below are a couple suggestions from ChatGPT of places to look at for optimization.
https://chatgpt.com/s/t_68517d104cb88191814964727ba0d8db

aladerran · 2025-06-17T14:48:50Z

Hi @kylesayrs,

Thank you for the feedback! I'll look into further optimizing the runtime.

Signed-off-by: aladerran <[email protected]>

aladerran · 2025-06-22T02:23:32Z

Hi @kylesayrs,

I introduce quantize_weight_optimized in a new commit, which isolates the main GPTQ quantization loop into a function that can be accelerated with torch.compile. The core logic should remain functionally equivalent to the original implementation. Without torch.compile, this version already achieves ~70% of the original runtime. With torch.compile enabled, execution time drops further to ~10-20% of the original.

I have updated my test script above and some of the test results are shown here:

gptq_baseline_profile.txt
gptq_tc_dynamic_profile.txt

However, there are a few considerations:

Precision: I observed numerical differences may occur due to torch.compile optimizations.
Memory: The peak memory increases by 1.5x on average.
Compilation Overhead: Initial compilation time can be significant.

Given the overhead, should we set the torch.compile as an optional feature?

Any feedback on how to best make this optimization feature would be great.

kylesayrs · 2025-06-24T04:46:14Z

@aladerran Amazing work! Thank you for the contribution! I'll verify this asap so we can start quantizing faster ⚡💪

kylesayrs · 2025-06-30T23:07:22Z

For internal testing: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/15985662662

kylesayrs · 2025-07-01T13:37:16Z

These compile times are very long, even with torch.compile(dynamic=False). This log shows the compilation time being nearly two hours with dynamic=False. On the other hand, I ran our runners with dynamic=True and they timed out. Perhaps there are more techniques to reduce this compilation time.

gptq_log

2025-06-30T19:19:23.115397-0400 | from_modifiers | INFO - Creating recipe from modifiers
2025-06-30T19:19:23.170072-0400 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-06-30T19:19:23.170229-0400 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`
Preparing cache: 100%|██████████████████████████████████████████████████████████████| 512/512 [00:00<00:00, 825.65it/s]
(1/33): Calibrating: 100%|███████████████████████████████████████████████████████████| 512/512 [00:07<00:00, 72.04it/s]
2025-06-30T19:19:31.490643-0400 | compress_modules | INFO - Quantizing model.layers.0.self_attn.q_proj using 512 samples
2025-06-30T21:21:38.373094-0400 | compress | METRIC - time 7326.88s
2025-06-30T21:21:38.373565-0400 | compress | METRIC - error 9.99
2025-06-30T21:21:38.374731-0400 | compress | METRIC - GPU 0 | usage: 0.94% | total memory: 85 GB
2025-06-30T21:21:38.374849-0400 | compress | METRIC - GPU 1 | usage: 7.21% | total memory: 85 GB
2025-06-30T21:21:38.374923-0400 | compress | METRIC - GPU 2 | usage: 0.94% | total memory: 85 GB
2025-06-30T21:21:38.374986-0400 | compress | METRIC - GPU 3 | usage: 0.94% | total memory: 85 GB
2025-06-30T21:21:38.375048-0400 | compress | METRIC - GPU 4 | usage: 0.94% | total memory: 85 GB
2025-06-30T21:21:38.375106-0400 | compress | METRIC - GPU 5 | usage: 0.94% | total memory: 85 GB
2025-06-30T21:21:38.375162-0400 | compress | METRIC - GPU 6 | usage: 0.94% | total memory: 85 GB
2025-06-30T21:21:38.375220-0400 | compress | METRIC - GPU 7 | usage: 0.94% | total memory: 85 GB
2025-06-30T21:21:38.375379-0400 | compress | METRIC - Compressed module size: 33.947648 MB
2025-06-30T21:21:38.375682-0400 | compress_modules | INFO - Quantizing model.layers.0.self_attn.k_proj using 512 samples
2025-06-30T23:28:16.618385-0400 | compress | METRIC - time 7598.24s
2025-06-30T23:28:16.618965-0400 | compress | METRIC - error 5.82
2025-06-30T23:28:16.619378-0400 | compress | METRIC - GPU 0 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.619454-0400 | compress | METRIC - GPU 1 | usage: 7.30% | total memory: 85 GB
2025-06-30T23:28:16.619521-0400 | compress | METRIC - GPU 2 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.619589-0400 | compress | METRIC - GPU 3 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.619646-0400 | compress | METRIC - GPU 4 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.619699-0400 | compress | METRIC - GPU 5 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.619751-0400 | compress | METRIC - GPU 6 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.619807-0400 | compress | METRIC - GPU 7 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.619951-0400 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-30T23:28:16.620263-0400 | compress_modules | INFO - Quantizing model.layers.0.self_attn.v_proj using 512 samples
2025-06-30T23:28:16.924333-0400 | compress | METRIC - time 0.30s
2025-06-30T23:28:16.924857-0400 | compress | METRIC - error 0.19
2025-06-30T23:28:16.925205-0400 | compress | METRIC - GPU 0 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.925284-0400 | compress | METRIC - GPU 1 | usage: 7.30% | total memory: 85 GB
2025-06-30T23:28:16.925348-0400 | compress | METRIC - GPU 2 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.925409-0400 | compress | METRIC - GPU 3 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.925468-0400 | compress | METRIC - GPU 4 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.925520-0400 | compress | METRIC - GPU 5 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.925571-0400 | compress | METRIC - GPU 6 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.925638-0400 | compress | METRIC - GPU 7 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:16.925736-0400 | compress | METRIC - Compressed module size: 8.486912 MB
2025-06-30T23:28:16.925958-0400 | compress_modules | INFO - Quantizing model.layers.0.self_attn.o_proj using 512 samples
2025-06-30T23:28:17.285699-0400 | compress | METRIC - time 0.36s
2025-06-30T23:28:17.286209-0400 | compress | METRIC - error 0.00
2025-06-30T23:28:17.286567-0400 | compress | METRIC - GPU 0 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:17.286645-0400 | compress | METRIC - GPU 1 | usage: 7.30% | total memory: 85 GB
2025-06-30T23:28:17.286719-0400 | compress | METRIC - GPU 2 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:17.286780-0400 | compress | METRIC - GPU 3 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:17.286843-0400 | compress | METRIC - GPU 4 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:17.286896-0400 | compress | METRIC - GPU 5 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:17.286955-0400 | compress | METRIC - GPU 6 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:17.287008-0400 | compress | METRIC - GPU 7 | usage: 0.94% | total memory: 85 GB
2025-06-30T23:28:17.287109-0400 | compress | METRIC - Compressed module size: 33.947648 MB
2025-06-30T23:28:17.287309-0400 | compress_modules | INFO - Quantizing model.layers.0.mlp.gate_proj using 512 samples
2025-07-01T01:34:01.178121-0400 | compress | METRIC - time 7543.89s
2025-07-01T01:34:01.178590-0400 | compress | METRIC - error 5.35
2025-07-01T01:34:01.179014-0400 | compress | METRIC - GPU 0 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.179091-0400 | compress | METRIC - GPU 1 | usage: 7.65% | total memory: 85 GB
2025-07-01T01:34:01.179153-0400 | compress | METRIC - GPU 2 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.179213-0400 | compress | METRIC - GPU 3 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.179267-0400 | compress | METRIC - GPU 4 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.179321-0400 | compress | METRIC - GPU 5 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.179374-0400 | compress | METRIC - GPU 6 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.179426-0400 | compress | METRIC - GPU 7 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.179576-0400 | compress | METRIC - Compressed module size: 118.816768 MB
2025-07-01T01:34:01.179878-0400 | compress_modules | INFO - Quantizing model.layers.0.mlp.up_proj using 512 samples
2025-07-01T01:34:01.671031-0400 | compress | METRIC - time 0.49s
2025-07-01T01:34:01.671472-0400 | compress | METRIC - error 4.21
2025-07-01T01:34:01.671827-0400 | compress | METRIC - GPU 0 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.671906-0400 | compress | METRIC - GPU 1 | usage: 7.65% | total memory: 85 GB
2025-07-01T01:34:01.671970-0400 | compress | METRIC - GPU 2 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.672026-0400 | compress | METRIC - GPU 3 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.672078-0400 | compress | METRIC - GPU 4 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.672128-0400 | compress | METRIC - GPU 5 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.672210-0400 | compress | METRIC - GPU 6 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.672264-0400 | compress | METRIC - GPU 7 | usage: 0.94% | total memory: 85 GB
2025-07-01T01:34:01.672369-0400 | compress | METRIC - Compressed module size: 118.816768 MB

aladerran · 2025-07-02T01:45:37Z

@kylesayrs Thanks for the test. I will look into it.

Signed-off-by: aladerran <[email protected]>

aladerran · 2025-07-02T15:13:11Z

@kylesayrs Could you please review this version? I think the compilation time should be reduced to tens of seconds now.

I isolated the per-block quantization code from the main loop and applied torch.compile to it to get faster kernel. This can improve speed and reduce compilation time, but it may sacrifice some memory usage.

I can provide the detail performance metric log later.

mgoin · 2025-07-07T16:32:49Z

FYI I would only recommend using torch.compile with dynamic=True for the purposes of this library and for simplicity

aladerran · 2025-07-15T09:39:48Z

Hi all, any updates on this issue?

I tested w. using GPTQModifier to quantize Qwen3-8B, the output were expected.

Runtime was reduced to 10%-20% except the first two calls.
Observed small numerical error brought by torch.compile.

oneshot_baseline.log
oneshot_tc.log

Btw I am also using AWQ feature recently. If there is any plan to accelerate quantization through torch.compile/parallelism, I would be interested to get involved.

brian-dellabetta · 2025-07-15T17:47:13Z

Btw I am also using AWQ feature recently. If there is any plan to accelerate quantization through torch.compile/parallelism, I would be interested to get involved.

Hi @aladerran , regarding AWQ, would love some help on improving it with torch.compile! I have a draft PR, but haven't had a chance to wrap it up. Feel free to try it out or enhance further.

AWQ minor performance improvements to smoothing #1557

AWQ with torch.compile is also being explored by a user in #1619.

Will sync with @kylesayrs on the GPTQ updates, we are under a crunch trying to wrap up a feature for transform-based compression (Quip, SpinQuant), but would be great to get this in soon

zou3519 · 2025-07-15T17:59:12Z

src/llmcompressor/modifiers/quantization/gptq/gptq_quantize.py

+torch._inductor.config.triton.tile_reductions = True
+torch.set_float32_matmul_precision("high")


+1 on this -- you don't want to set these globally.

zou3519 · 2025-07-15T18:00:38Z

FYI I would only recommend using torch.compile with dynamic=True for the purposes of this library and for simplicity

a blanket dynamic=True has the footgun that it has potentially long compile times. If we know anything about the dynamism for the model (e.g. there is a dynamic batch size and a sequence length) then we can apply torch._dynamo.mark_dynamic to specific exactly which dimensions are dynamic. This will also produce better code for torch.compile.

aladerran · 2025-07-16T00:31:15Z

@brian-dellabetta Thanks for the update! I will take a look at the related PRs to see where I can help.

@zou3519 Thanks very much for the suggestion! I will do more research on what you just mentioned.

zou3519 · 2025-07-16T01:06:54Z

See https://docs.pytorch.org/docs/stable/torch.compiler_dynamic_shapes.html#abridged-public-api for more details around mark_dynamic.

gemini-code-assist bot reviewed Jun 17, 2025

View reviewed changes

aladerran force-pushed the gptq_tc branch from af79e03 to bf2539b Compare June 17, 2025 14:42

aladerran added 2 commits June 22, 2025 09:37

Use torch.compile to speed up GPTQ algo

a4f9ba2

Signed-off-by: aladerran <[email protected]>

Upload torch.compiled GPTQ as an opt version

8fb026e

Signed-off-by: aladerran <[email protected]>

aladerran force-pushed the gptq_tc branch from bf2539b to 8fb026e Compare June 22, 2025 01:52

kylesayrs added the ready When a PR is ready for review label Jun 24, 2025

Merge branch 'main' into gptq_tc

f4a3419

brian-dellabetta mentioned this pull request Jun 24, 2025

[Feature] Log/info/Save/Restore quantization steps #1410

Closed

Fix long compilation issue

200ceac

Signed-off-by: aladerran <[email protected]>

cjackal mentioned this pull request Jul 4, 2025

add DeepseekV3 AWQ mapping #1619

Open

Merge branch 'main' into gptq_tc

6a5e420

zou3519 reviewed Jul 15, 2025

View reviewed changes

	torch._dynamo.config.capture_scalar_outputs = True
	# Enable scalar capture for torch.compile, potentially needed for control flow
	torch._dynamo.config.capture_scalar_outputs = True

		torch._inductor.config.triton.tile_reductions = True
		torch.set_float32_matmul_precision("high")

Use torch.compile to speed up GPTQ algo #1561

Are you sure you want to change the base?

Use torch.compile to speed up GPTQ algo #1561

Conversation

aladerran commented Jun 17, 2025

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

aladerran commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylesayrs commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aladerran commented Jun 17, 2025

Uh oh!

aladerran commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylesayrs commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylesayrs commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylesayrs commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aladerran commented Jul 2, 2025

Uh oh!

aladerran commented Jul 2, 2025

Uh oh!

mgoin commented Jul 7, 2025

Uh oh!

aladerran commented Jul 15, 2025

Uh oh!

brian-dellabetta commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

zou3519 commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aladerran commented Jul 16, 2025

Uh oh!

zou3519 commented Jul 16, 2025

Uh oh!

Uh oh!

aladerran commented Jun 17, 2025 •

edited

Loading

kylesayrs commented Jun 17, 2025 •

edited

Loading

aladerran commented Jun 22, 2025 •

edited

Loading

kylesayrs commented Jun 24, 2025 •

edited

Loading

kylesayrs commented Jun 30, 2025 •

edited

Loading

kylesayrs commented Jul 1, 2025 •

edited

Loading

brian-dellabetta commented Jul 15, 2025 •

edited

Loading

zou3519 commented Jul 15, 2025 •

edited

Loading