Speed up nvfp4 pack/unpack w/ torch.compile #400

fynnsu · 2025-07-22T21:07:59Z

Applies torch.compile to nvfp compressor as suggested in vllm-project/llm-compressor#1485

Speed ups anywhere from 3x to 25x depending on cpu/gpu.

Benchmarks

Benchmark pack/unpack (new)

./benchmark_fp4_packing.sh 

Running benchmark on CPU...
Creating tensor of shape (8192, 8192) on cpu...
Benchmarking on tensor of shape torch.Size([8192, 8192]) (67,108,864 elements) on cpu
Iter 1/3 - Pack: 1.8949s
Iter 2/3 - Pack: 0.0454s
Iter 3/3 - Pack: 0.0464s
Iter 1/3 - Unpack: 0.0784s
Iter 2/3 - Unpack: 0.0374s
Iter 3/3 - Unpack: 0.0377s

Benchmark Results:
Device: cpu
Tensor shape: 8192x8192 (67,108,864 elements)
Average pack time: 0.6622s (101.34M elements/s)
Average unpack time: 0.0512s (1311.30M elements/s)
Compression ratio: 8.00x
--------------------------------
Running benchmark on GPU...
Creating tensor of shape (8192, 8192) on cuda...
Benchmarking on tensor of shape torch.Size([8192, 8192]) (67,108,864 elements) on cuda:0
Iter 1/3 - Pack: 0.4240s
Iter 2/3 - Pack: 0.0027s
Iter 3/3 - Pack: 0.0027s
Iter 1/3 - Unpack: 0.0459s
Iter 2/3 - Unpack: 0.0005s
Iter 3/3 - Unpack: 0.0005s

Benchmark Results:
Device: cuda:0
Tensor shape: 8192x8192 (67,108,864 elements)
Average pack time: 0.1431s (468.85M elements/s)
Average unpack time: 0.0156s (4299.47M elements/s)
Compression ratio: 8.00x

Benchmark pack/unpack (old)

./benchmark_fp4_packing.sh 

Running benchmark on CPU...
Creating tensor of shape (8192, 8192) on cpu...
Benchmarking on tensor of shape torch.Size([8192, 8192]) (67,108,864 elements) on cpu
Iter 1/3 - Pack: 1.1415s
Iter 2/3 - Pack: 1.0510s
Iter 3/3 - Pack: 0.9702s
Iter 1/3 - Unpack: 0.1212s
Iter 2/3 - Unpack: 0.0985s
Iter 3/3 - Unpack: 0.0991s

Benchmark Results:
Device: cpu
Tensor shape: 8192x8192 (67,108,864 elements)
Average pack time: 1.0542s (63.66M elements/s)
Average unpack time: 0.1062s (631.62M elements/s)
Compression ratio: 8.00x
--------------------------------
Running benchmark on GPU...
Creating tensor of shape (8192, 8192) on cuda...
Benchmarking on tensor of shape torch.Size([8192, 8192]) (67,108,864 elements) on cuda:0
Iter 1/3 - Pack: 0.1073s
Iter 2/3 - Pack: 0.0649s
Iter 3/3 - Pack: 0.0650s
Iter 1/3 - Unpack: 0.0081s
Iter 2/3 - Unpack: 0.0050s
Iter 3/3 - Unpack: 0.0050s

Benchmark Results:
Device: cuda:0
Tensor shape: 8192x8192 (67,108,864 elements)
Average pack time: 0.0791s (848.65M elements/s)
Average unpack time: 0.0060s (11118.17M elements/s)
Compression ratio: 8.00x

This also translates to real usage improvements:

time python examples/quantization_w4a16_fp4/llama3_example.py

New

...
( Model preperation and test generation output excluded )
...
Compressing model: 423it [00:20, 20.30it/s]

real    2m37.888s
user    10m28.019s
sys     1m19.335s

Old

...
( Model preperation and test generation output excluded )
...
Compressing model: 423it [01:27,  4.83it/s]

real    3m59.430s
user    22m39.828s
sys     6m37.413s

Signed-off-by: Fynn Schmitt-Ulms <[email protected]>

brian-dellabetta

Thanks for the contribution! results look good

dsikka

please test compressed model in vllm

fynnsu · 2025-07-24T14:30:25Z

please test compressed model in vllm

@dsikka Is there a particular test I should run?

So far I've tested loading a compressed model into vllm and generating text and the output seems normal. I've also done small tests to confirm that the values saved exactly match the output from the previous version.

kylesayrs · 2025-07-29T23:25:51Z

@fynnsu You should do at least a quick accuracy evaluation using lm_eval, comparing an old model to a newly packed model.

import argparse
import lm_eval
from lm_eval.utils import make_table

def main():
    results = lm_eval.simple_evaluate(
        model="hf",
        model_args={
            "pretrained": "MODEL_ID",
            "add_bos_token": True,
            "dtype": "auto",
        },
        tasks="arc_challenge_llama",
        batch_size=128,
        apply_chat_template=True,
        fewshot_as_multiturn=True,
    )

    print(make_table(results))

kylesayrs · 2025-07-29T23:26:13Z

Alternatively you can just check that the safetensor outputs are exactly the same, which is probably faster

import sys
import torch
from safetensors.torch import load_file

def compare_safetensors(file1, file2):
    data1 = load_file(file1)
    data2 = load_file(file2)

    keys1 = set(data1.keys())
    keys2 = set(data2.keys())

    all_keys = sorted(keys1.union(keys2))
    differences = []

    for key in all_keys:
        if key not in data1:
            print(f"{key} missing in {file1}")
            differences.append(key)
            pass
        elif key not in data2:
            print(f"{key} missing in {file2}")
            differences.append(key)
            pass
        else:
            tensor1 = data1[key]
            tensor2 = data2[key]
            if tensor1.shape != tensor2.shape or not torch.allclose(tensor1, tensor2, rtol=1e-5, atol=1e-8):
                print(f"Difference found in key: {key}: {torch.count_nonzero(abs(tensor1) < abs(tensor2))}")
                differences.append(key)
            else:
                print("succ")

    return differences

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python compare_safetensors.py <file1.safetensors> <file2.safetensors>")
        sys.exit(1)

    file1, file2 = sys.argv[1], sys.argv[2]
    diff_keys = compare_safetensors(file1, file2)

    if not diff_keys:
        print("All keys match exactly.")
    else:
        print(f"{len(diff_keys)} differing keys found.")

mgoin · 2025-07-30T01:15:00Z

src/compressed_tensors/compressors/quantized_compressors/nvfp4_quantized.py

@@ -105,6 +105,7 @@ def decompress_weight(
        return decompressed_weight


+@torch.compile(fullgraph=True)


You may want to default to dynamic=True to avoid recompilation

Signed-off-by: Fynn Schmitt-Ulms <[email protected]>

fynnsu · 2025-07-30T16:49:16Z

Alternatively you can just check that the safetensor outputs are exactly the same, which is probably faster

I compared the outputs from the new version with the old version using both your script and diff -r on saved compressed directories. The outputs are exactly the same (including the safetensors files).

Note: I also ran these tests after the most recent change adding dynamic=True to the torch.compile calls as suggested by @mgoin.

rahul-tuli

Great addition!

Speed up nvfp4 pack/unpack w/ torch.compile

c2ffbac

Signed-off-by: Fynn Schmitt-Ulms <[email protected]>

fynnsu mentioned this pull request Jul 23, 2025

Raise ValueError when nvfp4 pack tensor has odd number of columns #402

Open

brian-dellabetta requested a review from dsikka July 23, 2025 17:31

brian-dellabetta previously approved these changes Jul 23, 2025

View reviewed changes

brian-dellabetta requested a review from kylesayrs July 23, 2025 17:32

dsikka requested changes Jul 23, 2025

View reviewed changes

fynnsu requested a review from dsikka July 25, 2025 13:46

mgoin reviewed Jul 30, 2025

View reviewed changes

Add dynamic=True to torch.compile call in nvfp4 packing

6bb69c1

Signed-off-by: Fynn Schmitt-Ulms <[email protected]>

fynnsu dismissed brian-dellabetta’s stale review via 6bb69c1 July 30, 2025 16:45

brian-dellabetta approved these changes Jul 31, 2025

View reviewed changes

rahul-tuli approved these changes Aug 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up nvfp4 pack/unpack w/ torch.compile #400

Speed up nvfp4 pack/unpack w/ torch.compile #400

Uh oh!

fynnsu commented Jul 22, 2025

Uh oh!

brian-dellabetta left a comment

Uh oh!

dsikka left a comment

Uh oh!

fynnsu commented Jul 24, 2025 •

edited

Loading

Uh oh!

kylesayrs commented Jul 29, 2025

Uh oh!

kylesayrs commented Jul 29, 2025 •

edited

Loading

Uh oh!

mgoin Jul 30, 2025

Uh oh!

fynnsu Jul 30, 2025

Uh oh!

fynnsu commented Jul 30, 2025

Uh oh!

rahul-tuli left a comment

Uh oh!

Uh oh!

		@@ -105,6 +105,7 @@ def decompress_weight(
		return decompressed_weight


		@torch.compile(fullgraph=True)

Speed up nvfp4 pack/unpack w/ torch.compile #400

Are you sure you want to change the base?

Speed up nvfp4 pack/unpack w/ torch.compile #400

Uh oh!

Conversation

fynnsu commented Jul 22, 2025

Benchmarks

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

fynnsu commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylesayrs commented Jul 29, 2025

Uh oh!

kylesayrs commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

fynnsu Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

fynnsu commented Jul 30, 2025

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fynnsu commented Jul 24, 2025 •

edited

Loading

kylesayrs commented Jul 29, 2025 •

edited

Loading