[Transform] Serialize with tied weights #370

kylesayrs · 2025-06-28T04:49:40Z

Purpose

Support saving models with transforms attached

Prerequisites

[Transform] apply_transform_config #348

Semi-Prerequisites

These changes are required in order to support saving models with offloaded Transforms. Models without offloading do not require these changes.

[Core] [Offloading] Enable saving offloaded models with multiple shared tensor groups huggingface/transformers#39263
[Core] [Offloading] Fix saving offloaded submodules huggingface/transformers#39280

Changes

Implement _update_tied_weights which
- Updates the _dynamic_tied_weights_keys attribute of the transform modules. This property is read by transformers during saving and tells transformers to delete duplicates of the tied weights before saving.
- Sets the meta tensors of shared weights to be identical, so they can be recognized and deleted by transformers

Testing

Add serialization tests

Signed-off-by: Kyle Sayers <[email protected]>

…tory

Signed-off-by: Kyle Sayers <[email protected]>

…s/transform_factory

…s/transform_permutations

…ermutations

Signed-off-by: Kyle Sayers <[email protected]>

…tory

Signed-off-by: Kyle Sayers <[email protected]>

…tory

Signed-off-by: Kyle Sayers <[email protected]>

…tory

Signed-off-by: Kyle Sayers <[email protected]>

…ermutations

Signed-off-by: Kyle Sayers <[email protected]>

src/compressed_tensors/transform/factory/base.py

Signed-off-by: Kyle Sayers <[email protected]>

The base branch was changed.

Signed-off-by: Kyle Sayers <[email protected]>

brian-dellabetta

Looks like your style checks have different behavior than what's on main. I've seen this elsewhere, not sure what's causing it. Maybe our version pin on flake8>=3.8.3 is too loose and later versions have different behavior?

src/compressed_tensors/transform/factory/base.py

Signed-off-by: Kyle Sayers <[email protected]>

brian-dellabetta

This and #391 are both ready to merge right? We can send over to team for review if so

kylesayrs · 2025-07-31T19:21:42Z

@brian-dellabetta This is ready to merge. Afaict the only thing left from the head branch is applying transformers in higher granularity, which is still unknown whether this actually has an effect on accuracy

brian-dellabetta

LGTM!

brian-dellabetta

👍

src/compressed_tensors/transform/factory/matrix_multiply.py

## Purpose ## * Enable offline spinquant-style transforms ## Prerequisites ## * neuralmagic/compressed-tensors#370 * neuralmagic/compressed-tensors#412 * neuralmagic/compressed-tensors#414 ## Changes ## * Added `spinquant_example.py` to examples folder * Added `SpinQuantModifier` which handles the construction of a spinquant-style transform config ## Testing ## * Added modifier serialization and correctness tests ## Evaluation ## Using this branch, and [the original SpinQuant code](https://github.com/facebookresearch/SpinQuant), we see very similar results for `meta-llama/Llama-3.2-1B-Instruct` with W4A16 quantization. Results are equivalent in hf (in-memory vs serialized and re-loaded), and very similar in vllm. The symmetric scales calculation in `llm-compressor` is slightly different than original SpinQuant paper, which uses the original GPTQ implementation. When this is swapped in, results are consistent, with hadamard improving results on `gsm8k_llama` and `arc_challenge_llama`: Scheme | Impl | gsm8k | gsm8k_llama | arc_challenge_llama -- | -- | -- | -- | -- Hadamard+W4A16 | LC | 0.2403 | 0.2835 | 0.5262 W4A16 | LC | 0.1964 | 0.1933 | 0.4781 Hadamard+W4A16 | LC+SQscales | 0.1721 | 0.2183 | 0.485 W4A16 | LC+SQscales | 0.207 | 0.1706 | 0.4498 Hadamard+W4A16 | SQ | 0.1736 | 0.2282 | 0.4807 W4A16 | SQ | 0.1986 | 0.1774 | 0.4489 To run LC+SQScales, change [this line in CT](https://github.com/neuralmagic/compressed-tensors/blob/b2df366797b00330ec765f5891dde14e4cc74c9d/src/compressed_tensors/quantization/utils/helpers.py#L111) from ```python scales = max_val_pos / (float(bit_range) / 2) ``` to ```python scales = max_val_pos / (float(bit_max)) ``` <details> <summary>The following python script was used to generate these results</summary> Clone SpinQuant repo and paste this in the top-level directory: ```python # coding=utf-8 # Copyright (c) Meta Platforms, Inc. and affiliates. # All rights reserved. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. import torch from typing import Literal import os os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" from torch import nn import lm_eval from transformers import LlamaForCausalLM, AutoTokenizer import transformers from train_utils.main import prepare_model from train_utils.modeling_llama_quant import LlamaForCausalLM as LlamaForCausalLMQuant from utils.hadamard_utils import random_hadamard_matrix, hadamard_matrix from utils.process_args import process_args_ptq # model_id = "meta-llama/Llama-3.1-8B-Instruct" # model_id = "meta-llama/Llama-3.2-3B-Instruct" model_id = "meta-llama/Llama-3.2-1B-Instruct" dtype = torch.bfloat16 class RotateModule(nn.Module): def __init__(self, R_init): super(RotateModule, self).__init__() self.weight = nn.Parameter(R_init.to(torch.float32).to(torch.device("cuda"))) def forward(self, x, transpose=False): if transpose: return x @ self.weight else: return self.weight @ x def get_sq_model( r1r2=Literal["eye", "random-hadamard", "hadamard"], w_bits=Literal[4, 16], w_clip: bool = False, ) -> LlamaForCausalLMQuant: model_args, training_args, ptq_args = process_args_ptq() model_args.input_model = model_id if w_bits == 4: ptq_args.w_bits = 4 ptq_args.w_groupsize = 128 ptq_args.w_rtn = True # if False, GPTQ is used ptq_args.w_clip = w_clip ptq_args.a_bits = 16 ptq_args.k_bits = 16 ptq_args.v_bits = 16 print("=======ARGS=======", ptq_args) config = transformers.AutoConfig.from_pretrained(model_args.input_model) # Llama v3.2 specific: Spinquant is not compatiable with tie_word_embeddings, clone lm_head from embed_tokens process_word_embeddings = False if config.tie_word_embeddings: config.tie_word_embeddings = False process_word_embeddings = True model = LlamaForCausalLMQuant.from_pretrained( pretrained_model_name_or_path=model_args.input_model, config=config, torch_dtype=dtype, device_map="cuda", ) if process_word_embeddings: model.lm_head.weight.data = model.model.embed_tokens.weight.data.clone() model = prepare_model(ptq_args, model) for param in model.parameters(): param.requires_grad = False match r1r2: case "eye": R1 = torch.eye(model.config.hidden_size, device="cuda") case "random-hadamard": R1 = random_hadamard_matrix(model.config.hidden_size, "cuda") case _: R1 = hadamard_matrix(model.config.hidden_size, "cuda") model.R1 = RotateModule(R1) for i in range(model.config.num_hidden_layers): # Each head dim = 128 for Llama model match r1r2: case "eye": R2 = torch.eye( model.config.hidden_size // model.config.num_attention_heads, device="cuda", ) case "random-hadamard": R2 = random_hadamard_matrix( model.config.hidden_size // model.config.num_attention_heads, "cuda" ) case _: R2 = hadamard_matrix( model.config.hidden_size // model.config.num_attention_heads, "cuda" ) model.model.layers[i].self_attn.R2 = RotateModule(R2) model.config.use_cache = False return model def get_lc_model( r1r2=Literal["eye", "random-hadamard", "hadamard"], w_bits=Literal[4, 16] ) -> LlamaForCausalLM: from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.modifiers.transform import SpinQuantModifier model = LlamaForCausalLM.from_pretrained( pretrained_model_name_or_path=model_id, torch_dtype=dtype, device_map="cuda", ) recipe = [ SpinQuantModifier( rotations=[] if r1r2 == "eye" else ["R1", "R2"], transform_type="hadamard", ) ] if w_bits == 4: recipe.append( QuantizationModifier( targets="Linear", scheme="W4A16", ignore=["lm_head"], ) ) oneshot( model=model, recipe=recipe, pipeline="datafree", log_dir=None, ) return model if __name__ == "__main__": for scales_impl in ["sq_min_hack", "lc_min_hack"]: for r1r2 in ["eye", "hadamard"]: for sq_lc in ["sq", "lc"]: w_bits = 4 os.environ["SCALES_IMPL"] = scales_impl model = ( get_sq_model(r1r2=r1r2, w_bits=w_bits) if sq_lc == "sq" else get_lc_model(r1r2=r1r2, w_bits=w_bits) ).to("cuda") SAVE_DIR = model_id.split("/")[1] + f"-{scales_impl}-{r1r2}-w4a16" model.save_pretrained(SAVE_DIR, save_compressed=True) tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=True ) tokenizer.save_pretrained(SAVE_DIR) del model del tokenizer torch.cuda.empty_cache() results = lm_eval.simple_evaluate( # 1) hf in-memory # model=lm_eval.models.huggingface.HFLM( # pretrained=model, # batch_size=32, # add_bos_token=False, # ), # 1/) # 2) vllm serialized model="vllm", model_args={ "pretrained": SAVE_DIR, "add_bos_token": False, "dtype": "auto", "max_model_len": 4096, "gpu_memory_utilization": 0.5, "enable_chunked_prefill": True, }, # 2/) # 3) hf serialized # model="hf", # model_args={ # "pretrained": SAVE_DIR, # "add_bos_token": False, # "dtype": "auto", # }, # device="cuda", # 3/) tasks=["gsm8k_llama", "gsm8k", "arc_challenge_llama"], num_fewshot=8, batch_size=32, apply_chat_template=True, fewshot_as_multiturn=True, ) print( f"RESULTS, {model_id} {sq_lc} R1R2 {r1r2} W_BITS {w_bits} SCALEIMPL {scales_impl}" ) print(lm_eval.utils.make_table(results)) ``` </details> ## Follow Ups ## * Infer data free pipeline, even if a transform modifier is included * Rotations R3 and R4 * Modify example to use GPTQ once basic evaluation has been performed --------- Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Kyle Sayers <[email protected]>

kylesayrs added 30 commits May 30, 2025 13:40

add utilities

d8a10ec

Signed-off-by: Kyle Sayers <[email protected]>

add tests

d2af054

Signed-off-by: Kyle Sayers <[email protected]>

add additional tests

e32d5b5

Signed-off-by: Kyle Sayers <[email protected]>

add utils and tests

9d0518b

Signed-off-by: Kyle Sayers <[email protected]>

Implement transform factories

8c5a2d9

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'kylesayrs/transform_utils' into kylesayrs/transform_fac…

809e367

…tory

add permutations

8d613b3

Signed-off-by: Kyle Sayers <[email protected]>

add delete_offload_module

57d171a

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'kylesayrs/transform-accelerate-utilities' into kylesayr…

d77bcef

…s/transform_factory

Merge branch 'kylesayrs/transform-accelerate-utilities' into kylesayr…

ab73b43

…s/transform_permutations

Merge branch 'kylesayrs/transform_factory' into kylesayrs/transform_p…

4b55733

…ermutations

key inverses by weight

aa7d21b

Signed-off-by: Kyle Sayers <[email protected]>

fix tests

6901e02

Signed-off-by: Kyle Sayers <[email protected]>

standardize random hadamard

47ae9fe

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'kylesayrs/transform_utils' into kylesayrs/transform_fac…

34f1343

…tory

prepend input hooks

1039100

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/transform_utils

5677553

apply sqrt division first

68ec14e

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'kylesayrs/transform_utils' into kylesayrs/transform_fac…

a62418a

…tory

use divided hadamards

b117523

Signed-off-by: Kyle Sayers <[email protected]>

fix typo

a46f754

Signed-off-by: Kyle Sayers <[email protected]>

add random option

cb1cb52

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'kylesayrs/transform_utils' into kylesayrs/transform_fac…

7c02bb2

…tory

use random seeds, rename matrix multiply

02af1e9

Signed-off-by: Kyle Sayers <[email protected]>

add deterministic generation to random matrix

f45f3e9

Signed-off-by: Kyle Sayers <[email protected]>

fix perm math

7a7abdf

Signed-off-by: Kyle Sayers <[email protected]>

update docstrings

6e52894

Signed-off-by: Kyle Sayers <[email protected]>

update docstrings

7230933

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'kylesayrs/transform_factory' into kylesayrs/transform_p…

f74fe3e

…ermutations

cleanup

92ddea9

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs marked this pull request as ready for review July 8, 2025 17:18

brian-dellabetta previously approved these changes Jul 8, 2025

View reviewed changes

src/compressed_tensors/transform/factory/base.py Outdated Show resolved Hide resolved

kylesayrs added 4 commits July 8, 2025 15:42

use parametrize

fd77ecc

Signed-off-by: Kyle Sayers <[email protected]>

populate _dynamic_tied_weights_keys

5a95fd2

Signed-off-by: Kyle Sayers <[email protected]>

ensure serializable

b009f47

Signed-off-by: Kyle Sayers <[email protected]>

remove extra space

2e362d2

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs force-pushed the kylesayrs/transform_save branch from 49e04b9 to 2e362d2 Compare July 8, 2025 23:16

apply style

c6abb96

Signed-off-by: Kyle Sayers <[email protected]>

Base automatically changed from kylesayrs/transform_apply to main July 9, 2025 22:32

kylesayrs added 3 commits July 9, 2025 20:46

Merge remote-tracking branch 'origin' into kylesayrs/transform_save

3da59a0

merge dregs

97345b0

skip offloading tests until transformers changes land

4085613

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs mentioned this pull request Jul 24, 2025

[Transform] Transforms HEAD #391

Closed

Merge remote-tracking branch 'origin' into kylesayrs/transform_save

85419e2

brian-dellabetta previously approved these changes Jul 24, 2025

View reviewed changes

src/compressed_tensors/transform/factory/base.py Outdated Show resolved Hide resolved

kylesayrs added 2 commits July 31, 2025 10:59

Merge remote-tracking branch 'origin' into kylesayrs/transform_save

c45352e

use set

85ae8ba

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed brian-dellabetta’s stale review via 85ae8ba July 31, 2025 15:03

brian-dellabetta approved these changes Jul 31, 2025

View reviewed changes

kylesayrs mentioned this pull request Aug 4, 2025

[Transform] Serialize transforms config #412

Merged

brian-dellabetta approved these changes Aug 4, 2025

View reviewed changes

kylesayrs mentioned this pull request Aug 5, 2025

[Transform] QuIP Modifier vllm-project/llm-compressor#1648

Open

dsikka approved these changes Aug 7, 2025

View reviewed changes

src/compressed_tensors/transform/factory/matrix_multiply.py Show resolved Hide resolved

dsikka merged commit b2df366 into main Aug 7, 2025
1 check passed

dsikka deleted the kylesayrs/transform_save branch August 7, 2025 01:12

kylesayrs mentioned this pull request Aug 8, 2025

[Transform] Spinquant with R1 and R2 vllm-project/llm-compressor#1615

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Transform] Serialize with tied weights #370

[Transform] Serialize with tied weights #370

Uh oh!

kylesayrs commented Jun 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

kylesayrs commented Jul 31, 2025

Uh oh!

brian-dellabetta left a comment

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Transform] Serialize with tied weights #370

[Transform] Serialize with tied weights #370

Uh oh!

Conversation

kylesayrs commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Prerequisites

Semi-Prerequisites

Changes

Testing

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented Jul 31, 2025

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented Jun 28, 2025 •

edited

Loading