|
| 1 | +(torchao_hf_integration)= |
| 2 | +# Hugging Face Integration |
| 3 | + |
| 4 | +```{contents} |
| 5 | +:local: |
| 6 | +:depth: 2 |
| 7 | +``` |
| 8 | + |
| 9 | +(usage-examples)= |
| 10 | +## Quick Start: Usage Example |
| 11 | + |
| 12 | +First, install the required packages. |
| 13 | + |
| 14 | +```bash |
| 15 | +pip install git+https://github.com/huggingface/transformers@main |
| 16 | +pip install git+https://github.com/huggingface/diffusers@main |
| 17 | +pip install torchao |
| 18 | +pip install torch |
| 19 | +pip install accelerate |
| 20 | +``` |
| 21 | + |
| 22 | +(quantizing-models-transformers)= |
| 23 | +### 1. Quantizing Models with Transformers |
| 24 | + |
| 25 | +Below is an example of using `Float8DynamicActivationInt4WeightConfig` on the Llama-3.2-1B model. |
| 26 | + |
| 27 | +```python |
| 28 | +from transformers import TorchAoConfig, AutoModelForCausalLM |
| 29 | +from torchao.quantization import Float8DynamicActivationInt4WeightConfig |
| 30 | + |
| 31 | +# Create quantization configuration |
| 32 | +quantization_config = TorchAoConfig( |
| 33 | + quant_type=Float8DynamicActivationInt4WeightConfig(group_size=128, use_hqq=True) |
| 34 | +) |
| 35 | + |
| 36 | +# Load and automatically quantize the model |
| 37 | +model = AutoModelForCausalLM.from_pretrained( |
| 38 | + "meta-llama/Llama-3.2-1B", |
| 39 | + torch_dtype="auto", |
| 40 | + device_map="auto", |
| 41 | + quantization_config=quantization_config |
| 42 | +) |
| 43 | +``` |
| 44 | +```{seealso} |
| 45 | +For inference examples and recommended quantization methods based on different hardwares (i.e. A100 GPU, H100 GPU, CPU), see [HF-Torchao Docs (Quantization Examples)](https://huggingface.co/docs/transformers/main/en/quantization/torchao#quantization-examples). |
| 46 | +
|
| 47 | +For inference using vLLM, please see [(Part 3) Serving on vLLM, SGLang, ExecuTorch](https://docs.pytorch.org/ao/main/serving.html) for a full end-to-end tutorial. |
| 48 | +``` |
| 49 | + |
| 50 | +(quantizing-models-diffusers)= |
| 51 | +### 2. Quantizing Models with Diffusers |
| 52 | + |
| 53 | +Below is an example of how we can integrate with Diffusers. |
| 54 | + |
| 55 | +```python |
| 56 | +from diffusers import FluxPipeline, FluxTransformer2DModel, TorchAoConfig |
| 57 | + |
| 58 | +model_id = "black-forest-labs/Flux.1-Dev" |
| 59 | +dtype = torch.bfloat16 |
| 60 | + |
| 61 | +quantization_config = TorchAoConfig("int8wo") |
| 62 | +transformer = FluxTransformer2DModel.from_pretrained( |
| 63 | + model_id, |
| 64 | + subfolder="transformer", |
| 65 | + quantization_config=quantization_config, |
| 66 | + torch_dtype=dtype, |
| 67 | +) |
| 68 | +pipe = FluxPipeline.from_pretrained( |
| 69 | + model_id, |
| 70 | + transformer=transformer, |
| 71 | + torch_dtype=dtype, |
| 72 | +) |
| 73 | +pipe.to("cuda") |
| 74 | + |
| 75 | +prompt = "A cat holding a sign that says hello world" |
| 76 | +image = pipe(prompt, num_inference_steps=4, guidance_scale=0.0).images[0] |
| 77 | +image.save("output.png") |
| 78 | +``` |
| 79 | + |
| 80 | +```{note} |
| 81 | +Example Output: |
| 82 | + |
| 83 | +``` |
| 84 | + |
| 85 | +```{seealso} |
| 86 | +Please refer to [HF-TorchAO-Diffuser Docs](https://huggingface.co/docs/diffusers/en/quantization/torchao) for more examples and benchmarking results. |
| 87 | +``` |
| 88 | + |
| 89 | +(saving-models)= |
| 90 | +## Saving the Model |
| 91 | + |
| 92 | +After we quantize the model, we can save it. |
| 93 | + |
| 94 | +```python |
| 95 | +# Save quantized model (see below for safe_serialization enablement progress) |
| 96 | +with tempfile.TemporaryDirectory() as tmp_dir: |
| 97 | + model.save_pretrained(tmp_dir, safe_serialization=False) |
| 98 | + |
| 99 | +# optional: push to hub (uncomment the following lines) |
| 100 | +# save_to = "your-username/Llama-3.2-1B-int4" |
| 101 | +# model.push_to_hub(save_to, safe_serialization=False) |
| 102 | + |
| 103 | +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") |
| 104 | +tokenizer.push_to_hub(save_to) |
| 105 | +``` |
| 106 | + |
| 107 | +**Current Status of Safetensors support**: TorchAO quantized models cannot yet be serialized with safetensors due to tensor subclass limitations. When saving quantized models, you must use `safe_serialization=False`. |
| 108 | + |
| 109 | +```python |
| 110 | +# don't serialize model with Safetensors |
| 111 | +output_dir = "llama3-8b-int4wo-128" |
| 112 | +quantized_model.save_pretrained("llama3-8b-int4wo-128", safe_serialization=False) |
| 113 | +``` |
| 114 | + |
| 115 | +**Workaround**: For production use, save models with `safe_serialization=False` when pushing to Hugging Face Hub. |
| 116 | + |
| 117 | +**Future Work**: The TorchAO team is actively working on safetensors support for tensor subclasses. Track progress [here](https://github.com/pytorch/ao/issues/2338) and [here](https://github.com/pytorch/ao/pull/2881). |
| 118 | + |
| 119 | +(Supported-Quantization-Types)= |
| 120 | +## Supported Quantization Types |
| 121 | + |
| 122 | +Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation. |
| 123 | + |
| 124 | +Dynamic activation quantization stores the model weights in a low-bit dtype, while also quantizing the activations on-the-fly to save additional memory. This lowers the memory requirements from model weights, while also lowering the memory overhead from activation computations. However, this may come at a quality tradeoff at times, so it is recommended to test different models thoroughly. |
| 125 | + |
| 126 | +```{note} |
| 127 | +Please refer to the [torchao docs](https://docs.pytorch.org/ao/main/api_ref_quantization.html) for supported quantization types. |
| 128 | +``` |
0 commit comments