Skip to content

Commit 24b9e73

Browse files
Merge branch 'main' into ko-main_classes/peft.md
2 parents ce1d483 + 7b897fe commit 24b9e73

File tree

172 files changed

+3370
-1011
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

172 files changed

+3370
-1011
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ limitations under the License.
4444
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ja.md">日本語</a> |
4545
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_hd.md">हिन्दी</a> |
4646
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ru.md">Русский</a> |
47-
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_pt-br.md">Рortuguês</a> |
47+
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_pt-br.md">Português</a> |
4848
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_te.md">తెలుగు</a> |
4949
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_fr.md">Français</a> |
5050
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_de.md">Deutsch</a> |

docker/transformers-quantization-latest-gpu/Dockerfile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,9 @@ RUN git clone https://github.com/NetEase-FuXi/EETQ.git && cd EETQ/ && git submod
7878
# RUN python3 -m pip install --no-cache-dir flute-kernel==0.4.1
7979
# RUN python3 -m pip install --no-cache-dir git+https://github.com/Dao-AILab/fast-hadamard-transform.git
8080

81+
# Add fp-quant for quantization testing
82+
RUN python3 -m pip install --no-cache-dir "fp-quant>=0.1.6"
83+
8184
# Add compressed-tensors for quantization testing
8285
RUN python3 -m pip install --no-cache-dir compressed-tensors
8386

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,8 @@
179179
title: FBGEMM
180180
- local: quantization/finegrained_fp8
181181
title: Fine-grained FP8
182+
- local: quantization/fp_quant
183+
title: FP-Quant
182184
- local: gguf
183185
title: GGUF
184186
- local: quantization/gptq

docs/source/en/main_classes/quantization.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,10 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
9393

9494
[[autodoc]] QuarkConfig
9595

96+
## FPQuantConfig
97+
98+
[[autodoc]] FPQuantConfig
99+
96100
## AutoRoundConfig
97101

98102
[[autodoc]] AutoRoundConfig

docs/source/en/model_doc/encodec.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,8 @@ Here is a quick example of how to encode and decode an audio using this model:
4747
>>> inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")
4848

4949
>>> encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"])
50-
>>> audio_values = model.decode(encoder_outputs.audio_codes, encoder_outputs.audio_scales, inputs["padding_mask"])[0]
50+
>>> # `encoder_outputs.audio_codes` contains discrete codes
51+
>>> audio_values = model.decode(**encoder_outputs, padding_mask=inputs["padding_mask"])[0]
5152
>>> # or the equivalent with a forward pass
5253
>>> audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values
5354
```

docs/source/en/model_doc/opt.md

Lines changed: 66 additions & 159 deletions
Large diffs are not rendered by default.

docs/source/en/model_doc/yolos.md

Lines changed: 64 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -13,76 +13,95 @@ specific language governing permissions and limitations under the License.
1313
rendered properly in your Markdown viewer.
1414
1515
-->
16-
17-
# YOLOS
18-
19-
<div class="flex flex-wrap space-x-1">
20-
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21-
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
22-
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
16+
<div style="float: right;">
17+
<div class="flex flex-wrap space-x-1">
18+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
19+
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
20+
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
21+
</div>
2322
</div>
2423

25-
## Overview
24+
# YOLOS
2625

27-
The YOLOS model was proposed in [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://huggingface.co/papers/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu.
28-
YOLOS proposes to just leverage the plain [Vision Transformer (ViT)](vit) for object detection, inspired by DETR. It turns out that a base-sized encoder-only Transformer can also achieve 42 AP on COCO, similar to DETR and much more complex frameworks such as Faster R-CNN.
26+
[YOLOS](https://huggingface.co/papers/2106.00666) uses a [Vision Transformer (ViT)](./vit) for object detection with minimal modifications and region priors. It can achieve performance comparable to specialized object detection models and frameworks with knowledge about 2D spatial structures.
2927

30-
The abstract from the paper is the following:
3128

32-
*Can Transformer perform 2D object- and region-level recognition from a pure sequence-to-sequence perspective with minimal knowledge about the 2D spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task. We find that YOLOS pre-trained on the mid-sized ImageNet-1k dataset only can already achieve quite competitive performance on the challenging COCO object detection benchmark, e.g., YOLOS-Base directly adopted from BERT-Base architecture can obtain 42.0 box AP on COCO val. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through YOLOS.*
29+
You can find all the original YOLOS checkpoints under the [HUST Vision Lab](https://huggingface.co/hustvl/models?search=yolos) organization.
3330

34-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/yolos_architecture.png"
35-
alt="drawing" width="600"/>
31+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/yolos_architecture.png" alt="drawing" width="600"/>
3632

3733
<small> YOLOS architecture. Taken from the <a href="https://huggingface.co/papers/2106.00666">original paper</a>.</small>
3834

39-
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/hustvl/YOLOS).
4035

41-
## Using Scaled Dot Product Attention (SDPA)
36+
> [!TIP]
37+
> This model wasa contributed by [nielsr](https://huggingface.co/nielsr).
38+
> Click on the YOLOS models in the right sidebar for more examples of how to apply YOLOS to different object detection tasks.
4239
43-
PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
44-
encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
45-
[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
46-
or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
47-
page for more information.
40+
The example below demonstrates how to detect objects with [`Pipeline`] or the [`AutoModel`] class.
4841

49-
SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
50-
`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
42+
<hfoptions id="usage">
43+
<hfoption id="Pipeline">
5144

52-
```
53-
from transformers import AutoModelForObjectDetection
54-
model = AutoModelForObjectDetection.from_pretrained("hustvl/yolos-base", attn_implementation="sdpa", torch_dtype=torch.float16)
55-
...
45+
```py
46+
import torch
47+
from transformers import pipeline
48+
49+
detector = pipeline(
50+
task="object-detection",
51+
model="hustvl/yolos-base",
52+
torch_dtype=torch.float16,
53+
device=0
54+
)
55+
detector("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
5656
```
5757

58-
For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
58+
</hfoption>
59+
<hfoption id="Automodel">
5960

60-
On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32` and `hustvl/yolos-base` model, we saw the following speedups during inference.
61+
```py
62+
import torch
63+
from PIL import Image
64+
import requests
65+
from transformers import AutoImageProcessor, AutoModelForObjectDetection
6166

62-
| Batch size | Average inference time (ms), eager mode | Average inference time (ms), sdpa model | Speed up, Sdpa / Eager (x) |
63-
|--------------|-------------------------------------------|-------------------------------------------|------------------------------|
64-
| 1 | 106 | 76 | 1.39 |
65-
| 2 | 154 | 90 | 1.71 |
66-
| 4 | 222 | 116 | 1.91 |
67-
| 8 | 368 | 168 | 2.19 |
67+
processor = AutoImageProcessor.from_pretrained("hustvl/yolos-base")
68+
model = AutoModelForObjectDetection.from_pretrained("hustvl/yolos-base", torch_dtype=torch.float16, attn_implementation="sdpa").to("cuda")
6869

69-
## Resources
70+
url = "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png"
71+
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
72+
inputs = processor(images=image, return_tensors="pt").to("cuda")
73+
74+
with torch.no_grad():
75+
outputs = model(**inputs)
76+
logits = outputs.logits.softmax(-1)
77+
scores, labels = logits[..., :-1].max(-1)
78+
boxes = outputs.pred_boxes
79+
80+
threshold = 0.3
81+
keep = scores[0] > threshold
7082

71-
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with YOLOS.
83+
filtered_scores = scores[0][keep]
84+
filtered_labels = labels[0][keep]
85+
filtered_boxes = boxes[0][keep]
7286

73-
<PipelineTag pipeline="object-detection"/>
87+
width, height = image.size
88+
pixel_boxes = filtered_boxes * torch.tensor([width, height, width, height], device=boxes.device)
7489

75-
- All example notebooks illustrating inference + fine-tuning [`YolosForObjectDetection`] on a custom dataset can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/YOLOS).
76-
- Scripts for finetuning [`YolosForObjectDetection`] with [`Trainer`] or [Accelerate](https://huggingface.co/docs/accelerate/index) can be found [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/object-detection).
77-
- See also: [Object detection task guide](../tasks/object_detection)
90+
for score, label, box in zip(filtered_scores, filtered_labels, pixel_boxes):
91+
x0, y0, x1, y1 = box.tolist()
92+
print(f"Label {model.config.id2label[label.item()]}: {score:.2f} at [{x0:.0f}, {y0:.0f}, {x1:.0f}, {y1:.0f}]")
93+
```
94+
95+
</hfoption>
96+
</hfoptions>
7897

79-
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
8098

81-
<Tip>
99+
## Notes
100+
- Use [`YolosImageProcessor`] for preparing images (and optional targets) for the model. Contrary to [DETR](./detr), YOLOS doesn't require a `pixel_mask`.
82101

83-
Use [`YolosImageProcessor`] for preparing images (and optional targets) for the model. Contrary to [DETR](detr), YOLOS doesn't require a `pixel_mask` to be created.
102+
## Resources
84103

85-
</Tip>
104+
- Refer to these [notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/YOLOS) for inference and fine-tuning with [`YolosForObjectDetection`] on a custom dataset.
86105

87106
## YolosConfig
88107

docs/source/en/modular_transformers.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ ValueError: You defined `RobertaEmbeddings` in the modular_roberta.py, it should
9494

9595
## Implementing a modular file
9696

97-
The easiest way to start is by browsing Transformers for a model similar to yours in order to inherit from it. Some good starting points are [Mistral](./model_doc/mistral), [Qwen2](./model_doc/qwen2), [Cohere](./model_doc/cohere) and [Cohere](./model_doc/cohere2), and [Llama](./model_doc/llama). Refer to the table below for components your model might be using and where you can inherit from.
97+
The easiest way to start is by browsing Transformers for a model similar to yours in order to inherit from it. Some good starting points are [Mistral](./model_doc/mistral), [Qwen2](./model_doc/qwen2), [Cohere](./model_doc/cohere) and [Cohere2](./model_doc/cohere2), and [Llama](./model_doc/llama). Refer to the table below for components your model might be using and where you can inherit from.
9898

9999
| Component | Model |
100100
|---|---|
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# FP-Quant
18+
19+
[FP-Quant](https://github.com/IST-DASLab/FP-Quant) is a family of quantization algorithms tailored for the Blackwell generation of Nvidia GPUs. The goal is to allow for efficient post-training quantization (PTQ) and quantization-aware trainin (QAT) of LLMs in the [MXFP4 and NVFP4 data-types](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
20+
21+
Currently, only PTQ with MXFP4 is supported. Models can either be quantized on the fly with `quantization_config=FPQuantConfig()`:
22+
23+
```python
24+
from transformers import AutoModelForCausalLM, AutoTokenizer, FPQuantConfig
25+
import torch
26+
27+
model = AutoModelForCausalLM.from_pretrained(
28+
"qwen/Qwen3-8B",
29+
quantization_config=FPQuantConfig(),
30+
device_map="cuda",
31+
torch_dtype=torch.bfloat16,
32+
)
33+
```
34+
35+
or pre-processed with GPTQ for better quality (see [FP Format Quantization Harness](https://github.com/IST-DASLab/FP-Quant)).
36+
37+
A **Blackwell-generation GPU is required** to run the kernels. Runtime support for FP-Quant is implemented through the [QuTLASS](https://github.com/IST-DASLab/qutlass) library and a lightweight PyTorch interface lib [`fp_quant`](https://github.com/IST-DASLab/FP-Quant/tree/master/inference_lib). We recommend installing the former **from source** and the latter with `pip install fp_quant`.
38+
39+
Users **without a Blackwell-generation GPU** , can use the method with `quantization_config=FPQuantConfig(pseudoquant=True)` without having to install [QuTLASS](https://github.com/IST-DASLab/qutlass). This would provide no speedups but would fully emulate the effect of quantization.
40+
41+
> [!TIP]
42+
> Find models pre-quantized with FP-Quant in the official ISTA-DASLab [collection](https://huggingface.co/collections/ISTA-DASLab/fp-quant-6877c186103a21d3a02568ee).
43+
44+
## torch.compile
45+
46+
FP-Quant is fully compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).
47+
48+
```python
49+
import torch
50+
from transformers import AutoModelForCausalLM, AutoTokenizer, FPQuantConfig
51+
52+
model = AutoModelForCausalLM.from_pretrained(
53+
"qwen/Qwen3-8B",
54+
quantization_config=FPQuantConfig(),
55+
device_map="cuda",
56+
torch_dtype=torch.bfloat16,
57+
)
58+
59+
model.forward = torch.compile(model.forward, mode="max-autotune", fullgraph=True)
60+
```
61+
62+
## Speedups
63+
64+
FP-Quant currently performs best for very large batch size processing.
65+
66+
See [QuTLASS README](https://github.com/IST-DASLab/qutlass/blob/main/README.md) for speedups.

docs/source/en/quantization/overview.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ Use the Space below to help you pick a quantization method depending on your har
3030
| [bitsandbytes](./bitsandbytes) | 🟢 | 🟡 | 🟢 | 🟡 | 🔴 | 🟡 | 🟢 | 4/8 | 🟢 | 🟢 | 🟢 | https://github.com/bitsandbytes-foundation/bitsandbytes |
3131
| [compressed-tensors](./compressed_tensors) | 🔴 | 🟢 | 🟢 | 🟢 | 🔴 | 🔴 | 🔴 | 1/8 | 🟢 | 🟢 | 🟢 | https://github.com/neuralmagic/compressed-tensors |
3232
| [EETQ](./eetq) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | ? | 8 | 🟢 | 🟢 | 🟢 | https://github.com/NetEase-FuXi/EETQ |
33+
| [FP-Quant](./fp_quant) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 4 | 🔴 | 🟢 | 🟢 | https://github.com/IST-DASLab/FP-Quant |
3334
| [GGUF / GGML (llama.cpp)](../gguf) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 1/8 | 🔴 | [See Notes](../gguf) | [See Notes](../gguf) | https://github.com/ggerganov/llama.cpp |
3435
| [GPTQModel](./gptq) | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 2/3/4/8 | 🟢 | 🟢 | 🟢 | https://github.com/ModelCloud/GPTQModel |
3536
| [AutoGPTQ](./gptq) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🔴 | 2/3/4/8 | 🟢 | 🟢 | 🟢 | https://github.com/AutoGPTQ/AutoGPTQ |

0 commit comments

Comments
 (0)