nanoPTQ

One file, one idea. One script, one experiment. One metric table, one truth.

Philosophy

Modern quantization frameworks (vLLM, TensorRT-LLM, AutoAWQ) are extraordinary engineering — CUDA kernel fusion, distributed sharding, multi-backend compatibility, hardware-specific precision formats. That engineering is production-critical, but it buries the math.

The 80/20 rule of quantization: A small number of techniques — RTN, AWQ, GPTQ — cover the vast majority of real-world weight quantization deployments. Everything else (QAT, sparsity, mixed-precision search, distillation) exists but rarely moves the needle in practice. Learn the critical few, skip the rest. nanoPTQ teaches exactly those 3 algorithms, the same way nanoGPT teaches the Transformer by stripping everything non-essential. Each algorithm lives in one file. Every formula maps to one line of code.

Clean demos of only:

nn.Linear layers only — the quantization target in every major LLM framework
int4 / int8 weight quantization — covers W4A16, W8A16 industrial workloads
Group-wise quantization (group_size=128) — the universal precision/size tradeoff
Safetensors I/O — real artifacts you can load into vLLM or HF Transformers
Perplexity + tokens/s — the two metrics that actually matter
Bundled calibration + eval data — one-stop eval, no internet needed after setup

This project does not include:

QAT, pruning, distillation, sparsity
CUDA/Triton kernels — dequant on the fly, backend handles matmul
Multi-GPU, FSDP, pipeline parallelism
VLMs, encoder-decoders, MoE routing

Key Concepts

Three numbers explain 90% of the code:

Concept	Formula	Intuition
Symmetric quant	`S = max\|W\| / (2^(b-1)-1)`, `Q = round(W/S)`	Scale by max absolute value, signed integers
Asymmetric quant	`S = (max-min)/(2^b-1)`, `Z = -round(min/S)`	Zero-point shift, unsigned integers, covers skewed distributions
Group-wise	Apply scale per 128-weight block, not per-tensor	One scale per 128 weights — precision up, storage cost tiny

Why int4 group_size=128?

Bits per weight:
  fp16         = 16 bits
  int8          =  8 bits   (2× compression)
  int4 g=128   ≈  4.25 bits (nearly 4× compression, with scale overhead)

group_size=128 is the industry consensus (AWQ, GPTQ, torchao all default to it).

Algorithms

Algorithm	Core Idea	Calibration	Industrial Use
RTN	Round-to-Nearest. No data needed. Your baseline.	None	bitsandbytes baseline
AWQ-lite	Protect outlier channels by scaling weights up before quant.	~128 samples	AutoAWQ
GPTQ-lite	Use activation Hessian to compensate downstream columns after each quant step.	~128 samples	GPTQModel

The mathematical essence:

# RTN (nanoptq/algorithms/rtn.py)
W_q = round(W / scale) * scale            # that's it

# AWQ (nanoptq/algorithms/awq_lite.py)
s   = mean(|activations|, dim=0) ** alpha  # channel importance
W_q = quantize(W * s)                      # scale up important channels
out = W_q @ (x / s)                        # divide back at runtime

# GPTQ (nanoptq/algorithms/gptq_lite.py)
H     = X.T @ X                            # input Hessian
H_inv = cholesky_inverse(cholesky(H))
for j in range(in_features):
    err          = W[:,j] - quantize(W[:,j])
    W[:,j+1:] -= err ⊗ H_inv[j, j+1:] / H_inv[j,j]  # compensate

Expected Results

On Qwen2-0.5B, int4 group_size=128 (your numbers will vary slightly):

Method	PPL (wikitext-2)	ΔPPL	Notes
fp16 baseline	~14.5	—	reference
RTN int4	~16–18	+2–4	no calibration needed
AWQ int4	~15–16	+0.5–2	better outlier handling
GPTQ int4	~15–16	+0.5–2	similar to AWQ

Lower perplexity = better. FP16 is the ceiling. RTN is the floor.

Quickstart

Prerequisites:

python >= 3.10, pytorch >= 2.0, transformers, safetensors

Install:

git clone https://github.com/host452b/nanoPTQ
cd nanoPTQ
pip install -e ".[dev]"

# Prepare bundled eval data (one-time, ~30s, needs internet once)
python scripts/prepare_data.py

Run:

# Quantize with RTN (no calibration data needed)
nanoptq quantize --model Qwen/Qwen2-0.5B --method rtn --bits 4 --group-size 128 --output ./qwen-rtn-int4

# Evaluate perplexity (uses bundled wikitext-2, no internet needed)
nanoptq eval --model ./qwen-rtn-int4 --metric ppl

# Compare RTN vs FP16 baseline
nanoptq compare --model Qwen/Qwen2-0.5B --bits 4 --group-size 128

# End-to-end example with latency
python examples/quant_model.py --model Qwen/Qwen2-0.5B --bits 4

# Compare all three methods side by side
python examples/compare_methods.py --model Qwen/Qwen2-0.5B --bits 4

Reading Order

If you are learning, read in this order:

Step	File	What you learn	Time
0	docs/Glossary.md	Every term with an analogy — read before anything else	10 min
1	nanoptq/core/quant_primitives.py	The math: symmetric, asymmetric, fake_quant	5 min
2	nanoptq/core/group_quant.py	Why group-wise dramatically improves int4	5 min
3	nanoptq/model/quant_linear.py	Unified layer abstraction; dequant-on-the-fly	10 min
4	nanoptq/algorithms/rtn.py	Baseline: round and done	5 min
5	nanoptq/algorithms/awq_lite.py	Activation-aware improvement	15 min
6	nanoptq/algorithms/gptq_lite.py	Hessian-based compensation	20 min
7	examples/compare_methods.py	See them all side by side	—
8	docs/flow.md	End-to-end lifecycle: offline quant → runtime inference	10 min

Project Structure

Directory	README	What's inside
nanoptq/	→	Core library
nanoptq/core/	→	Quantization math primitives
nanoptq/model/	→	QuantLinear + HF model loading
nanoptq/algorithms/	→	RTN, AWQ, GPTQ implementations
nanoptq/io/	→	Save/load safetensors checkpoints
nanoptq/eval/	→	Perplexity + latency benchmarks
nanoptq/data/	→	Dataset loader
examples/	→	Runnable demos
data/	→	Bundled calibration + eval datasets
scripts/	→	One-time setup scripts
tests/	→	Unit + integration tests
docs/	→	Glossary, flow diagrams

nanoptq/
├── core/
│   ├── quant_primitives.py   # symmetric/asymmetric/fake_quant math
│   └── group_quant.py        # group-wise quantization (the key trick)
├── model/
│   ├── quant_linear.py       # QuantLinear: drop-in for nn.Linear
│   └── hf_loader.py          # load HF model, replace Linear in-place
├── algorithms/
│   ├── rtn.py                # Round-to-Nearest (zero calibration)
│   ├── awq_lite.py           # AWQ-lite (activation-aware)
│   └── gptq_lite.py          # GPTQ-lite (Hessian compensation)
├── io/
│   └── safetensors_io.py     # save/load quantized checkpoints
├── eval/
│   ├── ppl.py                # sliding-window perplexity
│   └── latency.py            # prefill_ms, decode_tps, peak_mem_gb
└── data/
    └── loader.py             # load bundled calibration/eval data
data/
├── calibration/
│   └── {name}_128.jsonl      # 128 samples per dataset (7 datasets bundled)
└── eval/
    └── {name}_eval.jsonl     # full eval splits
examples/
├── quant_model.py            # end-to-end: load → quantize → eval → generate
├── compare_methods.py        # RTN vs AWQ vs GPTQ side-by-side table
├── precision_tour.py         # bf16 / fp8 / int4 / nvfp4 explained interactively
└── awq_explained.py          # AWQ step-by-step with live demos
docs/
├── Glossary.md               # every quantization term with an analogy
├── flow.md                   # flowcharts: offline quant + runtime inference
scripts/
└── prepare_data.py           # download datasets from HuggingFace (one-time)

Design Decisions

Decision	Rationale
Only quantize `nn.Linear`, skip embeddings/norms	All major frameworks target Linear; norm layers have too few params to matter
Dequant-on-the-fly in `forward()`	No CUDA kernel needed; preserves `model.generate()` compat
`group_size=128` default	AWQ, GPTQ, torchao consensus; best precision/size balance for int4
`symmetric=True` default for weights	Simpler hardware implementation; asymmetric is opt-in
Skip `lm_head` by default	Output projection is sensitive; quantizing it often hurts PPL disproportionately
Bundle datasets in `data/`	Reproducible eval without internet; one-stop eval for students

Additional Resources

Resource	Description
docs/Glossary.md	Every quantization term with an analogy
docs/flow.md	Flowcharts: offline quantization + runtime inference
examples/precision_tour.py	Interactive tour of bf16, fp8, int4, nvfp4
examples/awq_explained.py	Step-by-step AWQ with live demos

Inspired by

nanoGPT — the gold standard for educational ML repos
llm.c — C clarity applied to deep learning
AutoAWQ · GPTQModel · torchao — industrial reference implementations
AngelSlim — bundled dataset eval design

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
docs		docs
examples		examples
nanoptq		nanoptq
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
PLAN.md		PLAN.md
README.md		README.md
README.zh.md		README.zh.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoPTQ

Philosophy

Key Concepts

Algorithms

Expected Results

Quickstart

Reading Order

Project Structure

Design Decisions

Additional Resources

Inspired by

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nanoPTQ

Philosophy

Key Concepts

Algorithms

Expected Results

Quickstart

Reading Order

Project Structure

Design Decisions

Additional Resources

Inspired by

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages