Training-free embedding compression for local retrieval and RAG benchmark reproduction.
The cleanest single-chart summary in the repo: TurboQuant stays near FP32 retrieval quality while compressing much harder than PQ and OPQ, without fitting a corpus-specific quantizer.
turboquant-embed is a reference CPU/NumPy implementation of TurboQuant for dense embeddings. The goal of this repository is not to pretend TurboQuant is already a production vector database. The goal is narrower and more useful: compress embeddings aggressively, search them locally, and measure honestly what happens to retrieval quality against FP32, scalar quantization, product quantization, and hybrid retrieval pipelines.
At a glance:
- training-free compression; no corpus-specific codebook fitting
- packed storage reductions around
8xat4-bitand16xat2-bitfor1536dembeddings - direct compressed inner-product search plus
save()/load()helpers - BeIR-facing benchmark scripts and figure artifacts in
benchmarks/results/ - reference CPU/NumPy code, not a production ANN index
The practical default in this repo is TQ-4bit: about 7.8x packed compression at d=1536, near-lossless NDCG retention on the current BeIR runs, and no learned codebooks to fit before you can use it.
Install from source:
python3 -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install -e ".[dev]"Optional extras:
pip install -e ".[bench]"
pip install -e ".[faiss]"
pip install -e ".[qdrant]"
pip install -e ".[chroma]"
pip install -e ".[all]"Minimal compression and search flow:
from turboquant_embed import TurboQuantEmbedCompressor, CompressedEmbeddings
compressor = TurboQuantEmbedCompressor(dim=1536, bits=4)
compressed = compressor.compress(embeddings)
scores, top_idx = compressor.topk_inner_product(compressed, query_vector, k=10)
compressed.save("index.npz")
loaded = CompressedEmbeddings.load("index.npz")Useful examples:
- a local reference implementation of TurboQuant for dense embeddings
- a place to reproduce and inspect the retrieval figures shipped with the project
- useful for local RAG, per-tenant indexes, fast-moving corpora, and compression experiments
- not a production ANN index
- not a distributed vector store
- not a claim that TurboQuant already beats the usual HNSW stack end-to-end
A corpus of N vectors in dimension d, stored in float32, costs 4Nd bytes. For 100K documents embedded with a 1536-dimensional model, that is roughly 614 MB before metadata, replication, or vector-store overhead. That is manageable on a server. It is much less comfortable on a laptop, for per-tenant indexes, or when the corpus changes too quickly for a trained quantizer to be worth the operational cost.
Google Research introduced TurboQuant as a data-oblivious, online quantization scheme for high-dimensional vectors. The core idea is simple:
- apply a random orthogonal rotation
- quantize each rotated coordinate with a universal Lloyd-Max codebook
- optionally add a
1-bitQJL residual correction for inner-product estimation
The important word is data-oblivious. There is no corpus-specific k-means fit before compression. A new corpus can be compressed immediately, with deterministic behavior once the seed is fixed.
For the full theory, see:
- TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
- TurboQuant: Redefining AI efficiency with extreme compression
At the API level, the implementation stays intentionally small:
from turboquant_embed import TurboQuantEmbedCompressor, CompressedEmbeddings
compressor = TurboQuantEmbedCompressor(dim=1536, bits=4)
compressed = compressor.compress(embeddings)
scores, top_idx = compressor.topk_inner_product(compressed, query_vector, k=10)
compressed.save("index.npz")
loaded = CompressedEmbeddings.load("index.npz")Under the hood, a few distinctions matter:
serialized_nbytesis the packed on-disk or transferable sizeresident_nbytesis the actual in-memory NumPy footprint once search arrays are materializedrotation_backend="qr"is the research-faithful dense baselinerotation_backend="hadamard"is the engineering-oriented structured backend for larger dimensionsrotation_backend="auto"selects conservatively between them
The repo also separates:
- pure MSE-style compression
- optional QJL residual correction
- direct in-process compressed search
- reconstruction-based compatibility paths for vector stores such as Chroma and Qdrant
That separation matters because "compressed storage", "resident search memory", and "quality after reconstruction into a FP32 vector store" are not the same deployment story.
The benchmark directory mixes two kinds of evidence:
- primary comparative evidence for RAG: relevance-labeled plots that compare TurboQuant with meaningful baselines on BeIR
- synthetic diagnostics: useful to understand the quantizer, but not the main proof for deployment claims
If you want the paper/blog-oriented guide to what each figure actually proves, start with:
- benchmarks/FIGURES_ANALYSIS.md
- benchmarks/results/PUBLICATION_FIGURES.md
- benchmarks/results/STATUS_QUO_COST_TABLE.md
Before asking whether TurboQuant retrieves well, it is worth establishing that the compression itself is real, stable, and operationally interesting.
FP32 versus packed TurboQuant storage as corpus size grows.
The first figure plots serialized storage for FP32, TQ-2bit, TQ-3bit, and TQ-4bit as the corpus grows from 10K to 100K vectors at d=1536.
| Method | Packed size | Ratio vs FP32 |
|---|---|---|
| FP32 | 614 MB | 1.0x |
| TQ-2bit | 38.8 MB | 15.8x |
| TQ-3bit | 58.0 MB | 10.6x |
| TQ-4bit | 77.2 MB | 8.0x |
This is a storage figure, not a retrieval figure. What it establishes is straightforward and important: packed bytes scale linearly with corpus size, and the compression ratio does not erode as the corpus grows.
Retrieval behavior when memory budget, not corpus size, is the binding constraint.
If memory is the binding constraint, compression is not just a storage optimization; it can be a retrieval enabler. On synthetic data, fitting more vectors into the same memory budget can matter more than keeping every vector in FP32.
| Budget | FP32 | TQ-2bit | TQ-4bit |
|---|---|---|---|
10 MB | 0.03 | 0.34 | 0.27 |
20 MB | 0.07 | 0.50 | 0.53 |
40 MB | 0.13 | 0.50 | 0.84 |
160 MB | 0.54 | 0.50 | 0.84 |
This is still a synthetic budget-intuition plot, not a public retrieval-effectiveness claim. The reliable takeaway is the shape of the trade-off under tight memory budgets.
Encoding cost of training-free TurboQuant versus PQ-style pipelines.
The training-free story is not just conceptual. Relative to PQ-style pipelines that need to fit codebooks on the corpus, TurboQuant can be much lighter to initialize and encode with.
| Number of vectors | TQ | PQ | Speedup |
|---|---|---|---|
1K | 0.51 s | 70.8 s | 138x |
10K | 7.89 s | 2109 s | 267x |
50K | 8.58 s | 757.8 s | 88x |
This is the weakest operational figure in the slate. It supports the "no training pass" story, but the PQ timings are noisy enough that they should not be the lead public claim.
With the storage motivation established, the next step is to understand the quantizer's behavior before trusting it on real retrieval data.
Recall versus compression ratio for the MSE and product-style TurboQuant variants.
As bitrate drops, the MSE variant degrades in a predictable way. The product/QJL variant buys more aggressive compression, but pays for it sharply at low bitrates.
| Bits | MSE ratio | MSE recall | Product ratio | Product recall |
|---|---|---|---|---|
8 | 4.0x | 0.984 | 4.4x | 0.923 |
6 | 5.3x | 0.950 | 6.1x | 0.769 |
4 | 8.0x | 0.851 | 10.0x | 0.373 |
2 | 15.8x | 0.520 | 26.5x | 0.046 |
The most important negative result in the repo is visible here: the product/QJL path is not magic at 2-bit. It collapses hard in the ultra-compressed regime.
Accuracy at 4-bit compression across embedding dimensions from 128d to 3072d.
One practical concern for any embedding quantizer is whether it behaves erratically across dimensions. On the current synthetic setup, TQ-4bit remains unusually stable from 128d to 3072d.
| Dimension | TQ-4bit | SQ4 | PQ |
|---|---|---|---|
128 | 0.856 | 0.818 | 0.193 |
384 | 0.863 | 0.812 | 0.190 |
1536 | 0.869 | 0.812 | 0.199 |
3072 | 0.860 | 0.799 | 0.207 |
This supports the intuition that the random-rotation story is dimension-stable, but it remains a synthetic diagnostic, not a deployment benchmark.
Whether the exact top-1 answer remains inside the approximate top-k set.
Another angle on quantization quality is whether the true top-1 result survives somewhere inside the approximate top-k set.
k |
TQ | PQ |
|---|---|---|
1 | 0.465 | 0.505 |
4 | 0.785 | 0.770 |
16 | 0.950 | 0.955 |
64 | 1.000 | 1.000 |
This metric is forgiving, but it does show that TurboQuant does not catastrophically lose the best answer.
Scaling of fidelity and packed storage as the corpus grows from 1K to 100K vectors.
The packed-storage ratio stays fixed as the corpus grows; that part of the story is robust.
| Corpus | FP32 | TQ-4bit | Ratio | Recall@10 |
|---|---|---|---|---|
1K | 2 MB | 0.2 MB | 7.8x | 0.71 |
10K | 15 MB | 2.0 MB | 7.8x | 0.38 |
100K | 154 MB | 19.6 MB | 7.8x | 0.27 |
The recall values here are pessimistic synthetic diagnostics. The reliable takeaway is the linear packed-storage scaling, not the absolute recall numbers.
Everything so far helps build intuition. The figures below are the ones that matter most for RAG: real embedding models, labeled retrieval benchmarks, and baseline families that people actually use.
Hybrid dense+sparse retrieval under TurboQuant compression.
Most RAG systems do not rely on dense retrieval alone. They combine a sparse leg with a dense leg. The deployment question is simple: if you compress the dense leg, do you lose the hybrid lift?
| Comparison | SciFact ΔNDCG@10 | NFCorpus ΔNDCG@10 |
|---|---|---|
| Dense TQ - FP32 | +0.005 [-0.001, +0.010] | -0.003 [-0.007, -0.001] |
| Hybrid TQ - FP32 | -0.001 [-0.009, +0.006] | -0.001 [-0.004, +0.003] |
The honest claim here is not that TurboQuant improves hybrid retrieval. The honest claim is narrower and still useful: in this setup, TQ-4bit does not materially damage the hybrid pipeline.
TurboQuant versus standard quantizers on BeIR-style retrieval.
This is the central comparison figure in the repo. It asks the question that matters in practice: why bother with TurboQuant instead of using the standard FAISS quantizers?
For SciFact with MiniLM:
| Method | Bytes / vector | NDCG@10 |
|---|---|---|
| FP32 | 1536 | 0.645 |
| TQ-4b | 196 | 0.650 |
| SQ4 | 193 | 0.649 |
| OPQ16 | 214 | 0.602 |
| PQ16 | 100 | 0.526 |
Two things stand out:
- in the ultra-compact regime,
TQ-2bis materially stronger thanPQ16 - around the
~8xcompression point,TQ-4bis effectively tied withSQ4
That second point matters. The advantage over scalar quantization at 4-bit is not dramatic. The honest result is narrower: TurboQuant clearly beats PQ / OPQ at low byte counts, and stays competitive with SQ at moderate compression without any corpus-specific training.
Retention relative to FP32 for TurboQuant and standard quantizers.
The same story is easier to read when normalized as retention relative to FP32.
| Method | Mean retention | Compression regime |
|---|---|---|
| TQ-2b | ~100% | ~15x |
| TQ-4b | ~102% | ~8x |
| SQ4 | ~100% | ~8x |
| PQ16 | ~85% | ~15x |
| OPQ16 | ~98% | ~7x |
Retention above 100% should not be read as a real quality gain. That is ordinary evaluation noise. The right reading is "essentially equal to FP32."
For RAG, the strongest argument for TurboQuant is not that it is mathematically elegant. The strongest argument is this: it reduces embedding storage by roughly one order of magnitude, stays competitive with the quantization baselines people actually use, preserves hybrid retrieval quality in the current benchmark slate, and does all of this without fitting a corpus-specific quantizer.
That is a meaningful niche. It is especially relevant for:
- local RAG
- per-tenant indexes
- fast-moving corpora
- experiments where you want a compression layer without turning the pipeline into a codebook-training project
What this repository does not justify is a broader claim such as:
TurboQuant is already a better production vector store than the usual ANN stack.
That is a different claim, and this repository is not trying to make it.
Everything discussed here is tied to local artifacts and scripts in this repo:
- benchmarks/FIGURES_ANALYSIS.md
- benchmarks/status_quo_quantization_bench.py
- benchmarks/rag_benchmarks.py
- benchmarks/beir_relevance_bench.py
- benchmarks/results/PUBLICATION_FIGURES.md
- benchmarks/results/STATUS_QUO_COST_TABLE.md
- docs/ARXIV_EXPERIMENTS.md
Additional artifacts worth reading:
- benchmarks/results/05_text_embedding_models.svg for qrel-based dense retrieval across modern embedding models
- benchmarks/results/11_beir_relevance_server.png for the direct-search vs reconstruction-based deployment split
For local setup, workflow, and commit conventions, see CONTRIBUTING.md.
Common commands:
make lint
make test
make build
make verify
make smoke- TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
- QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
- PolarQuant
- TurboQuant: Redefining AI efficiency with extreme compression
Apache-2.0
