GitHub - zlaabsi/turboquant-embed: TurboQuant embedding compression and RAG retrieval benchmarks

Training-free embedding compression for local retrieval and RAG benchmark reproduction.

The cleanest single-chart summary in the repo: TurboQuant stays near FP32 retrieval quality while compressing much harder than PQ and OPQ, without fitting a corpus-specific quantizer.

turboquant-embed is a reference CPU/NumPy implementation of TurboQuant for dense embeddings. The goal of this repository is not to pretend TurboQuant is already a production vector database. The goal is narrower and more useful: compress embeddings aggressively, search them locally, and measure honestly what happens to retrieval quality against FP32, scalar quantization, product quantization, and hybrid retrieval pipelines.

At a glance:

training-free compression; no corpus-specific codebook fitting
packed storage reductions around 8x at 4-bit and 16x at 2-bit for 1536d embeddings
direct compressed inner-product search plus save() / load() helpers
BeIR-facing benchmark scripts and figure artifacts in benchmarks/results/
reference CPU/NumPy code, not a production ANN index

The practical default in this repo is TQ-4bit: about 7.8x packed compression at d=1536, near-lossless NDCG retention on the current BeIR runs, and no learned codebooks to fit before you can use it.

Quick start

Install from source:

python3 -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install -e ".[dev]"

Optional extras:

pip install -e ".[bench]"
pip install -e ".[faiss]"
pip install -e ".[qdrant]"
pip install -e ".[chroma]"
pip install -e ".[all]"

Minimal compression and search flow:

from turboquant_embed import TurboQuantEmbedCompressor, CompressedEmbeddings

compressor = TurboQuantEmbedCompressor(dim=1536, bits=4)
compressed = compressor.compress(embeddings)

scores, top_idx = compressor.topk_inner_product(compressed, query_vector, k=10)

compressed.save("index.npz")
loaded = CompressedEmbeddings.load("index.npz")

Useful examples:

What this repo is

a local reference implementation of TurboQuant for dense embeddings
a place to reproduce and inspect the retrieval figures shipped with the project
useful for local RAG, per-tenant indexes, fast-moving corpora, and compression experiments

What this repo is not

not a production ANN index
not a distributed vector store
not a claim that TurboQuant already beats the usual HNSW stack end-to-end

Why TurboQuant exists

A corpus of N vectors in dimension d, stored in float32, costs 4Nd bytes. For 100K documents embedded with a 1536-dimensional model, that is roughly 614 MB before metadata, replication, or vector-store overhead. That is manageable on a server. It is much less comfortable on a laptop, for per-tenant indexes, or when the corpus changes too quickly for a trained quantizer to be worth the operational cost.

Google Research introduced TurboQuant as a data-oblivious, online quantization scheme for high-dimensional vectors. The core idea is simple:

apply a random orthogonal rotation
quantize each rotated coordinate with a universal Lloyd-Max codebook
optionally add a 1-bit QJL residual correction for inner-product estimation

The important word is data-oblivious. There is no corpus-specific k-means fit before compression. A new corpus can be compressed immediately, with deterministic behavior once the seed is fixed.

For the full theory, see:

What `turboquant-embed` implements

At the API level, the implementation stays intentionally small:

from turboquant_embed import TurboQuantEmbedCompressor, CompressedEmbeddings

compressor = TurboQuantEmbedCompressor(dim=1536, bits=4)
compressed = compressor.compress(embeddings)

scores, top_idx = compressor.topk_inner_product(compressed, query_vector, k=10)

compressed.save("index.npz")
loaded = CompressedEmbeddings.load("index.npz")

Under the hood, a few distinctions matter:

serialized_nbytes is the packed on-disk or transferable size
resident_nbytes is the actual in-memory NumPy footprint once search arrays are materialized
rotation_backend="qr" is the research-faithful dense baseline
rotation_backend="hadamard" is the engineering-oriented structured backend for larger dimensions
rotation_backend="auto" selects conservatively between them

The repo also separates:

pure MSE-style compression
optional QJL residual correction
direct in-process compressed search
reconstruction-based compatibility paths for vector stores such as Chroma and Qdrant

That separation matters because "compressed storage", "resident search memory", and "quality after reconstruction into a FP32 vector store" are not the same deployment story.

How to read the figure slate

The benchmark directory mixes two kinds of evidence:

primary comparative evidence for RAG: relevance-labeled plots that compare TurboQuant with meaningful baselines on BeIR
synthetic diagnostics: useful to understand the quantizer, but not the main proof for deployment claims

If you want the paper/blog-oriented guide to what each figure actually proves, start with:

Storage and operational motivation

Before asking whether TurboQuant retrieves well, it is worth establishing that the compression itself is real, stable, and operationally interesting.

Packed memory scaling

FP32 versus packed TurboQuant storage as corpus size grows.

The first figure plots serialized storage for FP32, TQ-2bit, TQ-3bit, and TQ-4bit as the corpus grows from 10K to 100K vectors at d=1536.

Method	Packed size	Ratio vs FP32
FP32	`614 MB`	`1.0x`
TQ-2bit	`38.8 MB`	`15.8x`
TQ-3bit	`58.0 MB`	`10.6x`
TQ-4bit	`77.2 MB`	`8.0x`

This is a storage figure, not a retrieval figure. What it establishes is straightforward and important: packed bytes scale linearly with corpus size, and the compression ratio does not erode as the corpus grows.

Equal-memory retrieval

Retrieval behavior when memory budget, not corpus size, is the binding constraint.

If memory is the binding constraint, compression is not just a storage optimization; it can be a retrieval enabler. On synthetic data, fitting more vectors into the same memory budget can matter more than keeping every vector in FP32.

Budget	FP32	TQ-2bit	TQ-4bit
`10 MB`	`0.03`	`0.34`	`0.27`
`20 MB`	`0.07`	`0.50`	`0.53`
`40 MB`	`0.13`	`0.50`	`0.84`
`160 MB`	`0.54`	`0.50`	`0.84`

This is still a synthetic budget-intuition plot, not a public retrieval-effectiveness claim. The reliable takeaway is the shape of the trade-off under tight memory budgets.

Encoding speed

Encoding cost of training-free TurboQuant versus PQ-style pipelines.

The training-free story is not just conceptual. Relative to PQ-style pipelines that need to fit codebooks on the corpus, TurboQuant can be much lighter to initialize and encode with.

Number of vectors	TQ	PQ	Speedup
`1K`	`0.51 s`	`70.8 s`	`138x`
`10K`	`7.89 s`	`2109 s`	`267x`
`50K`	`8.58 s`	`757.8 s`	`88x`

This is the weakest operational figure in the slate. It supports the "no training pass" story, but the PQ timings are noisy enough that they should not be the lead public claim.

What the quantizer itself is doing

With the storage motivation established, the next step is to understand the quantizer's behavior before trusting it on real retrieval data.

Fidelity vs compression

Recall versus compression ratio for the MSE and product-style TurboQuant variants.

As bitrate drops, the MSE variant degrades in a predictable way. The product/QJL variant buys more aggressive compression, but pays for it sharply at low bitrates.

Bits	MSE ratio	MSE recall	Product ratio	Product recall
`8`	`4.0x`	`0.984`	`4.4x`	`0.923`
`6`	`5.3x`	`0.950`	`6.1x`	`0.769`
`4`	`8.0x`	`0.851`	`10.0x`	`0.373`
`2`	`15.8x`	`0.520`	`26.5x`	`0.046`

The most important negative result in the repo is visible here: the product/QJL path is not magic at 2-bit. It collapses hard in the ultra-compressed regime.

Dimension robustness

Accuracy at 4-bit compression across embedding dimensions from 128d to 3072d.

One practical concern for any embedding quantizer is whether it behaves erratically across dimensions. On the current synthetic setup, TQ-4bit remains unusually stable from 128d to 3072d.

Dimension	TQ-4bit	SQ4	PQ
`128`	`0.856`	`0.818`	`0.193`
`384`	`0.863`	`0.812`	`0.190`
`1536`	`0.869`	`0.812`	`0.199`
`3072`	`0.860`	`0.799`	`0.207`

This supports the intuition that the random-rotation story is dimension-stable, but it remains a synthetic diagnostic, not a deployment benchmark.

Top-1 recovery

Whether the exact top-1 answer remains inside the approximate top-k set.

Another angle on quantization quality is whether the true top-1 result survives somewhere inside the approximate top-k set.

`k`	TQ	PQ
`1`	`0.465`	`0.505`
`4`	`0.785`	`0.770`
`16`	`0.950`	`0.955`
`64`	`1.000`	`1.000`

This metric is forgiving, but it does show that TurboQuant does not catastrophically lose the best answer.

Scaling with corpus size

Scaling of fidelity and packed storage as the corpus grows from 1K to 100K vectors.

The packed-storage ratio stays fixed as the corpus grows; that part of the story is robust.

Corpus	FP32	TQ-4bit	Ratio	Recall@10
`1K`	`2 MB`	`0.2 MB`	`7.8x`	`0.71`
`10K`	`15 MB`	`2.0 MB`	`7.8x`	`0.38`
`100K`	`154 MB`	`19.6 MB`	`7.8x`	`0.27`

The recall values here are pessimistic synthetic diagnostics. The reliable takeaway is the linear packed-storage scaling, not the absolute recall numbers.

Comparative evidence on BeIR

Everything so far helps build intuition. The figures below are the ones that matter most for RAG: real embedding models, labeled retrieval benchmarks, and baseline families that people actually use.

Hybrid retrieval under compression

Hybrid dense+sparse retrieval under TurboQuant compression.

Most RAG systems do not rely on dense retrieval alone. They combine a sparse leg with a dense leg. The deployment question is simple: if you compress the dense leg, do you lose the hybrid lift?

Comparison	SciFact ΔNDCG@10	NFCorpus ΔNDCG@10
Dense TQ - FP32	`+0.005` [`-0.001`, `+0.010`]	`-0.003` [`-0.007`, `-0.001`]
Hybrid TQ - FP32	`-0.001` [`-0.009`, `+0.006`]	`-0.001` [`-0.004`, `+0.003`]

The honest claim here is not that TurboQuant improves hybrid retrieval. The honest claim is narrower and still useful: in this setup, TQ-4bit does not materially damage the hybrid pipeline.

Status quo against standard quantizers

TurboQuant versus standard quantizers on BeIR-style retrieval.

This is the central comparison figure in the repo. It asks the question that matters in practice: why bother with TurboQuant instead of using the standard FAISS quantizers?

For SciFact with MiniLM:

Method	Bytes / vector	NDCG@10
FP32	`1536`	`0.645`
TQ-4b	`196`	`0.650`
SQ4	`193`	`0.649`
OPQ16	`214`	`0.602`
PQ16	`100`	`0.526`

Two things stand out:

in the ultra-compact regime, TQ-2b is materially stronger than PQ16
around the ~8x compression point, TQ-4b is effectively tied with SQ4

That second point matters. The advantage over scalar quantization at 4-bit is not dramatic. The honest result is narrower: TurboQuant clearly beats PQ / OPQ at low byte counts, and stays competitive with SQ at moderate compression without any corpus-specific training.

Retention relative to FP32

Retention relative to FP32 for TurboQuant and standard quantizers.

The same story is easier to read when normalized as retention relative to FP32.

Method	Mean retention	Compression regime
TQ-2b	`~100%`	`~15x`
TQ-4b	`~102%`	`~8x`
SQ4	`~100%`	`~8x`
PQ16	`~85%`	`~15x`
OPQ16	`~98%`	`~7x`

Retention above 100% should not be read as a real quality gain. That is ordinary evaluation noise. The right reading is "essentially equal to FP32."

What this means for RAG

For RAG, the strongest argument for TurboQuant is not that it is mathematically elegant. The strongest argument is this: it reduces embedding storage by roughly one order of magnitude, stays competitive with the quantization baselines people actually use, preserves hybrid retrieval quality in the current benchmark slate, and does all of this without fitting a corpus-specific quantizer.

That is a meaningful niche. It is especially relevant for:

local RAG
per-tenant indexes
fast-moving corpora
experiments where you want a compression layer without turning the pipeline into a codebook-training project

What this repository does not justify is a broader claim such as:

TurboQuant is already a better production vector store than the usual ANN stack.

That is a different claim, and this repository is not trying to make it.

Reproducibility

Everything discussed here is tied to local artifacts and scripts in this repo:

Additional artifacts worth reading:

benchmarks/results/05_text_embedding_models.svg for qrel-based dense retrieval across modern embedding models
benchmarks/results/11_beir_relevance_server.png for the direct-search vs reconstruction-based deployment split

Development

For local setup, workflow, and commit conventions, see CONTRIBUTING.md.

Common commands:

make lint
make test
make build
make verify
make smoke

References

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
examples		examples
src/turboquant_embed		src/turboquant_embed
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick start

What this repo is

What this repo is not

Why TurboQuant exists

What `turboquant-embed` implements

How to read the figure slate

Storage and operational motivation

Packed memory scaling

Equal-memory retrieval

Encoding speed

What the quantizer itself is doing

Fidelity vs compression

Dimension robustness

Top-1 recovery

Scaling with corpus size

Comparative evidence on BeIR

Hybrid retrieval under compression

Status quo against standard quantizers

Retention relative to FP32

What this means for RAG

Reproducibility

Development

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick start

What this repo is

What this repo is not

Why TurboQuant exists

What turboquant-embed implements

How to read the figure slate

Storage and operational motivation

Packed memory scaling

Equal-memory retrieval

Encoding speed

What the quantizer itself is doing

Fidelity vs compression

Dimension robustness

Top-1 recovery

Scaling with corpus size

Comparative evidence on BeIR

Hybrid retrieval under compression

Status quo against standard quantizers

Retention relative to FP32

What this means for RAG

Reproducibility

Development

References

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What `turboquant-embed` implements

Packages