Skip to content

zlaabsi/turboquant-embed

Repository files navigation

TurboQuant Embed

Training-free embedding compression for local retrieval and RAG benchmark reproduction.

Retention relative to FP32 for TurboQuant and standard quantizers

The cleanest single-chart summary in the repo: TurboQuant stays near FP32 retrieval quality while compressing much harder than PQ and OPQ, without fitting a corpus-specific quantizer.

turboquant-embed is a reference CPU/NumPy implementation of TurboQuant for dense embeddings. The goal of this repository is not to pretend TurboQuant is already a production vector database. The goal is narrower and more useful: compress embeddings aggressively, search them locally, and measure honestly what happens to retrieval quality against FP32, scalar quantization, product quantization, and hybrid retrieval pipelines.

At a glance:

  • training-free compression; no corpus-specific codebook fitting
  • packed storage reductions around 8x at 4-bit and 16x at 2-bit for 1536d embeddings
  • direct compressed inner-product search plus save() / load() helpers
  • BeIR-facing benchmark scripts and figure artifacts in benchmarks/results/
  • reference CPU/NumPy code, not a production ANN index

The practical default in this repo is TQ-4bit: about 7.8x packed compression at d=1536, near-lossless NDCG retention on the current BeIR runs, and no learned codebooks to fit before you can use it.

Quick start

Install from source:

python3 -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install -e ".[dev]"

Optional extras:

pip install -e ".[bench]"
pip install -e ".[faiss]"
pip install -e ".[qdrant]"
pip install -e ".[chroma]"
pip install -e ".[all]"

Minimal compression and search flow:

from turboquant_embed import TurboQuantEmbedCompressor, CompressedEmbeddings

compressor = TurboQuantEmbedCompressor(dim=1536, bits=4)
compressed = compressor.compress(embeddings)

scores, top_idx = compressor.topk_inner_product(compressed, query_vector, k=10)

compressed.save("index.npz")
loaded = CompressedEmbeddings.load("index.npz")

Useful examples:

What this repo is

  • a local reference implementation of TurboQuant for dense embeddings
  • a place to reproduce and inspect the retrieval figures shipped with the project
  • useful for local RAG, per-tenant indexes, fast-moving corpora, and compression experiments

What this repo is not

  • not a production ANN index
  • not a distributed vector store
  • not a claim that TurboQuant already beats the usual HNSW stack end-to-end

Why TurboQuant exists

A corpus of N vectors in dimension d, stored in float32, costs 4Nd bytes. For 100K documents embedded with a 1536-dimensional model, that is roughly 614 MB before metadata, replication, or vector-store overhead. That is manageable on a server. It is much less comfortable on a laptop, for per-tenant indexes, or when the corpus changes too quickly for a trained quantizer to be worth the operational cost.

Google Research introduced TurboQuant as a data-oblivious, online quantization scheme for high-dimensional vectors. The core idea is simple:

  1. apply a random orthogonal rotation
  2. quantize each rotated coordinate with a universal Lloyd-Max codebook
  3. optionally add a 1-bit QJL residual correction for inner-product estimation

The important word is data-oblivious. There is no corpus-specific k-means fit before compression. A new corpus can be compressed immediately, with deterministic behavior once the seed is fixed.

For the full theory, see:

What turboquant-embed implements

At the API level, the implementation stays intentionally small:

from turboquant_embed import TurboQuantEmbedCompressor, CompressedEmbeddings

compressor = TurboQuantEmbedCompressor(dim=1536, bits=4)
compressed = compressor.compress(embeddings)

scores, top_idx = compressor.topk_inner_product(compressed, query_vector, k=10)

compressed.save("index.npz")
loaded = CompressedEmbeddings.load("index.npz")

Under the hood, a few distinctions matter:

  • serialized_nbytes is the packed on-disk or transferable size
  • resident_nbytes is the actual in-memory NumPy footprint once search arrays are materialized
  • rotation_backend="qr" is the research-faithful dense baseline
  • rotation_backend="hadamard" is the engineering-oriented structured backend for larger dimensions
  • rotation_backend="auto" selects conservatively between them

The repo also separates:

  • pure MSE-style compression
  • optional QJL residual correction
  • direct in-process compressed search
  • reconstruction-based compatibility paths for vector stores such as Chroma and Qdrant

That separation matters because "compressed storage", "resident search memory", and "quality after reconstruction into a FP32 vector store" are not the same deployment story.

How to read the figure slate

The benchmark directory mixes two kinds of evidence:

  • primary comparative evidence for RAG: relevance-labeled plots that compare TurboQuant with meaningful baselines on BeIR
  • synthetic diagnostics: useful to understand the quantizer, but not the main proof for deployment claims

If you want the paper/blog-oriented guide to what each figure actually proves, start with:

Storage and operational motivation

Before asking whether TurboQuant retrieves well, it is worth establishing that the compression itself is real, stable, and operationally interesting.

Packed memory scaling

FP32 vs TurboQuant packed memory scaling

FP32 versus packed TurboQuant storage as corpus size grows.

The first figure plots serialized storage for FP32, TQ-2bit, TQ-3bit, and TQ-4bit as the corpus grows from 10K to 100K vectors at d=1536.

Method Packed size Ratio vs FP32
FP32614 MB1.0x
TQ-2bit38.8 MB15.8x
TQ-3bit58.0 MB10.6x
TQ-4bit77.2 MB8.0x

This is a storage figure, not a retrieval figure. What it establishes is straightforward and important: packed bytes scale linearly with corpus size, and the compression ratio does not erode as the corpus grows.

Equal-memory retrieval

Retrieval quality at equal memory budget

Retrieval behavior when memory budget, not corpus size, is the binding constraint.

If memory is the binding constraint, compression is not just a storage optimization; it can be a retrieval enabler. On synthetic data, fitting more vectors into the same memory budget can matter more than keeping every vector in FP32.

Budget FP32 TQ-2bit TQ-4bit
10 MB0.030.340.27
20 MB0.070.500.53
40 MB0.130.500.84
160 MB0.540.500.84

This is still a synthetic budget-intuition plot, not a public retrieval-effectiveness claim. The reliable takeaway is the shape of the trade-off under tight memory budgets.

Encoding speed

Encoding cost of TurboQuant versus Product Quantization

Encoding cost of training-free TurboQuant versus PQ-style pipelines.

The training-free story is not just conceptual. Relative to PQ-style pipelines that need to fit codebooks on the corpus, TurboQuant can be much lighter to initialize and encode with.

Number of vectors TQ PQ Speedup
1K0.51 s70.8 s138x
10K7.89 s2109 s267x
50K8.58 s757.8 s88x

This is the weakest operational figure in the slate. It supports the "no training pass" story, but the PQ timings are noisy enough that they should not be the lead public claim.

What the quantizer itself is doing

With the storage motivation established, the next step is to understand the quantizer's behavior before trusting it on real retrieval data.

Fidelity vs compression

Recall versus compression ratio for TurboQuant MSE and product variants

Recall versus compression ratio for the MSE and product-style TurboQuant variants.

As bitrate drops, the MSE variant degrades in a predictable way. The product/QJL variant buys more aggressive compression, but pays for it sharply at low bitrates.

Bits MSE ratio MSE recall Product ratio Product recall
84.0x0.9844.4x0.923
65.3x0.9506.1x0.769
48.0x0.85110.0x0.373
215.8x0.52026.5x0.046

The most important negative result in the repo is visible here: the product/QJL path is not magic at 2-bit. It collapses hard in the ultra-compressed regime.

Dimension robustness

4-bit accuracy across embedding dimensions

Accuracy at 4-bit compression across embedding dimensions from 128d to 3072d.

One practical concern for any embedding quantizer is whether it behaves erratically across dimensions. On the current synthetic setup, TQ-4bit remains unusually stable from 128d to 3072d.

Dimension TQ-4bit SQ4 PQ
1280.8560.8180.193
3840.8630.8120.190
15360.8690.8120.199
30720.8600.7990.207

This supports the intuition that the random-rotation story is dimension-stable, but it remains a synthetic diagnostic, not a deployment benchmark.

Top-1 recovery

Recall@1@k for TurboQuant and Product Quantization

Whether the exact top-1 answer remains inside the approximate top-k set.

Another angle on quantization quality is whether the true top-1 result survives somewhere inside the approximate top-k set.

k TQ PQ
10.4650.505
40.7850.770
160.9500.955
641.0001.000

This metric is forgiving, but it does show that TurboQuant does not catastrophically lose the best answer.

Scaling with corpus size

Scaling of fidelity and packed storage with corpus size

Scaling of fidelity and packed storage as the corpus grows from 1K to 100K vectors.

The packed-storage ratio stays fixed as the corpus grows; that part of the story is robust.

Corpus FP32 TQ-4bit Ratio Recall@10
1K2 MB0.2 MB7.8x0.71
10K15 MB2.0 MB7.8x0.38
100K154 MB19.6 MB7.8x0.27

The recall values here are pessimistic synthetic diagnostics. The reliable takeaway is the linear packed-storage scaling, not the absolute recall numbers.

Comparative evidence on BeIR

Everything so far helps build intuition. The figures below are the ones that matter most for RAG: real embedding models, labeled retrieval benchmarks, and baseline families that people actually use.

Hybrid retrieval under compression

Hybrid dense+sparse retrieval under TurboQuant compression

Hybrid dense+sparse retrieval under TurboQuant compression.

Most RAG systems do not rely on dense retrieval alone. They combine a sparse leg with a dense leg. The deployment question is simple: if you compress the dense leg, do you lose the hybrid lift?

Comparison SciFact ΔNDCG@10 NFCorpus ΔNDCG@10
Dense TQ - FP32+0.005 [-0.001, +0.010]-0.003 [-0.007, -0.001]
Hybrid TQ - FP32-0.001 [-0.009, +0.006]-0.001 [-0.004, +0.003]

The honest claim here is not that TurboQuant improves hybrid retrieval. The honest claim is narrower and still useful: in this setup, TQ-4bit does not materially damage the hybrid pipeline.

Status quo against standard quantizers

TurboQuant vs standard quantizers on BeIR

TurboQuant versus standard quantizers on BeIR-style retrieval.

This is the central comparison figure in the repo. It asks the question that matters in practice: why bother with TurboQuant instead of using the standard FAISS quantizers?

For SciFact with MiniLM:

Method Bytes / vector NDCG@10
FP3215360.645
TQ-4b1960.650
SQ41930.649
OPQ162140.602
PQ161000.526

Two things stand out:

  • in the ultra-compact regime, TQ-2b is materially stronger than PQ16
  • around the ~8x compression point, TQ-4b is effectively tied with SQ4

That second point matters. The advantage over scalar quantization at 4-bit is not dramatic. The honest result is narrower: TurboQuant clearly beats PQ / OPQ at low byte counts, and stays competitive with SQ at moderate compression without any corpus-specific training.

Retention relative to FP32

Retention relative to FP32 for TurboQuant and standard quantizers

Retention relative to FP32 for TurboQuant and standard quantizers.

The same story is easier to read when normalized as retention relative to FP32.

Method Mean retention Compression regime
TQ-2b~100%~15x
TQ-4b~102%~8x
SQ4~100%~8x
PQ16~85%~15x
OPQ16~98%~7x

Retention above 100% should not be read as a real quality gain. That is ordinary evaluation noise. The right reading is "essentially equal to FP32."

What this means for RAG

For RAG, the strongest argument for TurboQuant is not that it is mathematically elegant. The strongest argument is this: it reduces embedding storage by roughly one order of magnitude, stays competitive with the quantization baselines people actually use, preserves hybrid retrieval quality in the current benchmark slate, and does all of this without fitting a corpus-specific quantizer.

That is a meaningful niche. It is especially relevant for:

  • local RAG
  • per-tenant indexes
  • fast-moving corpora
  • experiments where you want a compression layer without turning the pipeline into a codebook-training project

What this repository does not justify is a broader claim such as:

TurboQuant is already a better production vector store than the usual ANN stack.

That is a different claim, and this repository is not trying to make it.

Reproducibility

Everything discussed here is tied to local artifacts and scripts in this repo:

  1. benchmarks/FIGURES_ANALYSIS.md
  2. benchmarks/status_quo_quantization_bench.py
  3. benchmarks/rag_benchmarks.py
  4. benchmarks/beir_relevance_bench.py
  5. benchmarks/results/PUBLICATION_FIGURES.md
  6. benchmarks/results/STATUS_QUO_COST_TABLE.md
  7. docs/ARXIV_EXPERIMENTS.md

Additional artifacts worth reading:

Development

For local setup, workflow, and commit conventions, see CONTRIBUTING.md.

Common commands:

make lint
make test
make build
make verify
make smoke

References

License

Apache-2.0

About

TurboQuant embedding compression and RAG retrieval benchmarks

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors