Skip to content

add indexing guide for memory and time saving#158

Merged
cutecutecat merged 3 commits intotensorchord:mainfrom
cutecutecat:guide-hierarchical
Dec 31, 2025
Merged

add indexing guide for memory and time saving#158
cutecutecat merged 3 commits intotensorchord:mainfrom
cutecutecat:guide-hierarchical

Conversation

@cutecutecat
Copy link
Member

When to use:

  • kmeans_dimension
  • hierarchical clustering

When to use:
- kmeans_dimension
- hierarchical clustering

Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
@cutecutecat cutecutecat requested a review from Copilot December 31, 2025 07:50
@vercel
Copy link

vercel bot commented Dec 31, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
pgvecto-rs-docs Ready Ready Preview, Comment Dec 31, 2025 8:01am

This comment was marked as resolved.

Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@cutecutecat cutecutecat merged commit 344d8c0 into tensorchord:main Dec 31, 2025
10 checks passed
## Tuning: Balancing query throughput and accuracy

When there are less than $100,000$ rows in the table, you usually don't need to set parameters for search and query.
When there are less than $100,000$ rows in the table, you usually don't need to set the index options.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'parameters for search and query' is not clear. It may means query options like vchordrq.probes or index options like build.internal.lists. The former is still important if the user need a recall standard like >87%.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's important, why remove it?

| $N \in [5 \times 10^7, \infty)$ | $L \in [8 \sqrt{N}, 16\sqrt{N}]$ | `[80000]` |

The process of building an index involves two steps: partitioning the vector space first, and then inserting rows into the index. The first step, partitioning the vector space, can be sped up using multiple threads.
The process of building an index involves two steps: clustering the vectors first, and then inserting vectors into the index. The first step, clustering the vectors, can be sped up using multiple threads.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of description is inappropriate. Partitioning is the step, whereas clustering is merely an implementation detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted in #159

Comment on lines -69 to -71

SET vchordrq.probes TO '10';
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is moved to the chapter Tuning: Balancing query throughput and accuracy.

```

The second step, inserting rows, can be parallelized using multiple processes. Refer to [PostgreSQL Tuning](performance-tuning.md).
The second step, inserting vectors into the index, can be parallelized using the appropriate GUC parameter. Refer to [PostgreSQL Tuning](performance-tuning.md). It's a common practice to set the value of `build.internal.build_threads` and parallel workers of PostgreSQL to the number of CPU cores.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inaccurate. For a maxsim index, what gets inserted is an array of vectors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted in #159

The second step, inserting vectors into the index, can be parallelized using the appropriate GUC parameter. Refer to [PostgreSQL Tuning](performance-tuning.md). It's a common practice to set the value of `build.internal.build_threads` and parallel workers of PostgreSQL to the number of CPU cores.

For most datasets using cosine similarity, enabling `residual_quantization` and `build.internal.spherical_centroids` improves both QPS and recall.
For most datasets using cosine similarity, enabling `residual_quantization` and `build.internal.spherical_centroids` may improve both QPS and recall. We recommend validating this on a representative sample of your production data in a staging or offline evaluation environment (for example, via offline recall/latency benchmarks or a controlled A/B test) before enabling it broadly in production.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like it's promoting a complex process rather than telling users how to tune the parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important to tell the user that "enabling residual_quantization and build.internal.spherical_centroids improves both QPS and recall" is not guaranteed.

Maybe we can simplify it like We recommend validating this on a representative sample of your production data before enabling it in production?


SET vchordrq.probes TO '10';
SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 10;
## Tuning: Handling ultra large vector tables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The measures described here are also useful for a 1M dataset, so why restrict them to a 50M dataset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the front of the 50M discussion in #159

SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 10;
## Tuning: Handling ultra large vector tables

For large tables with more than 50 million rows, the `build.internal` process requires significant time and memory. Let the effective vector dimension used during k-means be $D$, `build.internal.lists[-1]` be $C$, `build.internal.sampling_factor` be $F$, `build.internal.kmeans_iterations` be $L$, and `build.internal.build_threads` be $T$.
Copy link
Collaborator

@usamoi usamoi Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier text never even mentioned K-means, nor did it explain what effective vector dimension is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the adequate description, I picked the vector dimension used for partition in #159 , we can discuss a better name later.


For large tables with more than 50 million rows, the `build.internal` process requires significant time and memory. Let the effective vector dimension used during k-means be $D$, `build.internal.lists[-1]` be $C$, `build.internal.sampling_factor` be $F$, `build.internal.kmeans_iterations` be $L$, and `build.internal.build_threads` be $T$.

* The memory consumption is approximately $4DC(F + T + 1)$ bytes, which usually takes more than 128 GB.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases it is under 128 GB, because even for a table with 50 M vectors, the vector dimension is usually much smaller than 768.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in #159

Comment on lines +92 to 101
To improve the build speed, you may opt to use more shared memory to accelerate the process by setting `build.pin` to `2`.

```sql
CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
build.pin = 2
[build.internal]
lists = [160000]
build_threads = 8
$$);
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incoherent. Why mix completely unrelated optimizations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ... to replace unrelated optimizations in #159

```

For large tables, you may opt to use more shared memory to accelerate the process by setting `build.pin` to `2`.
If the build speed is still unsatisfactory, you can use the hierarchical clustering to accelerate the process at the expense of some accuracy. In our [benchmark](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-hierarchical-k-means), the hierarchical clustering was 100 times faster than the default algorithm, while query accuracy decreased by less than 1%.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reduced from 0% to 0%, query accuracy decreased by less than 1%.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put it another way: decreased only from 95.6% to 94.9%

Comment on lines +116 to +117

If you encounter an Out-of-Memory (OOM) error, reducing $D$, $C$ or $F$ will lower the memory usage. Based on our [experience](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-dimensionality-reduction), reducing `D` will have the least impact on accuracy, so that could be a good starting point. Decreasing `F` is also plausible. Since `C` is much more sensitive, it should be the last thing you consider.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is disastrous. Settings should always be effective; readers shouldn't have to encounter an OOM error and then go back to adjust them. It's completely uncontrollable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put it another way: the user observes the memory estimate and then makes a decision.

```

You can also refer to [External Build](external-index-precomputation) to offload the indexing workload to other machines.
If you have sufficient memory, please do not set `build.internal.kmeans_dimension`, as it will reduce accuracy and may increase build time due to the dimension restoration. If the accuracy is not acceptable, you can also refer to the [External Build](external-index-precomputation) to offload the indexing workload to other machines.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using as much memory as possible just to avoid OOM is unreasonable. 100% of readers would prioritize having controllable memory usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted in #159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants