add indexing guide for memory and time saving#158

Merged

cutecutecat merged 3 commits intotensorchord:mainfrom

cutecutecat:guide-hierarchical

Dec 31, 2025

Member

cutecutecat commented Dec 31, 2025

When to use:

kmeans_dimension
hierarchical clustering


          add indexing guide for memory and time saving

e26365d

When to use:
- kmeans_dimension
- hierarchical clustering

Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>

cutecutecat requested a review from Copilot

December 31, 2025 07:50

vercel bot commented Dec 31, 2025 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
pgvecto-rs-docs	Ready	Preview, Comment	Dec 31, 2025 8:01am

Copilot started reviewing on behalf of cutecutecat

December 31, 2025 07:50

vercel bot deployed to Preview

December 31, 2025 07:50

View deployment

This comment was marked as resolved.

Sign in to view

cutecutecat added 2 commits

December 31, 2025 16:00


          fix by comment

10968be

Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>


          fix by comment

3b30ed9

Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>

cutecutecat marked this pull request as ready for review

December 31, 2025 08:01

vercel bot deployed to Preview

December 31, 2025 08:01

View deployment

cutecutecat requested a review from Copilot

December 31, 2025 08:03

Copilot started reviewing on behalf of cutecutecat

December 31, 2025 08:04

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cutecutecat merged commit 344d8c0 into tensorchord:main

10 checks passed

usamoi reviewed

View reviewed changes

src/vectorchord/usage/indexing.md

    
              ## Tuning: Balancing query throughput and accuracy

              When there are less than $100,000$ rows in the table, you usually don't need to set parameters for search and query.

              When there are less than $100,000$ rows in the table, you usually don't need to set the index options.

Collaborator

usamoi Dec 31, 2025

Why?

Member Author

cutecutecat Jan 4, 2026

The 'parameters for search and query' is not clear. It may means query options like vchordrq.probes or index options like build.internal.lists. The former is still important if the user need a recall standard like >87%.

Collaborator

usamoi Jan 4, 2026

Since it's important, why remove it?

usamoi reviewed

View reviewed changes

src/vectorchord/usage/indexing.md

    
              | $N \in [5 \times 10^7, \infty)$        | $L \in [8 \sqrt{N}, 16\sqrt{N}]$     | `[80000]`       |

              The process of building an index involves two steps: partitioning the vector space first, and then inserting rows into the index. The first step, partitioning the vector space, can be sped up using multiple threads.

              The process of building an index involves two steps: clustering the vectors first, and then inserting vectors into the index. The first step, clustering the vectors, can be sped up using multiple threads.

Collaborator

usamoi Dec 31, 2025

This kind of description is inappropriate. Partitioning is the step, whereas clustering is merely an implementation detail.

Member Author

cutecutecat Jan 4, 2026

Reverted in #159

usamoi reviewed

View reviewed changes

src/vectorchord/usage/indexing.md

Comment on lines -69 to -71


		SET vchordrq.probes TO '10';
		SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10;

Collaborator

usamoi Dec 31, 2025

Why remove it?

Member Author

cutecutecat Jan 4, 2026

It is moved to the chapter Tuning: Balancing query throughput and accuracy.

usamoi reviewed

View reviewed changes

src/vectorchord/usage/indexing.md

    
              ```

              The second step, inserting rows, can be parallelized using multiple processes. Refer to [PostgreSQL Tuning](performance-tuning.md).

              The second step, inserting vectors into the index, can be parallelized using the appropriate GUC parameter. Refer to [PostgreSQL Tuning](performance-tuning.md). It's a common practice to set the value of `build.internal.build_threads` and parallel workers of PostgreSQL to the number of CPU cores.

Collaborator

usamoi Dec 31, 2025

This is inaccurate. For a maxsim index, what gets inserted is an array of vectors.

Member Author

cutecutecat Jan 4, 2026

Reverted in #159

usamoi reviewed

View reviewed changes

src/vectorchord/usage/indexing.md

    
              The second step, inserting vectors into the index, can be parallelized using the appropriate GUC parameter. Refer to [PostgreSQL Tuning](performance-tuning.md). It's a common practice to set the value of `build.internal.build_threads` and parallel workers of PostgreSQL to the number of CPU cores.

              For most datasets using cosine similarity, enabling `residual_quantization` and `build.internal.spherical_centroids` improves both QPS and recall.

              For most datasets using cosine similarity, enabling `residual_quantization` and `build.internal.spherical_centroids` may improve both QPS and recall. We recommend validating this on a representative sample of your production data in a staging or offline evaluation environment (for example, via offline recall/latency benchmarks or a controlled A/B test) before enabling it broadly in production.

Collaborator

usamoi Dec 31, 2025

It sounds like it's promoting a complex process rather than telling users how to tune the parameters.

Member Author

cutecutecat Jan 4, 2026

I think it is important to tell the user that "enabling residual_quantization and build.internal.spherical_centroids improves both QPS and recall" is not guaranteed.

Maybe we can simplify it like We recommend validating this on a representative sample of your production data before enabling it in production?

usamoi reviewed

View reviewed changes

src/vectorchord/usage/indexing.md

-              SET vchordrq.probes TO '10';
-              SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 10;
+              ## Tuning: Handling ultra large vector tables

Collaborator

usamoi Dec 31, 2025

The measures described here are also useful for a 1M dataset, so why restrict them to a 50M dataset?

Member Author

cutecutecat Jan 4, 2026

Moved to the front of the 50M discussion in #159

usamoi reviewed

View reviewed changes

src/vectorchord/usage/indexing.md

-              SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 10;
+              ## Tuning: Handling ultra large vector tables
+              For large tables with more than 50 million rows, the `build.internal` process requires significant time and memory. Let the effective vector dimension used during k-means be $D$, `build.internal.lists[-1]` be $C$, `build.internal.sampling_factor` be $F$, `build.internal.kmeans_iterations` be $L$, and `build.internal.build_threads` be $T$.

Collaborator

usamoi Dec 31, 2025 •

edited

Loading

The earlier text never even mentioned K-means, nor did it explain what effective vector dimension is.

Member Author

cutecutecat Jan 4, 2026

I am not sure about the adequate description, I picked the vector dimension used for partition in #159 , we can discuss a better name later.

usamoi reviewed

View reviewed changes

src/vectorchord/usage/indexing.md


		For large tables with more than 50 million rows, the `build.internal` process requires significant time and memory. Let the effective vector dimension used during k-means be $D$, `build.internal.lists[-1]` be $C$, `build.internal.sampling_factor` be $F$, `build.internal.kmeans_iterations` be $L$, and `build.internal.build_threads` be $T$.

		* The memory consumption is approximately $4DC(F + T + 1)$ bytes, which usually takes more than 128 GB.

Collaborator

usamoi Dec 31, 2025

In most cases it is under 128 GB, because even for a table with 50 M vectors, the vector dimension is usually much smaller than 768.

Member Author

cutecutecat Jan 4, 2026

Removed in #159

usamoi reviewed

View reviewed changes

src/vectorchord/usage/indexing.md

Comment on lines +92 to 101

+              To improve the build speed, you may opt to use more shared memory to accelerate the process by setting `build.pin` to `2`.
+              ```sql
+              CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
+              build.pin = 2
+              [build.internal]
+              lists = [160000]
+              build_threads = 8
+              $$);
               ```

Collaborator

usamoi Dec 31, 2025

This is incoherent. Why mix completely unrelated optimizations?

Member Author

cutecutecat Jan 4, 2026

Use ... to replace unrelated optimizations in #159

usamoi reviewed

View reviewed changes

src/vectorchord/usage/indexing.md

    
              ```

              For large tables, you may opt to use more shared memory to accelerate the process by setting `build.pin` to `2`.

              If the build speed is still unsatisfactory, you can use the hierarchical clustering to accelerate the process at the expense of some accuracy. In our [benchmark](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-hierarchical-k-means), the hierarchical clustering was 100 times faster than the default algorithm, while query accuracy decreased by less than 1%.

Collaborator

usamoi Dec 31, 2025

When reduced from 0% to 0%, query accuracy decreased by less than 1%.

Member Author

cutecutecat Jan 4, 2026

Let's put it another way: decreased only from 95.6% to 94.9%

usamoi reviewed

View reviewed changes

src/vectorchord/usage/indexing.md

Comment on lines +116 to +117


		If you encounter an Out-of-Memory (OOM) error, reducing $D$, $C$ or $F$ will lower the memory usage. Based on our [experience](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-dimensionality-reduction), reducing `D` will have the least impact on accuracy, so that could be a good starting point. Decreasing `F` is also plausible. Since `C` is much more sensitive, it should be the last thing you consider.

Collaborator

usamoi Dec 31, 2025

This is disastrous. Settings should always be effective; readers shouldn't have to encounter an OOM error and then go back to adjust them. It's completely uncontrollable.

Member Author

cutecutecat Jan 4, 2026

Let's put it another way: the user observes the memory estimate and then makes a decision.

usamoi reviewed

View reviewed changes

src/vectorchord/usage/indexing.md

    
              ```

              You can also refer to [External Build](external-index-precomputation) to offload the indexing workload to other machines.

              If you have sufficient memory, please do not set `build.internal.kmeans_dimension`, as it will reduce accuracy and may increase build time due to the dimension restoration. If the accuracy is not acceptable, you can also refer to the [External Build](external-index-precomputation) to offload the indexing workload to other machines.

Collaborator

usamoi Dec 31, 2025

Using as much memory as possible just to avoid OOM is unreasonable. 100% of readers would prioritize having controllable memory usage.

Member Author

cutecutecat Jan 4, 2026

Reverted in #159

cutecutecat mentioned this pull request

fix: hierarchical document by review #159

Closed

usamoi added a commit that referenced this pull request


          Revert "add indexing guide for memory and time saving (#158)"

2d192de

This reverts commit 344d8c0.

usamoi mentioned this pull request

Revert "add indexing guide for memory and time saving" #160

Merged

usamoi added a commit that referenced this pull request


          Revert "add indexing guide for memory and time saving" (#160)

4fa4056

Reverts #158

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet