add indexing guide for memory and time saving (#158)

cutecutecat · web-flow · commit 344d8c0c8cdb · 2025-12-31T16:18:35.000+08:00
When to use:
- kmeans_dimension
- hierarchical clustering

---------

Signed-off-by: cutecutecat &lt;junyuchen@tensorchord.ai&gt;
diff --git a/src/pgvecto_rs/admin/kubernetes.md b/src/pgvecto_rs/admin/kubernetes.md
@@ -94,7 +94,7 @@ spec:
     - "vectors.so"
 ```
 
-You can install `cnpg` [kubectl plugin](https://cloudnative-pg.io/documentation/1.22/kubectl-plugin/) to manage your PostgreSQL cluster. Now we can check the status of the cluster. 
+You can install `cnpg` [kubectl plugin](https://cloudnative-pg.io/docs/1.28/kubectl-plugin) to manage your PostgreSQL cluster. Now we can check the status of the cluster.
 
 ```shell
 $ sudo kubectl get pod
diff --git a/src/vectorchord/admin/kubernetes.md b/src/vectorchord/admin/kubernetes.md
@@ -87,7 +87,7 @@ spec:
     - "vchord"
 ```
 
-You can install `cnpg` [kubectl plugin](https://cloudnative-pg.io/documentation/1.25/kubectl-plugin/) to manage your PostgreSQL cluster. Now we can check the status of the cluster.
+You can install `cnpg` [kubectl plugin](https://cloudnative-pg.io/docs/1.28/kubectl-plugin) to manage your PostgreSQL cluster. Now we can check the status of the cluster.
 
 ```shell
 $ sudo kubectl get pod
@@ -170,7 +170,7 @@ For Kubernetes, the [ImageVolume feature](https://kubernetes.io/blog/2024/08/16/
 Based on these two features, we can create lightweight `vectorchord` extension image [vchord-scratch](https://github.com/tensorchord/VectorChord-images/pkgs/container/vchord-scratch).
 
 :::tip
-If you want to use [`Image Volume Extensions`](https://cloudnative-pg.io/documentation/current/imagevolume_extensions/), you need to meet the following requirements:
+If you want to use [`Image Volume Extensions`](https://cloudnative-pg.io/docs/1.28/imagevolume_extensions), you need to meet the following requirements:
 - Use Kubernetes version 1.31.0 or above (1.33.0 is recommended), and make sure the `ImageVolume` feature gate is enabled.
 - Use CloudNative-PG helm chart version 0.26.0 or above.
 :::
diff --git a/src/vectorchord/admin/scalability.md b/src/vectorchord/admin/scalability.md
@@ -25,7 +25,7 @@ Therefore, we deploy VectorChord services on AWS EKS with OpenTofu and CloudNati
 
 * [CloudNativePG](https://cloudnative-pg.io/): Manage the full lifecycle of a highly available PostgreSQL database cluster with a primary/standby architecture
 
-* [PGBouncer](https://www.pgbouncer.org/): Lightweight connection pooler for PostgreSQL, provided by module [Pooler](https://cloudnative-pg.io/documentation/1.25/connection_pooling/) of CloudNativePG
+* [PGBouncer](https://www.pgbouncer.org/): Lightweight connection pooler for PostgreSQL, provided by module [Pooler](https://cloudnative-pg.io/docs/1.28/connection_pooling) of CloudNativePG
 
 <img src="../images/scalability-architecture.png" alt="Architecture" width="100%" />
 
diff --git a/src/vectorchord/getting-started/installation.md b/src/vectorchord/getting-started/installation.md
@@ -90,7 +90,7 @@ CMD ["postgres", "-c" ,"shared_preload_libraries=vchord,vector"]
 
 This image can also be used as a CloudNativePG image volume extension. See also
 
-* [Image Volume Extensions](https://cloudnative-pg.io/documentation/current/imagevolume_extensions/)
+* [Image Volume Extensions](https://cloudnative-pg.io/docs/1.28/imagevolume_extensions)
 * [Kubernetes](../admin/kubernetes)
 
 ## Source
diff --git a/src/vectorchord/usage/indexing.md b/src/vectorchord/usage/indexing.md
@@ -27,9 +27,9 @@ You can also add filters to vector search queries as needed.
 SELECT * FROM items WHERE id % 7 <> 0 ORDER BY embedding <-> '[3,1,2]' LIMIT 10;
 ```
 
-## Tuning
+## Tuning: Balancing query throughput and accuracy
 
-When there are less than $100,000$ rows in the table, you usually don't need to set parameters for search and query.
+When there are less than $100,000$ rows in the table, you usually don't need to set the index options.
 
 ```sql
 CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops);
@@ -58,22 +58,19 @@ The parameter `lists` should be tuned based on the number of rows. The following
 | $N \in [2 \times 10^6, 5 \times 10^7)$ | $L \in [4 \sqrt{N}, 8 \sqrt{N}]$     | `[10000]`       |
 | $N \in [5 \times 10^7, \infty)$        | $L \in [8 \sqrt{N}, 16\sqrt{N}]$     | `[80000]`       |
 
-The process of building an index involves two steps: partitioning the vector space first, and then inserting rows into the index. The first step, partitioning the vector space, can be sped up using multiple threads.
+The process of building an index involves two steps: clustering the vectors first, and then inserting vectors into the index. The first step, clustering the vectors, can be sped up using multiple threads.
 
 ```sql
 CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
 [build.internal]
 lists = [1000]
 build_threads = 8
 $$);
-
-SET vchordrq.probes TO '10';
-SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10;
 ```
 
-The second step, inserting rows, can be parallelized using multiple processes. Refer to [PostgreSQL Tuning](performance-tuning.md).
+The second step, inserting vectors into the index, can be parallelized using the appropriate GUC parameter. Refer to [PostgreSQL Tuning](performance-tuning.md). It's a common practice to set the value of `build.internal.build_threads` and parallel workers of PostgreSQL to the number of CPU cores.
 
-For most datasets using cosine similarity, enabling `residual_quantization` and `build.internal.spherical_centroids` improves both QPS and recall.
+For most datasets using cosine similarity, enabling `residual_quantization` and `build.internal.spherical_centroids` may improve both QPS and recall. We recommend validating this on a representative sample of your production data in a staging or offline evaluation environment (for example, via offline recall/latency benchmarks or a controlled A/B test) before enabling it broadly in production.
 
 ```sql
 CREATE INDEX ON items USING vchordrq (embedding vector_cosine_ops) WITH (options = $$
@@ -83,27 +80,59 @@ lists = [1000]
 spherical_centroids = true
 build_threads = 8
 $$);
+```
 
-SET vchordrq.probes TO '10';
-SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 10;
+## Tuning: Handling ultra large vector tables
+
+For large tables with more than 50 million rows, the `build.internal` process requires significant time and memory. Let the effective vector dimension used during k-means be $D$, `build.internal.lists[-1]` be $C$, `build.internal.sampling_factor` be $F$, `build.internal.kmeans_iterations` be $L$, and `build.internal.build_threads` be $T$.
+
+* The memory consumption is approximately $4DC(F + T + 1)$ bytes, which usually takes more than 128 GB.
+* The build time is approximately $O(FC^2DL)$, which usually takes more than one day.
+
+To improve the build speed, you may opt to use more shared memory to accelerate the process by setting `build.pin` to `2`.
+
+```sql
+CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
+build.pin = 2
+[build.internal]
+lists = [160000]
+build_threads = 8
+$$);
 ```
 
-For large tables, you may opt to use more shared memory to accelerate the process by setting `build.pin` to `2`.
+If the build speed is still unsatisfactory, you can use the hierarchical clustering to accelerate the process at the expense of some accuracy. In our [benchmark](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-hierarchical-k-means), the hierarchical clustering was 100 times faster than the default algorithm, while query accuracy decreased by less than 1%.
 
 ```sql
 CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
-residual_quantization = true
 build.pin = 2
 [build.internal]
-lists = [1000]
-spherical_centroids = true
+lists = [160000]
 build_threads = 8
+kmeans_algorithm.hierarchical = {}
 $$);
 ```
 
-For large tables, the `build.internal` process costs significant time and memory. Let `build.internal.kmeans_dimension` or the dimension be $D$, `build.internal.lists[-1]` be $C$, `build.internal.sampling_factor` be $F$, and `build.internal.build_threads` be $T$. The memory consumption is approximately $4CD(F + T + 1)$ bytes. You can moderately reduce these options for lower memory usage.
+---
+
+If you encounter an Out-of-Memory (OOM) error, reducing $D$, $C$ or $F$ will lower the memory usage. Based on our [experience](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-dimensionality-reduction), reducing `D` will have the least impact on accuracy, so that could be a good starting point. Decreasing `F` is also plausible. Since `C` is much more sensitive, it should be the last thing you consider.
+
+For your reference, this configuration has little impact on query accuracy (less than 1%):
+* Reduce `D` from 768 to 100
+* Reduce `F` from 256 to 64
+
+```sql
+CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
+build.pin = 2
+[build.internal]
+lists = [160000]
+build_threads = 8
+kmeans_algorithm.hierarchical = {}
+kmeans_dimension = 100
+sampling_factor = 64
+$$);
+```
 
-You can also refer to [External Build](external-index-precomputation) to offload the indexing workload to other machines.
+If you have sufficient memory, please do not set `build.internal.kmeans_dimension`, as it will reduce accuracy and may increase build time due to the dimension restoration. If the accuracy is not acceptable, you can also refer to the [External Build](external-index-precomputation) to offload the indexing workload to other machines.
 
 ## Reference