Skip to content

Commit 344d8c0

Browse files
authored
add indexing guide for memory and time saving (#158)
When to use: - kmeans_dimension - hierarchical clustering --------- Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
1 parent 608b0f6 commit 344d8c0

File tree

5 files changed

+50
-21
lines changed

5 files changed

+50
-21
lines changed

src/pgvecto_rs/admin/kubernetes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ spec:
9494
- "vectors.so"
9595
```
9696
97-
You can install `cnpg` [kubectl plugin](https://cloudnative-pg.io/documentation/1.22/kubectl-plugin/) to manage your PostgreSQL cluster. Now we can check the status of the cluster.
97+
You can install `cnpg` [kubectl plugin](https://cloudnative-pg.io/docs/1.28/kubectl-plugin) to manage your PostgreSQL cluster. Now we can check the status of the cluster.
9898

9999
```shell
100100
$ sudo kubectl get pod

src/vectorchord/admin/kubernetes.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ spec:
8787
- "vchord"
8888
```
8989
90-
You can install `cnpg` [kubectl plugin](https://cloudnative-pg.io/documentation/1.25/kubectl-plugin/) to manage your PostgreSQL cluster. Now we can check the status of the cluster.
90+
You can install `cnpg` [kubectl plugin](https://cloudnative-pg.io/docs/1.28/kubectl-plugin) to manage your PostgreSQL cluster. Now we can check the status of the cluster.
9191

9292
```shell
9393
$ sudo kubectl get pod
@@ -170,7 +170,7 @@ For Kubernetes, the [ImageVolume feature](https://kubernetes.io/blog/2024/08/16/
170170
Based on these two features, we can create lightweight `vectorchord` extension image [vchord-scratch](https://github.com/tensorchord/VectorChord-images/pkgs/container/vchord-scratch).
171171

172172
:::tip
173-
If you want to use [`Image Volume Extensions`](https://cloudnative-pg.io/documentation/current/imagevolume_extensions/), you need to meet the following requirements:
173+
If you want to use [`Image Volume Extensions`](https://cloudnative-pg.io/docs/1.28/imagevolume_extensions), you need to meet the following requirements:
174174
- Use Kubernetes version 1.31.0 or above (1.33.0 is recommended), and make sure the `ImageVolume` feature gate is enabled.
175175
- Use CloudNative-PG helm chart version 0.26.0 or above.
176176
:::

src/vectorchord/admin/scalability.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Therefore, we deploy VectorChord services on AWS EKS with OpenTofu and CloudNati
2525

2626
* [CloudNativePG](https://cloudnative-pg.io/): Manage the full lifecycle of a highly available PostgreSQL database cluster with a primary/standby architecture
2727

28-
* [PGBouncer](https://www.pgbouncer.org/): Lightweight connection pooler for PostgreSQL, provided by module [Pooler](https://cloudnative-pg.io/documentation/1.25/connection_pooling/) of CloudNativePG
28+
* [PGBouncer](https://www.pgbouncer.org/): Lightweight connection pooler for PostgreSQL, provided by module [Pooler](https://cloudnative-pg.io/docs/1.28/connection_pooling) of CloudNativePG
2929

3030
<img src="../images/scalability-architecture.png" alt="Architecture" width="100%" />
3131

src/vectorchord/getting-started/installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ CMD ["postgres", "-c" ,"shared_preload_libraries=vchord,vector"]
9090

9191
This image can also be used as a CloudNativePG image volume extension. See also
9292

93-
* [Image Volume Extensions](https://cloudnative-pg.io/documentation/current/imagevolume_extensions/)
93+
* [Image Volume Extensions](https://cloudnative-pg.io/docs/1.28/imagevolume_extensions)
9494
* [Kubernetes](../admin/kubernetes)
9595

9696
## Source

src/vectorchord/usage/indexing.md

Lines changed: 45 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,9 @@ You can also add filters to vector search queries as needed.
2727
SELECT * FROM items WHERE id % 7 <> 0 ORDER BY embedding <-> '[3,1,2]' LIMIT 10;
2828
```
2929

30-
## Tuning
30+
## Tuning: Balancing query throughput and accuracy
3131

32-
When there are less than $100,000$ rows in the table, you usually don't need to set parameters for search and query.
32+
When there are less than $100,000$ rows in the table, you usually don't need to set the index options.
3333

3434
```sql
3535
CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops);
@@ -58,22 +58,19 @@ The parameter `lists` should be tuned based on the number of rows. The following
5858
| $N \in [2 \times 10^6, 5 \times 10^7)$ | $L \in [4 \sqrt{N}, 8 \sqrt{N}]$ | `[10000]` |
5959
| $N \in [5 \times 10^7, \infty)$ | $L \in [8 \sqrt{N}, 16\sqrt{N}]$ | `[80000]` |
6060

61-
The process of building an index involves two steps: partitioning the vector space first, and then inserting rows into the index. The first step, partitioning the vector space, can be sped up using multiple threads.
61+
The process of building an index involves two steps: clustering the vectors first, and then inserting vectors into the index. The first step, clustering the vectors, can be sped up using multiple threads.
6262

6363
```sql
6464
CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
6565
[build.internal]
6666
lists = [1000]
6767
build_threads = 8
6868
$$);
69-
70-
SET vchordrq.probes TO '10';
71-
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10;
7269
```
7370

74-
The second step, inserting rows, can be parallelized using multiple processes. Refer to [PostgreSQL Tuning](performance-tuning.md).
71+
The second step, inserting vectors into the index, can be parallelized using the appropriate GUC parameter. Refer to [PostgreSQL Tuning](performance-tuning.md). It's a common practice to set the value of `build.internal.build_threads` and parallel workers of PostgreSQL to the number of CPU cores.
7572

76-
For most datasets using cosine similarity, enabling `residual_quantization` and `build.internal.spherical_centroids` improves both QPS and recall.
73+
For most datasets using cosine similarity, enabling `residual_quantization` and `build.internal.spherical_centroids` may improve both QPS and recall. We recommend validating this on a representative sample of your production data in a staging or offline evaluation environment (for example, via offline recall/latency benchmarks or a controlled A/B test) before enabling it broadly in production.
7774

7875
```sql
7976
CREATE INDEX ON items USING vchordrq (embedding vector_cosine_ops) WITH (options = $$
@@ -83,27 +80,59 @@ lists = [1000]
8380
spherical_centroids = true
8481
build_threads = 8
8582
$$);
83+
```
8684

87-
SET vchordrq.probes TO '10';
88-
SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 10;
85+
## Tuning: Handling ultra large vector tables
86+
87+
For large tables with more than 50 million rows, the `build.internal` process requires significant time and memory. Let the effective vector dimension used during k-means be $D$, `build.internal.lists[-1]` be $C$, `build.internal.sampling_factor` be $F$, `build.internal.kmeans_iterations` be $L$, and `build.internal.build_threads` be $T$.
88+
89+
* The memory consumption is approximately $4DC(F + T + 1)$ bytes, which usually takes more than 128 GB.
90+
* The build time is approximately $O(FC^2DL)$, which usually takes more than one day.
91+
92+
To improve the build speed, you may opt to use more shared memory to accelerate the process by setting `build.pin` to `2`.
93+
94+
```sql
95+
CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
96+
build.pin = 2
97+
[build.internal]
98+
lists = [160000]
99+
build_threads = 8
100+
$$);
89101
```
90102

91-
For large tables, you may opt to use more shared memory to accelerate the process by setting `build.pin` to `2`.
103+
If the build speed is still unsatisfactory, you can use the hierarchical clustering to accelerate the process at the expense of some accuracy. In our [benchmark](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-hierarchical-k-means), the hierarchical clustering was 100 times faster than the default algorithm, while query accuracy decreased by less than 1%.
92104

93105
```sql
94106
CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
95-
residual_quantization = true
96107
build.pin = 2
97108
[build.internal]
98-
lists = [1000]
99-
spherical_centroids = true
109+
lists = [160000]
100110
build_threads = 8
111+
kmeans_algorithm.hierarchical = {}
101112
$$);
102113
```
103114

104-
For large tables, the `build.internal` process costs significant time and memory. Let `build.internal.kmeans_dimension` or the dimension be $D$, `build.internal.lists[-1]` be $C$, `build.internal.sampling_factor` be $F$, and `build.internal.build_threads` be $T$. The memory consumption is approximately $4CD(F + T + 1)$ bytes. You can moderately reduce these options for lower memory usage.
115+
---
116+
117+
If you encounter an Out-of-Memory (OOM) error, reducing $D$, $C$ or $F$ will lower the memory usage. Based on our [experience](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-dimensionality-reduction), reducing `D` will have the least impact on accuracy, so that could be a good starting point. Decreasing `F` is also plausible. Since `C` is much more sensitive, it should be the last thing you consider.
118+
119+
For your reference, this configuration has little impact on query accuracy (less than 1%):
120+
* Reduce `D` from 768 to 100
121+
* Reduce `F` from 256 to 64
122+
123+
```sql
124+
CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
125+
build.pin = 2
126+
[build.internal]
127+
lists = [160000]
128+
build_threads = 8
129+
kmeans_algorithm.hierarchical = {}
130+
kmeans_dimension = 100
131+
sampling_factor = 64
132+
$$);
133+
```
105134

106-
You can also refer to [External Build](external-index-precomputation) to offload the indexing workload to other machines.
135+
If you have sufficient memory, please do not set `build.internal.kmeans_dimension`, as it will reduce accuracy and may increase build time due to the dimension restoration. If the accuracy is not acceptable, you can also refer to the [External Build](external-index-precomputation) to offload the indexing workload to other machines.
107136

108137
## Reference
109138

0 commit comments

Comments
 (0)