add indexing guide for memory and time saving#158
add indexing guide for memory and time saving#158cutecutecat merged 3 commits intotensorchord:mainfrom
Conversation
When to use: - kmeans_dimension - hierarchical clustering Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ## Tuning: Balancing query throughput and accuracy | ||
|
|
||
| When there are less than $100,000$ rows in the table, you usually don't need to set parameters for search and query. | ||
| When there are less than $100,000$ rows in the table, you usually don't need to set the index options. |
There was a problem hiding this comment.
The 'parameters for search and query' is not clear. It may means query options like vchordrq.probes or index options like build.internal.lists. The former is still important if the user need a recall standard like >87%.
There was a problem hiding this comment.
Since it's important, why remove it?
| | $N \in [5 \times 10^7, \infty)$ | $L \in [8 \sqrt{N}, 16\sqrt{N}]$ | `[80000]` | | ||
|
|
||
| The process of building an index involves two steps: partitioning the vector space first, and then inserting rows into the index. The first step, partitioning the vector space, can be sped up using multiple threads. | ||
| The process of building an index involves two steps: clustering the vectors first, and then inserting vectors into the index. The first step, clustering the vectors, can be sped up using multiple threads. |
There was a problem hiding this comment.
This kind of description is inappropriate. Partitioning is the step, whereas clustering is merely an implementation detail.
|
|
||
| SET vchordrq.probes TO '10'; | ||
| SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10; |
There was a problem hiding this comment.
It is moved to the chapter Tuning: Balancing query throughput and accuracy.
| ``` | ||
|
|
||
| The second step, inserting rows, can be parallelized using multiple processes. Refer to [PostgreSQL Tuning](performance-tuning.md). | ||
| The second step, inserting vectors into the index, can be parallelized using the appropriate GUC parameter. Refer to [PostgreSQL Tuning](performance-tuning.md). It's a common practice to set the value of `build.internal.build_threads` and parallel workers of PostgreSQL to the number of CPU cores. |
There was a problem hiding this comment.
This is inaccurate. For a maxsim index, what gets inserted is an array of vectors.
| The second step, inserting vectors into the index, can be parallelized using the appropriate GUC parameter. Refer to [PostgreSQL Tuning](performance-tuning.md). It's a common practice to set the value of `build.internal.build_threads` and parallel workers of PostgreSQL to the number of CPU cores. | ||
|
|
||
| For most datasets using cosine similarity, enabling `residual_quantization` and `build.internal.spherical_centroids` improves both QPS and recall. | ||
| For most datasets using cosine similarity, enabling `residual_quantization` and `build.internal.spherical_centroids` may improve both QPS and recall. We recommend validating this on a representative sample of your production data in a staging or offline evaluation environment (for example, via offline recall/latency benchmarks or a controlled A/B test) before enabling it broadly in production. |
There was a problem hiding this comment.
It sounds like it's promoting a complex process rather than telling users how to tune the parameters.
There was a problem hiding this comment.
I think it is important to tell the user that "enabling residual_quantization and build.internal.spherical_centroids improves both QPS and recall" is not guaranteed.
Maybe we can simplify it like We recommend validating this on a representative sample of your production data before enabling it in production?
|
|
||
| SET vchordrq.probes TO '10'; | ||
| SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 10; | ||
| ## Tuning: Handling ultra large vector tables |
There was a problem hiding this comment.
The measures described here are also useful for a 1M dataset, so why restrict them to a 50M dataset?
There was a problem hiding this comment.
Moved to the front of the 50M discussion in #159
| SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 10; | ||
| ## Tuning: Handling ultra large vector tables | ||
|
|
||
| For large tables with more than 50 million rows, the `build.internal` process requires significant time and memory. Let the effective vector dimension used during k-means be $D$, `build.internal.lists[-1]` be $C$, `build.internal.sampling_factor` be $F$, `build.internal.kmeans_iterations` be $L$, and `build.internal.build_threads` be $T$. |
There was a problem hiding this comment.
The earlier text never even mentioned K-means, nor did it explain what effective vector dimension is.
There was a problem hiding this comment.
I am not sure about the adequate description, I picked the vector dimension used for partition in #159 , we can discuss a better name later.
|
|
||
| For large tables with more than 50 million rows, the `build.internal` process requires significant time and memory. Let the effective vector dimension used during k-means be $D$, `build.internal.lists[-1]` be $C$, `build.internal.sampling_factor` be $F$, `build.internal.kmeans_iterations` be $L$, and `build.internal.build_threads` be $T$. | ||
|
|
||
| * The memory consumption is approximately $4DC(F + T + 1)$ bytes, which usually takes more than 128 GB. |
There was a problem hiding this comment.
In most cases it is under 128 GB, because even for a table with 50 M vectors, the vector dimension is usually much smaller than 768.
| To improve the build speed, you may opt to use more shared memory to accelerate the process by setting `build.pin` to `2`. | ||
|
|
||
| ```sql | ||
| CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$ | ||
| build.pin = 2 | ||
| [build.internal] | ||
| lists = [160000] | ||
| build_threads = 8 | ||
| $$); | ||
| ``` |
There was a problem hiding this comment.
This is incoherent. Why mix completely unrelated optimizations?
There was a problem hiding this comment.
Use ... to replace unrelated optimizations in #159
| ``` | ||
|
|
||
| For large tables, you may opt to use more shared memory to accelerate the process by setting `build.pin` to `2`. | ||
| If the build speed is still unsatisfactory, you can use the hierarchical clustering to accelerate the process at the expense of some accuracy. In our [benchmark](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-hierarchical-k-means), the hierarchical clustering was 100 times faster than the default algorithm, while query accuracy decreased by less than 1%. |
There was a problem hiding this comment.
When reduced from 0% to 0%, query accuracy decreased by less than 1%.
There was a problem hiding this comment.
Let's put it another way: decreased only from 95.6% to 94.9%
|
|
||
| If you encounter an Out-of-Memory (OOM) error, reducing $D$, $C$ or $F$ will lower the memory usage. Based on our [experience](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-dimensionality-reduction), reducing `D` will have the least impact on accuracy, so that could be a good starting point. Decreasing `F` is also plausible. Since `C` is much more sensitive, it should be the last thing you consider. |
There was a problem hiding this comment.
This is disastrous. Settings should always be effective; readers shouldn't have to encounter an OOM error and then go back to adjust them. It's completely uncontrollable.
There was a problem hiding this comment.
Let's put it another way: the user observes the memory estimate and then makes a decision.
| ``` | ||
|
|
||
| You can also refer to [External Build](external-index-precomputation) to offload the indexing workload to other machines. | ||
| If you have sufficient memory, please do not set `build.internal.kmeans_dimension`, as it will reduce accuracy and may increase build time due to the dimension restoration. If the accuracy is not acceptable, you can also refer to the [External Build](external-index-precomputation) to offload the indexing workload to other machines. |
There was a problem hiding this comment.
Using as much memory as possible just to avoid OOM is unreasonable. 100% of readers would prioritize having controllable memory usage.
When to use: