Skip to content

fix: hierarchical document by review#159

Closed
cutecutecat wants to merge 1 commit intotensorchord:mainfrom
cutecutecat:fix-hierarchical
Closed

fix: hierarchical document by review#159
cutecutecat wants to merge 1 commit intotensorchord:mainfrom
cutecutecat:fix-hierarchical

Conversation

@cutecutecat
Copy link
Member

@cutecutecat cutecutecat commented Jan 4, 2026

Fix by most comments in #158

Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
@vercel
Copy link

vercel bot commented Jan 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
pgvecto-rs-docs Ready Ready Preview, Comment Jan 4, 2026 2:16am

This comment was marked as resolved.

@cutecutecat cutecutecat requested a review from usamoi January 4, 2026 02:50
If the build speed is still unsatisfactory, you can use the hierarchical clustering to accelerate the process at the expense of some accuracy. In our [benchmark](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-hierarchical-k-means), the hierarchical clustering was 100 times faster than the default algorithm, while query accuracy decreased by less than 1%.
For large tables with more than 50 million rows, the `build.internal` process requires significant time and memory. Let $D$ be the vector dimension used for partition, $C$ be `build.internal.lists[-1]`, $F$ be `build.internal.sampling_factor`, $L$ be `build.internal.kmeans_iterations`, and $T$ be `build.internal.build_threads`. The build time is approximately $O(FC^2DL)$, which usually takes more than one day.

If this applies to you, you can use the hierarchical clustering to speed up the process, albeit at the expense of some accuracy. In our [benchmark](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-hierarchical-k-means), hierarchical clustering was 100 times faster than the default algorithm, while query recall decreased only from 95.6% to 94.9%.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's If this applies to you,? Not 100 times, should be 400 times.

---
## Tuning: Optimize the memory usage with indexing

When the indexing process starts, VectorChord shows the estimated amount of memory that will be allocated, such as:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not specify where this will be displayed. In addition, due to settings, users may not see this message at all.


When the indexing process starts, VectorChord shows the estimated amount of memory that will be allocated, such as:

```shell
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why shell?

INFO: clustering: estimated memory usage is 1.49 GiB
```

If the value exceeds your expectations or the physical memory constraint, it is wise to cancel and check this chapter. There are some options that can help reduce memory usage.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to cancel?

* C: `build.internal.lists[-1]`.

If you encounter an Out-of-Memory (OOM) error, reducing $D$, $C$ or $F$ will lower the memory usage. Based on our [experience](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-dimensionality-reduction), reducing `D` will have the least impact on accuracy, so that could be a good starting point. Decreasing `F` is also plausible. Since `C` is much more sensitive, it should be the last thing you consider.
Based on our [experience](https://blog.vectorchord.ai/how-we-made-100m-vector-indexing-in-20-minutes-possible-on-postgresql#heading-dimensionality-reduction), reducing `D` will have the least impact on accuracy, so that could be a good starting point. Decreasing `F` is also plausible. Since `C` is much more sensitive, it should be the last thing you consider.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is highly suspicious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants