Skip to content

Commit 50a575f

Browse files
committed
update external build
Signed-off-by: usamoi <usamoi@outlook.com>
1 parent a9786d8 commit 50a575f

File tree

1 file changed

+67
-25
lines changed

1 file changed

+67
-25
lines changed
Lines changed: 67 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,78 @@
11
# External Build
22

3-
Unlike pure SQL, external build performs clustering externally before inserting centroids into a PostgreSQL table. While this process may be more complex, it significantly speeds up indexing for larger datasets (>5M). We showed some benchmarks in the [blog post](https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql). It takes around 3 minutes to build an index for 1M vectors, 16x faster than standard indexing in pgvector.
3+
In addition to using the methods in [Partitioning Tuning](partitioning-tuning) to achieve shorter build time and lower memory consumption at the cost of QPS and recall, you can also choose to run the partitioning phase offline on other machines or on GPUs. This feature is not supported by `vchordg`, since it does not have a partitioning phase.
44

5-
To get started, you need to do a clustering of vectors using:
6-
- [faiss](https://github.com/facebookresearch/faiss)
7-
- [scikit-learn](https://github.com/scikit-learn/scikit-learn)
8-
- [fastkmeans](https://github.com/AnswerDotAI/fastkmeans)
9-
- or any other clustering library
5+
Assume the table is `t` and the column to be indexed is `val`.
106

11-
The centroids should be preset in a table of any name with 3 columns:
12-
- `id(integer)`: id of each centroid, should be unique
13-
- `parent(integer, nullable)`: parent id of each centroid, could be `NULL` for normal clustering
14-
- `vector(vector)`: representation of each centroid, `vector` type
7+
```sql
8+
CREATE TABLE t (val vector(3));
9+
```
10+
11+
Specifically, you need to sample the column for which the index will be built.
12+
13+
```sql
14+
CREATE EXTENSION IF NOT EXISTS tsm_system_rows;
15+
SELECT val FROM t TABLESAMPLE SYSTEM_ROWS(1000);
16+
```
17+
18+
Based on these samples, you can partition the vector space into Voronoi cells. After that, you will create a table and insert your partitions.
19+
20+
```sql
21+
CREATE TABLE public.t_build (id INTEGER NOT NULL UNIQUE, parent INTEGER, vector vector NOT NULL);
22+
INSERT INTO public.t_build (id, parent, vector) VALUES (0, NULL, '{0.1, 0.2, 0.3}');
23+
INSERT INTO public.t_build (id, parent, vector) VALUES (1, 0, '{0.1, 0.2, 0.3}');
24+
INSERT INTO public.t_build (id, parent, vector) VALUES (2, 0, '{0.4, 0.5, 0.6}');
25+
INSERT INTO public.t_build (id, parent, vector) VALUES (3, 0, '{0.7, 0.8, 0.9}');
26+
```
1527

16-
And example could be like this:
28+
The index can be created using the following syntax.
1729

1830
```sql
19-
-- Create table of centroids
20-
CREATE TABLE public.centroids (id integer NOT NULL UNIQUE, parent integer, vector vector(768));
21-
-- Insert centroids into it
22-
INSERT INTO public.centroids (id, parent, vector) VALUES (1, NULL, '{0.1, 0.2, 0.3, ..., 0.768}');
23-
INSERT INTO public.centroids (id, parent, vector) VALUES (2, NULL, '{0.4, 0.5, 0.6, ..., 0.768}');
24-
INSERT INTO public.centroids (id, parent, vector) VALUES (3, NULL, '{0.7, 0.8, 0.9, ..., 0.768}');
25-
-- ...
26-
27-
-- Create index using the centroid table
28-
CREATE INDEX ON gist_train USING vchordrq (embedding vector_l2_ops) WITH (options = $$
29-
[build.external]
30-
table = 'public.centroids'
31+
CREATE INDEX ON t USING vchordrq (embedding vector_l2_ops) WITH (options = $$
32+
build.external.table = 'public.t_build'
3133
$$);
3234
```
3335

34-
To simplify the workflow, we provide end-to-end scripts for external index pre-computation, refer to [Run External Index Precomputation Toolkit](https://github.com/tensorchord/VectorChord/tree/main/scripts#run-external-index-precomputation-toolkit).
36+
## Format
37+
38+
The table that stores the partition information must strictly follow the following schema:
39+
40+
- The first column must be `id`. Its type is `integer`. It must not be null. It must be unique.
41+
- The second column must be `parent`. Its type is `integer`. It could not be null.
42+
- The third column must be `vector`. Its type is `vector`. It must not be null. `halfvec` or other types are not supported yet.
43+
44+
Logically, this forms a tree. The distance from the root to each leaf must be the same.
45+
46+
If any of the above properties are not satisfied, an error will be reported.
47+
48+
## Partitioning
49+
50+
To get started, here is a minimal code example for performing partitioning using [faiss](https://github.com/facebookresearch/faiss).
51+
52+
```python
53+
from typing import List
54+
from faiss import Kmeans
55+
import numpy
56+
57+
def partition(
58+
samples: List[numpy.Array[float]], lists: List[int]
59+
) -> List[numpy.Array[float]]:
60+
dim = dataset.shape[1]
61+
results = []
62+
for i in range(len(lists) + 1):
63+
kmeans = Kmeans(
64+
dim,
65+
lists[len(lists) - 1 - i] if len(lists) - 1 - i >= 0 else 1,
66+
gpu=False,
67+
verbose=True,
68+
niter=10,
69+
seed=42,
70+
spherical=False,
71+
)
72+
kmeans.train(samples)
73+
results.push(kmeans.centroids)
74+
samples = kmeans.centroids
75+
return results
76+
```
3577

36-
This feature is not supported by `vchordg`, since this step does not exist in it.
78+
This computes the generators of all Voronoi cells across different levels. The parent-child relationships in the tree are determined by computing the shortest distances, but the details are omitted here. Any algorithm capable of generating multi-level Voronoi cells can be used, such as spherical K-means, balanced K-means or GPU K-means.

0 commit comments

Comments
 (0)