|
1 | 1 | # External Build |
2 | 2 |
|
3 | | -Unlike pure SQL, external build performs clustering externally before inserting centroids into a PostgreSQL table. While this process may be more complex, it significantly speeds up indexing for larger datasets (>5M). We showed some benchmarks in the [blog post](https://blog.vectorchord.ai/vectorchord-store-400k-vectors-for-1-in-postgresql). It takes around 3 minutes to build an index for 1M vectors, 16x faster than standard indexing in pgvector. |
| 3 | +In addition to using the methods in [Partitioning Tuning](partitioning-tuning) to achieve shorter build time and lower memory consumption at the cost of QPS and recall, you can also choose to run the partitioning phase offline on other machines or on GPUs. This feature is not supported by `vchordg`, since it does not have a partitioning phase. |
4 | 4 |
|
5 | | -To get started, you need to do a clustering of vectors using: |
6 | | -- [faiss](https://github.com/facebookresearch/faiss) |
7 | | -- [scikit-learn](https://github.com/scikit-learn/scikit-learn) |
8 | | -- [fastkmeans](https://github.com/AnswerDotAI/fastkmeans) |
9 | | -- or any other clustering library |
| 5 | +Assume the table is `t` and the column to be indexed is `val`. |
10 | 6 |
|
11 | | -The centroids should be preset in a table of any name with 3 columns: |
12 | | -- `id(integer)`: id of each centroid, should be unique |
13 | | -- `parent(integer, nullable)`: parent id of each centroid, could be `NULL` for normal clustering |
14 | | -- `vector(vector)`: representation of each centroid, `vector` type |
| 7 | +```sql |
| 8 | +CREATE TABLE t (val vector(3)); |
| 9 | +``` |
| 10 | + |
| 11 | +Specifically, you need to sample the column for which the index will be built. |
| 12 | + |
| 13 | +```sql |
| 14 | +CREATE EXTENSION IF NOT EXISTS tsm_system_rows; |
| 15 | +SELECT val FROM t TABLESAMPLE SYSTEM_ROWS(1000); |
| 16 | +``` |
| 17 | + |
| 18 | +Based on these samples, you can partition the vector space into Voronoi cells. After that, you will create a table and insert your partitions. |
| 19 | + |
| 20 | +```sql |
| 21 | +CREATE TABLE public.t_build (id INTEGER NOT NULL UNIQUE, parent INTEGER, vector vector NOT NULL); |
| 22 | +INSERT INTO public.t_build (id, parent, vector) VALUES (0, NULL, '{0.1, 0.2, 0.3}'); |
| 23 | +INSERT INTO public.t_build (id, parent, vector) VALUES (1, 0, '{0.1, 0.2, 0.3}'); |
| 24 | +INSERT INTO public.t_build (id, parent, vector) VALUES (2, 0, '{0.4, 0.5, 0.6}'); |
| 25 | +INSERT INTO public.t_build (id, parent, vector) VALUES (3, 0, '{0.7, 0.8, 0.9}'); |
| 26 | +``` |
15 | 27 |
|
16 | | -And example could be like this: |
| 28 | +The index can be created using the following syntax. |
17 | 29 |
|
18 | 30 | ```sql |
19 | | --- Create table of centroids |
20 | | -CREATE TABLE public.centroids (id integer NOT NULL UNIQUE, parent integer, vector vector(768)); |
21 | | --- Insert centroids into it |
22 | | -INSERT INTO public.centroids (id, parent, vector) VALUES (1, NULL, '{0.1, 0.2, 0.3, ..., 0.768}'); |
23 | | -INSERT INTO public.centroids (id, parent, vector) VALUES (2, NULL, '{0.4, 0.5, 0.6, ..., 0.768}'); |
24 | | -INSERT INTO public.centroids (id, parent, vector) VALUES (3, NULL, '{0.7, 0.8, 0.9, ..., 0.768}'); |
25 | | --- ... |
26 | | - |
27 | | --- Create index using the centroid table |
28 | | -CREATE INDEX ON gist_train USING vchordrq (embedding vector_l2_ops) WITH (options = $$ |
29 | | -[build.external] |
30 | | -table = 'public.centroids' |
| 31 | +CREATE INDEX ON t USING vchordrq (embedding vector_l2_ops) WITH (options = $$ |
| 32 | +build.external.table = 'public.t_build' |
31 | 33 | $$); |
32 | 34 | ``` |
33 | 35 |
|
34 | | -To simplify the workflow, we provide end-to-end scripts for external index pre-computation, refer to [Run External Index Precomputation Toolkit](https://github.com/tensorchord/VectorChord/tree/main/scripts#run-external-index-precomputation-toolkit). |
| 36 | +## Format |
| 37 | + |
| 38 | +The table that stores the partition information must strictly follow the following schema: |
| 39 | + |
| 40 | +- The first column must be `id`. Its type is `integer`. It must not be null. It must be unique. |
| 41 | +- The second column must be `parent`. Its type is `integer`. It could not be null. |
| 42 | +- The third column must be `vector`. Its type is `vector`. It must not be null. `halfvec` or other types are not supported yet. |
| 43 | + |
| 44 | +Logically, this forms a tree. The distance from the root to each leaf must be the same. |
| 45 | + |
| 46 | +If any of the above properties are not satisfied, an error will be reported. |
| 47 | + |
| 48 | +## Partitioning |
| 49 | + |
| 50 | +To get started, here is a minimal code example for performing partitioning using [faiss](https://github.com/facebookresearch/faiss). |
| 51 | + |
| 52 | +```python |
| 53 | +from typing import List |
| 54 | +from faiss import Kmeans |
| 55 | +import numpy |
| 56 | + |
| 57 | +def partition( |
| 58 | + samples: List[numpy.Array[float]], lists: List[int] |
| 59 | +) -> List[numpy.Array[float]]: |
| 60 | + dim = dataset.shape[1] |
| 61 | + results = [] |
| 62 | + for i in range(len(lists) + 1): |
| 63 | + kmeans = Kmeans( |
| 64 | + dim, |
| 65 | + lists[len(lists) - 1 - i] if len(lists) - 1 - i >= 0 else 1, |
| 66 | + gpu=False, |
| 67 | + verbose=True, |
| 68 | + niter=10, |
| 69 | + seed=42, |
| 70 | + spherical=False, |
| 71 | + ) |
| 72 | + kmeans.train(samples) |
| 73 | + results.push(kmeans.centroids) |
| 74 | + samples = kmeans.centroids |
| 75 | + return results |
| 76 | +``` |
35 | 77 |
|
36 | | -This feature is not supported by `vchordg`, since this step does not exist in it. |
| 78 | +This computes the generators of all Voronoi cells across different levels. The parent-child relationships in the tree are determined by computing the shortest distances, but the details are omitted here. Any algorithm capable of generating multi-level Voronoi cells can be used, such as spherical K-means, balanced K-means or GPU K-means. |
0 commit comments