|
| 1 | +# Guide |
| 2 | + |
| 3 | +## Define the table |
| 4 | + |
| 5 | +Inherite the {py:class}`~vechord.spec.Table` class and define the columns as attributes with the |
| 6 | +type hints. Some advanced configuration can be done by using the {py:class}`typing.Annotated`. |
| 7 | + |
| 8 | +### Choose a primary key |
| 9 | + |
| 10 | +- {py:class}`~vechord.spec.PrimaryKeyAutoIncrease`: generate an auto-incrementing integer as the primary key |
| 11 | +- {py:class}`~vechord.spec.PrimaryKeyUUID`: use `uuid7` as the primary key, suitable for distributed systems or general purposes |
| 12 | +- `int` or `str`: insert the key manually |
| 13 | + |
| 14 | +### Vector and Keyword search |
| 15 | + |
| 16 | +- {py:class}`~vechord.spec.Vector`: define a vector column with dimensions, it's recommended to define something like `DenseVector = Vector[768]` and use it in all tables. This accepts `list[float]` or `numpy.ndarray` as the input. For now, it only supports `f32` type. |
| 17 | + - for multivector, use `list[DenseVector]` as the type hint |
| 18 | +- {py:class}`~vechord.spec.Keyword`: define a keyword column that the `str` will be tokenized and stored as the `bm25vector` type. This accepts `str` as the input. |
| 19 | + |
| 20 | +### Configure the Index |
| 21 | + |
| 22 | +The default index is suitable for small datasets (less than 100k). For larger datasets, you can |
| 23 | +customize the index configuration by using the {py:class}`typing.Annotated` with: |
| 24 | + |
| 25 | +- {py:class}`~vechord.spec.VectorIndex`: configure the `lists` and `distance` operators. |
| 26 | +- {py:class}`~vechord.spec.MultiVectorIndex`: configure the `lists`. |
| 27 | + |
| 28 | +```python |
| 29 | +DenseVector = Vector[768] |
| 30 | + |
| 31 | +class MyTable(Table, kw_only=True): |
| 32 | + uid: PrimaryKeyUUID = msgspec.field(default_factory=PrimaryKeyUUID.factory) |
| 33 | + vec: Annotated[DenseVector, VectorIndex(lists=128)] |
| 34 | + text: str |
| 35 | +``` |
| 36 | + |
| 37 | +:::{tip} |
| 38 | +If you need to use a customized tokenizer, please refer to the [VectorChord-bm25 document](https://github.com/tensorchord/VectorChord-bm25/?tab=readme-ov-file#more-examples). |
| 39 | +::: |
| 40 | + |
| 41 | +### Use the foreign key to link tables |
| 42 | + |
| 43 | +By default, the foreign key will add `REFERENCES ON DELETE CASCADE`. |
| 44 | + |
| 45 | +```python |
| 46 | +class SubTable(Table, kw_only=True): |
| 47 | + uid: PrimaryKeyUUID = msgspec.field(default_factory=PrimaryKeyUUID.factory) |
| 48 | + text: str |
| 49 | + mytable_uid: Annotated[UUID, ForeignKey[MyTable.uid]] |
| 50 | +``` |
| 51 | + |
| 52 | +### JSONB |
| 53 | + |
| 54 | +If you want to store a JSONB column, you can define like: |
| 55 | + |
| 56 | +```python |
| 57 | +from psycopg.types.json import Jsonb |
| 58 | + |
| 59 | +class MyJsonTable(Table, kw_only=True): |
| 60 | + uid: PrimaryKeyUUID = msgspec.field(default_factory=PrimaryKeyUUID.factory) |
| 61 | + json: JSONB |
| 62 | + |
| 63 | +item = MyJsonTable(json=Jsonb({"key": "value"})) |
| 64 | +``` |
| 65 | + |
| 66 | +## Inject with decorator |
| 67 | + |
| 68 | +The decorator {py:meth}`~vechord.registry.VechordRegistry.inject` can be used to load the |
| 69 | +function arguments from the database and dump the return values to the database. |
| 70 | + |
| 71 | +To use this decorator, you need to specify at least one of the `input` or `output` with |
| 72 | +the table class you have defined. |
| 73 | + |
| 74 | +- `input=Type[Table]`: will load the specified columns rom the database and inject the data to the decorated function arguments |
| 75 | + - if `input=None`, the function will need to pass the arguments manually |
| 76 | +- `output=Type[Table]`: will dump the return values to the database (will also need to annotate the return type with the provided table class or a list of the table class) |
| 77 | + - if `output=None`, you can get the return value from the functiona call |
| 78 | + |
| 79 | +The following example uses the pre-defined tables: |
| 80 | + |
| 81 | +- {py:class}`~vechord.spec.DefaultDocument` |
| 82 | +- {py:func}`~vechord.spec.create_chunk_with_dim` |
| 83 | + |
| 84 | +```python |
| 85 | +from uuid import UUID |
| 86 | +import httpx |
| 87 | +from vechord.registry import VechordRegistry |
| 88 | +from vechord.extract import SimpleExtractor |
| 89 | +from vechord.embedding import GeminiDenseEmbedding |
| 90 | +from vechord.spec import DefaultDocument, create_chunk_with_dim |
| 91 | + |
| 92 | +DefaultChunk = create_chunk_with_dim(768) |
| 93 | +vr = VechordRegistry(namespace="test", url="postgresql://postgres:postgres@127.0.0.1:5432/") |
| 94 | +vr.register([DefaultDocument, DefaultChunk]) |
| 95 | +extractor = SimpleExtractor() |
| 96 | +emb = GeminiDenseEmbedding() |
| 97 | + |
| 98 | + |
| 99 | +@vr.inject(output=DefaultDocument) |
| 100 | +def add_document(url: str) -> DefaultDocument: |
| 101 | + with httpx.Client() as client: |
| 102 | + resp = client.get(url) |
| 103 | + text = extractor.extract_html(resp.text) |
| 104 | + return DefaultDocument(title=url, text=text) |
| 105 | + |
| 106 | + |
| 107 | +@vr.inject(input=Document, output=DefaultChunk) |
| 108 | +def add_chunk(uid: UUID, text: str) -> list[DefaultChunk]: |
| 109 | + chunks = text.split("\n") |
| 110 | + return [DefaultChunk(doc_id=uid, vec=emb.vectorize_chunk(t), text=t) for t in chunks] |
| 111 | + |
| 112 | + |
| 113 | +for url in ["https://paulgraham.com/best.html", "https://paulgraham.com/read.html"]: |
| 114 | + add_document(url) |
| 115 | +add_chunk() |
| 116 | +``` |
| 117 | + |
| 118 | +### Select/Insert/Delete |
| 119 | + |
| 120 | +We also provide some functions to select, insert and delete the data from the database. |
| 121 | + |
| 122 | +- {py:meth}`~vechord.registry.VechordRegistry.select_by` |
| 123 | +- {py:meth}`~vechord.registry.VechordRegistry.insert` |
| 124 | +- {py:meth}`~vechord.registry.VechordRegistry.copy_bulk` |
| 125 | +- {py:meth}`~vechord.registry.VechordRegistry.remove_by` |
| 126 | + |
| 127 | +```python |
| 128 | +docs = vr.select_by(DefaultDocument.partial_init()) |
| 129 | +vr.insert(DefaultDocument(text="hello world")) |
| 130 | +vr.copy_bulk([DefaultDocument(text="hello world"), DefaultDocument(text="hello vector")]) |
| 131 | +vr.remove_by(DefaultDocument.partial_init()) |
| 132 | +``` |
| 133 | + |
| 134 | +## Transaction |
| 135 | + |
| 136 | +Use the {py:class}`~vechord.registry.VechordPipeline` to run multiple functions in a transaction. |
| 137 | + |
| 138 | +This also guarantees that the decorated functions will only load the data from the current |
| 139 | +transaction instead of the whole table. So users can focus on the data processing part. |
| 140 | + |
| 141 | +```python |
| 142 | +pipeline = vr.create_pipeline([add_document, add_chunk]) |
| 143 | +pipeline.run("https://paulgraham.com/best.html") |
| 144 | +``` |
| 145 | + |
| 146 | +## Search |
| 147 | + |
| 148 | +We provide search interface for different types of queries: |
| 149 | + |
| 150 | +- {py:meth}`~vechord.registry.VechordRegistry.search_by_vector` |
| 151 | +- {py:meth}`~vechord.registry.VechordRegistry.search_by_keyword` |
| 152 | +- {py:meth}`~vechord.registry.VechordRegistry.search_by_multivec` |
| 153 | + |
| 154 | +```python |
| 155 | +vr.search_by_vector(DefaultChunk, emb.vectorize_query("hey"), topk=10) |
| 156 | +``` |
| 157 | + |
| 158 | +## Access the cursor |
| 159 | + |
| 160 | +If you need to change some settings or use the cursor directly: |
| 161 | + |
| 162 | +```python |
| 163 | +vr.client.get_cursor().execute("SET vchordrq.probes = 100;") |
| 164 | +``` |
0 commit comments