Skip to content

Commit b4ebadc

Browse files
authored
feat: support multivec maxsim (#11)
* feat: support multivec maxsim Signed-off-by: Keming <kemingyang@tensorchord.ai> * accept suggestion from copilot review Signed-off-by: Keming <kemingyang@tensorchord.ai> * allow check if something is NULL Signed-off-by: Keming <kemingyang@tensorchord.ai> * align the maxsim op with upstream Signed-off-by: Keming <kemingyang@tensorchord.ai> * test with latest image Signed-off-by: Keming <kemingyang@tensorchord.ai> * add simple example, fix the ci check with latest image, fix logo url Signed-off-by: Keming <kemingyang@tensorchord.ai> * address commments Signed-off-by: Keming <kemingyang@tensorchord.ai> --------- Signed-off-by: Keming <kemingyang@tensorchord.ai>
1 parent 84b5b4c commit b4ebadc

File tree

13 files changed

+263
-34
lines changed

13 files changed

+263
-34
lines changed

.github/workflows/check.yml

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,21 @@ name: Python Check
33
on:
44
push:
55
branches: [ "main" ]
6+
paths:
7+
- 'vechord/**'
8+
- 'examples/**'
9+
- '.github/workflows/check.yml'
10+
- 'pyproject.toml'
11+
- 'Makefile'
612
pull_request:
713
branches: [ "main" ]
14+
paths:
15+
- 'vechord/**'
16+
- 'examples/**'
17+
- '.github/workflows/check.yml'
18+
- 'pyproject.toml'
19+
- 'Makefile'
20+
workflow_dispatch:
821

922
permissions:
1023
contents: read
@@ -26,8 +39,9 @@ jobs:
2639
- name: Test
2740
env:
2841
PYTEST_ADDOPTS: -s
42+
IMAGE: kemingy/vechord:latest
2943
run: |
30-
docker run --rm -d -p 5432:5432 --name vdb -e POSTGRES_PASSWORD=postgres --health-cmd="pg_isready -U postgres" --health-interval=1s --health-timeout=1s --health-retries=5 ghcr.io/tensorchord/vchord_bm25-postgres:pg17-v0.1.1
44+
docker run --rm -d -p 5432:5432 --name vdb -e POSTGRES_PASSWORD=postgres --health-cmd="pg_isready -U postgres" --health-interval=1s --health-timeout=1s --health-retries=5 ${IMAGE}
3145
3246
# Wait for the container to be healthy
3347
for i in {1..10}; do

.github/workflows/pages.yml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,13 @@ on:
77
- 'docs/**'
88
- '.github/workflows/pages.yml'
99
- 'examples/**'
10-
- '**.md'
1110
push:
1211
branches: [ main ]
1312
paths:
1413
- 'vechord/**'
1514
- 'docs/**'
1615
- '.github/workflows/pages.yml'
1716
- 'examples/**'
18-
- '**.md'
1917
# Allows you to run this workflow manually from the Actions tab
2018
workflow_dispatch:
2119

@@ -35,8 +33,6 @@ jobs:
3533
with:
3634
enable-cache: true
3735
python-version: "3.12"
38-
- name: Set up Rust
39-
uses: dtolnay/rust-toolchain@stable
4036
- name: Install dependencies
4137
run: |
4238
make sync

README.md

Lines changed: 30 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,13 @@
11
<div align="center">
2-
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="200" height="128" fill="none" viewBox="0 0 200 206">
3-
<defs><path id="a" stroke="#EAB711" d="M0-8h40"/></defs>
4-
<path stroke="#EAB711" stroke-width="16" d="M8 6v200M0 8h40M0 198h40M192 6v200"/>
5-
<use xlink:href="#a" stroke-width="16" transform="matrix(-1 0 0 1 200 16)"/>
6-
<use xlink:href="#a" stroke-width="16" transform="matrix(-1 0 0 1 200 206)"/>
7-
<path fill="#3776AB" d="m75.91 67.91 22.5 70.726h.863l22.545-70.727h21.818L111.545 161H86.182L54.045 67.91z"/>
8-
</svg>
2+
<img src="https://github.com/user-attachments/assets/7b2819bb-1a7d-4b84-9ff9-d0c4d5340da9">
93

104
<p>
115

12-
[![Python Check](https://github.com/tensorchord/vechord/actions/workflows/check.yml/badge.svg)](https://github.com/tensorchord/vechord/actions/workflows/check.yml)
13-
[![Pages](https://github.com/tensorchord/vechord/actions/workflows/pages.yml/badge.svg)]( tensorchord.github.io/vechord/)
14-
![GitHub License](https://img.shields.io/github/license/tensorchord/vechord)
15-
![PyPI - Version](https://img.shields.io/pypi/v/vechord)
16-
[![Discord](https://img.shields.io/discord/974584200327991326?&logoColor=white&color=5865F2&style=flat&logo=discord&cacheSeconds=60)](https://discord.gg/KqswhpVgdU)
6+
[![Python Check][ci-check-badge]][ci-check-file]
7+
[![Pages][ci-page-badge]][document-link]
8+
![GitHub License][license-badge]
9+
![PyPI - Version][pypi-badge]
10+
[![Discord][discord-badge]][discord-link]
1711

1812
</p>
1913
<p><em>Turn PostgreSQL into your search engine in a Pythonic way.</em></p>
@@ -25,12 +19,23 @@
2519
pip install vechord
2620
```
2721

22+
## Features
23+
24+
- [x] vector search with [RaBitQ][rabitq] (powered by [VectorChord][vectorchord])
25+
- [x] multivec search with [WARP][xtr-warp] (powered by [VectorChord][vectorchord])
26+
- [x] keyword search with BM25 score (powered by [VectorChord-bm25][vectorchord-bm25])
27+
- [x] guarantee the data consistency with transaction (use the `VechordRegistry.run`)
28+
- [x] provide decorator to inject the data from/to the database
29+
- [x] auto-generate the web service
30+
2831
## Examples
2932

33+
- [simple.py](examples/simple.py): for people that are familiar with specialized vector database APIs
3034
- [beir.py](examples/beir.py): the most flexible way to use the library (loading, indexing, querying and evaluation)
3135
- [web.py](examples/web.py): build a web application with from the defined tables and pipeline
3236
- [essay.py](examples/essay.py): extract the content from Paul Graham's essays and evaluate the search results from LLM generated queries
3337
- [contextual.py](examples/contextual.py): contextual retrieval example
38+
- [hybrid.py](examples/hybrid.py): hybrid search that rerank the results from vector search with keyword search
3439

3540
## Development
3641

@@ -42,3 +47,16 @@ make sync
4247
# format the code
4348
make format
4449
```
50+
51+
[vectorchord]: https://github.com/tensorchord/VectorChord/
52+
[vectorchord-bm25]: https://github.com/tensorchord/VectorChord-bm25
53+
[rabitq]: https://github.com/gaoj0017/RaBitQ
54+
[xtr-warp]:https://github.com/jlscheerer/xtr-warp
55+
[ci-check-badge]: https://github.com/tensorchord/vechord/actions/workflows/check.yml/badge.svg
56+
[ci-check-file]: https://github.com/tensorchord/vechord/actions/workflows/check.yml
57+
[ci-page-badge]: https://github.com/tensorchord/vechord/actions/workflows/pages.yml/badge.svg
58+
[document-link]: https://tensorchord.github.io/vechord/
59+
[license-badge]: https://img.shields.io/github/license/tensorchord/vechord
60+
[pypi-badge]: https://img.shields.io/pypi/v/vechord
61+
[discord-badge]: https://img.shields.io/discord/974584200327991326?&logoColor=white&color=5865F2&style=flat&logo=discord&cacheSeconds=60
62+
[discord-link]: https://discord.gg/KqswhpVgdU

docs/source/api.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111

1212
```{eval-rst}
1313
.. automodule:: vechord.spec
14-
:members: Vector,ForeignKey,PrimaryKeyAutoIncrease,Table
14+
:members: Vector,ForeignKey,PrimaryKeyAutoIncrease,Table,Keyword
1515
```
1616

1717
## Augment

examples/hybrid.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,14 @@
44
import httpx
55

66
from vechord.chunk import RegexChunker
7-
from vechord.embedding import SpacyDenseEmbedding
7+
from vechord.embedding import GeminiDenseEmbedding
88
from vechord.registry import VechordRegistry
99
from vechord.rerank import CohereReranker
1010
from vechord.spec import ForeignKey, Keyword, PrimaryKeyAutoIncrease, Table, Vector
1111

1212
URL = "https://paulgraham.com/{}.html"
13-
DenseVector = Vector[96]
14-
emb = SpacyDenseEmbedding()
13+
DenseVector = Vector[768]
14+
emb = GeminiDenseEmbedding()
1515
chunker = RegexChunker(size=1024, overlap=0)
1616
reranker = CohereReranker()
1717

examples/simple.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
from vechord.embedding import GeminiDenseEmbedding
2+
from vechord.registry import VechordRegistry
3+
from vechord.spec import PrimaryKeyAutoIncrease, Table, Vector
4+
5+
DenseVector = Vector[768]
6+
7+
8+
class Document(Table, kw_only=True):
9+
uid: PrimaryKeyAutoIncrease | None = None
10+
title: str = ""
11+
text: str
12+
vec: DenseVector
13+
14+
15+
if __name__ == "__main__":
16+
vr = VechordRegistry("simple", "postgresql://postgres:postgres@172.17.0.1:5432/")
17+
vr.register([Document])
18+
emb = GeminiDenseEmbedding()
19+
20+
# add a document
21+
text = "my personal long note"
22+
doc = Document(title="note", text=text, vec=DenseVector(emb.vectorize_chunk(text)))
23+
vr.insert(doc)
24+
25+
# load
26+
docs = vr.select_by(Document.partial_init(), limit=1)
27+
print(docs)
28+
29+
# query
30+
res = vr.search_by_vector(Document, emb.vectorize_query("note"), topk=1)
31+
print(res)

examples/web.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ def chunk_document(uid: int, text: str) -> list[Chunk]:
8080

8181

8282
if __name__ == "__main__":
83+
# this pipeline will be used in the web app, or you can run it with `vr.run()`
8384
vr.set_pipeline([load_document, chunk_document])
8485
app = create_web_app(vr)
8586

tests/test_spec.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
import numpy as np
66
import pytest
77

8-
from vechord.spec import ForeignKey, PrimaryKeyAutoIncrease, Table, Vector
8+
from vechord.spec import ForeignKey, Keyword, PrimaryKeyAutoIncrease, Table, Vector
99

1010

1111
class Document(Table, kw_only=True):
@@ -20,9 +20,16 @@ class Chunk(Table, kw_only=True):
2020
doc_id: Annotated[int, ForeignKey[Document.uid]]
2121
text: str
2222
vec: Vector[128]
23+
multivec: list[Vector[128]]
24+
keyword: Keyword
2325

2426

25-
@pytest.mark.parametrize("table", [Document, Chunk])
27+
class Simple(Table):
28+
uid: int
29+
text: str
30+
31+
32+
@pytest.mark.parametrize("table", [Document, Chunk, Simple])
2633
def test_storage_cls_methods(table: type[Table]):
2734
assert table.name() == table.__name__.lower()
2835
assert "uid" in table.fields()
@@ -31,13 +38,18 @@ def test_storage_cls_methods(table: type[Table]):
3138
for field in t.fields():
3239
assert getattr(t, field) is msgspec.UNSET
3340

41+
# UNSET won't appear in the `todict` result
42+
assert t.todict() == {}
43+
3444

3545
def test_table_cls_methods():
3646
assert Document.primary_key() == "uid", Document
3747
assert Chunk.primary_key() == "uid", Chunk
3848

3949
assert Document.vector_column() is None
4050
assert Chunk.vector_column() == "vec"
51+
assert Chunk.multivec_column() == "multivec"
52+
assert Chunk.keyword_column() == "keyword"
4153

4254
def find_schema_by_name(schema, name):
4355
for n, t in schema:

tests/test_table.py

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,12 @@ class Chunk(Table, kw_only=True):
3838
keyword: Keyword
3939

4040

41+
class Sentence(Table, kw_only=True):
42+
uid: PrimaryKeyAutoIncrease | None = None
43+
text: str
44+
vector: list[DenseVector]
45+
46+
4147
@pytest.fixture(name="registry")
4248
def fixture_registry(request):
4349
registry = VechordRegistry(request.node.name, TEST_POSTGRES)
@@ -60,6 +66,10 @@ def test_insert_select_remove(registry):
6066
assert inserted[0].text == "hello world"
6167
assert inserted[1].text == "hello there"
6268

69+
# select with limit
70+
one = registry.select_by(Document.partial_init(), limit=1)
71+
assert len(one) == 1
72+
6373
# select by id
6474
first = registry.select_by(Document.partial_init(uid=1))
6575
assert len(first) == 1
@@ -113,15 +123,42 @@ def create_chunk(uid: int, text: str) -> list[Chunk]:
113123
chunks = registry.select_by(Chunk.partial_init())
114124
assert len(chunks) == len(text.split())
115125

116-
# test search
117126
topk = 3
127+
# vector search
118128
vec_res = registry.search_by_vector(Chunk, gen_vector(), topk=topk)
119129
assert len(vec_res) == topk
120130
assert all(chunk.text in text for chunk in vec_res)
131+
# keyword search
121132
text_res = registry.search_by_keyword(Chunk, "vector", topk=topk)
122133
assert len(text_res) == 1
123134

124135

136+
@pytest.mark.db
137+
def test_multi_vec_maxsim(registry):
138+
registry.register([Sentence])
139+
140+
@registry.inject(output=Sentence)
141+
def create_sentence(text: str) -> Sentence:
142+
return Sentence(
143+
text=text, vector=[gen_vector() for _ in range(len(text.split()))]
144+
)
145+
146+
text = "the quick brown fox jumps over the lazy dog"
147+
num = 32
148+
for _ in range(num):
149+
create_sentence(text)
150+
sentence = registry.select_by(Sentence.partial_init())
151+
assert len(sentence) == num
152+
assert len(sentence[0].vector) == len(text.split())
153+
154+
topk = 3
155+
for dim in range(1, 10):
156+
res = registry.search_by_multivec(
157+
Sentence, [gen_vector() for _ in range(dim)], topk=topk
158+
)
159+
assert len(res) == topk
160+
161+
125162
@pytest.mark.db
126163
def test_pipeline(registry):
127164
@registry.inject(output=Document)

vechord/client.py

Lines changed: 56 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,23 @@ def create_vector_index(self, name: str, column: str):
9191
)
9292
)
9393

94+
def create_multivec_index(self, name: str, column: str):
95+
config = "build.internal.lists = []"
96+
with self.transaction():
97+
cursor = self.get_cursor()
98+
cursor.execute(
99+
sql.SQL(
100+
"CREATE INDEX IF NOT EXISTS {index} ON "
101+
"{table} USING vchordrq ({column} vector_maxsim_ops) WITH "
102+
"(options = $${config}$$);"
103+
).format(
104+
table=sql.Identifier(f"{self.ns}_{name}"),
105+
index=sql.Identifier(f"{self.ns}_{name}_{column}_multivec_idx"),
106+
column=sql.Identifier(column),
107+
config=sql.SQL(config),
108+
)
109+
)
110+
94111
def _keyword_index_name(self, name: str, column: str):
95112
return f"{self.ns}_{name}_{column}_bm25_idx"
96113

@@ -114,6 +131,7 @@ def select(
114131
raw_columns: Sequence[str],
115132
kvs: Optional[dict[str, Any]] = None,
116133
from_buffer: bool = False,
134+
limit: Optional[int] = None,
117135
):
118136
"""Select from db table with optional key-value condition or from un-committed
119137
transaction buffer.
@@ -129,12 +147,18 @@ def select(
129147
)
130148
if kvs:
131149
condition = sql.SQL(" AND ").join(
132-
sql.SQL("{} = {}").format(sql.Identifier(col), sql.Placeholder(col))
133-
for col in kvs
150+
sql.SQL("{} IS NULL").format(sql.Identifier(col))
151+
if val is None
152+
else sql.SQL("{} = {}").format(
153+
sql.Identifier(col), sql.Placeholder(col)
154+
)
155+
for col, val in kvs.items()
134156
)
135157
query += sql.SQL(" WHERE {condition}").format(condition=condition)
136158
elif from_buffer:
137159
query += sql.SQL(" WHERE xmin = pg_current_xact_id()::xid;")
160+
if limit:
161+
query += sql.SQL(" LIMIT {}").format(sql.Literal(limit))
138162
cursor.execute(query, kvs)
139163
return [row for row in cursor.fetchall()]
140164

@@ -199,6 +223,36 @@ def query_vec(
199223
)
200224
return [row for row in cursor.fetchall()]
201225

226+
def query_multivec( # noqa: PLR0913
227+
self,
228+
name: str,
229+
multivec_col: str,
230+
vec: np.ndarray,
231+
max_maxsim_tuples: int,
232+
return_fields: list[str],
233+
topk: int = 10,
234+
):
235+
columns = sql.SQL(", ").join(map(sql.Identifier, return_fields))
236+
with self.transaction():
237+
cursor = self.get_cursor()
238+
cursor.execute("SET vchordrq.probes = '';")
239+
cursor.execute(
240+
sql.SQL("SET vchordrq.max_maxsim_tuples = {};").format(
241+
sql.Literal(max_maxsim_tuples)
242+
)
243+
)
244+
cursor.execute(
245+
sql.SQL(
246+
"SELECT {columns} FROM {table} ORDER BY {multivec_col} @# %s LIMIT %s;"
247+
).format(
248+
table=sql.Identifier(f"{self.ns}_{name}"),
249+
columns=columns,
250+
multivec_col=sql.Identifier(multivec_col),
251+
),
252+
(vec, topk),
253+
)
254+
return [row for row in cursor.fetchall()]
255+
202256
def query_keyword( # noqa: PLR0913
203257
self,
204258
name: str,

0 commit comments

Comments
 (0)