Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
243 commits
Select commit Hold shift + click to select a range
9b0752f
fix include path for usearch
cpegeric Feb 5, 2026
0c2f15b
fix ut
cpegeric Feb 5, 2026
7da35f9
add stream
cpegeric Feb 6, 2026
6818845
add worker
cpegeric Feb 7, 2026
ba0b62e
gofmt
cpegeric Feb 7, 2026
f6e3cdc
worker pool
cpegeric Feb 7, 2026
863c93c
remove init()
cpegeric Feb 7, 2026
1d9b72e
close channel in Stop
cpegeric Feb 7, 2026
75c8ae6
sigterm and sigint
cpegeric Feb 7, 2026
2d42d79
bug fix sigterm thread not stop
cpegeric Feb 7, 2026
4c25e19
sigterm test case
cpegeric Feb 7, 2026
e2f98ef
keepalive
cpegeric Feb 7, 2026
a29b0d3
cleanup
cpegeric Feb 7, 2026
acd31bd
stopfn
cpegeric Feb 7, 2026
2f0faef
brute-force with cuvs worker
cpegeric Feb 7, 2026
0d0c771
two ivf index will crash
cpegeric Feb 7, 2026
2ad2267
better error handling
cpegeric Feb 9, 2026
a7e3bce
bug fix check error
cpegeric Feb 9, 2026
7766829
better error handling
cpegeric Feb 9, 2026
a3bdddd
error handling
cpegeric Feb 9, 2026
9bef5ba
task result store use per-job channel to wait
cpegeric Feb 10, 2026
0cba40a
bug fix test
cpegeric Feb 10, 2026
87abd66
setting nthread to brute-force search
cpegeric Feb 10, 2026
1ae4c33
Merge branch 'main' into gpu_cuvsworker
cpegeric Feb 10, 2026
ee4cff7
always return result first even stopped
cpegeric Feb 10, 2026
be5a74e
cleanup
cpegeric Feb 10, 2026
3e172c3
cuvs must be LockOSThread with go routine
cpegeric Feb 11, 2026
2ceac55
gpu clusterer with cuvsworker
cpegeric Feb 11, 2026
3c39129
update
cpegeric Feb 11, 2026
9ca3046
disable gpu brute force index
cpegeric Feb 11, 2026
4e00002
bug fix
cpegeric Feb 12, 2026
1941d88
bug fix
cpegeric Feb 12, 2026
27b20d6
add cuvs cpp
cpegeric Feb 13, 2026
d2af75d
relocation
cpegeric Feb 13, 2026
51ecdb3
destructor
cpegeric Feb 13, 2026
6f2e395
change namespace
cpegeric Feb 13, 2026
4041cb1
change namespace
cpegeric Feb 13, 2026
973cc2a
change namespace
cpegeric Feb 13, 2026
5849207
suppress compiler warning
cpegeric Feb 13, 2026
3b47589
cleanup
cpegeric Feb 13, 2026
3a85f56
flatten vector
cpegeric Feb 13, 2026
38885d0
search with flattened vector
cpegeric Feb 16, 2026
2ee37e8
flattened vector in hostdataset
cpegeric Feb 16, 2026
be5efd7
shared mutex
cpegeric Feb 16, 2026
66a4f34
go and c interface
cpegeric Feb 16, 2026
267e516
bug fix shared mutex in Submit
cpegeric Feb 16, 2026
f149703
generate .a and .so
cpegeric Feb 16, 2026
349e415
merge fix
cpegeric Mar 2, 2026
c75c1ec
brute force index
cpegeric Mar 2, 2026
ddb2da8
refactor with flattened array
cpegeric Mar 2, 2026
a8d62a3
refactor cusv_worker
cpegeric Mar 2, 2026
39abb25
errmsg
cpegeric Mar 2, 2026
0cda120
errmsg
cpegeric Mar 2, 2026
aef6a37
ivfflat
cpegeric Mar 2, 2026
f355f6f
sync
cpegeric Mar 2, 2026
810607e
sharded ivfflat index
cpegeric Mar 2, 2026
348a87a
bug fix raft resource
cpegeric Mar 2, 2026
ab05746
sharded ivfflat index
cpegeric Mar 2, 2026
90abc79
add tests
cpegeric Mar 2, 2026
05610d0
helper
cpegeric Mar 2, 2026
74d6203
cagra
cpegeric Mar 2, 2026
3aacb04
support multiple data type
cpegeric Mar 2, 2026
6401eab
cleanup
cpegeric Mar 2, 2026
ac422f6
cleanup
cpegeric Mar 2, 2026
44b0a31
convert float32 to float16
cpegeric Mar 2, 2026
9096492
better float32 to float16 convsersion
cpegeric Mar 2, 2026
dac3337
extend and merge for cagra
cpegeric Mar 2, 2026
6808488
change package cuvs to mocuvs
cpegeric Mar 2, 2026
fbf0840
runtime.KeepAlive
cpegeric Mar 2, 2026
9945fcb
rename function to lowercase
cpegeric Mar 2, 2026
34eddc3
rename function to lowercase
cpegeric Mar 2, 2026
e14eac2
merge sharded and single gpu index
cpegeric Mar 2, 2026
1ba6f93
better checking snmg_handle
cpegeric Mar 2, 2026
1f02e69
rename gpu_ivf_flat_index to gpu_ivf_flat
cpegeric Mar 2, 2026
01d7e1e
rename
cpegeric Mar 2, 2026
24250b9
kmeans
cpegeric Mar 3, 2026
4879e9a
balanced kmeans
cpegeric Mar 3, 2026
4b48a60
build_params and search_params
cpegeric Mar 3, 2026
417e6bd
cpp cuvs_types
cpegeric Mar 3, 2026
34e1a98
add params
cpegeric Mar 3, 2026
f6e9616
include cpp for header
cpegeric Mar 3, 2026
cb171d3
remove ../cpp
cpegeric Mar 3, 2026
471f3f3
relocate
cpegeric Mar 3, 2026
1540501
fix test error
cpegeric Mar 3, 2026
77052b3
integrate to use cgo cuvs index
cpegeric Mar 3, 2026
54894e9
add tests
cpegeric Mar 3, 2026
8b68171
compile
cpegeric Mar 3, 2026
7ebe95a
copy .so
cpegeric Mar 4, 2026
c09ee19
rename to libmo_c
cpegeric Mar 4, 2026
8823616
fix linker in darwin
cpegeric Mar 4, 2026
50a2266
bug fix save the dataset pointer and only delete at the end. index o…
cpegeric Mar 4, 2026
ae27f31
Merge branch 'main' into gpu_cuvsworker
cpegeric Mar 4, 2026
e53a3a6
update distance type
cpegeric Mar 4, 2026
7815430
use moerr
cpegeric Mar 4, 2026
260ae1a
benchmark for bruteforce index
cpegeric Mar 4, 2026
3994578
enable gpu brute force index
cpegeric Mar 4, 2026
3adb0de
fix Makefile
cpegeric Mar 4, 2026
1ec5323
default params
cpegeric Mar 4, 2026
5f0cf17
default params in test
cpegeric Mar 4, 2026
ef0c7cd
update README
cpegeric Mar 4, 2026
1163c48
add license and comment
cpegeric Mar 4, 2026
da2dfce
license
cpegeric Mar 4, 2026
974cbdf
bug fix revert to lmo
cpegeric Mar 4, 2026
29e2e3f
remove test
cpegeric Mar 5, 2026
45c498c
ld library path
cpegeric Mar 5, 2026
1132bc5
add rapids_logger
cpegeric Mar 5, 2026
9ccf911
remove cuvs from async worker pool
cpegeric Mar 6, 2026
a402856
async worker pool
cpegeric Mar 6, 2026
bab3b88
check nil callback function
cpegeric Mar 6, 2026
2e40c5b
darwin support
cpegeric Mar 6, 2026
24f7048
remove cuvs
cpegeric Mar 6, 2026
222b56b
remove cuvs
cpegeric Mar 6, 2026
5c1e7cc
bug fix ivfflat search slow table scan
cpegeric Mar 6, 2026
6327006
sample
Mar 6, 2026
c724dee
lmo
cpegeric Mar 9, 2026
7d9fb3a
cherry pick
cpegeric Mar 9, 2026
09d02c5
Merge branch 'main' into ivf_escapeheap
mergify[bot] Mar 9, 2026
b035516
merge fix
cpegeric Mar 9, 2026
18a65ff
update gpu
cpegeric Mar 9, 2026
b49ea80
bug fix kmeans
cpegeric Mar 9, 2026
e47845e
fix select count with version
cpegeric Mar 9, 2026
5f23108
merge fix
cpegeric Mar 9, 2026
f79efb3
zero out the memory before put to sync.Pool
cpegeric Mar 9, 2026
3b7254f
sca test
cpegeric Mar 9, 2026
3675de6
Merge branch 'ivf_escapeheap' into gpu_cuvsworker
cpegeric Mar 9, 2026
9af53b8
bug fix u16 pool
cpegeric Mar 9, 2026
843ba67
limit sample percent between 0 and 100
cpegeric Mar 9, 2026
226e2c5
Merge branch 'gpu_cuvsworker' of github.com:cpegeric/matrixone into g…
cpegeric Mar 9, 2026
1af76c5
limit sample percent between 0 and 100
cpegeric Mar 9, 2026
7cc5bb3
go fmt
cpegeric Mar 9, 2026
d4630e2
Merge branch 'ivf_escapeheap' into gpu_cuvsworker
cpegeric Mar 9, 2026
b69d512
go fmt
cpegeric Mar 9, 2026
caa06ba
sca
cpegeric Mar 9, 2026
deb4202
revise test
cpegeric Mar 9, 2026
0a4574e
Merge branch 'ivf_escapeheap' into gpu_cuvsworker
cpegeric Mar 9, 2026
60dc973
fix make ut
cpegeric Mar 9, 2026
81b05bd
ld library path
cpegeric Mar 9, 2026
5079e67
async worker pool race condition
cpegeric Mar 9, 2026
55402f3
run_ut.sh
cpegeric Mar 10, 2026
dec4d7f
use CAllocator
cpegeric Mar 10, 2026
473642f
default to use go brute force index
cpegeric Mar 10, 2026
22ea33a
Merge branch 'ivf_escapeheap' into gpu_cuvsworker
cpegeric Mar 10, 2026
2c3f367
gpu remove sync.pool
cpegeric Mar 10, 2026
84ceb5f
remove partial
cpegeric Mar 10, 2026
ae54552
merge fix
cpegeric Mar 10, 2026
e09b50f
go fmt
cpegeric Mar 10, 2026
64cdb1a
Merge branch 'ivf_escapeheap' into gpu_cuvsworker
cpegeric Mar 10, 2026
82c0d89
merge
cpegeric Mar 10, 2026
4b8cc4c
cleanup malloc
cpegeric Mar 10, 2026
f43965f
Merge branch 'ivf_escapeheap' into gpu_cuvsworker
cpegeric Mar 10, 2026
f6e2b60
remove signal handler from C++
cpegeric Mar 10, 2026
88fc7d1
fast max heap
cpegeric Mar 10, 2026
96d8387
Merge branch 'ivf_escapeheap' into gpu_cuvsworker
cpegeric Mar 10, 2026
cce3bd0
go fmt
cpegeric Mar 10, 2026
877ff41
Merge branch 'ivf_escapeheap' into gpu_cuvsworker
cpegeric Mar 10, 2026
7d3de29
go fmt
cpegeric Mar 11, 2026
1c9151b
Merge branch 'ivf_escapeheap' into gpu_cuvsworker
cpegeric Mar 11, 2026
43751b2
Merge branch 'main' into ivf_escapeheap
cpegeric Mar 11, 2026
bb50776
Merge branch 'main' into gpu_cuvsworker
cpegeric Mar 11, 2026
50564fb
split by dataset
cpegeric Mar 11, 2026
ee7a75d
Merge branch 'ivf_escapeheap' into gpu_cuvsworker
cpegeric Mar 11, 2026
f996a50
Revert "split by dataset"
cpegeric Mar 11, 2026
227e82f
Merge branch 'ivf_escapeheap' into gpu_cuvsworker
cpegeric Mar 11, 2026
a6022c8
single thread run in current thread
cpegeric Mar 11, 2026
ea90d7c
add centroid search test
cpegeric Mar 11, 2026
b3ee5fc
ivf_pq
cpegeric Mar 12, 2026
fbf907b
ivf_pq load from datafile
cpegeric Mar 12, 2026
5b1242f
support float32 -> half, int8 and half -> int8
cpegeric Mar 12, 2026
337cf9c
start and then load
cpegeric Mar 12, 2026
43ae027
cagra and ivf_flat start and load
cpegeric Mar 12, 2026
c2978ae
remove row_offset
cpegeric Mar 12, 2026
584583a
brute force index and kmeans
cpegeric Mar 12, 2026
6077344
sync after quanitzer train
cpegeric Mar 12, 2026
d518bda
better quantizer and search float with auto quantization
cpegeric Mar 12, 2026
4296809
bug fix sync_stream
cpegeric Mar 12, 2026
15ea845
bug fix memory leak
cpegeric Mar 12, 2026
2016983
len and cap
cpegeric Mar 13, 2026
a73b50f
rename Load to Build
cpegeric Mar 13, 2026
0fbc37b
brute force index in blockio reader
cpegeric Mar 13, 2026
8a71f1b
Merge branch 'main' into gpu_ivfpq
cpegeric Mar 13, 2026
abeb726
adhoc brute force search in gpu
cpegeric Mar 13, 2026
56f4084
thread_local resource
cpegeric Mar 13, 2026
47c5c60
flattened adhoc search
cpegeric Mar 13, 2026
ef6d255
flattend
cpegeric Mar 13, 2026
08fc2e4
revert to main
cpegeric Mar 16, 2026
b5060c3
merge fix
cpegeric Mar 16, 2026
4bb6fc3
quantizer
cpegeric Mar 16, 2026
15047f1
bug fix misalign memory
cpegeric Mar 16, 2026
ae15dc3
get/set quantizer
cpegeric Mar 16, 2026
38ea4a9
pairwise
cpegeric Mar 16, 2026
9be0c39
pairwise
cpegeric Mar 16, 2026
4ede16e
hybrid
cpegeric Mar 16, 2026
a816435
pairwise distance in blockio/read.go
cpegeric Mar 16, 2026
622dc5e
bvt fix
cpegeric Mar 16, 2026
dde7275
cagra merged index need explicit call Start() before search
cpegeric Mar 17, 2026
588f756
Merge branch 'gpu_ivfpq' of github.com:cpegeric/matrixone into gpu_ivfpq
cpegeric Mar 17, 2026
d4cd5b5
remove compiler warning
cpegeric Mar 17, 2026
94ab8b4
benchmark
cpegeric Mar 17, 2026
922472c
optimize for replicated mode
cpegeric Mar 17, 2026
d45a0b5
dynamic batching
cpegeric Mar 17, 2026
2849c87
run false and then true
cpegeric Mar 17, 2026
d2a1833
run old path when useBatching = false
cpegeric Mar 17, 2026
bcc938e
go fmt
cpegeric Mar 17, 2026
4b428cc
set_use_batch and set_per_thread_device
cpegeric Mar 17, 2026
1347342
index_base class
cpegeric Mar 17, 2026
016702a
introduce main thread queue to make sure build in main thread
cpegeric Mar 17, 2026
8db3f72
bug fix thread safe queue with capacity limit
cpegeric Mar 18, 2026
db4b2e9
info
cpegeric Mar 18, 2026
c892823
bug fix thread safe queue stopped
cpegeric Mar 18, 2026
b28d746
build_internal refactor
cpegeric Mar 18, 2026
04b0bab
add chunk benchmark
cpegeric Mar 18, 2026
d306e77
tests for cuvs_worker
cpegeric Mar 18, 2026
01b9093
thread safe queue stress test
cpegeric Mar 18, 2026
4158505
search_batch_internal
cpegeric Mar 18, 2026
5f21f0e
clean up include headers
cpegeric Mar 18, 2026
559399b
get centroids return []T
cpegeric Mar 18, 2026
b587cc6
recall rate
cpegeric Mar 18, 2026
87fe7ea
recall rate shown
cpegeric Mar 18, 2026
1fc6dbc
info in JSON
cpegeric Mar 18, 2026
8b26d2b
info in JSON
cpegeric Mar 18, 2026
e01bc66
info test
cpegeric Mar 18, 2026
0533f7f
comment out Info
cpegeric Mar 18, 2026
2305f42
go fmt
cpegeric Mar 18, 2026
2fe384a
readme
cpegeric Mar 18, 2026
64c9acb
auto quantization
cpegeric Mar 18, 2026
172de9b
add blog.md
cpegeric Mar 19, 2026
1b63918
Merge branch 'main' into gpu_ivfpq
cpegeric Mar 19, 2026
bc40d61
more log
cpegeric Mar 19, 2026
44bf29a
more log
cpegeric Mar 19, 2026
daa10b8
bug fix assign wrong device id in single gpu mode
cpegeric Mar 19, 2026
8cd64c5
bug fix device id
cpegeric Mar 19, 2026
2bcfc08
sharded mode use int64 id in cagra
cpegeric Mar 19, 2026
9b2922e
cagra id use int64
cpegeric Mar 19, 2026
9324e56
bug fix deallocate
cpegeric Mar 19, 2026
914352d
inner scope to free temp memory
cpegeric Mar 19, 2026
9d132c3
Revert "inner scope to free temp memory"
cpegeric Mar 20, 2026
a90480a
Revert "bug fix deallocate"
cpegeric Mar 20, 2026
aa41f3f
Revert "cagra id use int64"
cpegeric Mar 20, 2026
6666e27
Revert "sharded mode use int64 id in cagra"
cpegeric Mar 20, 2026
af88138
Revert "bug fix device id"
cpegeric Mar 20, 2026
4b776cc
Revert "bug fix assign wrong device id in single gpu mode"
cpegeric Mar 20, 2026
e445ab0
Revert "more log"
cpegeric Mar 20, 2026
6bcb10d
Revert "more log"
cpegeric Mar 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@ pb: vendor-build generate-pb fmt

VERSION_INFO :=-X '$(GO_MODULE)/pkg/version.GoVersion=$(GO_VERSION)' -X '$(GO_MODULE)/pkg/version.BranchName=$(BRANCH_NAME)' -X '$(GO_MODULE)/pkg/version.CommitID=$(LAST_COMMIT_ID)' -X '$(GO_MODULE)/pkg/version.BuildTime=$(BUILD_TIME)' -X '$(GO_MODULE)/pkg/version.Version=$(MO_VERSION)'
THIRDPARTIES_INSTALL_DIR=$(ROOT_DIR)/thirdparties/install
CGO_DIR=$(ROOT_DIR)/cgo
RACE_OPT :=
DEBUG_OPT :=
CGO_DEBUG_OPT :=
Expand All @@ -188,7 +189,7 @@ ifeq ($(MO_CL_CUDA),1)
$(error CONDA_PREFIX env variable not found.)
endif
CUVS_CFLAGS := -I$(CONDA_PREFIX)/include
CUVS_LDFLAGS := -L$(CONDA_PREFIX)/envs/go/lib -lcuvs -lcuvs_c
CUVS_LDFLAGS := -L$(CONDA_PREFIX)/lib -lcuvs -lcuvs_c
CUDA_CFLAGS := -I/usr/local/cuda/include $(CUVS_CFLAGS)
CUDA_LDFLAGS := -L/usr/local/cuda/lib64/stubs -lcuda -L/usr/local/cuda/lib64 -lcudart $(CUVS_LDFLAGS) -lstdc++
TAGS += -tags "gpu"
Expand All @@ -198,11 +199,11 @@ ifeq ($(TYPECHECK),1)
TAGS += -tags "typecheck"
endif

CGO_OPTS :=CGO_CFLAGS="-I$(THIRDPARTIES_INSTALL_DIR)/include $(CUDA_CFLAGS)"
GOLDFLAGS=-ldflags="-extldflags '$(CUDA_LDFLAGS) -L$(THIRDPARTIES_INSTALL_DIR)/lib -Wl,-rpath,\$${ORIGIN}/lib -fopenmp' $(VERSION_INFO)"
CGO_OPTS :=CGO_CFLAGS="-I$(CGO_DIR) -I$(THIRDPARTIES_INSTALL_DIR)/include $(CUDA_CFLAGS)"
GOLDFLAGS=-ldflags="-extldflags '$(CUDA_LDFLAGS) -L$(CGO_DIR) -lmo -L$(THIRDPARTIES_INSTALL_DIR)/lib -Wl,-rpath,\$${ORIGIN}/lib -fopenmp' $(VERSION_INFO)"

ifeq ("$(UNAME_S)","darwin")
GOLDFLAGS:=-ldflags="-extldflags '-L$(THIRDPARTIES_INSTALL_DIR)/lib -Wl,-rpath,@executable_path/lib' $(VERSION_INFO)"
GOLDFLAGS:=-ldflags="-extldflags '-L$(CGO_DIR) -lmo -L$(THIRDPARTIES_INSTALL_DIR)/lib -Wl,-rpath,@executable_path/lib' $(VERSION_INFO)"
endif

ifeq ($(GOBUILD_OPT),)
Expand Down
65 changes: 47 additions & 18 deletions cgo/Makefile
Original file line number Diff line number Diff line change
@@ -1,48 +1,77 @@
DEBUG_OPT :=
UNAME_M := $(shell uname -m)
UNAME_S := $(shell uname -s)
CC ?= gcc

# Yeah, fast math. We want it to be fast, for all xcall,
# IEEE compliance should not be an issue.
OPT_LV := -O3 -ffast-math -ftree-vectorize -funroll-loops
CFLAGS=-std=c99 -g ${OPT_LV} -Wall -Werror -I../thirdparties/install/include
OBJS=mo.o arith.o compare.o logic.o xcall.o usearchex.o bloom.o
CUDA_OBJS=
COMMON_CFLAGS := -g $(OPT_LV) -Wall -Werror -fPIC -I../thirdparties/install/include
CFLAGS := -std=c99 $(COMMON_CFLAGS)
OBJS := mo.o arith.o compare.o logic.o xcall.o usearchex.o bloom.o
CUDA_OBJS :=
LDFLAGS := -L../thirdparties/install/lib -lusearch_c
TARGET_LIB := libmo.so

ifeq ($(UNAME_S),Darwin)
TARGET_LIB := libmo.dylib
LDFLAGS += -dynamiclib -undefined dynamic_lookup -install_name @rpath/$(TARGET_LIB)
else
LDFLAGS += -shared
endif

ifeq ($(UNAME_M), x86_64)
CFLAGS+= -march=haswell
CFLAGS += -march=haswell
endif

ifeq ($(MO_CL_CUDA),1)
ifeq ($(CONDA_PREFIX),)
$(error CONDA_PREFIX env variable not found. Please activate your conda environment.)
endif
CC = /usr/local/cuda/bin/nvcc
CFLAGS = -ccbin g++ -m64 --shared -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90
CFLAGS = -ccbin g++ -m64 -Xcompiler -fPIC -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90
CFLAGS += -I../thirdparties/install/include -DMO_CL_CUDA
CUDA_OBJS += cuda/cuda.o
CUDA_LDFLAGS := -L/usr/local/cuda/lib64/stubs -lcuda -L/usr/local/cuda/lib64 -lcudart -lstdc++
# Explicitly include all needed libraries for shared library linking
CUDA_LDFLAGS := -L/usr/local/cuda/lib64/stubs -lcuda -L/usr/local/cuda/lib64 -lcudart -L$(CONDA_PREFIX)/lib -lcuvs -lcuvs_c -ldl -lrmm -lstdc++
LDFLAGS += $(CUDA_LDFLAGS)
endif

all: libmo.a
.PHONY: all clean test debug

all: $(TARGET_LIB) libmo.a

libmo.a: $(OBJS)
$(TARGET_LIB): $(OBJS)
ifeq ($(MO_CL_CUDA),1)
make -C cuda
$(MAKE) -C cuda
$(MAKE) -C cuvs
$(CC) $(LDFLAGS) -o $@ $(OBJS) $(CUDA_OBJS) cuvs/*.o
else
$(CC) $(LDFLAGS) -o $@ $(OBJS)
endif
ar -rcs libmo.a $(OBJS) $(CUDA_OBJS)

#
# $(CC) -o libmo.a $(OBJS) $(CUDA_OBJS) $(CUDA_LDFLAGS)
libmo.a: $(OBJS)
ifeq ($(MO_CL_CUDA),1)
$(MAKE) -C cuda
$(MAKE) -C cuvs
ar -rcs $@ $(OBJS) $(CUDA_OBJS) cuvs/*.o
else
ar -rcs $@ $(OBJS)
endif

%.o: %.c
$(CC) $(CFLAGS) -c $< -o $@

test: libmo.a
make -C test
test: $(TARGET_LIB)
$(MAKE) -C test

.PHONY: debug
debug: override OPT_LV := -O0
debug: override DEBUG_OPT := debug
debug: all

.PHONY: clean
clean:
rm -f *.o *.a *.so
rm -f *.o *.a *.so *.dylib
ifeq ($(MO_CL_CUDA),1)
make -C cuda clean
$(MAKE) -C cuda clean
$(MAKE) -C cuvs clean
endif
33 changes: 18 additions & 15 deletions cgo/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,28 @@
MatrixOne CGO Kernel
===============================

This directory contains cgo source code for MO. Running
make should produce two files to be used by go code.
On go side, go will `include "mo.h"` and `-lmo`.
This directory contains CGO source code for MatrixOne. Running `make` produces the core library files used by Go code.

On the Go side, the integration typically uses `mo.h` and links against the generated libraries:
```
mo.h
libmo.a
libmo.a / libmo.so
```

`mo.h` should be pristine, meaning it only contains C function
prototype used by go. The only datatypes that can be passed
between go and c code are int and float/double and pointer.
Always explicitly specify int size such as `int32_t`, `uint64_t`.
Do not use `int`, `long`, etc.
`mo.h` should remain pristine, containing only C function prototypes for Go to consume. Data passed between Go and C should be limited to standard types (int, float, double, pointers). Always specify explicit integer sizes (e.g., `int32_t`, `uint64_t`) and avoid platform-dependent types like `int` or `long`.

GPU Support (CUDA & cuVS)
-------------------------
The kernel supports GPU acceleration for certain operations (e.g., vector search) via NVIDIA CUDA and the cuVS library.

- **Build Flag:** GPU support is enabled by setting `MO_CL_CUDA=1` during the build.
- **Environment:** Requires a working CUDA installation and a Conda environment with `cuvs` and `rmm` installed.
- **Source Code:** GPU-specific code resides in the `cuda/` and `cuvs/` subdirectories.

Implementation Notes
--------------------------------
--------------------

1. Pure C.
2. Use memory passed from go. Try not allocate memory in C code.
3. Only depends on libc and libm.
4. If 3rd party lib is absolutely necessary, import source code
and build from source. If 3rd party lib is C++, wrap it completely in C.
1. **Language:** Core kernel is Pure C. GPU extensions use C++ and CUDA, wrapped in a C-compatible interface.
2. **Memory Management:** Prefer using memory allocated and passed from Go. Minimize internal allocations in C/C++ code.
3. **Dependencies:** The base kernel depends only on `libc`, `libm`, and `libusearch`. GPU builds introduce dependencies on CUDA, `cuvs`, and `rmm`.
4. **Third-party Libraries:** If a third-party library is necessary, it should be built from source (see `thirdparties/` directory). C++ libraries must be fully wrapped in C before being exposed to Go.
2 changes: 1 addition & 1 deletion cgo/cuda/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -395,7 +395,7 @@ $(FATBIN_FILE): mocl.cu
$(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -fatbin $<

cuda.o: cuda.cpp
$(EXEC) $(NVCC) $(INCLUDES) -O3 --shared $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -c $<
$(EXEC) $(NVCC) $(INCLUDES) -O3 --shared -Xcompiler -fPIC $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -c $<

mytest.o: cuda.cpp $(FATBIN_FILE)
$(EXEC) $(NVCC) $(INCLUDES) -DTEST_RUN -g -O0 $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -c $<
Expand Down
75 changes: 75 additions & 0 deletions cgo/cuvs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Makefile for MatrixOne cuVS C Wrapper

UNAME_M := $(shell uname -m)
CUDA_PATH ?= /usr/local/cuda
NVCC := $(CUDA_PATH)/bin/nvcc

ifeq ($(CONDA_PREFIX),)
$(error CONDA_PREFIX env variable not found. Please activate your conda environment.)
endif

# Compilation flags
# Added --extended-lambda because raft/core/copy.cuh requires it for some internal headers
NVCC_FLAGS := -std=c++17 -x cu -Xcompiler "-Wall -Wextra -fPIC -O2" --extended-lambda --expt-relaxed-constexpr
NVCC_FLAGS += -I. -I$(CUDA_PATH)/include -I$(CONDA_PREFIX)/include -I$(CONDA_PREFIX)/include/rapids -I$(CONDA_PREFIX)/include/raft -I$(CONDA_PREFIX)/include/cuvs
NVCC_FLAGS += -DLIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE -DRAFT_SYSTEM_LITTLE_ENDIAN=1

# Linking flags
LDFLAGS := -shared
LDFLAGS += -L$(CUDA_PATH)/lib64/stubs -lcuda -L$(CUDA_PATH)/lib64 -lcudart
LDFLAGS += -L$(CONDA_PREFIX)/lib -lcuvs -lcuvs_c -ldl -lrmm -lrapids_logger
LDFLAGS += -Xlinker -lpthread -Xlinker -lm

# Target library
TARGET := libmocuvs.so

# Source files
SRCS := brute_force_c.cpp ivf_flat_c.cpp ivf_pq_c.cpp cagra_c.cpp kmeans_c.cpp helper.cpp adhoc_c.cpp distance_c.cpp
OBJS := $(SRCS:.cpp=.o)

# Test configuration
TESTDIR := test
OBJDIR := obj
TEST_EXE := test_cuvs_worker
TEST_SRCS := $(TESTDIR)/main_test.cu \
$(TESTDIR)/brute_force_test.cu \
$(TESTDIR)/ivf_flat_test.cu \
$(TESTDIR)/ivf_pq_test.cu \
$(TESTDIR)/cagra_test.cu \
$(TESTDIR)/kmeans_test.cu \
$(TESTDIR)/quantize_test.cu \
$(TESTDIR)/distance_test.cu \
$(TESTDIR)/batching_test.cu

TEST_OBJS := $(patsubst $(TESTDIR)/%.cu, $(OBJDIR)/test/%.o, $(TEST_SRCS))

.PHONY: all clean test

all: $(OBJS)

$(TARGET): $(OBJS)
@echo "Linking shared library $@"
$(NVCC) $(LDFLAGS) $^ -o $@

%.o: %.cpp
@echo "Compiling $< with NVCC"
$(NVCC) $(NVCC_FLAGS) -c $< -o $@

# Test targets
test: $(TEST_EXE)
@echo "Running tests..."
./$(TEST_EXE)

$(TEST_EXE): $(TEST_OBJS) helper.o
@echo "NVCCLD $@"
$(NVCC) $(subst -x cu,,$(NVCC_FLAGS)) $^ $(subst -shared,,$(LDFLAGS)) -o $@

$(OBJDIR)/test/%.o: $(TESTDIR)/%.cu
@mkdir -p $(@D)
@echo "NVCC $<"
$(NVCC) -std=c++17 -Xcompiler "-Wall -Wextra -fPIC -O2" --extended-lambda --expt-relaxed-constexpr -I. -I$(CUDA_PATH)/include -I$(CONDA_PREFIX)/include -I$(CONDA_PREFIX)/include/rapids -I$(CONDA_PREFIX)/include/raft -I$(CONDA_PREFIX)/include/cuvs -DLIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE -DRAFT_SYSTEM_LITTLE_ENDIAN=1 -c $< -o $@

clean:
@echo "Cleaning up..."
rm -f $(TARGET) *.o $(TEST_EXE)
rm -rf $(OBJDIR)
119 changes: 119 additions & 0 deletions cgo/cuvs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
✦ Architecture Design: cuVS-Accelerated Vector Indexing

1. Overview
The MatrixOne cuvs package provides a high-performance, GPU-accelerated vector search and clustering infrastructure. It acts as
a bridge between the Go-based database kernel and NVIDIA's cuVS and RAFT libraries. The architecture is designed to solve three
primary challenges:
1. Impedance Mismatch: Reconciling Go’s concurrent goroutine scheduler with CUDA’s thread-specific resource requirements.
2. Scalability: Supporting datasets that exceed single-GPU memory (Sharding) or high-concurrency search requirements
(Replicated).
3. Efficiency: Minimizing CUDA kernel launch overhead via dynamic query batching.

---

2. Core Component: cuvs_worker_t
The cuvs_worker_t is the foundational engine of the architecture.

Implementation Details:
* Persistent C++ Thread Pool: Instead of executing CUDA calls directly from CGO (which could be scheduled on any OS thread),
the worker maintains a dedicated pool of long-lived C++ threads. Each thread is pinned to a specific GPU device.
* Job Queuing: Requests from the Go layer are submitted as "Jobs" to an internal thread-safe queue. The worker returns a
std::future, allowing the Go layer to perform other tasks while the GPU processes the request.
* Context Stability: By using dedicated threads, we ensure that CUDA context and RAFT resource handles remain stable and
cached, avoiding the expensive overhead of context creation or handle re-initialization.

---

3. Distribution Modes
The system supports three distinct modes to leverage multi-GPU hardware:

A. Single GPU Mode
* Design: The index resides entirely on one device.
* Use Case: Small to medium datasets where latency is the priority.

B. Replicated Mode (Scaling Throughput)
* Design: The full index is loaded onto multiple GPUs simultaneously.
* Mechanism: The cuvs_worker implements a load-balancing strategy (typically round-robin). Incoming queries are dispatched to
the next available GPU.
* Benefit: Linearly scales the Queries Per Second (QPS) by utilizing the compute power of all available GPUs.

C. Sharded Mode (Scaling Capacity)
* Design: The dataset is partitioned into $N$ shards across $N$ GPUs.
* Mechanism:
1. Broadcast: A search request is sent to all GPUs.
2. Local Search: Each GPU searches its local shard independently using RAFT resources.
3. Top-K Merge: The worker aggregates the results ($N \times K$ candidates) and performs a final merge-sort (often on the
CPU or via a fast GPU kernel) to return the global top-K.
* Benefit: Enables indexing of massive datasets (e.g., 100M+ vectors) that would not fit in the memory of a single GPU.

---

4. RAFT Resource Management
The package relies on RAFT (raft::resources) for all CUDA-accelerated operations.

* Resource Caching: raft::resources objects (containing CUDA streams, cuBLAS handles, and workspace memory) are held within the
cuvs_worker threads. They are created once at Start() and reused for the lifetime of the index.
* Stream-Based Parallelism: Every index operation is executed asynchronously on a RAFT-managed CUDA stream. This allows the
system to overlap data transfers (Host-to-Device) with kernel execution, maximizing hardware utilization.
* Memory Layout: Leveraging raft::mdspan and raft::mdarray ensures that memory is handled in a layout-aware manner
(C-contiguous or Fortran-contiguous), matching the requirements of optimized BLAS and LAPACK kernels.

---

5. Dynamic Batching: The Throughput Key
In a database environment, queries often arrive one by one from different users. Processing these as individual CUDA kernels is
inefficient due to launch overhead and under-utilization of GPU warps.

The Dynamic Batching Mechanism:
* Aggregation Window: When multiple search requests arrive at the worker within a small time window (microseconds), the worker
stalls briefly to aggregate them.
* Matrix Consolidation: Individual query vectors are packed into a single large query matrix.
* Consolidated Search: A single cuvs::neighbors::search call is made. GPUs are significantly more efficient at processing one
$64 \times D$ matrix than 64 individual $1 \times D$ vectors.
* Automatic Fulfilling: Once the batch search completes, the worker de-multiplexes the results and fulfills the specific
std::future for each individual Go request.

---

6. Automatic Type Quantization
To optimize memory footprint and search speed, the architecture features an automated quantization pipeline that converts
high-precision float32 vectors into compressed formats.

* Transparent Conversion: The Go layer can consistently provide float32 data. The system automatically handles the conversion
to the index's internal type (half, int8, or uint8) directly on the GPU.
* FP16 (Half Precision):
* Mechanism: Uses raft::copy to perform bit-level conversion from 32-bit to 16-bit floating point.
* Benefit: 2x memory reduction with negligible impact on search recall.
* 8-Bit Integer (int8/uint8):
* Mechanism: Implements a learned Scalar Quantizer. The system samples the dataset to determine optimal min and max
clipping bounds.
* Training: Before building, the quantizer is "trained" on a subset of the data to ensure the 256 available integer levels
are mapped to the most significant range of the distribution.
* Benefit: 4x memory reduction, enabling massive datasets to reside in VRAM.
* GPU-Accelerated: All quantization kernels are executed on the device. This minimizes CPU usage and avoids the latency of
converting data before sending it over the PCIe bus.

7. Supported Index Types
The following indexes are fully integrated into the MatrixOne GPU architecture:


┌──────────┬──────────────────────┬───────────────────────────────────────────────────────────────────────────────┐
│ Index │ Algorithm │ Strengths │
├──────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ CAGRA │ Hardware-accelerated │ Best-in-class search speed and high recall. Optimized for hardware graph │
│ │ Graph │ traversal. │
│ IVF-Flat │ Inverted File Index │ High accuracy and fast search. Excellent for general-purpose use. │
│ IVF-PQ │ Product Quantization │ Extreme compression. Supports billions of vectors via lossy code compression. │
│ Brute │ Exact Flat Search │ 100% recall. Ideal for small datasets or generating ground-truth for │
│ Force │ │ benchmarks. │
│ K-Means │ Clustering │ High-performance centroid calculation for data partitioning and unsupervised │
│ │ │ learning. │
└──────────┴──────────────────────┴───────────────────────────────────────────────────────────────────────────────┘


8. Operational Telemetry
All indexes implement a unified Info() method that returns a JSON-formatted string. This allows the database to programmatically
verify:
* Hardware Mapping: Which GPU devices are holding which shards.
* Data Layout: Element sizes, dimensions, and current vector counts.
* Hyper-parameters: Internal tuning values like NLists, GraphDegree, or PQBits.
Loading
Loading