Hybrid Query #55

amithbhat1 · 2025-04-05T22:53:53Z

No description provided.

Build with attributes

Add and remove to attribute tables working!

JasonMoho

A good start. Main concern is that columns are hardcoded. Make it general.

Also you should ignore removals for now, since arrow tables are immutable. We can consider how to handle them later. Focus on the query processing implementation and performance.

JasonMoho · 2025-04-06T19:19:38Z

src/cpp/include/common.h

 constexpr float DEFAULT_INITIAL_SEARCH_FRACTION = 0.02f; ///< Default initial fraction of partitions to search.
 constexpr float DEFAULT_RECOMPUTE_THRESHOLD = 0.001f;    ///< Default threshold to trigger recomputation of search parameters.
 constexpr int DEFAULT_APS_FLUSH_PERIOD_US = 100;         ///< Default period (in microseconds) for flushing the APS buffer.
+constexpr int DEFAULT_PRICE_THRESHOLD = INT_MAX;


what is this?

This is the price threshold in case no value is given by the user. Will modify this while making the filtering column name agnostic

JasonMoho · 2025-04-06T19:22:25Z

src/cpp/include/index_partition.h

+     *
+     * @param index Index of the vector to remove.
+     */
+    void removeAttribute(int64_t index);


is this function needed? seems unnecessary

JasonMoho · 2025-04-06T19:24:16Z

src/cpp/include/list_scanning.h

        partitions_scanned_.fetch_add(1, std::memory_order_relaxed);
    }

+    void remove(int rejected_index) {


I would avoid modifying the topkbuffer class. Just make sure that elements you add to the buffer pass the filter (in the case of pre-filtering)

I added this for post-filtering case. So after we get topk buffer from one partition, we need to remove whatever doesn't pass the filter. This function serves that purpose

JasonMoho · 2025-04-06T19:25:08Z

src/cpp/include/list_scanning.h

                                        int d,
-                                        TopkBuffer &buffer) {
+                                        TopkBuffer &buffer,
+                                        bool* bitmap = nullptr) {


switch to a vector so we avoid memory leaks

JasonMoho · 2025-04-06T19:27:07Z

src/cpp/src/dynamic_inverted_list.cpp

        const idx_t *ids,
-        const uint8_t *codes) {
+        const uint8_t *codes,
+        shared_ptr<arrow::Table> attributes_table


nit: shared_ptr<arrow::Table> attributes_table) {

JasonMoho · 2025-04-06T19:38:14Z

src/cpp/src/partition_manager.cpp

+        throw runtime_error("[PartitionManager] add: mismatch in attributes_table and vector_ids size.");
+    }
+
+    if(attributes_table!=nullptr && !attributes_table->GetColumnByName("id")){


Ideally, we shouldn't keep track of ids in the arrow table since we already keep track of them in the index partitions.

JasonMoho · 2025-04-06T19:40:31Z

src/cpp/src/partition_manager.cpp

            id_ptr + i,
-            code_ptr + i * code_size_bytes
+            code_ptr + i * code_size_bytes,
+            filtered_table_result


This might be a significant performance slowdown since we do this filtering per entry.

JasonMoho · 2025-04-06T19:42:11Z

src/cpp/src/query_coordinator.cpp

-shared_ptr<SearchResult> QueryCoordinator::serial_scan(Tensor x, Tensor partition_ids,
-                                                       shared_ptr<SearchParams> search_params) {
+
+bool* create_bitmap(std::unordered_map<int64_t, int64_t> id_to_price, int64_t* list_ids, 


Return a vector instead of bool * to avoid memory leaks.

What is price? Why are we hardcoded to have a price column?

Why do we need an unordered_map to produce the bitmap?

Okay, will change

So, id is like a sample column we added. We are looking to make the user queries and columns agnostic to the names and type of filter.

Its not necessary, but it makes the filter much faster. Otherwise, for every vector_id, we'd have to scan the entire arrow table. Specially for pre_filtering case, there might be a lot of vectors, slowing down the search.

JasonMoho · 2025-04-06T19:43:15Z

src/cpp/src/query_coordinator.cpp

+                            partition_manager_->partition_store_->partitions_[pi]->attributes_table_;
            int64_t list_size = partition_manager_->partition_store_->list_size(pi);

+            std::shared_ptr<arrow::Int64Array> id_array = nullptr;


fix so we don't have hardcoded column names

JasonMoho · 2025-04-06T19:44:31Z

src/cpp/src/query_coordinator.cpp

+                for (int i = 0;i < buffer_size; i++) {
+                    auto vector_id = scanned_vectors[i].second;
+                    if (id_to_price.count(vector_id) and id_to_price[vector_id] > search_params->price_threshold) {
+                        topk_buf->remove(i);


No need for this topk_buffer->remove(i). Just do the filtering on the final topk for post-filter

This is the local post-filtering. So after the closest partition has been scanned, we filter out the vectors which don't pass the filter, so quake can scan the next partition if k-vectors don't pass the filter

Sujan242 and others added 25 commits March 14, 2025 12:36

Added arrow as third party dependency

3a46c22

Added build with arrow tables functionality

2063b5d

updated stress tests to include attributes

e1aa5ac

renamed duplicate function

c0aece3

added libarrow as conda dependency

147dbe1

updated data structure for storing tables

29e4614

Merge pull request #1 from Sujan242/build-with-attributes

fbedcac

Build with attributes

modified gitignore

b8012db

index partition bug fix

8281295

wip

3db5f45

Merge branch 'main' of https://github.com/Sujan242/quake

35dfe11

written code for add & remove

318bb8c

compile works

3bc3e5e

index partition bug fix

bf953a2

handled null attributes table

e70db43

Adding some search things

1d8a5b5

Older tests working fine with filter search

f7eb486

Tests for search added

f12d21e

fix remove logical bug - attr table can be NULL

ac50b7f

allow attr_table to be null - add vector

1288863

fixed tests

711e414

Merge pull request #3 from Sujan242/attr-manip

d530d46

Add and remove to attribute tables working!

Merge branch 'main' into hybrid-query

332ea7a

fixed conflicts

68706a6

added conda to quake_env

f6305c9

JasonMoho changed the base branch from main to attribute_filter April 6, 2025 19:11

JasonMoho changed the base branch from attribute_filter to main April 6, 2025 19:17

JasonMoho changed the base branch from main to attribute_filter April 6, 2025 19:18

JasonMoho requested changes Apr 6, 2025

View reviewed changes

Hybrid Query #55

Are you sure you want to change the base?

Hybrid Query #55

Uh oh!

Conversation

amithbhat1 commented Apr 5, 2025

Uh oh!

JasonMoho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants