-
Notifications
You must be signed in to change notification settings - Fork 14
Hybrid Query #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: attribute_filter
Are you sure you want to change the base?
Hybrid Query #55
Conversation
Build with attributes
Add and remove to attribute tables working!
JasonMoho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good start. Main concern is that columns are hardcoded. Make it general.
Also you should ignore removals for now, since arrow tables are immutable. We can consider how to handle them later. Focus on the query processing implementation and performance.
| constexpr float DEFAULT_INITIAL_SEARCH_FRACTION = 0.02f; ///< Default initial fraction of partitions to search. | ||
| constexpr float DEFAULT_RECOMPUTE_THRESHOLD = 0.001f; ///< Default threshold to trigger recomputation of search parameters. | ||
| constexpr int DEFAULT_APS_FLUSH_PERIOD_US = 100; ///< Default period (in microseconds) for flushing the APS buffer. | ||
| constexpr int DEFAULT_PRICE_THRESHOLD = INT_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the price threshold in case no value is given by the user. Will modify this while making the filtering column name agnostic
| * | ||
| * @param index Index of the vector to remove. | ||
| */ | ||
| void removeAttribute(int64_t index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this function needed? seems unnecessary
| partitions_scanned_.fetch_add(1, std::memory_order_relaxed); | ||
| } | ||
|
|
||
| void remove(int rejected_index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid modifying the topkbuffer class. Just make sure that elements you add to the buffer pass the filter (in the case of pre-filtering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this for post-filtering case. So after we get topk buffer from one partition, we need to remove whatever doesn't pass the filter. This function serves that purpose
| int d, | ||
| TopkBuffer &buffer) { | ||
| TopkBuffer &buffer, | ||
| bool* bitmap = nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch to a vector so we avoid memory leaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
| const idx_t *ids, | ||
| const uint8_t *codes) { | ||
| const uint8_t *codes, | ||
| shared_ptr<arrow::Table> attributes_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shared_ptr<arrow::Table> attributes_table) {
| throw runtime_error("[PartitionManager] add: mismatch in attributes_table and vector_ids size."); | ||
| } | ||
|
|
||
| if(attributes_table!=nullptr && !attributes_table->GetColumnByName("id")){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we shouldn't keep track of ids in the arrow table since we already keep track of them in the index partitions.
| id_ptr + i, | ||
| code_ptr + i * code_size_bytes | ||
| code_ptr + i * code_size_bytes, | ||
| filtered_table_result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a significant performance slowdown since we do this filtering per entry.
| shared_ptr<SearchResult> QueryCoordinator::serial_scan(Tensor x, Tensor partition_ids, | ||
| shared_ptr<SearchParams> search_params) { | ||
|
|
||
| bool* create_bitmap(std::unordered_map<int64_t, int64_t> id_to_price, int64_t* list_ids, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Return a vector instead of bool * to avoid memory leaks.
- What is price? Why are we hardcoded to have a price column?
- Why do we need an unordered_map to produce the bitmap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Okay, will change
- So, id is like a sample column we added. We are looking to make the user queries and columns agnostic to the names and type of filter.
- Its not necessary, but it makes the filter much faster. Otherwise, for every vector_id, we'd have to scan the entire arrow table. Specially for pre_filtering case, there might be a lot of vectors, slowing down the search.
| partition_manager_->partition_store_->partitions_[pi]->attributes_table_; | ||
| int64_t list_size = partition_manager_->partition_store_->list_size(pi); | ||
|
|
||
| std::shared_ptr<arrow::Int64Array> id_array = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix so we don't have hardcoded column names
| for (int i = 0;i < buffer_size; i++) { | ||
| auto vector_id = scanned_vectors[i].second; | ||
| if (id_to_price.count(vector_id) and id_to_price[vector_id] > search_params->price_threshold) { | ||
| topk_buf->remove(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this topk_buffer->remove(i). Just do the filtering on the final topk for post-filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the local post-filtering. So after the closest partition has been scanned, we filter out the vectors which don't pass the filter, so quake can scan the next partition if k-vectors don't pass the filter
No description provided.