Skip to content

Conversation

@amithbhat1
Copy link

No description provided.

@JasonMoho JasonMoho changed the base branch from main to attribute_filter April 6, 2025 19:11
@JasonMoho JasonMoho changed the base branch from attribute_filter to main April 6, 2025 19:17
@JasonMoho JasonMoho changed the base branch from main to attribute_filter April 6, 2025 19:18
Copy link
Contributor

@JasonMoho JasonMoho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good start. Main concern is that columns are hardcoded. Make it general.

Also you should ignore removals for now, since arrow tables are immutable. We can consider how to handle them later. Focus on the query processing implementation and performance.

constexpr float DEFAULT_INITIAL_SEARCH_FRACTION = 0.02f; ///< Default initial fraction of partitions to search.
constexpr float DEFAULT_RECOMPUTE_THRESHOLD = 0.001f; ///< Default threshold to trigger recomputation of search parameters.
constexpr int DEFAULT_APS_FLUSH_PERIOD_US = 100; ///< Default period (in microseconds) for flushing the APS buffer.
constexpr int DEFAULT_PRICE_THRESHOLD = INT_MAX;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the price threshold in case no value is given by the user. Will modify this while making the filtering column name agnostic

*
* @param index Index of the vector to remove.
*/
void removeAttribute(int64_t index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this function needed? seems unnecessary

partitions_scanned_.fetch_add(1, std::memory_order_relaxed);
}

void remove(int rejected_index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid modifying the topkbuffer class. Just make sure that elements you add to the buffer pass the filter (in the case of pre-filtering)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this for post-filtering case. So after we get topk buffer from one partition, we need to remove whatever doesn't pass the filter. This function serves that purpose

int d,
TopkBuffer &buffer) {
TopkBuffer &buffer,
bool* bitmap = nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch to a vector so we avoid memory leaks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

const idx_t *ids,
const uint8_t *codes) {
const uint8_t *codes,
shared_ptr<arrow::Table> attributes_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shared_ptr<arrow::Table> attributes_table) {

throw runtime_error("[PartitionManager] add: mismatch in attributes_table and vector_ids size.");
}

if(attributes_table!=nullptr && !attributes_table->GetColumnByName("id")){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we shouldn't keep track of ids in the arrow table since we already keep track of them in the index partitions.

id_ptr + i,
code_ptr + i * code_size_bytes
code_ptr + i * code_size_bytes,
filtered_table_result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a significant performance slowdown since we do this filtering per entry.

shared_ptr<SearchResult> QueryCoordinator::serial_scan(Tensor x, Tensor partition_ids,
shared_ptr<SearchParams> search_params) {

bool* create_bitmap(std::unordered_map<int64_t, int64_t> id_to_price, int64_t* list_ids,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Return a vector instead of bool * to avoid memory leaks.
  2. What is price? Why are we hardcoded to have a price column?
  3. Why do we need an unordered_map to produce the bitmap?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Okay, will change
  2. So, id is like a sample column we added. We are looking to make the user queries and columns agnostic to the names and type of filter.
  3. Its not necessary, but it makes the filter much faster. Otherwise, for every vector_id, we'd have to scan the entire arrow table. Specially for pre_filtering case, there might be a lot of vectors, slowing down the search.

partition_manager_->partition_store_->partitions_[pi]->attributes_table_;
int64_t list_size = partition_manager_->partition_store_->list_size(pi);

std::shared_ptr<arrow::Int64Array> id_array = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix so we don't have hardcoded column names

for (int i = 0;i < buffer_size; i++) {
auto vector_id = scanned_vectors[i].second;
if (id_to_price.count(vector_id) and id_to_price[vector_id] > search_params->price_threshold) {
topk_buf->remove(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this topk_buffer->remove(i). Just do the filtering on the final topk for post-filter

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the local post-filtering. So after the closest partition has been scanned, we filter out the vectors which don't pass the filter, so quake can scan the next partition if k-vectors don't pass the filter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants