-
Notifications
You must be signed in to change notification settings - Fork 0
fix: apply distinct constraint in hybrid search #5
base: main
Are you sure you want to change the base?
Conversation
Hybrid search was not respecting the distinct attribute configuration, returning duplicate documents from the same distinct group. This fix ensures distinct filtering is applied after merging vector and keyword results but before pagination. Changes: - Apply distinct rule in hybrid search merge logic - Add comprehensive test covering the distinct constraint bug - Make distinct module public to enable reuse in hybrid search - Update function signature to handle distinct filtering errors The fix maintains the correct order of operations: merge → distinct → paginate, ensuring hybrid search behaves consistently with regular keyword search regarding distinct attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, as written, the fix breaks the order of results of the hybrid search in the presence of distinct
// remove documents we already saw | ||
.filter(|((docid, _), _)| documents_seen.insert(*docid)) | ||
// start skipping **after** the filter | ||
.skip(from) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the filters at this point breaks the semantic_hit_count
, which is no longer representative of the number of semantic hits in these results.
}, | ||
semantic_hit_count, | ||
) | ||
let mut final_result = SearchResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: we usually build an immutable object from its final constituents, rather than a mutable object and then edit its fields.
if let Some(distinct_fid) = fields_ids_map.id(f) { | ||
let mut ctx = SearchContext::new(search.index, search.rtxn)?; | ||
let all_candidates: RoaringBitmap = final_result.documents_ids.iter().copied().collect(); | ||
let remaining_candidates = apply_distinct_rule(&mut ctx, distinct_fid, &all_candidates)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this call will give priority to distinct documents with a lower document id, breaking the order of documents returned by the search.
if let Some(f) = distinct_field { | ||
let fields_ids_map = search.index.fields_ids_map(search.rtxn)?; | ||
if let Some(distinct_fid) = fields_ids_map.id(f) { | ||
let mut ctx = SearchContext::new(search.index, search.rtxn)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
performance: SearchContext
is a slightly heavyweight object that we'd rather not rebuild here.
.await; | ||
snapshot!(code, @"200 OK"); | ||
// Should only return 2 hits: one from "superhero" category and one from "action" category | ||
// But due to the bug, it returns more than expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment, the bug is on the semantic search, not on the hybrid search
// First test: pure keyword search with distinct - should only return 2 documents (one per category) | ||
let (response, code) = index | ||
.search_post( | ||
json!({"q": "Captain", "showRankingScore": true}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will return a single document because the one in the action
category is not matched by the search
// Should only return 2 hits: one from "superhero" category and one from "action" category | ||
// But due to the bug, it returns more than expected | ||
let keyword_hits = response["hits"].as_array().unwrap().len(); | ||
println!("Keyword search hits: {}", keyword_hits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not print results in tests; please use snapshots to assert the number of results
snapshot!(code, @"200 OK"); | ||
|
||
let hybrid_hits = response["hits"].as_array().unwrap().len(); | ||
println!("Hybrid search hits: {}", hybrid_hits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same remark as above
} | ||
|
||
// This assertion will fail due to the bug - hybrid search ignores distinct constraint | ||
assert!(hybrid_hits <= 2, "Hybrid search should respect distinct constraint and return at most 2 documents (one per category), but got {}", hybrid_hits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests in Meilisearch are written to avoid accidental changes, please snapshot on the exact number of hits rather than asserting the logical condition here.
} | ||
} | ||
final_result.documents_ids = filtered_docs; | ||
final_result.document_scores = filtered_scores; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the candidates
from the final result are not updated, meaning that the field distribution and the total hits will be wrong
Hybrid search was not respecting the distinct attribute configuration, returning duplicate documents from the same distinct group. This fix ensures distinct filtering is applied after merging vector and keyword results but before pagination.
Changes:
Apply distinct rule in hybrid search merge logic
Add comprehensive test covering the distinct constraint bug
Make distinct module public to enable reuse in hybrid search
Update function signature to handle distinct filtering errors
The fix maintains the correct order of operations: merge → distinct → paginate, ensuring hybrid search behaves consistently with regular keyword search regarding distinct attributes.
Related issue
Fixes meilisearch#5526
PR checklist
Please check if your PR fulfills the following requirements: