Skip to content
This repository was archived by the owner on Jul 31, 2025. It is now read-only.

Conversation

priyankc
Copy link
Member

Hybrid search was not respecting the distinct attribute configuration, returning duplicate documents from the same distinct group. This fix ensures distinct filtering is applied after merging vector and keyword results but before pagination.

Changes:

  • Apply distinct rule in hybrid search merge logic

  • Add comprehensive test covering the distinct constraint bug

  • Make distinct module public to enable reuse in hybrid search

  • Update function signature to handle distinct filtering errors

The fix maintains the correct order of operations: merge → distinct → paginate, ensuring hybrid search behaves consistently with regular keyword search regarding distinct attributes.

Related issue

Fixes meilisearch#5526

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Hybrid search was not respecting the distinct attribute configuration, returning duplicate documents from the same distinct group. This fix ensures distinct filtering is applied after merging vector and keyword results but before pagination.

Changes:

- Apply distinct rule in hybrid search merge logic

- Add comprehensive test covering the distinct constraint bug

- Make distinct module public to enable reuse in hybrid search

- Update function signature to handle distinct filtering errors

The fix maintains the correct order of operations: merge → distinct → paginate, ensuring hybrid search behaves consistently with regular keyword search regarding distinct attributes.
Copy link

@dureuill dureuill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, as written, the fix breaks the order of results of the hybrid search in the presence of distinct

// remove documents we already saw
.filter(|((docid, _), _)| documents_seen.insert(*docid))
// start skipping **after** the filter
.skip(from)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the filters at this point breaks the semantic_hit_count, which is no longer representative of the number of semantic hits in these results.

},
semantic_hit_count,
)
let mut final_result = SearchResult {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: we usually build an immutable object from its final constituents, rather than a mutable object and then edit its fields.

if let Some(distinct_fid) = fields_ids_map.id(f) {
let mut ctx = SearchContext::new(search.index, search.rtxn)?;
let all_candidates: RoaringBitmap = final_result.documents_ids.iter().copied().collect();
let remaining_candidates = apply_distinct_rule(&mut ctx, distinct_fid, &all_candidates)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this call will give priority to distinct documents with a lower document id, breaking the order of documents returned by the search.

if let Some(f) = distinct_field {
let fields_ids_map = search.index.fields_ids_map(search.rtxn)?;
if let Some(distinct_fid) = fields_ids_map.id(f) {
let mut ctx = SearchContext::new(search.index, search.rtxn)?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

performance: SearchContext is a slightly heavyweight object that we'd rather not rebuild here.

.await;
snapshot!(code, @"200 OK");
// Should only return 2 hits: one from "superhero" category and one from "action" category
// But due to the bug, it returns more than expected

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment, the bug is on the semantic search, not on the hybrid search

// First test: pure keyword search with distinct - should only return 2 documents (one per category)
let (response, code) = index
.search_post(
json!({"q": "Captain", "showRankingScore": true}),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will return a single document because the one in the action category is not matched by the search

// Should only return 2 hits: one from "superhero" category and one from "action" category
// But due to the bug, it returns more than expected
let keyword_hits = response["hits"].as_array().unwrap().len();
println!("Keyword search hits: {}", keyword_hits);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not print results in tests; please use snapshots to assert the number of results

snapshot!(code, @"200 OK");

let hybrid_hits = response["hits"].as_array().unwrap().len();
println!("Hybrid search hits: {}", hybrid_hits);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same remark as above

}

// This assertion will fail due to the bug - hybrid search ignores distinct constraint
assert!(hybrid_hits <= 2, "Hybrid search should respect distinct constraint and return at most 2 documents (one per category), but got {}", hybrid_hits);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests in Meilisearch are written to avoid accidental changes, please snapshot on the exact number of hits rather than asserting the logical condition here.

}
}
final_result.documents_ids = filtered_docs;
final_result.document_scores = filtered_scores;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the candidates from the final result are not updated, meaning that the field distribution and the total hits will be wrong

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

distinct is not applied in all cases when using hybrid or federated search

2 participants