-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[HOTFIX] applying PR #5767 to rc/2025-11-21 #5900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: rc/2025-11-21
Are you sure you want to change the base?
Conversation
…5767) ## Description of changes This is an attempt to put the tokens for a sparse vector in said sparse vector. ## Test plan CI ## Migration plan N/A ## Observability plan N/A ## Documentation Changes N/A
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
|
Hot-fix: Back-port sparse-vector token metadata support to rc/2025-11-21 This PR cherry-picks the sparse-vector enhancement originally delivered in PR #5767 onto the November 21 2025 release branch. The change introduces an optional metadata key that stores the original token list alongside the sparse vector, allowing all sparse-search operators (BM25, IDF, KNN ranking, etc.) to retrieve token information end-to-end. The patch touches the Rust core, persistence layer, operator implementations, public protobuf/API surfaces, and JS/Python client code so that the new field is transparently propagated without breaking existing clients. Because metadata is schemaless JSON, the addition is backward-compatible. Older clients ignore the new key, while newer clients and operators can rely on it to achieve better relevance scores and consistency across storage layers. No explicit data migrations are required. Key Changes• Added optional Affected Areas• Metadata schema & persistence This summary was automatically generated by @propel-code-bot |
| let sparse_triples = token_ids.chunk_by(|a, b| a.0 == b.0).map(|chunk| { | ||
| let id = chunk[0].0; | ||
| let tk = chunk[0].1.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
Logic error in token storage: When store_tokens is true, you're storing chunk[0].1.clone() which is the first token, but for duplicate token IDs after hashing, this loses information about which specific token variant was used. If "running" and "run" both hash to ID 42, you'll only store the first one encountered.
Consider if this is the intended behavior or if you need to store a representative token (e.g., most frequent variant).
Context for Agents
**Logic error in token storage**: When `store_tokens` is true, you're storing `chunk[0].1.clone()` which is the first token, but for duplicate token IDs after hashing, this loses information about which specific token variant was used. If "running" and "run" both hash to ID 42, you'll only store the first one encountered.
Consider if this is the intended behavior or if you need to store a representative token (e.g., most frequent variant).
File: rust/chroma/src/embed/bm25.rs
Line: 92| } | ||
| let scaled_query = SparseVector::from_triples(self.query.iter().enumerate().map( | ||
| |(token_position, (index, value))| { | ||
| let nt = nts.get(&index).cloned().unwrap_or_default() as f32; | ||
| let scale = scale(n as f32, nt); | ||
| (tokens[token_position].clone(), index, scale * value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Potential panic on index out of bounds: If tokens.len() != self.query.indices.len() at runtime (despite earlier validation), tokens[token_position] will panic. The enumerate() iterator doesn't guarantee bounds safety here.
let scaled_query = SparseVector::from_triples(self.query.iter().zip(tokens.iter()).map(
|((index, value), token)| {
let nt = nts.get(&index).cloned().unwrap_or_default() as f32;
let scale = scale(n as f32, nt);
(token.clone(), index, scale * value)
},
));Context for Agents
**Potential panic on index out of bounds**: If `tokens.len() != self.query.indices.len()` at runtime (despite earlier validation), `tokens[token_position]` will panic. The `enumerate()` iterator doesn't guarantee bounds safety here.
```rust
let scaled_query = SparseVector::from_triples(self.query.iter().zip(tokens.iter()).map(
|((index, value), token)| {
let nt = nts.get(&index).cloned().unwrap_or_default() as f32;
let scale = scale(n as f32, nt);
(token.clone(), index, scale * value)
},
));
```
File: rust/worker/src/execution/operators/idf.rs
Line: 185
This PR cherry-picks the commit 2eca285 onto rc/2025-11-21. If there are unresolved conflicts, please resolve them manually.