[HOTFIX] applying PR #5767 to rc/2025-11-21 #5900

chroma-droid · 2025-11-23T18:21:46Z

This PR cherry-picks the commit 2eca285 onto rc/2025-11-21. If there are unresolved conflicts, please resolve them manually.

…5767) ## Description of changes This is an attempt to put the tokens for a sparse vector in said sparse vector. ## Test plan CI ## Migration plan N/A ## Observability plan N/A ## Documentation Changes N/A

github-actions · 2025-11-23T18:22:02Z

propel-code-bot · 2025-11-23T18:22:22Z

Hot-fix: Back-port sparse-vector token metadata support to rc/2025-11-21

This PR cherry-picks the sparse-vector enhancement originally delivered in PR #5767 onto the November 21 2025 release branch. The change introduces an optional metadata key that stores the original token list alongside the sparse vector, allowing all sparse-search operators (BM25, IDF, KNN ranking, etc.) to retrieve token information end-to-end. The patch touches the Rust core, persistence layer, operator implementations, public protobuf/API surfaces, and JS/Python client code so that the new field is transparently propagated without breaking existing clients.

Because metadata is schemaless JSON, the addition is backward-compatible. Older clients ignore the new key, while newer clients and operators can rely on it to achieve better relevance scores and consistency across storage layers. No explicit data migrations are required.

Key Changes

• Added optional tokens field to metadata model (rust/types/src/metadata.rs, collection_schema.rs)
• Updated execution operators (idf.rs, rank.rs, sparse_log_knn.rs, bm25.rs) to read/write token lists from metadata
• Extended public API definitions (chroma.proto, api_types.rs, TypeScript and Python bindings) to expose the new field
• Persisted new metadata layout in block-file storage
• Adjusted server code and property-based tests to verify round-trip of token data
• Updated JS client packages to consume and emit the new format

Affected Areas

• Metadata schema & persistence
• Sparse-vector execution operators
• Public protobuf & client SDKs (TypeScript, Python, JS)
• Server request/response handling
• Automated tests

This summary was automatically generated by @propel-code-bot

propel-code-bot · 2025-11-23T18:26:19Z

rust/chroma/src/embed/bm25.rs

+            let sparse_triples = token_ids.chunk_by(|a, b| a.0 == b.0).map(|chunk| {
+                let id = chunk[0].0;
+                let tk = chunk[0].1.clone();


[CriticalError]

Logic error in token storage: When store_tokens is true, you're storing chunk[0].1.clone() which is the first token, but for duplicate token IDs after hashing, this loses information about which specific token variant was used. If "running" and "run" both hash to ID 42, you'll only store the first one encountered.

Consider if this is the intended behavior or if you need to store a representative token (e.g., most frequent variant).

Context for Agents

**Logic error in token storage**: When `store_tokens` is true, you're storing `chunk[0].1.clone()` which is the first token, but for duplicate token IDs after hashing, this loses information about which specific token variant was used. If "running" and "run" both hash to ID 42, you'll only store the first one encountered. Consider if this is the intended behavior or if you need to store a representative token (e.g., most frequent variant). File: rust/chroma/src/embed/bm25.rs Line: 92

propel-code-bot · 2025-11-23T18:26:20Z

rust/worker/src/execution/operators/idf.rs

+            }
+            let scaled_query = SparseVector::from_triples(self.query.iter().enumerate().map(
+                |(token_position, (index, value))| {
+                    let nt = nts.get(&index).cloned().unwrap_or_default() as f32;
+                    let scale = scale(n as f32, nt);
+                    (tokens[token_position].clone(), index, scale * value)


[BestPractice]

Potential panic on index out of bounds: If tokens.len() != self.query.indices.len() at runtime (despite earlier validation), tokens[token_position] will panic. The enumerate() iterator doesn't guarantee bounds safety here.

let scaled_query = SparseVector::from_triples(self.query.iter().zip(tokens.iter()).map( |((index, value), token)| { let nt = nts.get(&index).cloned().unwrap_or_default() as f32; let scale = scale(n as f32, nt); (token.clone(), index, scale * value) }, ));

Context for Agents

**Potential panic on index out of bounds**: If `tokens.len() != self.query.indices.len()` at runtime (despite earlier validation), `tokens[token_position]` will panic. The `enumerate()` iterator doesn't guarantee bounds safety here. ```rust let scaled_query = SparseVector::from_triples(self.query.iter().zip(tokens.iter()).map( |((index, value), token)| { let nt = nts.get(&index).cloned().unwrap_or_default() as f32; let scale = scale(n as f32, nt); (token.clone(), index, scale * value) }, )); ``` File: rust/worker/src/execution/operators/idf.rs Line: 185

[ENH] Try shoe-horning sparse vector tokens into the metadata value. (#…

5dc6458

…5767) ## Description of changes This is an attempt to put the tokens for a sparse vector in said sparse vector. ## Test plan CI ## Migration plan N/A ## Observability plan N/A ## Documentation Changes N/A

chroma-droid assigned rescrv Nov 23, 2025

chroma-droid requested a review from rescrv as a code owner November 23, 2025 18:21

propel-code-bot bot reviewed Nov 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HOTFIX] applying PR #5767 to rc/2025-11-21 #5900

[HOTFIX] applying PR #5767 to rc/2025-11-21 #5900

Uh oh!

chroma-droid commented Nov 23, 2025

Uh oh!

github-actions bot commented Nov 23, 2025

Uh oh!

propel-code-bot bot commented Nov 23, 2025

Uh oh!

propel-code-bot bot Nov 23, 2025

Uh oh!

propel-code-bot bot Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[HOTFIX] applying PR #5767 to rc/2025-11-21 #5900

Are you sure you want to change the base?

[HOTFIX] applying PR #5767 to rc/2025-11-21 #5900

Uh oh!

Conversation

chroma-droid commented Nov 23, 2025

Uh oh!

github-actions bot commented Nov 23, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

propel-code-bot bot commented Nov 23, 2025

Uh oh!

propel-code-bot bot Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants