Skip to content

Conversation

@chroma-droid
Copy link

This PR cherry-picks the commit 2eca285 onto rc/2025-11-21. If there are unresolved conflicts, please resolve them manually.

…5767)

## Description of changes

This is an attempt to put the tokens for a sparse vector in said sparse
vector.

## Test plan

CI

## Migration plan

N/A

## Observability plan

N/A

## Documentation Changes

N/A
@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@propel-code-bot
Copy link
Contributor

Hot-fix: Back-port sparse-vector token metadata support to rc/2025-11-21

This PR cherry-picks the sparse-vector enhancement originally delivered in PR #5767 onto the November 21 2025 release branch. The change introduces an optional metadata key that stores the original token list alongside the sparse vector, allowing all sparse-search operators (BM25, IDF, KNN ranking, etc.) to retrieve token information end-to-end. The patch touches the Rust core, persistence layer, operator implementations, public protobuf/API surfaces, and JS/Python client code so that the new field is transparently propagated without breaking existing clients.

Because metadata is schemaless JSON, the addition is backward-compatible. Older clients ignore the new key, while newer clients and operators can rely on it to achieve better relevance scores and consistency across storage layers. No explicit data migrations are required.

Key Changes

• Added optional tokens field to metadata model (rust/types/src/metadata.rs, collection_schema.rs)
• Updated execution operators (idf.rs, rank.rs, sparse_log_knn.rs, bm25.rs) to read/write token lists from metadata
• Extended public API definitions (chroma.proto, api_types.rs, TypeScript and Python bindings) to expose the new field
• Persisted new metadata layout in block-file storage
• Adjusted server code and property-based tests to verify round-trip of token data
• Updated JS client packages to consume and emit the new format

Affected Areas

• Metadata schema & persistence
• Sparse-vector execution operators
• Public protobuf & client SDKs (TypeScript, Python, JS)
• Server request/response handling
• Automated tests

This summary was automatically generated by @propel-code-bot

Comment on lines +90 to +92
let sparse_triples = token_ids.chunk_by(|a, b| a.0 == b.0).map(|chunk| {
let id = chunk[0].0;
let tk = chunk[0].1.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

Logic error in token storage: When store_tokens is true, you're storing chunk[0].1.clone() which is the first token, but for duplicate token IDs after hashing, this loses information about which specific token variant was used. If "running" and "run" both hash to ID 42, you'll only store the first one encountered.

Consider if this is the intended behavior or if you need to store a representative token (e.g., most frequent variant).

Context for Agents
**Logic error in token storage**: When `store_tokens` is true, you're storing `chunk[0].1.clone()` which is the first token, but for duplicate token IDs after hashing, this loses information about which specific token variant was used. If "running" and "run" both hash to ID 42, you'll only store the first one encountered.

Consider if this is the intended behavior or if you need to store a representative token (e.g., most frequent variant).

File: rust/chroma/src/embed/bm25.rs
Line: 92

Comment on lines +180 to +185
}
let scaled_query = SparseVector::from_triples(self.query.iter().enumerate().map(
|(token_position, (index, value))| {
let nt = nts.get(&index).cloned().unwrap_or_default() as f32;
let scale = scale(n as f32, nt);
(tokens[token_position].clone(), index, scale * value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Potential panic on index out of bounds: If tokens.len() != self.query.indices.len() at runtime (despite earlier validation), tokens[token_position] will panic. The enumerate() iterator doesn't guarantee bounds safety here.

let scaled_query = SparseVector::from_triples(self.query.iter().zip(tokens.iter()).map(
    |((index, value), token)| {
        let nt = nts.get(&index).cloned().unwrap_or_default() as f32;
        let scale = scale(n as f32, nt);
        (token.clone(), index, scale * value)
    },
));
Context for Agents
**Potential panic on index out of bounds**: If `tokens.len() != self.query.indices.len()` at runtime (despite earlier validation), `tokens[token_position]` will panic. The `enumerate()` iterator doesn't guarantee bounds safety here.

```rust
let scaled_query = SparseVector::from_triples(self.query.iter().zip(tokens.iter()).map(
    |((index, value), token)| {
        let nt = nts.get(&index).cloned().unwrap_or_default() as f32;
        let scale = scale(n as f32, nt);
        (token.clone(), index, scale * value)
    },
));
```

File: rust/worker/src/execution/operators/idf.rs
Line: 185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants