-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: Implement keyword search
and delete_chunk
at ChromaDB
#3057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bwook00
wants to merge
15
commits into
llamastack:main
Choose a base branch
from
bwook00:chroma
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+327
−11
Open
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
554c78b
add delete_chunk feature at chroma
bwook00 26fb208
add query_keyword function
bwook00 abd4562
apply pre-commit
bwook00 5f2de49
add test code
bwook00 80dc2a6
Merge branch 'refs/heads/main' into chroma
bwook00 4809b33
Merge branch 'main' into chroma
bwook00 7690de9
Merge branch 'main' into chroma
bwook00 d460fd6
Merge branch 'main' into chroma
bwook00 0d3646c
Merge branch 'main' into chroma
bwook00 6b5ff1c
Merge branch 'main' into chroma
bwook00 c66ebae
Merge branch 'main' into chroma
bwook00 db0ce0d
apply Reranker class at chromaDB
bwook00 f23ea04
Merge branch 'main' into chroma
bwook00 d3958fa
Merge branch 'main' into chroma
bwook00 581608d
Merge branch 'main' into chroma
bwook00 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
# Copyright (c) Meta Platforms, Inc. and affiliates. | ||
# All rights reserved. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As mentioned in #3264, I don't think we should have remote provider unit tests. I think we should leave the testing for remote providers to our integrations tests. |
||
# | ||
# This source code is licensed under the terms described in the LICENSE file in | ||
# the root directory of this source tree. | ||
|
||
import json | ||
from unittest.mock import MagicMock, patch | ||
|
||
import numpy as np | ||
import pytest | ||
|
||
from llama_stack.apis.vector_io import QueryChunksResponse | ||
|
||
# Mock the entire chromadb module | ||
chromadb_mock = MagicMock() | ||
chromadb_mock.AsyncHttpClient = MagicMock | ||
chromadb_mock.PersistentClient = MagicMock | ||
|
||
# Apply the mock before importing ChromaIndex | ||
with patch.dict("sys.modules", {"chromadb": chromadb_mock}): | ||
from llama_stack.providers.remote.vector_io.chroma.chroma import ChromaIndex | ||
|
||
# This test is a unit test for the ChromaVectorIOAdapter class. This should only contain | ||
# tests which are specific to this class. More general (API-level) tests should be placed in | ||
# tests/integration/vector_io/ | ||
# | ||
# How to run this test: | ||
# | ||
# pytest tests/unit/providers/vector_io/test_chroma.py \ | ||
# -v -s --tb=short --disable-warnings --asyncio-mode=auto | ||
|
||
CHROMA_PROVIDER = "chromadb" | ||
|
||
|
||
@pytest.fixture | ||
async def mock_chroma_collection() -> MagicMock: | ||
"""Create a mock Chroma collection with common method behaviors.""" | ||
collection = MagicMock() | ||
collection.name = "test_collection" | ||
|
||
# Mock add operation | ||
collection.add.return_value = None | ||
|
||
# Mock query operation for vector search | ||
collection.query.return_value = { | ||
"distances": [[0.1, 0.2]], | ||
"documents": [ | ||
[ | ||
json.dumps({"content": "mock chunk 1", "metadata": {"document_id": "doc1"}}), | ||
json.dumps({"content": "mock chunk 2", "metadata": {"document_id": "doc2"}}), | ||
] | ||
], | ||
} | ||
|
||
# Mock delete operation | ||
collection.delete.return_value = None | ||
|
||
return collection | ||
|
||
|
||
@pytest.fixture | ||
async def mock_chroma_client(mock_chroma_collection): | ||
"""Create a mock Chroma client with common method behaviors.""" | ||
client = MagicMock() | ||
|
||
# Mock collection operations | ||
client.get_or_create_collection.return_value = mock_chroma_collection | ||
client.get_collection.return_value = mock_chroma_collection | ||
client.delete_collection.return_value = None | ||
|
||
return client | ||
|
||
|
||
@pytest.fixture | ||
async def chroma_index(mock_chroma_client, mock_chroma_collection): | ||
"""Create a ChromaIndex with mocked client and collection.""" | ||
index = ChromaIndex(client=mock_chroma_client, collection=mock_chroma_collection) | ||
yield index | ||
# No real cleanup needed since we're using mocks | ||
|
||
|
||
async def test_add_chunks(chroma_index, sample_chunks, sample_embeddings, mock_chroma_collection): | ||
await chroma_index.add_chunks(sample_chunks, sample_embeddings) | ||
|
||
# Verify data was inserted | ||
mock_chroma_collection.add.assert_called_once() | ||
|
||
# Verify the add call had the right number of chunks | ||
add_call = mock_chroma_collection.add.call_args | ||
assert len(add_call[1]["documents"]) == len(sample_chunks) | ||
|
||
|
||
async def test_query_chunks_vector( | ||
chroma_index, sample_chunks, sample_embeddings, embedding_dimension, mock_chroma_collection | ||
): | ||
# Setup: Add chunks first | ||
await chroma_index.add_chunks(sample_chunks, sample_embeddings) | ||
|
||
# Test vector search | ||
query_embedding = np.random.rand(embedding_dimension).astype(np.float32) | ||
response = await chroma_index.query_vector(query_embedding, k=2, score_threshold=0.0) | ||
|
||
assert isinstance(response, QueryChunksResponse) | ||
assert len(response.chunks) == 2 | ||
mock_chroma_collection.query.assert_called_once() | ||
|
||
|
||
async def test_query_chunks_keyword_search(chroma_index, sample_chunks, sample_embeddings, mock_chroma_collection): | ||
await chroma_index.add_chunks(sample_chunks, sample_embeddings) | ||
|
||
# Test keyword search | ||
query_string = "Sentence 5" | ||
response = await chroma_index.query_keyword(query_string=query_string, k=2, score_threshold=0.0) | ||
|
||
assert isinstance(response, QueryChunksResponse) | ||
assert len(response.chunks) == 2 | ||
|
||
|
||
async def test_delete_collection(chroma_index, mock_chroma_client): | ||
# Test collection deletion | ||
await chroma_index.delete() | ||
|
||
mock_chroma_client.delete_collection.assert_called_once_with(chroma_index.collection.name) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase your PR to exclude llama_stack/providers/utils/vector_io/vector_utils.py from your PR. Rebase from main since #3064 was merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1