Skip to content

Commit e1c8254

Browse files
baasitshariefmliu-clouderaactions-user
authored
feat: Add ChromaDB support (#314)
* WIP Add ChromaDB support and update dependencies * WIP Add ChromaDB configuration support and update UI components * WIP Enhance ChromaDB integration by adding SSL, token, tenant, and database configurations * fix: mypy errors * feat: Enhance ChromaDB support by adding database configuration and updating local development script * feat: Disable input fields and switch for ChromaDB in case of enableModification as False * Use ChromaDB constants for default tenant and database * fix: Removed SSL configuration and infer SSL from host URL, add chromadb to localhost script * fix: mypy errors * feat: Add SSL configuration for ChromaDB client with conditional settings based on host URL * refactor: Comment out SSL certificate configuration for ChromaDB client, update instructions for future implementation * feat: Enhance ChromaDB configuration in .env and README, add SSL cert path support in settings * Remove https check for setting port and ssl_verify * Log port parsing errors * More cleanups * Fix my check oops * Update publish_release.yml workflow to trigger on bs/chromadb branch, refine startup_app.sh script comments for clarity * Update release version to dev-chromadb * Implement support for ChromaDB as an alternative local vector DB provider * Update release version to dev-chromadb * Add .cursor to gitignore, update startup_app.sh to use uvx to start chroma * Update release version to dev-chromadb * Add support for controlling anonymized telemetry in ChromaDB client * Update release version to dev-chromadb * Remove ChromaDB anonymized telemetry configuration option from UI * Update release version to dev-chromadb * bug fix remove undefined arg from chromadb_config * Fix: Refactor ChromaVectorStore visualize * fix: ruff and mypy errors * Flatten metadata in EmbeddingIndexer when vector store has flat_metadata enabled * Move flat_metadata to VectorStore and flatten metadata for summary indexer --------- Co-authored-by: Michael Liu <[email protected]> Co-authored-by: actions-user <[email protected]>
1 parent c35a20d commit e1c8254

File tree

26 files changed

+1188
-28
lines changed

26 files changed

+1188
-28
lines changed

.env.example

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,19 @@
11
AWS_DEFAULT_REGION=us-west-2
22

3+
# H2 or PostgreSQL (RDS) (H2 is default)
4+
DB_TYPE=H2
5+
6+
# H2
37
DB_URL=jdbc:h2:../databases/rag
48

9+
# RDS
10+
# DB_URL= "jdbc:postgresql://<host>:<port>/<database>"
11+
DB_USERNAME=
12+
DB_PASSWORD=
13+
14+
# Model Provider
15+
MODEL_PROVIDER=Bedrock
16+
517
# CAII
618
CAII_DOMAIN=
719

@@ -10,7 +22,7 @@ AZURE_OPENAI_API_KEY=
1022
AZURE_OPENAI_ENDPOINT=
1123
OPENAI_API_VERSION=
1224

13-
# QDRANT or OPENSEARCH
25+
# QDRANT or OPENSEARCH or CHROMADB
1426
VECTOR_DB_PROVIDER=QDRANT
1527

1628
# OpenSearch
@@ -19,6 +31,18 @@ OPENSEARCH_USERNAME=
1931
OPENSEARCH_PASSWORD=
2032
OPENSEARCH_NAMESPACE=
2133

34+
# ChromaDB
35+
CHROMADB_HOST=http://localhost
36+
CHROMADB_PORT=8000
37+
CHROMADB_TOKEN=
38+
# Tenant and database defaults to the Chroma default values
39+
CHROMADB_TENANT=
40+
CHROMADB_DATABASE=
41+
# If CHROMADB_HOST starts with "https://" and your server uses a private CA,
42+
# set it to the path of your PEM bundle so Python can verify TLS connections to ChromaDB:
43+
CHROMADB_SERVER_SSL_CERT_PATH=/absolute/path/to/ca-bundle.pem
44+
CHROMADB_ENABLE_ANONYMIZED_TELEMETRY=false
45+
2246
# AWS
2347
AWS_ACCESS_KEY_ID=
2448
AWS_SECRET_ACCESS_KEY=

.github/workflows/publish_release.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ on:
1818
- mob/main
1919
- release/1
2020
- customer-hotfix
21+
- bs/chromadb
2122
jobs:
2223
build:
2324
runs-on: ubuntu-latest

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
.env
22
.idea/*
33
.vscode/*
4+
.cursor/*
45
!.idea/copyright/
56
!.idea/prettier.xml
67
!.idea/google-java-format.xml

README.md

Lines changed: 47 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,32 @@ RAG Studio can utilize the local file system or an S3 bucket for storing documen
5252

5353
S3 will also require providing the AWS credentials for the bucket.
5454

55+
### Vector Database Options
56+
57+
RAG Studio supports Qdrant (default), OpenSearch (Cloudera Semantic Search), and ChromaDB.
58+
59+
- To choose the vector DB, set `VECTOR_DB_PROVIDER` to one of `QDRANT`, `OPENSEARCH`, or `CHROMADB` in your `.env`.
60+
61+
#### ChromaDB Setup
62+
63+
If you select ChromaDB, configure the following environment variables in `.env`:
64+
65+
- `CHROMADB_HOST` - Hostname or URL for ChromaDB. Use `localhost` for local Docker.
66+
- `CHROMADB_PORT` - Port for ChromaDB (default `8000`). Not required if `CHROMADB_HOST` starts with `https://` and the server infers the port.
67+
- `CHROMADB_TENANT` - Optional. Defaults to the Chroma default tenant.
68+
- `CHROMADB_DATABASE` - Optional. Defaults to the Chroma default database.
69+
- `CHROMADB_TOKEN` - Optional. Include if your Chroma server requires an auth token.
70+
- `CHROMADB_SERVER_SSL_CERT_PATH` - Optional. Path to PEM bundle for TLS verification when using HTTPS with a private CA.
71+
- `CHROMADB_ENABLE_ANONYMIZED_TELEMETRY` - Optional. Enables anonymized telemetry in the ChromaDB client; defaults to `false`.
72+
73+
Notes:
74+
75+
- The local-dev script will automatically start a ChromaDB Docker container when `VECTOR_DB_PROVIDER=CHROMADB`, `CHROMADB_HOST=localhost` on `CHROMADB_PORT=8000`.
76+
- ChromaDB collections are automatically namespaced using the tenant and database values to avoid conflicts between different RAG Studio instances.
77+
- For production deployments, consider using a dedicated ChromaDB server with authentication enabled via `CHROMADB_TOKEN`.
78+
- When using HTTPS endpoints, ensure your certificate chain is properly configured or provide the CA bundle path via `CHROMADB_SERVER_SSL_CERT_PATH`.
79+
- Anonymized telemetry is disabled by default. You can enable it either by setting `CHROMADB_ENABLE_ANONYMIZED_TELEMETRY=true`.
80+
5581
### Enhanced Parsing Options:
5682

5783
RAG Studio can optionally enable enhanced parsing by providing the `USE_ENHANCED_PDF_PROCESSING` environment variable. Enabling this will allow RAG Studio to parse images and tables from PDFs. When enabling this feature, we strongly recommend using this with a GPU and at least 16GB of memory.
@@ -82,7 +108,7 @@ This variable can be set from the project settings for the AMP in CML.
82108
## Air-gapped Environments
83109

84110
If you are using an air-gapped environment, you will need to whitelist at the minimum the following domains in order to use the AMP.
85-
There may be other domains that need to be whitelisted depending on your environment and the model service provider you select.
111+
There may be other domains that need to be whitelisted depending on your environment and the model service provider you select.
86112

87113
- `https://github.com`
88114
- `https://raw.githubusercontent.com`
@@ -150,17 +176,29 @@ the Node service locally, you can do so by following these steps:
150176
docker run -p 6333:6333 -p 6334:6334 -v $(pwd)/databases/qdrant_storage:/qdrant/storage:z qdrant/qdrant
151177
```
152178

179+
#### To run ChromaDB locally
180+
181+
```
182+
docker run --name chromadb_dev --rm -d -p 8000:8000 -v $(pwd)/databases/chromadb_storage:/data chromadb/chroma
183+
```
184+
185+
#### Use ChromaDB with local-dev.sh
186+
187+
- Copy `.env.example` to `.env`.
188+
- Set `VECTOR_DB_PROVIDER=CHROMADB` in `.env` (defaults assume `CHROMADB_HOST=localhost` and `CHROMADB_PORT=8000`).
189+
- Run `./local-dev.sh` from the repo root. When `CHROMADB_HOST=localhost`, the script will auto-start a ChromaDB Docker container.
190+
153191
#### Modifying UI in CML
154192

155-
* This is an unsupported workflow, but it is possible to modify the UI code in CML.
193+
- This is an unsupported workflow, but it is possible to modify the UI code in CML.
156194

157-
- Start a CML Session from a CML Project that has the RAG Studio AMP installed.
158-
- Open the terminal in the CML Session and navigate to the `ui` directory.
159-
- Run `source ~/.bashrc` to ensure the Node environment variables are loaded.
160-
- Install PNPM using `npm install -g pnpm`. Docs on PNPM can be found here: https://pnpm.io/installation#using-npm
161-
- Run `pnpm install` to install the dependencies.
162-
- Make your changes to the UI code in the `ui` directory.
163-
- Run `pnpm build` to build the new UI bundle.
195+
* Start a CML Session from a CML Project that has the RAG Studio AMP installed.
196+
* Open the terminal in the CML Session and navigate to the `ui` directory.
197+
* Run `source ~/.bashrc` to ensure the Node environment variables are loaded.
198+
* Install PNPM using `npm install -g pnpm`. Docs on PNPM can be found here: https://pnpm.io/installation#using-npm
199+
* Run `pnpm install` to install the dependencies.
200+
* Make your changes to the UI code in the `ui` directory.
201+
* Run `pnpm build` to build the new UI bundle.
164202

165203
## The Fine Print
166204

llm-service/app/ai/indexing/base.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
1+
import json
12
import logging
23
import os
34
from abc import abstractmethod
45
from dataclasses import dataclass
56
from pathlib import Path
6-
from typing import Dict, Type, Optional
7+
from typing import Dict, Type, Optional, TypeVar
8+
9+
from llama_index.core.schema import BaseNode
710

811
from .readers.base_reader import BaseReader, ReaderConfig
912
from .readers.csv import CSVReader
@@ -26,7 +29,6 @@
2629
".docx": DocxReader,
2730
".pptx": PptxReader,
2831
".pptm": PptxReader,
29-
".ppt": PptxReader,
3032
".csv": CSVReader,
3133
".json": JSONReader,
3234
".jpg": ImagesReader,
@@ -40,6 +42,9 @@
4042
}
4143

4244

45+
TNode = TypeVar("TNode", bound=BaseNode)
46+
47+
4348
@dataclass
4449
class NotSupportedFileExtensionError(Exception):
4550
file_extension: str
@@ -54,6 +59,13 @@ def __init__(
5459
self.data_source_id = data_source_id
5560
self.reader_config = reader_config
5661

62+
@staticmethod
63+
def _flatten_metadata(chunk: TNode) -> TNode:
64+
for key, value in chunk.metadata.items():
65+
if isinstance(value, list) or isinstance(value, dict):
66+
chunk.metadata[key] = json.dumps(value)
67+
return chunk
68+
5769
@abstractmethod
5870
def index_file(self, file_path: Path, doc_id: str) -> None:
5971
pass

llm-service/app/ai/indexing/embedding_indexer.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,12 @@ def index_file(self, file_path: Path, document_id: str) -> None:
108108
# we're capturing "text".
109109
converted_chunks: List[BaseNode] = [chunk for chunk in chunk_batch]
110110

111+
# flatten metadata if vector store has self.flat_metadata
112+
if self.chunks_vector_store.flat_metadata:
113+
converted_chunks = [
114+
self._flatten_metadata(chunk) for chunk in converted_chunks
115+
]
116+
111117
chunks_vector_store = self.chunks_vector_store.llama_vector_store()
112118
chunks_vector_store.add(converted_chunks)
113119

llm-service/app/ai/indexing/summary_indexer.py

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@
7070
from qdrant_client.http.exceptions import UnexpectedResponse
7171

7272
from app.services import models
73+
from app.ai.vector_stores.vector_store import VectorStore
7374
from .base import BaseTextIndexer
7475
from .readers.base_reader import ReaderConfig, ChunksResult
7576
from ..vector_stores.vector_store_factory import VectorStoreFactory
@@ -101,6 +102,7 @@ def __init__(
101102
self.splitter = splitter
102103
self.llm = llm
103104
self.embedding_model = embedding_model
105+
self.summary_vector_store = VectorStoreFactory.for_summaries(data_source_id)
104106

105107
@staticmethod
106108
def __database_dir(data_source_id: int) -> str:
@@ -177,19 +179,20 @@ def __summary_indexer(
177179
return SummaryIndexer.__summary_indexer_with_config(
178180
persist_dir=persist_dir,
179181
index_configuration=self.__index_kwargs(embed_summaries),
182+
summary_vector_store=self.summary_vector_store,
180183
)
181184
except (ValueError, FileNotFoundError):
182185
doc_summary_index = self.__init_summary_store(persist_dir)
183186
return doc_summary_index
184187

185188
@staticmethod
186189
def __summary_indexer_with_config(
187-
persist_dir: str, index_configuration: Dict[str, Any]
190+
persist_dir: str, index_configuration: Dict[str, Any],
191+
summary_vector_store: VectorStore,
188192
) -> DocumentSummaryIndex:
189-
data_source_id: int = index_configuration.get("data_source_id")
190193
storage_context = SummaryIndexer.create_storage_context(
191194
persist_dir,
192-
VectorStoreFactory.for_summaries(data_source_id).llama_vector_store(),
195+
summary_vector_store.llama_vector_store(),
193196
)
194197
doc_summary_index: DocumentSummaryIndex = cast(
195198
DocumentSummaryIndex,
@@ -293,6 +296,8 @@ def index_file(self, file_path: Path, document_id: str) -> None:
293296
with _write_lock:
294297
persist_dir = self.__persist_dir()
295298
summary_store: DocumentSummaryIndex = self.__summary_indexer(persist_dir)
299+
if self.summary_vector_store.flat_metadata:
300+
nodes = [self._flatten_metadata(node) for node in nodes]
296301
summary_store.insert_nodes(nodes)
297302
summary_store.storage_context.persist(persist_dir=persist_dir)
298303

@@ -311,7 +316,7 @@ def __update_global_summary_store(
311316
# and re-index it with the addition/removal.
312317
global_persist_dir = self.__persist_root_dir()
313318
global_summary_store = self.__summary_indexer(
314-
global_persist_dir, embed_summaries=False
319+
global_persist_dir, embed_summaries=False,
315320
)
316321
data_source_node = Document(doc_id=str(self.data_source_id))
317322

@@ -493,7 +498,8 @@ def delete_data_source_by_id(data_source_id: int) -> None:
493498
embed_summaries=False,
494499
)
495500
global_summary_store = SummaryIndexer.__summary_indexer_with_config(
496-
global_persist_dir, configuration
501+
global_persist_dir, configuration,
502+
summary_vector_store=vector_store,
497503
)
498504
except FileNotFoundError:
499505
## global summary store doesn't exist, nothing to do

0 commit comments

Comments
 (0)