Skip to content

Commit 6e4703f

Browse files
baasitshariefmliu-clouderaactions-user
authored
feat: Add support for Excel files (#323)
* feat: add support for reading XLSX files * feat: add pandas excel dependencies * fix: prevent lambda function from capturing loop variable in EmbeddingIndexer * Use Executor.submit() args * feat: renamed XlsxReader with ExcelReader for broader Excel file support * refactor: renaming XlsxSplitter and fixing mypy errors * refactor: rename config classes for consistency * Update release version to dev-testing * Cast TextNodes directly * Simplify model_source if-else * Remove implicit port conversion in config_to_env to stringify of None * Improve Qdrant configuration and performance with environment variables and gRPC support * Remove unused environment variables and hardcode embedding concurrency and boto3 max pool connections * minor fixes for configuration * Reduce EmbeddingIndexer batch size and add botocore config to BedrockModelProvider * Adjust batch size in EmbeddingIndexer based on reader type to prevent Qdrant timeouts * Add support for CSVReader in EmbeddingIndexer and adjust batch size accordingly * Refactor batch sizes and sampling for EmbeddingIndexer and SummaryIndexer to improve performance with tabular documents * Enhance ExcelReader to handle empty workbooks and ensure JSON serialization compatibility * Enhance ExcelReader to handle null dataframes and improve JSON serialization * fix: Use non-depracated `map` function over `applymap` in ExcelReader * Refactor batch sizes in EmbeddingIndexer and SummaryIndexer to use Qdrant-safe batches * Adjust batch sizes in LlamaIndexQdrantVectorStore * fix: mypy errors * Update release version to dev-testing * Refactor Qdrant configuration and ExcelReader for improved performance and compatibility * fix: more mypy issues * Update release version to dev-testing * Enable Git LFS for prebuilt artifacts * merge origin/main * Update prebuilt artifacts with new versions * Update batch sizes for Qdrant vector store and indexing * fix: Increase memory for application to allow excel use cases * Update llm-service/app/ai/indexing/readers/base_reader.py Co-authored-by: mliu-cloudera <[email protected]> * Update llm-service/app/config.py Co-authored-by: mliu-cloudera <[email protected]> * Update llm-service/app/ai/vector_stores/qdrant.py Co-authored-by: mliu-cloudera <[email protected]> * Update llm-service/app/ai/indexing/embedding_indexer.py Co-authored-by: mliu-cloudera <[email protected]> * fix: minor fixes and adjustments for consistency * refactor: simplify batch size logic in embedding and summary indexers * Update llm-service/app/ai/indexing/readers/base_reader.py Co-authored-by: mliu-cloudera <[email protected]> * Update llm-service/app/ai/indexing/readers/base_reader.py Co-authored-by: mliu-cloudera <[email protected]> * Update llm-service/app/ai/indexing/summary_indexer.py Co-authored-by: mliu-cloudera <[email protected]> * refactor: reverting variable name batch_size to max_samples * Update .DS_Store file in llm-service directory --------- Co-authored-by: Michael Liu <[email protected]> Co-authored-by: actions-user <[email protected]>
1 parent 64e1a14 commit 6e4703f

File tree

21 files changed

+2356
-1822
lines changed

21 files changed

+2356
-1822
lines changed

.project-metadata.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
name: RAG Studio
2-
description: |
2+
description: |
33
"Build a RAG application to ask questions about your documents. Configuration for access to models will be available inside the application itself once it has been deployed."
44
author: "Cloudera"
55
date: "2024-09-10"
66
specification_version: 1.0
77
prototype_version: 1.0
88

99
environment_variables:
10-
UV_HTTP_TIMEOUT:
11-
description: "Timeout for UV processing in seconds."
12-
default: "60000"
13-
required: false
10+
UV_HTTP_TIMEOUT:
11+
description: "Timeout for UV processing in seconds."
12+
default: "60000"
13+
required: false
1414

1515
runtimes:
1616
- editor: JupyterLab

backend/src/main/resources/application.properties

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,3 +55,11 @@ otel.traces.exporter=none
5555

5656
server.address=${API_HOST:127.0.0.1}
5757
server.port=${METADATA_APP_PORT:8080}
58+
59+
# HikariCP Database Connection Pool Configuration
60+
spring.datasource.hikari.maximum-pool-size=10
61+
spring.datasource.hikari.minimum-idle=5
62+
spring.datasource.hikari.connection-timeout=30000
63+
spring.datasource.hikari.idle-timeout=300000
64+
spring.datasource.hikari.max-lifetime=1800000
65+
spring.datasource.hikari.leak-detection-threshold=60000

docker-compose.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ services:
5252
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
5353
- S3_RAG_DOCUMENT_BUCKET=cloudera-ai-rag-dev-us-west-2
5454
- QDRANT_HOST=qdrant
55+
- QDRANT_GRPC_PORT=6334
5556
- API_URL=http://api:8080
5657
- MLFLOW_RECONCILER_DATA_PATH=/tmp
5758
depends_on:
@@ -66,5 +67,6 @@ services:
6667
image: qdrant/qdrant
6768
ports:
6869
- "6333:6333"
70+
- "6334:6334" # gRPC port for better performance
6971
environment:
7072
- RUST_LOG=info

llm-service/.DS_Store

0 Bytes
Binary file not shown.

llm-service/app/ai/indexing/base.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
from .readers.pdf import PDFReader
1919
from .readers.pptx import PptxReader
2020
from .readers.simple_file import SimpleFileReader
21+
from .readers.excel import ExcelReader
2122
from ...config import settings
2223

2324
logger = logging.getLogger(__name__)
@@ -30,6 +31,11 @@
3031
".pptx": PptxReader,
3132
".pptm": PptxReader,
3233
".csv": CSVReader,
34+
".xlsx": ExcelReader,
35+
".xlsb": ExcelReader,
36+
".xlsm": ExcelReader,
37+
".xls": ExcelReader,
38+
".ods": ExcelReader,
3339
".json": JSONReader,
3440
".jpg": ImagesReader,
3541
".jpeg": ImagesReader,

llm-service/app/ai/indexing/embedding_indexer.py

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@
4949

5050
from .base import BaseTextIndexer
5151
from .readers.base_reader import ReaderConfig, ChunksResult
52+
from .readers.excel import ExcelReader
53+
from .readers.csv import CSVReader
54+
from ...ai.vector_stores.qdrant import QdrantVectorStore
5255
from ...ai.vector_stores.vector_store import VectorStore
5356
from ...services.utils import batch_sequence, flatten_sequence
5457

@@ -78,6 +81,8 @@ def index_file(self, file_path: Path, document_id: str) -> None:
7881

7982
reader_cls = self._get_reader_class(file_path)
8083

84+
is_tabular_document = reader_cls in (ExcelReader, CSVReader)
85+
8186
reader = reader_cls(
8287
splitter=self.splitter,
8388
document_id=document_id,
@@ -99,7 +104,14 @@ def index_file(self, file_path: Path, document_id: str) -> None:
99104
chunks_with_embeddings = flatten_sequence(self._compute_embeddings(nodes))
100105

101106
acc = 0
102-
for chunk_batch in batch_sequence(chunks_with_embeddings, 1000):
107+
use_qdrant_safe_batches = isinstance(
108+
self.chunks_vector_store, QdrantVectorStore
109+
)
110+
if use_qdrant_safe_batches and is_tabular_document:
111+
batch_size = 256
112+
else:
113+
batch_size = 1000
114+
for chunk_batch in batch_sequence(chunks_with_embeddings, batch_size):
103115
acc += len(chunk_batch)
104116
logger.debug(f"Adding {acc}/{len(nodes)} chunks to vector store")
105117

@@ -125,13 +137,20 @@ def _compute_embeddings(
125137
batched_chunks = list(batch_sequence(chunks, 100))
126138
batched_texts = [[chunk.text for chunk in batch] for batch in batched_chunks]
127139

128-
with ThreadPoolExecutor(max_workers=20) as executor:
140+
max_workers = 15
141+
logger.debug("Using %s workers for embedding generation", max_workers)
142+
143+
with ThreadPoolExecutor(max_workers=max_workers) as executor:
129144
futures = [
130145
executor.submit(
131-
lambda b: (i, self.embedding_model.get_text_embedding_batch(b)),
132-
batch,
146+
lambda batch_text, batch_index: (
147+
batch_index,
148+
self.embedding_model.get_text_embedding_batch(batch_text),
149+
),
150+
b,
151+
i,
133152
)
134-
for i, batch in enumerate(batched_texts)
153+
for i, b in enumerate(batched_texts)
135154
]
136155
logger.debug(f"Waiting for {len(futures)} futures")
137156
for future in as_completed(futures):

llm-service/app/ai/indexing/readers/base_reader.py

Lines changed: 30 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import functools
12
import os
23
import tempfile
34
from abc import ABC, abstractmethod
@@ -13,6 +14,25 @@
1314
from presidio_anonymizer import AnonymizerEngine
1415

1516

17+
@functools.cache
18+
def _get_analyzer() -> AnalyzerEngine:
19+
"""Cached analyzer engine to reuse compiled regex patterns."""
20+
return AnalyzerEngine()
21+
22+
23+
@functools.cache
24+
def _get_anonymizer() -> AnonymizerEngine:
25+
"""Cached anonymizer engine to reuse compiled patterns."""
26+
return AnonymizerEngine() # type: ignore[no-untyped-call]
27+
28+
29+
@functools.cache
30+
def _get_secret_collection() -> SecretsCollection:
31+
"""Cached secrets collection to reuse compiled regex patterns."""
32+
with default_settings():
33+
return SecretsCollection()
34+
35+
1636
@dataclass
1737
class ReaderConfig:
1838
block_secrets: bool = False
@@ -70,19 +90,19 @@ def _block_secrets(self, chunks: List[str]) -> Optional[Set[str]]:
7090
if not self.config.block_secrets:
7191
return None
7292

93+
# Create a fresh collection each time since clear() doesn't exist
94+
# but still benefit from cached settings/plugins via default_settings()
7395
with tempfile.TemporaryDirectory() as tmpdir:
96+
paths = []
7497
for i, chunk in enumerate(chunks):
75-
with open(os.path.join(tmpdir, f"chunk_{i}.txt"), "w") as f:
98+
path = os.path.join(tmpdir, f"chunk_{i}.txt")
99+
with open(path, "w") as f:
76100
f.write(chunk)
101+
paths.append(path)
77102

78-
secrets_collection = SecretsCollection()
103+
secrets_collection = _get_secret_collection()
79104
with default_settings():
80-
secrets_collection.scan_files(
81-
*[
82-
os.path.join(tmpdir, f"chunk_{i}.txt")
83-
for i in range(len(chunks))
84-
]
85-
)
105+
secrets_collection.scan_files(*paths)
86106

87107
secrets_json = secrets_collection.json()
88108

@@ -97,12 +117,12 @@ def _anonymize_pii(self, text: str) -> Optional[str]:
97117
if not self.config.anonymize_pii:
98118
return None
99119

100-
analyzer = AnalyzerEngine()
120+
analyzer = _get_analyzer()
101121

102122
# TODO: support other languages
103123
results = analyzer.analyze(text=text, entities=None, language="en")
104124

105-
anonymizer = AnonymizerEngine() # type: ignore[no-untyped-call]
125+
anonymizer = _get_anonymizer()
106126

107127
anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results) # type: ignore[arg-type]
108128
if anonymized_text.text == text:

llm-service/app/ai/indexing/readers/csv.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,8 @@
5353

5454
class _CsvSplitter(MetadataAwareTextSplitter):
5555
def split_text_metadata_aware(self, text: str, metadata_str: str) -> List[str]:
56+
# metadata_str is kept as an argument to satisfy the interface, but it is not used
57+
# because metadata is added to the chunks later.
5658
return self.split_text(text)
5759

5860
def split_text(self, text: str) -> List[str]:

0 commit comments

Comments
 (0)