Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,67 @@ For the `S3FileSystem`, the following parameters are needed:

The `base_path` is used when downloading files from S3. The files have to be stored locally in order to be used in the retrieval system.

#### VLM Clients (new)
Visual Language Models (VLMs) now follow the same class-based abstraction pattern (ABC) as LLM, Embedding, and Reranker components. You can supply a first-class VLM client instance to the KnowledgeBase, or override per document by passing a serialized client. Backward compatibility is maintained: legacy provider/model dictionaries in `file_parsing_config['vlm_config']` still work.

- Class-based usage (KB default):
```python
from dsrag.knowledge_base import KnowledgeBase
from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM

kb = KnowledgeBase(
kb_id="my_kb",
vlm_client=GeminiVLM(model="gemini-2.0-flash"), # default VLM used for VLM parsing
)
```

- Per-document override (serialized client):
```python
from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM

kb.add_document(
doc_id="doc1",
file_path="/path/to/file.pdf",
file_parsing_config={
"use_vlm": True,
"vlm": GeminiVLM(model="gemini-2.0-flash").to_dict(), # per-document override
"vlm_config": {"max_pages": 10},
},
auto_context_config={
"use_generated_title": False,
"get_document_summary": False,
"get_section_summaries": False,
},
)
```

- Fallback configuration (preferred, class-based):
```python
from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM

primary = GeminiVLM(model="gemini-2.0-flash").to_dict()
fallback = GeminiVLM(model="gemini-2.5-flash").to_dict()

kb.add_document(
doc_id="doc2",
file_path="/path/to/file.pdf",
file_parsing_config={
"use_vlm": True,
"vlm": primary,
"vlm_fallback": fallback,
"vlm_config": {"max_pages": 10},
},
)
```
Legacy dict-based fallback remains supported via `vlm_config["fallback_provider"/"fallback_model"]`.

- Backward compatibility and precedence
- The legacy dict path (e.g., `provider`, `model`, `images_already_exist`, etc.) continues to work.
- When both a serialized client (`vlm`) and a legacy `provider`/`model` are supplied, the system prefers `vlm`/`vlm_fallback`.

- Environment variables
- `GEMINI_API_KEY` is required for `GeminiVLM`; a clear error is raised if missing.

## Config dictionaries
Since there are a lot of configuration parameters available, they're organized into a few config dictionaries. There are four config dictionaries that can be passed in to `add_document` (`auto_context_config`, `file_parsing_config`, `semantic_sectioning_config`, and `chunking_config`) and one that can be passed in to `query` (`rse_params`).

Expand All @@ -258,6 +319,57 @@ file_parsing_config
- save_path: the path to save intermediate files created during VLM processing
- exclude_elements: a list of element types to exclude from the parsed text. Default is ["Header", "Footer"].

VLM class-based clients and fallback
- You can pass a first-class VLM client instance to KnowledgeBase for default usage:

```python
from dsrag.knowledge_base import KnowledgeBase
from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM

kb = KnowledgeBase(
kb_id="my_kb",
vlm_client=GeminiVLM(model="gemini-2.0-flash"), # used by default for VLM parsing
)
```

- You can also override the VLM on a per-document basis by passing a serialized client via file_parsing_config["vlm"]. This is useful when you want different models per document:

```python
vlm_override = GeminiVLM(model="gemini-2.0-flash").to_dict()
kb.add_document(
doc_id="doc1",
file_path="/path/to/file.pdf",
file_parsing_config={
"use_vlm": True,
"vlm": vlm_override, # per-document VLM client
"vlm_config": {"images_already_exist": False},
},
auto_context_config={
"use_generated_title": False,
"get_document_summary": False,
"get_section_summaries": False,
},
)
```

- Fallback configuration: you can provide a serialized fallback client via file_parsing_config["vlm_fallback"]. The system will alternate between primary and fallback after the first few retries when needed. Legacy fallback using vlm_config["fallback_provider"/"fallback_model"] is also supported.

```python
primary = GeminiVLM(model="gemini-2.0-flash").to_dict()
fallback = GeminiVLM(model="gemini-2.5-flash").to_dict()

kb.add_document(
doc_id="doc2",
file_path="/path/to/file.pdf",
file_parsing_config={
"use_vlm": True,
"vlm": primary,
"vlm_fallback": fallback,
"vlm_config": {"max_pages": 10},
},
)
```

semantic_sectioning_config
- llm_provider: the LLM provider to use for semantic sectioning - "openai", "anthropic", and "gemini" are supported
- model: the LLM model to use for semantic sectioning (e.g., "gpt-4.1-mini", "claude-3-5-haiku-latest", "gemini-2.0-flash")
Expand Down
98 changes: 92 additions & 6 deletions dsrag/dsparse/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,91 @@ kb.add_document(
)
```

## VLM clients
VLMs now support a class-based client abstraction (similar to LLM/Embedding/Reranker) that you can pass either at the KB level or per document. Legacy dict-based `vlm_config` remains fully supported.

- Quickstart with class-based client (serialized) and LocalFileSystem
```python
from dsrag.dsparse.main import parse_and_chunk
from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM
from dsrag.dsparse.file_parsing.file_system import LocalFileSystem

sections, chunks = parse_and_chunk(
kb_id="sample_kb",
doc_id="sample_doc",
file_path="/path/to/file.pdf",
file_parsing_config={
"use_vlm": True,
"vlm": GeminiVLM(model="gemini-2.0-flash").to_dict(),
"vlm_config": {"max_pages": 5, "vlm_max_concurrent_requests": 2},
},
file_system=LocalFileSystem(base_path="~/dsParse"),
)
```

- Fallback (preferred, class-based):
```python
from dsrag.dsparse.main import parse_and_chunk
from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM

primary = GeminiVLM(model="gemini-2.0-flash").to_dict()
fallback = GeminiVLM(model="gemini-2.5-flash").to_dict()

sections, chunks = parse_and_chunk(
kb_id="kb",
doc_id="doc",
file_path="/path/to/file.pdf",
file_parsing_config={
"use_vlm": True,
"vlm": primary,
"vlm_fallback": fallback,
"vlm_config": {"max_pages": 5},
},
)
```

- Legacy path (still valid):
```python
from dsrag.dsparse.main import parse_and_chunk

sections, chunks = parse_and_chunk(
kb_id="kb",
doc_id="doc",
file_path="/path/to/file.pdf",
file_parsing_config={
"use_vlm": True,
"vlm_config": {
"provider": "gemini",
"model": "gemini-2.0-flash",
"max_pages": 5,
# Optional legacy fallback
"fallback_provider": "gemini",
"fallback_model": "gemini-2.5-flash",
},
},
)
```

- Images already exist
If you’ve pre-extracted page images into the configured FileSystem directory structure, you can reuse them:
```python
sections, chunks = parse_and_chunk(
kb_id="kb",
doc_id="doc",
file_path="/path/to/file.pdf", # path still required for metadata, but images won’t be regenerated
file_parsing_config={
"use_vlm": True,
"vlm": GeminiVLM(model="gemini-2.0-flash").to_dict(),
"vlm_config": {"images_already_exist": True},
},
)
```

- Notes
- Parallelism controls and DPI are in `vlm_config` (e.g., `vlm_max_concurrent_requests`, `dpi`).
- Page images and `elements.json` are saved via the configured `FileSystem`.
- Environment variable `GEMINI_API_KEY` is required for `GeminiVLM`. Clear errors are raised if missing.

## Installation
If you want to use dsParse on its own, without installing the full `dsrag` package, there is a standalone Python package available for dsParse, which can be installed with `pip install dsparse`. If you already have `dsrag` installed, you DO NOT need to separately install `dsparse`.

Expand Down Expand Up @@ -81,13 +166,14 @@ The default model for semantic sectioning is `gpt-4o-mini`, but similar or stron
An obvious concern with using a VLM to parse documents is the cost. Let's run the numbers:

VLM file parsing cost calculation (`gemini-2.0-flash`)
- Text input (prompt) + image input: 400 (text) + 258 (image) tokens x $0.10/10^6 per token = $0.000066
- Text output: 600 tokens x $0.40/10^6 per token = $0.000240
- Total: $0.000306/page or **$0.31 per 1000 pages**
- Input tokens for images are calculated based on the number of 768x768 tiles needed. At the standard dpi of 100 (or even up to around 150), this usually means 4 tiles. Each tile is counted as 258 tokens.
- Text input (prompt) + image input: 500 (text) + 4x258 (image) tokens x $0.10/10^6 per token = $0.0001532
- Text output: 700 tokens x $0.40/10^6 per token = $0.0002800
- Total: $0.0004332/page or **$0.43 per 1000 pages**

This is substantially cheaper than commercially available OCR/PDF parsing services. Unstructured and Azure Document Intelligence, for example, both cost $10 per 1000 pages.
This is substantially cheaper than commercially available OCR/PDF parsing services. Unstructured and Azure Document Intelligence, for example, both cost $10 per 1000 pages. Reducto is generally $10-20 per 1000 pages.

What about latency and throughput? Since each page is processed independently, this is a highly parallelizable problem. The main limiting factor then is the rate limits imposed by the VLM provider. The current rate limit for `gemini-2.0-flash` is 2000 requests per minute. Since dsParse uses one request per page, that means the limit is 2000 pages per minute. Processing a single page takes around 15-20 seconds, so that's the minimum latency for processing a document.
What about latency and throughput? Since each page is processed independently, this is a highly parallelizable problem. The main limiting factor then is the rate limits imposed by the VLM provider. The current rate limit for `gemini-2.0-flash` on the highest tier is 30k requests per minute. Since dsParse uses one request per page, that means the limit is 30k pages per minute. Processing a single page takes around 15-20 seconds, so that's the minimum latency for processing a document.

### Semantic sectioning
Semantic sectioning produces far fewer output tokens, so it ends up being a bit cheaper than the file parsing step.
Expand All @@ -97,4 +183,4 @@ Semantic sectioning cost calculation (`gpt-4o-mini`)
- Output: 50 tokens x $0.60/10^6 per token = $0.00003
- Total: $0.00015/page or **$0.15 per 1000 pages**

Document text is processed in ~5000 token mega-chunks, which is roughly ten pages on average. But these mega-chunks have to be processed sequentially for each document. Processing each mega-chunk only takes a couple seconds, though, so even a large document of a few hundred pages will only take 20-60 seconds. Rate limits for the OpenAI API are heavily dependent on the usage tier you're in.
Document text is processed in ~5000 token mega-chunks, which is roughly ten pages on average. These mega-chunks are processed in parallel for each document. Processing each mega-chunk only takes a few seconds, though, so even a large document of a few hundred pages should only take 5-10 seconds. Rate limits for the OpenAI API are heavily dependent on the usage tier you're in.
101 changes: 21 additions & 80 deletions dsrag/dsparse/file_parsing/vlm.py
Original file line number Diff line number Diff line change
@@ -1,95 +1,36 @@
import PIL.Image
import os
import io
from ..utils.imports import vertexai, genai_new
from .vlm_clients import GeminiVLM, VertexAIVLM

def make_llm_call_gemini(image_path: str, system_message: str, model: str = "gemini-2.0-flash", response_schema: dict = None, max_tokens: int = 4000, temperature: float = 0.5) -> str:
# With the newer Google GenAI SDK, we need to create a client
client = genai_new.Client(api_key=os.environ["GEMINI_API_KEY"])
"""
Backward-compatible free function that delegates to GeminiVLM.

# Create generation config with the correct GenerateContentConfig type
config = genai_new.types.GenerateContentConfig(
Signature and behavior are preserved for compatibility.
"""
client = GeminiVLM(model=model)
return client.make_llm_call(
image_path=image_path,
system_message=system_message,
response_schema=response_schema,
max_tokens=max_tokens,
temperature=temperature,
max_output_tokens=max_tokens,
response_mime_type="application/json"
)

# Add response schema if provided
if response_schema is not None:
config.response_schema = response_schema

try:
# Open and compress the image
image = PIL.Image.open(image_path)
compressed_image_bytes, _ = compress_image(image) # Quality is returned but not used here

# Close the original image object now that compression is done
if image:
image.close()
# The 'image' variable still exists and will be handled by the finally block,
# PIL's close() is typically safe to call multiple times.

# Create content parts using bytes
image_part = genai_new.types.Part.from_bytes(data=compressed_image_bytes, mime_type='image/jpeg')
content_parts = [image_part, system_message]

# For Gemini 2.5 models, disable thinking
if model.startswith("gemini-2.5"):
# Create a new config with thinking disabled by setting thinking_config
gemini25_config = genai_new.types.GenerateContentConfig(
temperature=temperature,
max_output_tokens=max_tokens,
response_mime_type="application/json",
thinking_config=genai_new.types.ThinkingConfig(thinking_budget=0)
)

# Add response schema if provided
if response_schema is not None:
gemini25_config.response_schema = response_schema

# Generate content with thinking disabled
response = client.models.generate_content(
model=model,
contents=content_parts,
config=gemini25_config
)
else:
# Standard call for other Gemini models
response = client.models.generate_content(
model=model,
contents=content_parts,
config=config
)

return response.text
finally:
# Ensure image is closed even if an error occurs
if 'image' in locals() and image: # Check if image was defined and not None
try:
image.close() # Attempt to close; safe if already closed
except Exception:
pass # Ignore errors if it fails (e.g., trying to close a None object or already closed and problematic)

def make_llm_call_vertex(image_path: str, system_message: str, model: str, project_id: str, location: str, response_schema: dict = None, max_tokens: int = 4000, temperature: float = 0.5) -> str:
"""
This function calls the Vertex AI Gemini API (not to be confused with the Gemini API) with an image and a system message and returns the response text.
Backward-compatible free function that delegates to VertexAIVLM.

Signature and behavior are preserved for compatibility.
"""
vertexai.init(project=project_id, location=location)
model = vertexai.generative_models.GenerativeModel(model)

if response_schema is not None:
generation_config = vertexai.generative_models.GenerationConfig(temperature=temperature, max_output_tokens=max_tokens, response_mime_type="application/json", response_schema=response_schema)
else:
generation_config = vertexai.generative_models.GenerationConfig(temperature=temperature, max_output_tokens=max_tokens)

response = model.generate_content(
[
vertexai.generative_models.Part.from_image(vertexai.generative_models.Image.load_from_file(image_path)),
system_message,
],
generation_config=generation_config,
client = VertexAIVLM(model=model, project_id=project_id, location=location)
return client.make_llm_call(
image_path=image_path,
system_message=system_message,
response_schema=response_schema,
max_tokens=max_tokens,
temperature=temperature,
)
return response.text

def compress_image(image: PIL.Image.Image, max_size_bytes: int = 1097152, quality: int = 95) -> tuple[bytes, int]:
"""
Expand Down
Loading
Loading