D-Star-AI · NickGator1 · Nov 10, 2025 · Nov 5, 2025 · Nov 5, 2025 · Nov 5, 2025
diff --git a/README.md b/README.md
@@ -235,6 +235,67 @@ For the `S3FileSystem`, the following parameters are needed:
 
 The `base_path` is used when downloading files from S3. The files have to be stored locally in order to be used in the retrieval system. 
 
+#### VLM Clients (new)
+Visual Language Models (VLMs) now follow the same class-based abstraction pattern (ABC) as LLM, Embedding, and Reranker components. You can supply a first-class VLM client instance to the KnowledgeBase, or override per document by passing a serialized client. Backward compatibility is maintained: legacy provider/model dictionaries in `file_parsing_config['vlm_config']` still work.
+
+- Class-based usage (KB default):
+```python
+from dsrag.knowledge_base import KnowledgeBase
+from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM
+
+kb = KnowledgeBase(
+    kb_id="my_kb",
+    vlm_client=GeminiVLM(model="gemini-2.0-flash"),  # default VLM used for VLM parsing
+)
+```
+
+- Per-document override (serialized client):
+```python
+from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM
+
+kb.add_document(
+    doc_id="doc1",
+    file_path="/path/to/file.pdf",
+    file_parsing_config={
+        "use_vlm": True,
+        "vlm": GeminiVLM(model="gemini-2.0-flash").to_dict(),  # per-document override
+        "vlm_config": {"max_pages": 10},
+    },
+    auto_context_config={
+        "use_generated_title": False,
+        "get_document_summary": False,
+        "get_section_summaries": False,
+    },
+)
+```
+
+- Fallback configuration (preferred, class-based):
+```python
+from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM
+
+primary = GeminiVLM(model="gemini-2.0-flash").to_dict()
+fallback = GeminiVLM(model="gemini-2.5-flash").to_dict()
+
+kb.add_document(
+    doc_id="doc2",
+    file_path="/path/to/file.pdf",
+    file_parsing_config={
+        "use_vlm": True,
+        "vlm": primary,
+        "vlm_fallback": fallback,
+        "vlm_config": {"max_pages": 10},
+    },
+)
+```
+Legacy dict-based fallback remains supported via `vlm_config["fallback_provider"/"fallback_model"]`.
+
+- Backward compatibility and precedence
+  - The legacy dict path (e.g., `provider`, `model`, `images_already_exist`, etc.) continues to work.
+  - When both a serialized client (`vlm`) and a legacy `provider`/`model` are supplied, the system prefers `vlm`/`vlm_fallback`.
+
+- Environment variables
+  - `GEMINI_API_KEY` is required for `GeminiVLM`; a clear error is raised if missing.
+
 ## Config dictionaries
 Since there are a lot of configuration parameters available, they're organized into a few config dictionaries. There are four config dictionaries that can be passed in to `add_document` (`auto_context_config`, `file_parsing_config`, `semantic_sectioning_config`, and `chunking_config`) and one that can be passed in to `query` (`rse_params`).
 
@@ -258,6 +319,57 @@ file_parsing_config
     - save_path: the path to save intermediate files created during VLM processing
     - exclude_elements: a list of element types to exclude from the parsed text. Default is ["Header", "Footer"].
 
+VLM class-based clients and fallback
+- You can pass a first-class VLM client instance to KnowledgeBase for default usage:
+
+```python
+from dsrag.knowledge_base import KnowledgeBase
+from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM
+
+kb = KnowledgeBase(
+    kb_id="my_kb",
+    vlm_client=GeminiVLM(model="gemini-2.0-flash"),  # used by default for VLM parsing
+)
+```
+
+- You can also override the VLM on a per-document basis by passing a serialized client via file_parsing_config["vlm"]. This is useful when you want different models per document:
+
+```python
+vlm_override = GeminiVLM(model="gemini-2.0-flash").to_dict()
+kb.add_document(
+    doc_id="doc1",
+    file_path="/path/to/file.pdf",
+    file_parsing_config={
+        "use_vlm": True,
+        "vlm": vlm_override,  # per-document VLM client
+        "vlm_config": {"images_already_exist": False},
+    },
+    auto_context_config={
+        "use_generated_title": False,
+        "get_document_summary": False,
+        "get_section_summaries": False,
+    },
+)
+```
+
+- Fallback configuration: you can provide a serialized fallback client via file_parsing_config["vlm_fallback"]. The system will alternate between primary and fallback after the first few retries when needed. Legacy fallback using vlm_config["fallback_provider"/"fallback_model"] is also supported.
+
+```python
+primary = GeminiVLM(model="gemini-2.0-flash").to_dict()
+fallback = GeminiVLM(model="gemini-2.5-flash").to_dict()
+
+kb.add_document(
+    doc_id="doc2",
+    file_path="/path/to/file.pdf",
+    file_parsing_config={
+        "use_vlm": True,
+        "vlm": primary,
+        "vlm_fallback": fallback,
+        "vlm_config": {"max_pages": 10},
+    },
+)
+```
+
 semantic_sectioning_config
 - llm_provider: the LLM provider to use for semantic sectioning - "openai", "anthropic", and "gemini" are supported
 - model: the LLM model to use for semantic sectioning (e.g., "gpt-4.1-mini", "claude-3-5-haiku-latest", "gemini-2.0-flash")

diff --git a/dsrag/dsparse/README.md b/dsrag/dsparse/README.md
@@ -34,6 +34,91 @@ kb.add_document(
 )
 ```
 
+## VLM clients
+VLMs now support a class-based client abstraction (similar to LLM/Embedding/Reranker) that you can pass either at the KB level or per document. Legacy dict-based `vlm_config` remains fully supported.
+
+- Quickstart with class-based client (serialized) and LocalFileSystem
+```python
+from dsrag.dsparse.main import parse_and_chunk
+from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM
+from dsrag.dsparse.file_parsing.file_system import LocalFileSystem
+
+sections, chunks = parse_and_chunk(
+    kb_id="sample_kb",
+    doc_id="sample_doc",
+    file_path="/path/to/file.pdf",
+    file_parsing_config={
+        "use_vlm": True,
+        "vlm": GeminiVLM(model="gemini-2.0-flash").to_dict(),
+        "vlm_config": {"max_pages": 5, "vlm_max_concurrent_requests": 2},
+    },
+    file_system=LocalFileSystem(base_path="~/dsParse"),
+)
+```
+
+- Fallback (preferred, class-based):
+```python
+from dsrag.dsparse.main import parse_and_chunk
+from dsrag.dsparse.file_parsing.vlm_clients import GeminiVLM
+
+primary = GeminiVLM(model="gemini-2.0-flash").to_dict()
+fallback = GeminiVLM(model="gemini-2.5-flash").to_dict()
+
+sections, chunks = parse_and_chunk(
+    kb_id="kb",
+    doc_id="doc",
+    file_path="/path/to/file.pdf",
+    file_parsing_config={
+        "use_vlm": True,
+        "vlm": primary,
+        "vlm_fallback": fallback,
+        "vlm_config": {"max_pages": 5},
+    },
+)
+```
+
+- Legacy path (still valid):
+```python
+from dsrag.dsparse.main import parse_and_chunk
+
+sections, chunks = parse_and_chunk(
+    kb_id="kb",
+    doc_id="doc",
+    file_path="/path/to/file.pdf",
+    file_parsing_config={
+        "use_vlm": True,
+        "vlm_config": {
+            "provider": "gemini",
+            "model": "gemini-2.0-flash",
+            "max_pages": 5,
+            # Optional legacy fallback
+            "fallback_provider": "gemini",
+            "fallback_model": "gemini-2.5-flash",
+        },
+    },
+)
+```
+
+- Images already exist
+If you’ve pre-extracted page images into the configured FileSystem directory structure, you can reuse them:
+```python
+sections, chunks = parse_and_chunk(
+    kb_id="kb",
+    doc_id="doc",
+    file_path="/path/to/file.pdf",  # path still required for metadata, but images won’t be regenerated
+    file_parsing_config={
+        "use_vlm": True,
+        "vlm": GeminiVLM(model="gemini-2.0-flash").to_dict(),
+        "vlm_config": {"images_already_exist": True},
+    },
+)
+```
+
+- Notes
+  - Parallelism controls and DPI are in `vlm_config` (e.g., `vlm_max_concurrent_requests`, `dpi`).
+  - Page images and `elements.json` are saved via the configured `FileSystem`.
+  - Environment variable `GEMINI_API_KEY` is required for `GeminiVLM`. Clear errors are raised if missing.
+
 ## Installation
 If you want to use dsParse on its own, without installing the full `dsrag` package, there is a standalone Python package available for dsParse, which can be installed with `pip install dsparse`. If you already have `dsrag` installed, you DO NOT need to separately install `dsparse`.
 
@@ -81,13 +166,14 @@ The default model for semantic sectioning is `gpt-4o-mini`, but similar or stron
 An obvious concern with using a VLM to parse documents is the cost. Let's run the numbers:
 
 VLM file parsing cost calculation (`gemini-2.0-flash`)
-- Text input (prompt) + image input: 400 (text) + 258 (image) tokens x $0.10/10^6 per token = $0.000066
-- Text output: 600 tokens x $0.40/10^6 per token = $0.000240
-- Total: $0.000306/page or **$0.31 per 1000 pages**
+- Input tokens for images are calculated based on the number of 768x768 tiles needed. At the standard dpi of 100 (or even up to around 150), this usually means 4 tiles. Each tile is counted as 258 tokens.
+- Text input (prompt) + image input: 500 (text) + 4x258 (image) tokens x $0.10/10^6 per token = $0.0001532
+- Text output: 700 tokens x $0.40/10^6 per token = $0.0002800
+- Total: $0.0004332/page or **$0.43 per 1000 pages**
 
-This is substantially cheaper than commercially available OCR/PDF parsing services. Unstructured and Azure Document Intelligence, for example, both cost $10 per 1000 pages. 
+This is substantially cheaper than commercially available OCR/PDF parsing services. Unstructured and Azure Document Intelligence, for example, both cost $10 per 1000 pages. Reducto is generally $10-20 per 1000 pages.
 
-What about latency and throughput? Since each page is processed independently, this is a highly parallelizable problem. The main limiting factor then is the rate limits imposed by the VLM provider. The current rate limit for `gemini-2.0-flash` is 2000 requests per minute. Since dsParse uses one request per page, that means the limit is 2000 pages per minute. Processing a single page takes around 15-20 seconds, so that's the minimum latency for processing a document.
+What about latency and throughput? Since each page is processed independently, this is a highly parallelizable problem. The main limiting factor then is the rate limits imposed by the VLM provider. The current rate limit for `gemini-2.0-flash` on the highest tier is 30k requests per minute. Since dsParse uses one request per page, that means the limit is 30k pages per minute. Processing a single page takes around 15-20 seconds, so that's the minimum latency for processing a document.
 
 ### Semantic sectioning
 Semantic sectioning produces far fewer output tokens, so it ends up being a bit cheaper than the file parsing step.
@@ -97,4 +183,4 @@ Semantic sectioning cost calculation (`gpt-4o-mini`)
 - Output: 50 tokens x $0.60/10^6 per token = $0.00003
 - Total: $0.00015/page or **$0.15 per 1000 pages**
 
-Document text is processed in ~5000 token mega-chunks, which is roughly ten pages on average. But these mega-chunks have to be processed sequentially for each document. Processing each mega-chunk only takes a couple seconds, though, so even a large document of a few hundred pages will only take 20-60 seconds. Rate limits for the OpenAI API are heavily dependent on the usage tier you're in.
+Document text is processed in ~5000 token mega-chunks, which is roughly ten pages on average. These mega-chunks are processed in parallel for each document. Processing each mega-chunk only takes a few seconds, though, so even a large document of a few hundred pages should only take 5-10 seconds. Rate limits for the OpenAI API are heavily dependent on the usage tier you're in.
diff --git a/dsrag/dsparse/file_parsing/vlm.py b/dsrag/dsparse/file_parsing/vlm.py
@@ -1,95 +1,36 @@
 import PIL.Image
-import os
 import io
-from ..utils.imports import vertexai, genai_new
+from .vlm_clients import GeminiVLM, VertexAIVLM
 
 def make_llm_call_gemini(image_path: str, system_message: str, model: str = "gemini-2.0-flash", response_schema: dict = None, max_tokens: int = 4000, temperature: float = 0.5) -> str:
-    # With the newer Google GenAI SDK, we need to create a client
-    client = genai_new.Client(api_key=os.environ["GEMINI_API_KEY"])
+    """
+    Backward-compatible free function that delegates to GeminiVLM.
 
-    # Create generation config with the correct GenerateContentConfig type
-    config = genai_new.types.GenerateContentConfig(
+    Signature and behavior are preserved for compatibility.
+    """
+    client = GeminiVLM(model=model)
+    return client.make_llm_call(
+        image_path=image_path,
+        system_message=system_message,
+        response_schema=response_schema,
+        max_tokens=max_tokens,
         temperature=temperature,
-        max_output_tokens=max_tokens,
-        response_mime_type="application/json"
     )
 
-    # Add response schema if provided
-    if response_schema is not None:
-        config.response_schema = response_schema
-
-    try:
-        # Open and compress the image
-        image = PIL.Image.open(image_path)
-        compressed_image_bytes, _ = compress_image(image) # Quality is returned but not used here
-
-        # Close the original image object now that compression is done
-        if image:
-            image.close()
-            # The 'image' variable still exists and will be handled by the finally block,
-            # PIL's close() is typically safe to call multiple times.
-
-        # Create content parts using bytes
-        image_part = genai_new.types.Part.from_bytes(data=compressed_image_bytes, mime_type='image/jpeg')
-        content_parts = [image_part, system_message]
-
-        # For Gemini 2.5 models, disable thinking
-        if model.startswith("gemini-2.5"):
-            # Create a new config with thinking disabled by setting thinking_config
-            gemini25_config = genai_new.types.GenerateContentConfig(
-                temperature=temperature,
-                max_output_tokens=max_tokens,
-                response_mime_type="application/json",
-                thinking_config=genai_new.types.ThinkingConfig(thinking_budget=0)
-            )
-
-            # Add response schema if provided
-            if response_schema is not None:
-                gemini25_config.response_schema = response_schema
-
-            # Generate content with thinking disabled
-            response = client.models.generate_content(
-                model=model,
-                contents=content_parts,
-                config=gemini25_config
-            )
-        else:
-            # Standard call for other Gemini models
-            response = client.models.generate_content(
-                model=model,
-                contents=content_parts,
-                config=config
-            )
-
-        return response.text
-    finally:
-        # Ensure image is closed even if an error occurs
-        if 'image' in locals() and image: # Check if image was defined and not None
-            try:
-                image.close() # Attempt to close; safe if already closed
-            except Exception:
-                pass # Ignore errors if it fails (e.g., trying to close a None object or already closed and problematic)
-
 def make_llm_call_vertex(image_path: str, system_message: str, model: str, project_id: str, location: str, response_schema: dict = None, max_tokens: int = 4000, temperature: float = 0.5) -> str:
     """
-    This function calls the Vertex AI Gemini API (not to be confused with the Gemini API) with an image and a system message and returns the response text.
+    Backward-compatible free function that delegates to VertexAIVLM.
+
+    Signature and behavior are preserved for compatibility.
     """
-    vertexai.init(project=project_id, location=location)
-    model = vertexai.generative_models.GenerativeModel(model)
-
-    if response_schema is not None:
-        generation_config = vertexai.generative_models.GenerationConfig(temperature=temperature, max_output_tokens=max_tokens, response_mime_type="application/json", response_schema=response_schema)
-    else:
-        generation_config = vertexai.generative_models.GenerationConfig(temperature=temperature, max_output_tokens=max_tokens)
-
-    response = model.generate_content(
-        [
-            vertexai.generative_models.Part.from_image(vertexai.generative_models.Image.load_from_file(image_path)),
-            system_message,
-        ],
-        generation_config=generation_config,
+    client = VertexAIVLM(model=model, project_id=project_id, location=location)
+    return client.make_llm_call(
+        image_path=image_path,
+        system_message=system_message,
+        response_schema=response_schema,
+        max_tokens=max_tokens,
+        temperature=temperature,
     )
-    return response.text
 
 def compress_image(image: PIL.Image.Image, max_size_bytes: int = 1097152, quality: int = 95) -> tuple[bytes, int]:
     """