PDF Q&A with RAG - Chunker Comparison

A Streamlit application that demonstrates the difference between document chunking methods for Retrieval Augmented Generation (RAG) systems. This application allows you to ask questions about PDF documents and compares the results using two different chunking strategies:

Docling's HybridChunker: An intelligent chunking strategy that preserves semantic coherence
RecursiveCharacterTextSplitter: A standard text splitting approach based on character counts

Features

Interactive chat interface to ask questions about your PDF documents
Side-by-side comparison of retrieved document chunks from different chunking methods
Comparison of answers generated using both chunking methods
Debugging tools to examine the full prompt and context sent to the language model
Support for custom embedding models and language models

Prerequisites

Python 3.11 or higher
One or more PDF documents for analysis

Installation

1. Clone the repository

git clone https://github.com/williamcaban/odsc-east-2025.git
cd odsc-east-2025

2. Install dependencies

If you don't have uv installed, you can install it first:

# Install uv (on macOS/Linux)
curl -sSf https://astral.sh/uv/install.sh | bash

# Install uv (on Windows PowerShell)
irm https://astral.sh/uv/install.ps1 | iex

The project uses uv for fast dependency installation:

# Sync all dependencies from project.toml
uv sync

4. Create an environment file

Create a .env file in the project root with the following variables: Note: Use env.example as template

# OpenAI API and compatible endpoints configuration

# OpenAI API credentials
OPENAI_API_KEY=your_key
OPENAI_MODEL_NAME=phi4:latest

# For custom OpenAI-compatible endpoints (like vLLM, local API servers, etc.)
# Comment out for official OpenAI API
# OPENAI_API_BASE=http://localhost:8000/v1
OPENAI_API_BASE=http://localhost:11434/v1  # exmple using Ollama

# LLM parameters
LLM_TEMPERATURE=0.05
LLM_MAX_TOKENS=1024

# Embeddings configuration 
EMBED_MODEL_ID=sentence-transformers/all-MiniLM-L6-v2 # default embedding model

# PDF configuration
PDF_FILE_PATH=./pdfs/2501.17887v1.pdf # default pdf
TOP_K=3  # Optional, number of chunks to retrieve

Running the Application

Start the Streamlit server:

uv run streamlit run streamlit_rag.py

The application will open in your default web browser at http://localhost:8501.

If you didn't specify a PDF file in the .env file, the application will search for any PDF in the current directory and its subdirectories.

Using the Application

When the application loads, you'll see information about the loaded PDF in the sidebar
Use the chat input at the bottom to ask questions about the document
The system will display:
- Retrieved document chunks from both chunking methods
- An answer generated using the HybridChunker in the chat
- A comparison of answers from both chunking methods below
- Expandable debugging sections to examine the full context and prompt

Troubleshooting

Common Issues

Token length errors: If you see token length warnings, the chunking parameters might need adjustment for your specific embedding model
MPS warnings on macOS: These are informational and don't affect functionality
PDF loading errors: Ensure your PDF is readable and not password-protected

Debugging

The application includes debugging tools that help diagnose RAG pipeline issues:

Click on the "🔍 Debug: HybridChunker Full Prompt" or "🔍 Debug: RecursiveCharacterTextSplitter Full Prompt" expanders
Examine the retrieved documents and the full prompt sent to the model
Compare retrieval quality between the two chunking methods

License

Apache License 2.0

Copyright 2025 ODSC East

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
pdfs		pdfs
README.md		README.md
env.example		env.example
pyproject.toml		pyproject.toml
streamlit_rag.py		streamlit_rag.py
streamlit_rag_odsc_2025.py		streamlit_rag_odsc_2025.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Q&A with RAG - Chunker Comparison

Features

Prerequisites

Installation

1. Clone the repository

2. Install dependencies

4. Create an environment file

Running the Application

Start the Streamlit server:

Using the Application

Troubleshooting

Common Issues

Debugging

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Q&A with RAG - Chunker Comparison

Features

Prerequisites

Installation

1. Clone the repository

2. Install dependencies

4. Create an environment file

Running the Application

Start the Streamlit server:

Using the Application

Troubleshooting

Common Issues

Debugging

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages