🔍 FinSight RAG

AI-powered financial document analysis on SEC 10-K filings

Benchmarks 7 retrieval strategies across 4 chunking methods and 5 embedding models on S&P 500 filings. Best configuration scores 4.50 / 5.00 on LLM-as-judge evaluation.

📊 Results at a Glance

Strategy Benchmark

Rank	Chunking	Embedding	Retrieval	Score	Latency
🥇 1	hybrid	BGE	hybrid_07	0.877	19,236ms
🥈 2	hybrid	MiniLM	hybrid_07	0.865	11,175ms ⚡
🥉 3	recursive	BGE	hybrid_07	0.856	20,553ms
4	semantic	BGE	hybrid_05	0.826	16,666ms

MiniLM at hybrid_07 is the best latency-performance tradeoff: 42% faster than BGE with only 1.4% score drop.

RAG Quality (LLM-as-Judge, 1-5 scale)

Dimension	Score
🎯 Relevance	4.00
🔗 Groundedness	5.00
📋 Completeness	4.00
💬 Coherence	5.00
Overall	4.50 / 5.00

🧠 What This Does

FinSight RAG downloads SEC 10-K filings for S&P 500 companies, extracts Item 1A (Risk Factors), and runs them through a fully configurable chunking-embedding-retrieval pipeline. The system benchmarks 7 strategy combinations to find the optimal configuration for financial document QA, then runs end-to-end RAG with Llama 3.2 via the HuggingFace Inference API.

Results are evaluated using an LLM-as-judge scoring rubric across four quality dimensions. A Streamlit dashboard lets you query any processed ticker, inspect citations, and compare strategy performance interactively.

🏗️ Architecture

SEC EDGAR
    │
    ▼
📥 data_pipeline/
   ├── sec_downloader.py     → pulls 10-K filings via EDGAR full-text search
   ├── item_extractor.py     → isolates Item 1A (Risk Factors) section
   └── text_cleaner.py       → normalizes whitespace, removes boilerplate
    │
    ▼
✂️  chunking/                  4 strategies
   ├── fixed_chunker.py      → fixed token windows (512 / 1000 tokens)
   ├── semantic_chunker.py   → sentence-boundary aware splits
   ├── recursive_chunker.py  → hierarchical splitting with overlap
   └── hybrid_chunker.py     → semantic first, fixed fallback ★ BEST
    │
    ▼
🧬 embedding/                  5 models
   ├── BGEEmbedder           → BAAI/bge-small-en-v1.5 (384d)  ★ BEST
   ├── MiniLMEmbedder        → all-MiniLM-L6-v2 (384d)
   ├── FinBERTEmbedder       → ProsusAI/finbert (768d)
   ├── E5Embedder            → intfloat/e5-small-v2 (384d)
   └── vector_store.py       → ChromaDB persistence
    │
    ▼
🔍 retrieval/                  3 modes
   ├── dense_retriever.py    → vector similarity (ChromaDB)
   ├── sparse_retriever.py   → BM25 keyword match
   └── hybrid_retriever.py   → RRF fusion, alpha ∈ {0.3,0.5,0.7,0.9} ★ BEST: 0.7
    │
    ▼
🤖 rag/
   ├── rag_pipeline.py       → orchestrates retrieve → build_context → generate
   ├── llm_client.py         → Llama 3.2 via HuggingFace Inference API
   ├── citation_manager.py   → tracks chunk provenance in generated answers
   └── evaluator.py          → latency, coverage, token metrics
    │
    ▼
📊 app/qa_app.py              → Streamlit dashboard

🗂️ Project Structure

FinSight-RAG/
├── 📁 app/                         # Streamlit dashboard
├── 📁 chunking/                    # 4 chunking strategies
├── 📁 config/                      # All strategy config files
├── 📁 data_pipeline/               # SEC ingestion + cleaning
├── 📁 embedding/                   # 5 embedding models + ChromaDB
├── 📁 rag/                         # Pipeline, LLM client, evaluator
├── 📁 retrieval/                   # Dense, sparse, hybrid retrievers
├── 📁 scripts/                     # Experiment runner scripts
├── 📄 run_focused_evaluation.py    # Strategy benchmark runner
├── 📄 run_final_rag_evaluation.py  # End-to-end RAG evaluation
├── 📄 visualize_evaluation.py      # Chart + report generator
└── 📄 requirements.txt

🚀 Getting Started

Prerequisites: Python 3.10+, free HuggingFace account, ~5GB disk space.

# 1. Clone and install
git clone https://github.com/Darsh29/FinSight-RAG.git
cd FinSight-RAG
pip install -r requirements.txt

# 2. Set your HuggingFace token
echo "HF_TOKEN=your_token_here" > .env

Get a free token at huggingface.co/settings/tokens

# 3. Download and process SEC filings
python scripts/run_data_pipeline.py

# Run for specific tickers only
python scripts/run_data_pipeline.py --tickers AAPL MSFT NVDA GOOGL

# 4. Benchmark all retrieval strategies
python run_focused_evaluation.py --max-tickers 10

# 5. Run full end-to-end RAG evaluation
python run_final_rag_evaluation.py --max-tickers 5

# 6. Generate result visualizations
python visualize_evaluation.py

# 7. Launch the interactive dashboard
streamlit run app/qa_app.py

⚙️ Configuration

All strategy switches live in config/ and take effect without touching pipeline code:

# config/chunking_config.py
ACTIVE_STRATEGY = "hybrid"       # fixed_512 | fixed_1000 | semantic | recursive | hybrid

# config/embedding_config.py
ACTIVE_EMBEDDING = "bge"         # bge | minilm | mpnet | finbert | e5

# config/retrieval_config.py
ACTIVE_RETRIEVAL = "hybrid_07"   # dense_only | sparse_only | hybrid_03 | hybrid_05 | hybrid_07 | hybrid_09

The hybrid retriever uses Reciprocal Rank Fusion (RRF) to merge dense and sparse results. Alpha controls the dense-to-sparse weight ratio — hybrid_07 = 70% dense + 30% BM25.

💡 Key Findings

Hybrid chunking beats fixed windows on financial text with irregular section lengths
BGE outperforms FinBERT despite FinBERT being finance-specific, likely due to BGE's larger training corpus
RRF at alpha=0.7 consistently beats both dense-only and sparse-only retrieval across all chunking strategies
MiniLM is the production choice: 42% faster than BGE with only 1.4% score drop

🛠️ Troubleshooting

Issue	Fix
`No cleaned data available`	Run `python scripts/run_data_pipeline.py` first
`HF_TOKEN not set`	Add token to `.env` file
HuggingFace API rate limits	Built-in exponential backoff handles this automatically
Out of memory during embedding	Reduce `batch_size` in `config/embedding_config.py`
ChromaDB collection not found	Re-run embedding pipeline for that strategy/model combo

🧰 Tech Stack

📄 License

MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 FinSight RAG

AI-powered financial document analysis on SEC 10-K filings

📊 Results at a Glance

Strategy Benchmark

RAG Quality (LLM-as-Judge, 1-5 scale)

🧠 What This Does

🏗️ Architecture

🗂️ Project Structure

🚀 Getting Started

⚙️ Configuration

💡 Key Findings

🛠️ Troubleshooting

🧰 Tech Stack

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
chunking		chunking
config		config
data_pipeline		data_pipeline
embedding		embedding
rag		rag
retrieval		retrieval
scripts		scripts
Final Project PPT.pdf		Final Project PPT.pdf
Final Project Paper.pdf		Final Project Paper.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_final_rag_evaluation.py		run_final_rag_evaluation.py
run_focused_evaluation.py		run_focused_evaluation.py
visualize_evaluation.py		visualize_evaluation.py

Folders and files

Latest commit

History

Repository files navigation

🔍 FinSight RAG

AI-powered financial document analysis on SEC 10-K filings

📊 Results at a Glance

Strategy Benchmark

RAG Quality (LLM-as-Judge, 1-5 scale)

🧠 What This Does

🏗️ Architecture

🗂️ Project Structure

🚀 Getting Started

⚙️ Configuration

💡 Key Findings

🛠️ Troubleshooting

🧰 Tech Stack

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages