Skip to content

Darsh29/FinSight-RAG

Repository files navigation

πŸ” FinSight RAG

AI-powered financial document analysis on SEC 10-K filings

Python License Status LLM

Benchmarks 7 retrieval strategies across 4 chunking methods and 5 embedding models on S&P 500 filings. Best configuration scores 4.50 / 5.00 on LLM-as-judge evaluation.


πŸ“Š Results at a Glance

Strategy Benchmark

Rank Chunking Embedding Retrieval Score Latency
πŸ₯‡ 1 hybrid BGE hybrid_07 0.877 19,236ms
πŸ₯ˆ 2 hybrid MiniLM hybrid_07 0.865 11,175ms ⚑
πŸ₯‰ 3 recursive BGE hybrid_07 0.856 20,553ms
4 semantic BGE hybrid_05 0.826 16,666ms

MiniLM at hybrid_07 is the best latency-performance tradeoff: 42% faster than BGE with only 1.4% score drop.

RAG Quality (LLM-as-Judge, 1-5 scale)

Dimension Score
🎯 Relevance 4.00
πŸ”— Groundedness 5.00
πŸ“‹ Completeness 4.00
πŸ’¬ Coherence 5.00
Overall 4.50 / 5.00

🧠 What This Does

FinSight RAG downloads SEC 10-K filings for S&P 500 companies, extracts Item 1A (Risk Factors), and runs them through a fully configurable chunking-embedding-retrieval pipeline. The system benchmarks 7 strategy combinations to find the optimal configuration for financial document QA, then runs end-to-end RAG with Llama 3.2 via the HuggingFace Inference API.

Results are evaluated using an LLM-as-judge scoring rubric across four quality dimensions. A Streamlit dashboard lets you query any processed ticker, inspect citations, and compare strategy performance interactively.


πŸ—οΈ Architecture

SEC EDGAR
    β”‚
    β–Ό
πŸ“₯ data_pipeline/
   β”œβ”€β”€ sec_downloader.py     β†’ pulls 10-K filings via EDGAR full-text search
   β”œβ”€β”€ item_extractor.py     β†’ isolates Item 1A (Risk Factors) section
   └── text_cleaner.py       β†’ normalizes whitespace, removes boilerplate
    β”‚
    β–Ό
βœ‚οΈ  chunking/                  4 strategies
   β”œβ”€β”€ fixed_chunker.py      β†’ fixed token windows (512 / 1000 tokens)
   β”œβ”€β”€ semantic_chunker.py   β†’ sentence-boundary aware splits
   β”œβ”€β”€ recursive_chunker.py  β†’ hierarchical splitting with overlap
   └── hybrid_chunker.py     β†’ semantic first, fixed fallback β˜… BEST
    β”‚
    β–Ό
🧬 embedding/                  5 models
   β”œβ”€β”€ BGEEmbedder           β†’ BAAI/bge-small-en-v1.5 (384d)  β˜… BEST
   β”œβ”€β”€ MiniLMEmbedder        β†’ all-MiniLM-L6-v2 (384d)
   β”œβ”€β”€ FinBERTEmbedder       β†’ ProsusAI/finbert (768d)
   β”œβ”€β”€ E5Embedder            β†’ intfloat/e5-small-v2 (384d)
   └── vector_store.py       β†’ ChromaDB persistence
    β”‚
    β–Ό
πŸ” retrieval/                  3 modes
   β”œβ”€β”€ dense_retriever.py    β†’ vector similarity (ChromaDB)
   β”œβ”€β”€ sparse_retriever.py   β†’ BM25 keyword match
   └── hybrid_retriever.py   β†’ RRF fusion, alpha ∈ {0.3,0.5,0.7,0.9} β˜… BEST: 0.7
    β”‚
    β–Ό
πŸ€– rag/
   β”œβ”€β”€ rag_pipeline.py       β†’ orchestrates retrieve β†’ build_context β†’ generate
   β”œβ”€β”€ llm_client.py         β†’ Llama 3.2 via HuggingFace Inference API
   β”œβ”€β”€ citation_manager.py   β†’ tracks chunk provenance in generated answers
   └── evaluator.py          β†’ latency, coverage, token metrics
    β”‚
    β–Ό
πŸ“Š app/qa_app.py              β†’ Streamlit dashboard

πŸ—‚οΈ Project Structure

FinSight-RAG/
β”œβ”€β”€ πŸ“ app/                         # Streamlit dashboard
β”œβ”€β”€ πŸ“ chunking/                    # 4 chunking strategies
β”œβ”€β”€ πŸ“ config/                      # All strategy config files
β”œβ”€β”€ πŸ“ data_pipeline/               # SEC ingestion + cleaning
β”œβ”€β”€ πŸ“ embedding/                   # 5 embedding models + ChromaDB
β”œβ”€β”€ πŸ“ rag/                         # Pipeline, LLM client, evaluator
β”œβ”€β”€ πŸ“ retrieval/                   # Dense, sparse, hybrid retrievers
β”œβ”€β”€ πŸ“ scripts/                     # Experiment runner scripts
β”œβ”€β”€ πŸ“„ run_focused_evaluation.py    # Strategy benchmark runner
β”œβ”€β”€ πŸ“„ run_final_rag_evaluation.py  # End-to-end RAG evaluation
β”œβ”€β”€ πŸ“„ visualize_evaluation.py      # Chart + report generator
└── πŸ“„ requirements.txt

πŸš€ Getting Started

Prerequisites: Python 3.10+, free HuggingFace account, ~5GB disk space.

# 1. Clone and install
git clone https://github.com/Darsh29/FinSight-RAG.git
cd FinSight-RAG
pip install -r requirements.txt

# 2. Set your HuggingFace token
echo "HF_TOKEN=your_token_here" > .env

Get a free token at huggingface.co/settings/tokens

# 3. Download and process SEC filings
python scripts/run_data_pipeline.py

# Run for specific tickers only
python scripts/run_data_pipeline.py --tickers AAPL MSFT NVDA GOOGL

# 4. Benchmark all retrieval strategies
python run_focused_evaluation.py --max-tickers 10

# 5. Run full end-to-end RAG evaluation
python run_final_rag_evaluation.py --max-tickers 5

# 6. Generate result visualizations
python visualize_evaluation.py

# 7. Launch the interactive dashboard
streamlit run app/qa_app.py

βš™οΈ Configuration

All strategy switches live in config/ and take effect without touching pipeline code:

# config/chunking_config.py
ACTIVE_STRATEGY = "hybrid"       # fixed_512 | fixed_1000 | semantic | recursive | hybrid

# config/embedding_config.py
ACTIVE_EMBEDDING = "bge"         # bge | minilm | mpnet | finbert | e5

# config/retrieval_config.py
ACTIVE_RETRIEVAL = "hybrid_07"   # dense_only | sparse_only | hybrid_03 | hybrid_05 | hybrid_07 | hybrid_09

The hybrid retriever uses Reciprocal Rank Fusion (RRF) to merge dense and sparse results. Alpha controls the dense-to-sparse weight ratio β€” hybrid_07 = 70% dense + 30% BM25.


πŸ’‘ Key Findings

  • Hybrid chunking beats fixed windows on financial text with irregular section lengths
  • BGE outperforms FinBERT despite FinBERT being finance-specific, likely due to BGE's larger training corpus
  • RRF at alpha=0.7 consistently beats both dense-only and sparse-only retrieval across all chunking strategies
  • MiniLM is the production choice: 42% faster than BGE with only 1.4% score drop

πŸ› οΈ Troubleshooting

Issue Fix
No cleaned data available Run python scripts/run_data_pipeline.py first
HF_TOKEN not set Add token to .env file
HuggingFace API rate limits Built-in exponential backoff handles this automatically
Out of memory during embedding Reduce batch_size in config/embedding_config.py
ChromaDB collection not found Re-run embedding pipeline for that strategy/model combo

🧰 Tech Stack

Sentence Transformers ChromaDB BM25 Streamlit HuggingFace Llama pandas SEC EDGAR


πŸ“„ License

MIT License. See LICENSE for details.

About

RAG system benchmarking 7 retrieval strategies on SEC 10-K filings. Hybrid chunking + BGE + RRF fusion scores 4.50/5.00 on LLM-as-judge evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages