A Retrieval-Augmented Generation (RAG) system that lets you ask questions about Reinforcement Learning research papers and get grounded, cited answers — powered by a local LLM running entirely on your machine.
- PDF Ingestion — Extracts and chunks text from RL papers with smart paragraph merging, reference filtering, and PDF artifact cleanup
- Semantic Search — FAISS vector index with sentence-transformer embeddings for fast retrieval
- IDF-Weighted Reranking — Two-stage retrieval: FAISS top-k → keyword reranking with stopword removal, Porter stemming, and rare-term boosting
- Local LLM Generation — TinyLlama 1.1B (GGUF Q4) via
llama-cpp-pythonwith Metal GPU acceleration on Apple Silicon - REST API — FastAPI server with
/askand/healthendpoints - Dockerized — Ready to containerize for deployment
User Query
│
▼
┌──────────┐ ┌───────────┐ ┌───────────┐
│ FastAPI │────▶│ Retriever │────▶│ Generator │
│ (api.py) │ │ │ │ │
└──────────┘ │ FAISS │ │ TinyLlama │
│ + Rerank │ │ (GGUF) │
└───────────┘ └───────────┘
│ │
▼ ▼
┌───────────┐ ┌──────────┐
│ Embeddings│ │ Answer │
│ Index │ │ + Cites │
└───────────┘ └──────────┘
RL Research Paper Assistant/
├── data/papers/ # PDF research papers (19 RL papers)
├── models/
│ ├── tinyllama.gguf # TinyLlama 1.1B Q4 model (~608MB)
│ ├── faiss_index.bin # FAISS vector index
│ └── chunk_metadata.pkl # Chunk text + source metadata
├── src/
│ ├── ingest.py # PDF → chunks → embeddings → FAISS index
│ ├── retriever.py # Semantic search + IDF reranking
│ ├── generator.py # LLM prompt building + generation
│ ├── utils.py # Shared utilities (tokenization, stemming, logging)
│ ├── api.py # FastAPI REST endpoints
│ └── test.py # Quick test script
├── requirements.txt
├── dockerfile
├── .dockerignore
└── .gitignore
- Python 3.10+
- ~1GB free disk space (for model + index)
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtcurl -L -o models/tinyllama.gguf \
"https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_0.gguf"Place your PDF papers in data/papers/.
cd src
python ingest.pycd src
uvicorn api:app --reloadcurl http://localhost:8000/healthcurl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"query": "What is Proximal Policy Optimization?"}'Response:
{
"answer": "Proximal Policy Optimization (PPO) is a family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a 'surrogate' objective function...",
"latency_seconds": 6.374
}docker build -t rl-rag .
docker run -p 8000:8000 rl-ragNote: Metal GPU acceleration is not available inside Docker (Linux VM). The LLM will run CPU-only, which is slower but functional.
| Component | Choice | Reason |
|---|---|---|
| Embeddings | all-MiniLM-L6-v2 |
Fast, lightweight, good quality |
| Vector DB | FAISS (IndexFlatL2) | Simple, no server needed |
| LLM | TinyLlama 1.1B Q4 | Runs locally, no API keys |
| Chunking | Paragraph-merge + sliding window | Semantic coherence vs. fixed-size |
| Reranking | IDF-weighted keyword + vector similarity | Better precision than vector-only |
The system comes pre-configured with 19 foundational RL papers including:
- PPO — Proximal Policy Optimization (Schulman et al., 2017)
- TRPO — Trust Region Policy Optimization (Schulman et al., 2015)
- DDPG — Deep Deterministic Policy Gradient (Lillicrap et al., 2015)
- A3C — Asynchronous Advantage Actor-Critic (Mnih et al., 2016)
- SAC — Soft Actor-Critic (Haarnoja et al., 2018)
- AlphaGo — Mastering Go with Neural Networks (Silver et al., 2016)
- DQN — Playing Atari with Deep RL (Mnih et al., 2013)
- And more...
This project is for educational and portfolio purposes.