A Local Retrieval-Augmented Generation (RAG) System for IPCC Climate Reports
This system helps researchers, policymakers, and students quickly find and understand information from IPCC (Intergovernmental Panel on Climate Change) reports. Think of it as a smart research assistant that can:
- 📖 Load IPCC chapters from your computer
- 🔍 Answer questions about climate science
- 📍 Show you exactly where information comes from (paragraph IDs)
- 🤖 Run entirely on your computer (no internet needed after setup)
First, make sure you have Python 3.12+ installed:
- Windows: Download from python.org
- Mac: Usually pre-installed, or use Homebrew:
brew install [email protected]
- Linux:
sudo apt install python3.12 python3.12-pip
Note: This system requires Python 3.12 or higher for optimal performance and compatibility.
For Windows users experiencing slow installation:
# Download the project
git clone https://github.com/yourusername/llmrag.git
cd llmrag
# Use the optimized installation script
# Windows:
install_fast.bat
# Unix/Linux/Mac:
./install_fast.sh
# Download the project
git clone https://github.com/yourusername/llmrag.git
cd llmrag
# Install with optimization flags (faster)
pip install --upgrade pip setuptools wheel
pip install --use-feature=fast-deps --only-binary=all -e .
# Or install required packages
pip install -r requirements.txt
# Start the web interface
streamlit run streamlit_app.py
# Or use the command line
python -m llmrag.cli list-chapters
python -m llmrag.cli ask "What are the main findings about temperature trends?" --chapter wg1/chapter02
- What is RAG?: LangChain RAG Tutorial
- Climate Science: IPCC FAQ
- Python Basics: Python.org Tutorial
- RAG Systems: Retrieval-Augmented Generation Paper
- Vector Databases: ChromaDB Documentation
- Embeddings: Sentence Transformers Guide
- Streamlit: Streamlit Documentation
- HuggingFace: Transformers Tutorial
- Vector Search: FAISS Tutorial
- Run
streamlit run streamlit_app.py
- Open your browser to
http://localhost:8501
- Select a chapter and start asking questions!
# See available chapters
python -m llmrag.cli list-chapters
# Ask a question
python -m llmrag.cli ask "What causes global warming?" --chapter wg1/chapter02
# Interactive mode
python -m llmrag.cli interactive --chapter wg1/chapter02
from llmrag.chapter_rag import ask_chapter
# Ask a question about a chapter
result = ask_chapter(
question="What are the main climate change impacts?",
chapter_name="wg1/chapter02"
)
print(f"Answer: {result['answer']}")
print(f"Sources: {result['paragraph_ids']}")
llmrag/
├── 📖 IPCC Chapters # Climate report data
├── 🤖 RAG System # Question answering engine
├── 🌐 Web Interface # User-friendly browser app
├── 💻 Command Line Tools # Power user interface
├── 🔧 Processing Pipeline # Data preparation tools
└── 📊 Documentation # Guides and tutorials
- Document Loading: Processes IPCC HTML chapters
- Text Chunking: Breaks documents into searchable pieces
- Vector Search: Finds relevant information quickly
- Answer Generation: Creates coherent responses
- Source Tracking: Shows exactly where answers come from
- Streamlit Web App: Beautiful, interactive interface
- Command Line: Fast, scriptable interface
- Python API: For integration with other tools
- HTML Cleaning: Removes formatting, keeps content
- Paragraph IDs: Tracks information sources
- Semantic Chunking: Keeps related information together
- Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
- Language Model: GPT-2 Large (774M parameters)
- Vector Database: ChromaDB (local storage)
- Speed: Answers in 2-5 seconds
- Accuracy: Based on IPCC content only
- Memory: ~2GB RAM for full system
- Storage: ~500MB for all chapters
We welcome contributions! Here's how to help:
- 📝 Report bugs or suggest improvements
- 📚 Test the system with your research questions
- 📖 Improve documentation or write tutorials
- 🌍 Share with colleagues who might find it useful
- 🔧 Fix bugs or add features
- 🧪 Add tests to ensure quality
- 📦 Improve packaging or deployment
- 🚀 Optimize performance
See CONTRIBUTING.md for detailed guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
- IPCC: For providing the climate science reports
- HuggingFace: For the language models and tools
- ChromaDB: For the vector database
- Streamlit: For the web interface framework
- Open Source Community: For all the amazing tools we build upon
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
- More IPCC Chapters: Add WG2 and WG3 reports
- Better Models: Upgrade to larger language models
- Multi-language: Support for non-English reports
- Collaborative Features: Share questions and answers
- Mobile App: iOS and Android versions
Made with ❤️ for climate science research
- Python 3.9 or higher (Python 3.12 recommended)
- Git
git clone https://github.com/semanticClimate/llmrag.git
cd llmrag
Windows (Command Prompt):
python -m venv venv
venv\Scripts\activate
Windows (PowerShell):
python -m venv venv
.\venv\Scripts\Activate.ps1
macOS/Linux:
python3 -m venv venv
source venv/bin/activate
pip install -e .
pip install -r requirements.txt
# Run all tests
python -m pytest tests/ -v
# Run with coverage
coverage run --source=llmrag -m pytest tests/
coverage report -m
python test_ipcc_ingestion.py
This will:
- Ingest the IPCC Chapter 4 HTML file with paragraph IDs
- Test the RAG pipeline with climate-related queries
- Show which paragraph IDs were used to generate answers
The system now supports ingesting HTML documents and tracking paragraph IDs for source attribution:
- HTML Splitter: Extracts text while preserving paragraph IDs from HTML elements
- RAG Pipeline: Returns paragraph IDs used in generating answers
- Test Script:
test_ipcc_ingestion.py
demonstrates the feature with IPCC content
Query: What are the main scenarios used in climate projections?
Answer: [Generated answer]
Paragraph IDs found: ['4.1_p3', '4.3.2.2_p2']
If pip install -e .
takes more than 10 minutes:
-
Use the fast installation scripts:
# Windows install_fast.bat # Unix/Linux/Mac ./install_fast.sh
-
Try staged installation:
pip install --upgrade pip setuptools wheel pip install pyyaml lxml pytest rich streamlit toml pip install chromadb langchain pip install --only-binary=all transformers sentence-transformers pip install -e .
-
Use conda for heavy packages (Windows):
conda install -c conda-forge transformers sentence-transformers pip install -e .
- Virtual Environment: Make sure to use the correct activation script for your shell
- Dependencies: If you encounter issues with
lxml
ortransformers
, try:pip install lxml transformers
- DLL Errors: Ensure you have the latest Python and pip versions
- Visual Studio Build Tools: Install Visual Studio Build Tools 2019+ for compilation
- Python Version: We recommend Python 3.12. Some libraries may have compatibility issues with older versions
- Virtual Environment: ALWAYS USE A VIRTUAL ENVIRONMENT to avoid conflicts
- NumPy Conflicts: If you have NumPy in your global environment, it may cause issues. Use a clean virtual environment
- Network Issues: Large model downloads may timeout. Use
pip install --timeout 300
for longer timeouts
llmrag/
├── llmrag/ # Main package
│ ├── chunking/ # Text splitting (including HTML)
│ ├── embeddings/ # Embedding models
│ ├── models/ # LLM models
│ ├── pipelines/ # RAG pipeline
│ └── retrievers/ # Vector stores
├── tests/ # Test suite
│ └── ipcc/ # IPCC test data
├── test_ipcc_ingestion.py # IPCC ingestion test script
└── requirements.txt # Dependencies
For chat history and development notes, see:
./project.md
- Project documentation./all_code.py
- Development history (Messy)
The test suite includes:
- Unit tests for all components
- Integration tests for the RAG pipeline
- HTML ingestion tests with paragraph ID tracking
- IPCC content tests
Run tests with:
python -m pytest tests/ -v
Expected result:
=========================================== 10 passed in 27.61s ============================================
git clone https://github.com/semanticClimate/llmrag/
Then
cd llmrag
setup and activate a virtual environment (on Mac:
python3.12 -m venv venv
source venv/bin/activate
run the tests - should take about 0.5 min
pip install -r requirements.txt
coverage run --source=llmrag -m unittest discover -s tests
coverage report -m
result:
..Device set to use cpu
..Retrieved: [('Paris is the capital of France.', 0.2878604531288147)]
.Retrieved: [('Paris is the capital of France.', 0.37026578187942505)]
.
----------------------------------------------------------------------
Ran 6 tests in 20.267s
OK
to print the coverage:
coverage report -m
We run on Python 3.12. This can cause problems with some libraries, such as NumPy. Although numpy
is not included in llmrag
at present it may be in your environment. ALWAYS USE A VIRTUAL ENVIRONMENT
(PMR found differences between numpy on Python 3.11 and 3.12)