IPCC RAG System 🌍📚

Cite as:

A Local Retrieval-Augmented Generation (RAG) System for IPCC Climate Reports

🎯 What is this?

This system helps researchers, policymakers, and students quickly find and understand information from IPCC (Intergovernmental Panel on Climate Change) reports. Think of it as a smart research assistant that can:

📖 Load IPCC chapters from your computer
🔍 Answer questions about climate science
📍 Show you exactly where information comes from (paragraph IDs)
🤖 Run entirely on your computer (no internet needed after setup)

🚀 Quick Start (5 minutes)

1. Install Python

First, make sure you have Python 3.12+ installed:

Windows: Download from python.org
Mac: Usually pre-installed, or use Homebrew: brew install [email protected]
Linux: sudo apt install python3.12 python3.12-pip

Note: This system requires Python 3.12 or higher for optimal performance and compatibility.

2. Download and Setup

Fast Installation (Recommended)

For Windows users experiencing slow installation:

# Download the project
git clone https://github.com/yourusername/llmrag.git
cd llmrag

# Use the optimized installation script
# Windows:
install_fast.bat

# Unix/Linux/Mac:
./install_fast.sh

Standard Installation

# Download the project
git clone https://github.com/yourusername/llmrag.git
cd llmrag

# Install with optimization flags (faster)
pip install --upgrade pip setuptools wheel
pip install --use-feature=fast-deps --only-binary=all -e .

# Or install required packages
pip install -r requirements.txt

3. Try it out!

# Start the web interface
streamlit run streamlit_app.py

# Or use the command line
python -m llmrag.cli list-chapters
python -m llmrag.cli ask "What are the main findings about temperature trends?" --chapter wg1/chapter02

📚 Learning Resources

For Beginners

What is RAG?: LangChain RAG Tutorial
Climate Science: IPCC FAQ
Python Basics: Python.org Tutorial

For Researchers

RAG Systems: Retrieval-Augmented Generation Paper
Vector Databases: ChromaDB Documentation
Embeddings: Sentence Transformers Guide

For Developers

Streamlit: Streamlit Documentation
HuggingFace: Transformers Tutorial
Vector Search: FAISS Tutorial

🎮 How to Use

Web Interface (Recommended for beginners)

Run streamlit run streamlit_app.py
Open your browser to http://localhost:8501
Select a chapter and start asking questions!

Command Line (For power users)

# See available chapters
python -m llmrag.cli list-chapters

# Ask a question
python -m llmrag.cli ask "What causes global warming?" --chapter wg1/chapter02

# Interactive mode
python -m llmrag.cli interactive --chapter wg1/chapter02

Python Code (For developers)

from llmrag.chapter_rag import ask_chapter

# Ask a question about a chapter
result = ask_chapter(
    question="What are the main climate change impacts?",
    chapter_name="wg1/chapter02"
)

print(f"Answer: {result['answer']}")
print(f"Sources: {result['paragraph_ids']}")

📁 What's Included

llmrag/
├── 📖 IPCC Chapters          # Climate report data
├── 🤖 RAG System            # Question answering engine
├── 🌐 Web Interface         # User-friendly browser app
├── 💻 Command Line Tools    # Power user interface
├── 🔧 Processing Pipeline   # Data preparation tools
└── 📊 Documentation         # Guides and tutorials

🛠️ System Components

Core RAG System

Document Loading: Processes IPCC HTML chapters
Text Chunking: Breaks documents into searchable pieces
Vector Search: Finds relevant information quickly
Answer Generation: Creates coherent responses
Source Tracking: Shows exactly where answers come from

User Interfaces

Streamlit Web App: Beautiful, interactive interface
Command Line: Fast, scriptable interface
Python API: For integration with other tools

Data Processing

HTML Cleaning: Removes formatting, keeps content
Paragraph IDs: Tracks information sources
Semantic Chunking: Keeps related information together

🔬 Technical Details

Models Used

Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
Language Model: GPT-2 Large (774M parameters)
Vector Database: ChromaDB (local storage)

Performance

Speed: Answers in 2-5 seconds
Accuracy: Based on IPCC content only
Memory: ~2GB RAM for full system
Storage: ~500MB for all chapters

🤝 Contributing

We welcome contributions! Here's how to help:

For Non-Developers

📝 Report bugs or suggest improvements
📚 Test the system with your research questions
📖 Improve documentation or write tutorials
🌍 Share with colleagues who might find it useful

For Developers

🔧 Fix bugs or add features
🧪 Add tests to ensure quality
📦 Improve packaging or deployment
🚀 Optimize performance

See CONTRIBUTING.md for detailed guidelines.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

IPCC: For providing the climate science reports
HuggingFace: For the language models and tools
ChromaDB: For the vector database
Streamlit: For the web interface framework
Open Source Community: For all the amazing tools we build upon

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: [email protected]

📈 Roadmap

More IPCC Chapters: Add WG2 and WG3 reports
Better Models: Upgrade to larger language models
Multi-language: Support for non-English reports
Collaborative Features: Share questions and answers
Mobile App: iOS and Android versions

Made with ❤️ for climate science research

🚀 Quickstart for Collaborators

Prerequisites

Python 3.9 or higher (Python 3.12 recommended)
Git

1. Clone the Repository

git clone https://github.com/semanticClimate/llmrag.git
cd llmrag

2. Set Up Virtual Environment

Windows (Command Prompt):

python -m venv venv
venv\Scripts\activate

Windows (PowerShell):

python -m venv venv
.\venv\Scripts\Activate.ps1

macOS/Linux:

python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -e .
pip install -r requirements.txt

4. Run Tests

# Run all tests
python -m pytest tests/ -v

# Run with coverage
coverage run --source=llmrag -m pytest tests/
coverage report -m

5. Test IPCC HTML Ingestion (New Feature)

python test_ipcc_ingestion.py

This will:

Ingest the IPCC Chapter 4 HTML file with paragraph IDs
Test the RAG pipeline with climate-related queries
Show which paragraph IDs were used to generate answers

🔧 Features

HTML Ingestion with Paragraph ID Tracking

The system now supports ingesting HTML documents and tracking paragraph IDs for source attribution:

HTML Splitter: Extracts text while preserving paragraph IDs from HTML elements
RAG Pipeline: Returns paragraph IDs used in generating answers
Test Script: test_ipcc_ingestion.py demonstrates the feature with IPCC content

Example Output

Query: What are the main scenarios used in climate projections?
Answer: [Generated answer]
Paragraph IDs found: ['4.1_p3', '4.3.2.2_p2']

🐛 Troubleshooting

Slow Installation Issues

If pip install -e . takes more than 10 minutes:

Use the fast installation scripts:

# Windows
install_fast.bat

# Unix/Linux/Mac
./install_fast.sh

Try staged installation:

pip install --upgrade pip setuptools wheel
pip install pyyaml lxml pytest rich streamlit toml
pip install chromadb langchain
pip install --only-binary=all transformers sentence-transformers
pip install -e .

Use conda for heavy packages (Windows):

conda install -c conda-forge transformers sentence-transformers
pip install -e .

Windows-Specific Issues

Virtual Environment: Make sure to use the correct activation script for your shell
Dependencies: If you encounter issues with lxml or transformers, try:
```
pip install lxml transformers
```
DLL Errors: Ensure you have the latest Python and pip versions
Visual Studio Build Tools: Install Visual Studio Build Tools 2019+ for compilation

General Issues

Python Version: We recommend Python 3.12. Some libraries may have compatibility issues with older versions
Virtual Environment: ALWAYS USE A VIRTUAL ENVIRONMENT to avoid conflicts
NumPy Conflicts: If you have NumPy in your global environment, it may cause issues. Use a clean virtual environment
Network Issues: Large model downloads may timeout. Use pip install --timeout 300 for longer timeouts

📁 Project Structure

llmrag/
├── llmrag/                    # Main package
│   ├── chunking/             # Text splitting (including HTML)
│   ├── embeddings/           # Embedding models
│   ├── models/               # LLM models
│   ├── pipelines/            # RAG pipeline
│   └── retrievers/           # Vector stores
├── tests/                    # Test suite
│   └── ipcc/                # IPCC test data
├── test_ipcc_ingestion.py   # IPCC ingestion test script
└── requirements.txt          # Dependencies

📝 Development

For chat history and development notes, see:

./project.md - Project documentation
./all_code.py - Development history (Messy)

🧪 Testing

The test suite includes:

Unit tests for all components
Integration tests for the RAG pipeline
HTML ingestion tests with paragraph ID tracking
IPCC content tests

Run tests with:

python -m pytest tests/ -v

Expected result:

=========================================== 10 passed in 27.61s ============================================

TEST

git clone https://github.com/semanticClimate/llmrag/

Then cd llmrag

setup and activate a virtual environment (on Mac:

python3.12 -m venv venv
source venv/bin/activate

run the tests - should take about 0.5 min

pip install -r requirements.txt
coverage run --source=llmrag -m unittest discover -s tests
coverage report -m

result:

..Device set to use cpu
..Retrieved: [('Paris is the capital of France.', 0.2878604531288147)]
.Retrieved: [('Paris is the capital of France.', 0.37026578187942505)]
.
----------------------------------------------------------------------
Ran 6 tests in 20.267s

OK

to print the coverage:

 coverage report -m

BUGS

We run on Python 3.12. This can cause problems with some libraries, such as NumPy. Although numpy is not included in llmrag at present it may be in your environment. ALWAYS USE A VIRTUAL ENVIRONMENT (PMR found differences between numpy on Python 3.11 and 3.12)

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
configs		configs
diffs		diffs
docs		docs
llmrag		llmrag
pipeline		pipeline
tests		tests
.gitignore		.gitignore
CLI_DEVELOPMENT_GUIDE.md		CLI_DEVELOPMENT_GUIDE.md
COLAB_GUIDE.md		COLAB_GUIDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DEMO_SCRIPT.md		DEMO_SCRIPT.md
DEVELOPER_GUIDE.md		DEVELOPER_GUIDE.md
IMPROVEMENTS.md		IMPROVEMENTS.md
IMPROVEMENTS_SUMMARY.md		IMPROVEMENTS_SUMMARY.md
PIPELINE_README.md		PIPELINE_README.md
README.md		README.md
TEAM_GUIDE.md		TEAM_GUIDE.md
TESTING_GUIDE.md		TESTING_GUIDE.md
TUTORIAL.md		TUTORIAL.md
USER_GUIDE.md		USER_GUIDE.md
__init__.py		__init__.py
analyze_results.py		analyze_results.py
chapter_manager.py		chapter_manager.py
cli.py		cli.py
example_notebook.ipynb		example_notebook.ipynb
example_usage.py		example_usage.py
install_fast.bat		install_fast.bat
install_fast.sh		install_fast.sh
ipcc_pipeline.py		ipcc_pipeline.py
log.md		log.md
main.py		main.py
pipeline_config.yaml		pipeline_config.yaml
pipeline_requirements.txt		pipeline_requirements.txt
project.md		project.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py
test_custom_query.py		test_custom_query.py
test_improvements.py		test_improvements.py
test_ipcc_ingestion.py		test_ipcc_ingestion.py
test_pipeline.py		test_pipeline.py
test_simple_usage.py		test_simple_usage.py
test_titles.py		test_titles.py

semanticClimate/llmrag

Folders and files

Latest commit

History

Repository files navigation

IPCC RAG System 🌍📚

🎯 What is this?

🚀 Quick Start (5 minutes)

1. Install Python

2. Download and Setup

Fast Installation (Recommended)

Standard Installation

3. Try it out!

📚 Learning Resources

For Beginners

For Researchers

For Developers

🎮 How to Use

Web Interface (Recommended for beginners)

Command Line (For power users)

Python Code (For developers)

📁 What's Included

🛠️ System Components

Core RAG System

User Interfaces

Data Processing

🔬 Technical Details

Models Used

Performance

🤝 Contributing

For Non-Developers

For Developers

📄 License

🙏 Acknowledgments

📞 Support

📈 Roadmap

🚀 Quickstart for Collaborators

Prerequisites

1. Clone the Repository

2. Set Up Virtual Environment

3. Install Dependencies

4. Run Tests

5. Test IPCC HTML Ingestion (New Feature)

🔧 Features

HTML Ingestion with Paragraph ID Tracking

Example Output

🐛 Troubleshooting

Slow Installation Issues

Windows-Specific Issues

General Issues

📁 Project Structure

📝 Development

🧪 Testing

TEST

BUGS

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages