Skip to content

semanticClimate/llmrag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IPCC RAG System 🌍📚

Cite as: DOI

A Local Retrieval-Augmented Generation (RAG) System for IPCC Climate Reports

Python 3.8+ License: MIT PRs Welcome

🎯 What is this?

This system helps researchers, policymakers, and students quickly find and understand information from IPCC (Intergovernmental Panel on Climate Change) reports. Think of it as a smart research assistant that can:

  • 📖 Load IPCC chapters from your computer
  • 🔍 Answer questions about climate science
  • 📍 Show you exactly where information comes from (paragraph IDs)
  • 🤖 Run entirely on your computer (no internet needed after setup)

🚀 Quick Start (5 minutes)

1. Install Python

First, make sure you have Python 3.12+ installed:

Note: This system requires Python 3.12 or higher for optimal performance and compatibility.

2. Download and Setup

Fast Installation (Recommended)

For Windows users experiencing slow installation:

# Download the project
git clone https://github.com/yourusername/llmrag.git
cd llmrag

# Use the optimized installation script
# Windows:
install_fast.bat

# Unix/Linux/Mac:
./install_fast.sh

Standard Installation

# Download the project
git clone https://github.com/yourusername/llmrag.git
cd llmrag

# Install with optimization flags (faster)
pip install --upgrade pip setuptools wheel
pip install --use-feature=fast-deps --only-binary=all -e .

# Or install required packages
pip install -r requirements.txt

3. Try it out!

# Start the web interface
streamlit run streamlit_app.py

# Or use the command line
python -m llmrag.cli list-chapters
python -m llmrag.cli ask "What are the main findings about temperature trends?" --chapter wg1/chapter02

📚 Learning Resources

For Beginners

For Researchers

For Developers

🎮 How to Use

Web Interface (Recommended for beginners)

  1. Run streamlit run streamlit_app.py
  2. Open your browser to http://localhost:8501
  3. Select a chapter and start asking questions!

Command Line (For power users)

# See available chapters
python -m llmrag.cli list-chapters

# Ask a question
python -m llmrag.cli ask "What causes global warming?" --chapter wg1/chapter02

# Interactive mode
python -m llmrag.cli interactive --chapter wg1/chapter02

Python Code (For developers)

from llmrag.chapter_rag import ask_chapter

# Ask a question about a chapter
result = ask_chapter(
    question="What are the main climate change impacts?",
    chapter_name="wg1/chapter02"
)

print(f"Answer: {result['answer']}")
print(f"Sources: {result['paragraph_ids']}")

📁 What's Included

llmrag/
├── 📖 IPCC Chapters          # Climate report data
├── 🤖 RAG System            # Question answering engine
├── 🌐 Web Interface         # User-friendly browser app
├── 💻 Command Line Tools    # Power user interface
├── 🔧 Processing Pipeline   # Data preparation tools
└── 📊 Documentation         # Guides and tutorials

🛠️ System Components

Core RAG System

  • Document Loading: Processes IPCC HTML chapters
  • Text Chunking: Breaks documents into searchable pieces
  • Vector Search: Finds relevant information quickly
  • Answer Generation: Creates coherent responses
  • Source Tracking: Shows exactly where answers come from

User Interfaces

  • Streamlit Web App: Beautiful, interactive interface
  • Command Line: Fast, scriptable interface
  • Python API: For integration with other tools

Data Processing

  • HTML Cleaning: Removes formatting, keeps content
  • Paragraph IDs: Tracks information sources
  • Semantic Chunking: Keeps related information together

🔬 Technical Details

Models Used

  • Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
  • Language Model: GPT-2 Large (774M parameters)
  • Vector Database: ChromaDB (local storage)

Performance

  • Speed: Answers in 2-5 seconds
  • Accuracy: Based on IPCC content only
  • Memory: ~2GB RAM for full system
  • Storage: ~500MB for all chapters

🤝 Contributing

We welcome contributions! Here's how to help:

For Non-Developers

  • 📝 Report bugs or suggest improvements
  • 📚 Test the system with your research questions
  • 📖 Improve documentation or write tutorials
  • 🌍 Share with colleagues who might find it useful

For Developers

  • 🔧 Fix bugs or add features
  • 🧪 Add tests to ensure quality
  • 📦 Improve packaging or deployment
  • 🚀 Optimize performance

See CONTRIBUTING.md for detailed guidelines.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • IPCC: For providing the climate science reports
  • HuggingFace: For the language models and tools
  • ChromaDB: For the vector database
  • Streamlit: For the web interface framework
  • Open Source Community: For all the amazing tools we build upon

📞 Support

📈 Roadmap

  • More IPCC Chapters: Add WG2 and WG3 reports
  • Better Models: Upgrade to larger language models
  • Multi-language: Support for non-English reports
  • Collaborative Features: Share questions and answers
  • Mobile App: iOS and Android versions

Made with ❤️ for climate science research

🚀 Quickstart for Collaborators

Prerequisites

  • Python 3.9 or higher (Python 3.12 recommended)
  • Git

1. Clone the Repository

git clone https://github.com/semanticClimate/llmrag.git
cd llmrag

2. Set Up Virtual Environment

Windows (Command Prompt):

python -m venv venv
venv\Scripts\activate

Windows (PowerShell):

python -m venv venv
.\venv\Scripts\Activate.ps1

macOS/Linux:

python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -e .
pip install -r requirements.txt

4. Run Tests

# Run all tests
python -m pytest tests/ -v

# Run with coverage
coverage run --source=llmrag -m pytest tests/
coverage report -m

5. Test IPCC HTML Ingestion (New Feature)

python test_ipcc_ingestion.py

This will:

  • Ingest the IPCC Chapter 4 HTML file with paragraph IDs
  • Test the RAG pipeline with climate-related queries
  • Show which paragraph IDs were used to generate answers

🔧 Features

HTML Ingestion with Paragraph ID Tracking

The system now supports ingesting HTML documents and tracking paragraph IDs for source attribution:

  • HTML Splitter: Extracts text while preserving paragraph IDs from HTML elements
  • RAG Pipeline: Returns paragraph IDs used in generating answers
  • Test Script: test_ipcc_ingestion.py demonstrates the feature with IPCC content

Example Output

Query: What are the main scenarios used in climate projections?
Answer: [Generated answer]
Paragraph IDs found: ['4.1_p3', '4.3.2.2_p2']

🐛 Troubleshooting

Slow Installation Issues

If pip install -e . takes more than 10 minutes:

  1. Use the fast installation scripts:

    # Windows
    install_fast.bat
    
    # Unix/Linux/Mac
    ./install_fast.sh
  2. Try staged installation:

    pip install --upgrade pip setuptools wheel
    pip install pyyaml lxml pytest rich streamlit toml
    pip install chromadb langchain
    pip install --only-binary=all transformers sentence-transformers
    pip install -e .
  3. Use conda for heavy packages (Windows):

    conda install -c conda-forge transformers sentence-transformers
    pip install -e .

Windows-Specific Issues

  • Virtual Environment: Make sure to use the correct activation script for your shell
  • Dependencies: If you encounter issues with lxml or transformers, try:
    pip install lxml transformers
  • DLL Errors: Ensure you have the latest Python and pip versions
  • Visual Studio Build Tools: Install Visual Studio Build Tools 2019+ for compilation

General Issues

  • Python Version: We recommend Python 3.12. Some libraries may have compatibility issues with older versions
  • Virtual Environment: ALWAYS USE A VIRTUAL ENVIRONMENT to avoid conflicts
  • NumPy Conflicts: If you have NumPy in your global environment, it may cause issues. Use a clean virtual environment
  • Network Issues: Large model downloads may timeout. Use pip install --timeout 300 for longer timeouts

📁 Project Structure

llmrag/
├── llmrag/                    # Main package
│   ├── chunking/             # Text splitting (including HTML)
│   ├── embeddings/           # Embedding models
│   ├── models/               # LLM models
│   ├── pipelines/            # RAG pipeline
│   └── retrievers/           # Vector stores
├── tests/                    # Test suite
│   └── ipcc/                # IPCC test data
├── test_ipcc_ingestion.py   # IPCC ingestion test script
└── requirements.txt          # Dependencies

📝 Development

For chat history and development notes, see:

  • ./project.md - Project documentation
  • ./all_code.py - Development history (Messy)

🧪 Testing

The test suite includes:

  • Unit tests for all components
  • Integration tests for the RAG pipeline
  • HTML ingestion tests with paragraph ID tracking
  • IPCC content tests

Run tests with:

python -m pytest tests/ -v

Expected result:

=========================================== 10 passed in 27.61s ============================================

TEST

git clone https://github.com/semanticClimate/llmrag/

Then cd llmrag

setup and activate a virtual environment (on Mac:

python3.12 -m venv venv
source venv/bin/activate

run the tests - should take about 0.5 min

pip install -r requirements.txt
coverage run --source=llmrag -m unittest discover -s tests
coverage report -m

result:

..Device set to use cpu
..Retrieved: [('Paris is the capital of France.', 0.2878604531288147)]
.Retrieved: [('Paris is the capital of France.', 0.37026578187942505)]
.
----------------------------------------------------------------------
Ran 6 tests in 20.267s

OK

to print the coverage:

 coverage report -m

BUGS

We run on Python 3.12. This can cause problems with some libraries, such as NumPy. Although numpy is not included in llmrag at present it may be in your environment. ALWAYS USE A VIRTUAL ENVIRONMENT (PMR found differences between numpy on Python 3.11 and 3.12)

About

A template Github repository for running your own AI LLM RAG for a literature review project

Topics

Resources

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages