Skip to content

semanticClimate/encyclopedia

Repository files navigation

Encyclopedia Project

A comprehensive toolset for extracting and analyzing keywords from scientific documents, with a focus on climate change research and IPCC reports.

Project Overview

This project consists of two main subprojects that work together to process scientific documents and extract meaningful insights:

  1. Keyword_extraction - AI-powered keyword extraction from text documents
  2. Dictionary - Structured storage and analysis of extracted keywords and document content

Subprojects

Keyword_extraction

A Python-based tool that uses state-of-the-art Natural Language Processing (NLP) models to extract the most important keywords and keyphrases from scientific text documents. Built with Hugging Face transformers and optimized for academic content.

Key Features:

  • AI-powered keyword extraction using pre-trained models
  • Support for multiple text processing methods (sentence-based, chunk-based)
  • Batch processing for large documents
  • CSV output format for easy analysis
  • Configurable top-N keyword extraction

Use Cases:

  • Academic paper analysis
  • Research document summarization
  • Content indexing and search
  • Literature review automation

Dictionary

A structured storage system for organizing extracted keywords, document content, and metadata. Currently contains processed IPCC Working Group 1 reports with extracted keywords and full text content.

Key Features:

  • Organized storage of document chapters
  • Keyword frequency analysis
  • HTML and plain text document versions
  • CSV exports for data analysis
  • Structured directory organization

Current Content:

  • IPCC WG1 Chapter 1: Introduction
  • IPCC WG1 Chapter 5: Global Carbon and Other Biogeochemical Cycles
  • IPCC WG1 Chapter 6: Short-lived Climate Forcers

Quick Start

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • Virtual environment (recommended)

Installation

# Clone the repository
git clone <repository-url>
cd encyclopedia

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies for keyword extraction
cd Keyword_extraction
pip install -r requirements.txt

Basic Usage

Extract Keywords from a Document

cd Keyword_extraction
python Keyword_extraction.py -i your_document.txt -s results/ -o keywords.csv -n 500

Parameters:

  • -i: Input text file path
  • -s: Output directory for results
  • -o: Output CSV filename
  • -n: Number of top keywords to extract

View Extracted Keywords

# Navigate to Dictionary directory to view processed content
cd Dictionary/ipcc_wg1/ipcc_wg1_ch1
# View top keywords
cat top_keywords_only.txt
# Or open CSV file in Excel/Google Sheets
open top_keywords.csv

Project Structure

encyclopedia/
├── README.md                    # This file
├── Keyword_extraction/          # Keyword extraction tool
│   ├── README.md               # Subproject documentation
│   ├── Keyword_extraction.py   # Main extraction script
│   ├── requirements.txt        # Python dependencies
│   └── workflow.md            # Usage workflow
├── Dictionary/                  # Document storage and analysis
│   ├── README.md               # Subproject documentation
│   └── ipcc_wg1/              # IPCC Working Group 1 content
│       ├── ipcc_wg1_ch1/      # Chapter 1 content and keywords
│       ├── ipcc_wg1_ch5/      # Chapter 5 content and keywords
│       └── ipcc_wg1_ch6/      # Chapter 6 content and keywords
└── LICENSE                     # Project license

Technology Stack

  • Python: Core programming language
  • Transformers: Hugging Face NLP models for keyword extraction
  • BeautifulSoup: HTML parsing and processing
  • Pandas: Data manipulation and CSV export
  • PyTorch: Deep learning backend for NLP models

Contributing

This project follows established style guidelines:

  • Use absolute imports with module prefixes
  • Keep __init__.py files empty unless explicitly agreed
  • Follow established naming conventions (alphanumeric + underscores only)
  • Always propose changes before implementation
  • Work in small, testable steps

License

See LICENSE file for details.

Support

For questions or issues:

  1. Check the subproject-specific README files
  2. Review the workflow documentation in Keyword_extraction/workflow.md
  3. Examine existing examples in the Dictionary directory

Development Notes

  • All output files are stored in designated directories to maintain project structure
  • The project follows climate change research examples for demonstrations
  • Keywords are extracted using the ml6team/keyphrase-extraction-kbir-inspec model
  • Document processing supports both HTML and plain text formats

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •