Skip to content

Latest commit

 

History

History
27 lines (18 loc) · 1001 Bytes

File metadata and controls

27 lines (18 loc) · 1001 Bytes

Archive HR Newsletters

Several tools for email archiving at CCA.

Apps Script

Google Apps Script to search a Gmail inbox for particular emails and save them to a Drive folder. See apps_script/readme.md for details.

Entity Extractor

Python CLI tool for extracting named entities (people, organizations, locations) from emails using spaCy NER. Download the emails stored in Drive from the apps script locally to work on them. Processes EML (preferred), HTML, and PDF files and outputs structured JSON with entity information. Optional Wikidata linking for entity enrichment. See entity_extractor/readme.md for details.

Setup

# Install dependencies & spaCy model
uv sync
uv run spacy download en_core_web_sm
# Extract entities from emails
uv run extract-entities extract data/
# Compile all entities into a single CSV
uv run extract-entities compile data/

License

ECL-2.0