Extraction of Zurich death register entries using OCR + layout analysis + LLM post-processing.
Project Background This project operationalizes the workflow described in the poster "Informationsextraktion serieller Quellen: Die Zürcher Sterberegister (1876–1925)". The goal is to convert scanned serial sources (e.g., civil and church registers) into structured tables without manual correction by combining layout analysis, automatic text recognition (ATR), and large language models (LLMs). The project evaluates closed (Transkribus, OpenAI) and open (YOLO, TrOCR) toolchains to deliver a reproducible pipeline for historical sources.
Poster Workflow Summary
- Layout and line segmentation to isolate entries and separate name vs. content areas.
- ATR for handwritten and printed text (Transkribus and TrOCR models).
- LLM-based reconstruction and extraction to normalize order and identify fields via keyword windows.
- Validation and correction to quantify errors and improve models or apply rule-based fixes.
Poster Figures (From Project Documentation)
Figure 1: Document with layout regions marked (Transkribus screenshot, Fields model, Zurich 1876 register).
Figure 2: Layout/line segmentation + OCR output showing disrupted line order (Transkribus screenshot).
Figure 3: CSV output after segmentation, OCR, extraction, and post-processing (VSC screenshot).
What This Repo Does This repo contains several standalone scripts that form a loose pipeline for:
- Segmenting page images/PDFs into regions and lines (YOLO).
- Performing OCR on line crops (TrOCR).
- Converting OCR output into PAGE-XML for downstream tools.
- Extracting structured fields from PAGE-XML and/or raw CSV via LLMs.
The scripts are currently configured with hardcoded local paths and hardcoded API keys. You will need to change those constants before running.
Scripts Overview
-
regions_lines_trocr.py
Runs YOLO text-region + text-line segmentation and TrOCR.
Inputs: images/PDFs inIMAGE_FOLDER.
Outputs: overlays, per-page JSON with polygons/lines, and aregions_ocr.csv(semicolon-delimited).
CSV columns:page, region_id, x1, y1, x2, y2, textwheretextis all line texts joined by newlines. -
csv_to_pagexml.py
Converts theregions_ocr.csvto PAGE-XML.
Expects exactly 4 regions per page and assigns roles by left/right + top/bottom ordering:IDField_1,ContentField_1,IDField_2,ContentField_2.
Writes PAGE-XML per page plus a skip log for pages not matching 4 regions.
Note:imageWidth/imageHeightare placeholders (2000x3000). -
transkr_xml_gui_YOLO_V1.py
Tkinter GUI to extract segments from PAGE-XML using keyword pairs and optional GPT-based normalization.
Supports both Transkribus-style XML (TextLine/Unicode) and YOLO+TrOCR XML (TextEquiv/Unicode).
Presets are defined for multiple Jahrgänge with start/stop keywords and tag labels.
Presets can be saved/loaded tokeywords_tags_storage.json.
The GPT workflow usespredefined_texts_text1as a template and fills in info from extracted text.
Output: CSV withDatei,XML_Block_Index, and one column per tag for the chosen Jahrgang.
The “Transkribus Download” UI options are currently not wired to any download function. -
csv_gpt-oss.py
Reads a CSV and uses a local OpenAI-compatible API (Ollama) to extract structured fields:
death place + causes, name/occupation/family, and residence/birthdate.
Adds new columns likeTodesort,Todesursachen,Name,Beruf,Wohnort,Geburtsdatum, etc.
This file executes a test run at import time (bottom of file), so adjust paths before running.
Pipeline (Typical Usage)
- Run
regions_lines_trocr.pyto generateregions_ocr.csv, overlays, and JSON. - Run
csv_to_pagexml.pyto convert the CSV into PAGE-XML. - Use
transkr_xml_gui_YOLO_V1.pyto extract structured fields from PAGE-XML. - Optionally run
csv_gpt-oss.pyto enrich CSV columns with LLM-extracted fields.
Configuration You Must Edit
regions_lines_trocr.pyREGION_MODEL_PATH,LINE_MODEL_PATHIMAGE_FOLDER,OUTPUT_DIRPDF_DPI,SAVE_RENDERED_PDF_PAGES
csv_to_pagexml.pyCSV_PATH,OUTPUT_DIR
transkr_xml_gui_YOLO_V1.pybase_url,api_keyin OpenAI clientDEST_DIR
csv_gpt-oss.pybase_urlandmodel- input CSV paths in the “testing” block at the bottom
Dependencies Install the Python packages used by each script:
- Core:
pandas,tqdm,openai,requests - OCR/vision:
ultralytics,opencv-python,numpy,Pillow,torch,transformers - PDF rendering:
pymupdf(imported asfitz) - XML:
lxml - GUI:
tkinter(usually bundled with Python)
Notes / Caveats
csv_to_pagexml.pyassumes 4 regions per page. All other pages are skipped and logged.regions_lines_trocr.pyuses YOLO masks; if no lines are detected, it still emits the region with empty text.csv_gpt-oss.pyandtranskr_xml_gui_YOLO_V1.pyinclude hardcoded API keys. Move these to environment variables before sharing the repo.- Most paths are Windows-style and must be updated to your local environment.