Skip to content

hipe-eval/HIPE-OCRepair-2026-data

Repository files navigation

license
agpl-3.0

HIPE OCRepair 2026 - Impresso-OCR-Bench

HIPE-OCRepair-2026 is an ICDAR 2026 Competition focused on LLM-assisted OCR post-correction of historical documents, with a particular emphasis on historical newspapers, but not only.

With renewed interest driven by large language models (LLMs), OCR post-correction has (re)gained momentum, resulting in a growing number of models and experimental approaches. However, these efforts often rely on heterogeneous legacy datasets that come with important limitations, making systematic evaluation and meaningful comparison across approaches difficult.

A central question motivating this competition is:

To what extent can modern large language models address the OCR debt accumulated in large-scale digitized historical collections?

The HIPE OCRepair competition aims to support the development and assessment of new models and methods in this area by evaluating LLM-based OCR post-correction approaches on Impresso-OCR-Bench, which provides:

HIPE-OCRepair-2026-data

Data for the HIPE OCRepair-2026 shared task (ICDAR 2026 Competition)

HIPE OCRepair 2026 - Impresso-OCR-Bench

HIPE-OCRepair-2026 is an ICDAR 2026 Competition focused on LLM-assisted OCR post-correction of historical documents, with a particular emphasis on historical newspapers, but not only.

With renewed interest driven by large language models (LLMs), OCR post-correction has (re)gained momentum, resulting in a growing number of models and experimental approaches. However, these efforts often rely on heterogeneous legacy datasets that come with important limitations, making systematic evaluation and meaningful comparison across approaches difficult.

A central question motivating this competition is:

To what extent can modern large language models address the OCR debt accumulated in large-scale digitized historical collections?

The HIPE OCRepair competition aims to support the development and assessment of new models and methods in this area by evaluating LLM-based OCR post-correction approaches on Impresso-OCR-Bench, which provides:

  • a curated and unified multilingual ground truth dataset for OCR post-correction on historical documents;
  • standardised evaluation protocol to ensure comparability and reproducibility.

Key information

  • 💻 Visit the website for general information on the shared task and registration.
  • 📓 Read the Participation Guidelines for detailed information about the tasks, datasets and evaluation. Will be published in the first 2 weeks of January 2026.
  • Where to find the data:
  • Release history:
    • 19.12.2026: data sample.
    • 12.01.2026: Release of JSON schema for input data and predictions.
    • 19.01.2026: Release of training and development data + scorer + JSON schema.
    • 31.03.2026: Masked test data release, start of evaluation phase.
    • 08.04.2026: Publication of results and unmasked test data release.

HIPE-2026 data

Coming early January 2026: More detailed information on all sections below.

In the meantime, please consult the dataset page on the competition website.

  • 💻 Visit the website for general information on the shared task and registration.
  • 📓 Read the Participation Guidelines for detailed information about the tasks, datasets and evaluation. Will be published in the first 2 weeks of January 2026.
  • Where to find the data:
  • Release history:
    • 19.12.2026: data sample.
    • 12.01.2026: Release of JSON schema for input data and predictions.
    • 19.01.2026: Release of training and development data + scorer + JSON schema.
    • 31.03.2026: Masked test data release, start of evaluation phase.
    • 08.04.2026: Publication of results and unmasked test data release.

HIPE-2026 data

Coming early January 2026: More detailed information on all sections below.

In the meantime, please consult the dataset page on the competition website.

Contents and preparation

Format and data representation

Directory structure and naming convention

Versioning

Dataset statistics

About and Acknowledgements

The HIPE-OCRepair-2026 organising team expresses its sincere appreciation to the ICDAR-2026 Competition Committee for the overall coordination and support.

HIPE-OCRepair-2026 is part of the HIPE-eval series of shared tasks on historical document and information processing and evaluation.

HIPE-eval editions are organised within the framework of the Impresso - Media Monitoring of the Past project, funded by the Swiss National Science Foundation under grant No. CRSII5_213585 and by the Luxembourg National Research Fund under grant No. 17498891. The HIPE-OCRepair-2026 organising team expresses its sincere appreciation to the ICDAR-2026 Competition Committee for the overall coordination and support.

HIPE-OCRepair-2026 is part of the HIPE-eval series of shared tasks on historical document and information processing and evaluation.

HIPE-eval editions are organised within the framework of the Impresso - Media Monitoring of the Past project, funded by the Swiss National Science Foundation under grant No. CRSII5_213585 and by the Luxembourg National Research Fund under grant No. 17498891.

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages