Skip to content

Explicit file name convention missing for train/dev/test files and resolution of file name inconsistency #9

@simon-clematide

Description

@simon-clematide

@e-maud We must specify exactly what the files of the submission have to look like with respect to dataset versions.

In this repository and the guidelines, we just talk about source dataset names without the dataset version tags. For the evaluation it is crucial to know whether we always talk about specific versions of a datasets. Also for submission.

In the participation guidelines we have:

hipe-ocrepair-bench_<version>_<dataset>_<split>_<language>.jsonl

and the datasets are described as :
https://github.com/hipe-eval/HIPE-OCRepair-2026-data/blob/main/README-Participation-Guidelines.md#32-dataset-descriptions

dataset := icdar2017 | overproof|impresso-nzz| dta19|impresso-snippets

But the actual filenames include a dataset version, wich is not specified properly in the guidelines. For the benchmark evaluation repository to work well and for the participants to know about the exact filenames to deliver to us, it should be as exact as possible.

├── dta19
│   └── de
│       ├── hipe-ocrepair-bench_v0.9_dta19-l0_v0.1_dev_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l0_v0.1_dev-unmatched_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l0_v0.1_train_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l0_v0.1_train-unmatched_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l1_v0.1_dev_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l1_v0.1_dev-unmatched_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l1_v0.1_train_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l1_v0.1_train-unmatched_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l2_v0.1_dev_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l2_v0.1_dev-unmatched_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l2_v0.1_train_de.jsonl
│       └── hipe-ocrepair-bench_v0.9_dta19-l2_v0.1_train-unmatched_de.jsonl
├── icdar2017
│   ├── en
│   │   ├── hipe-ocrepair-bench_v0.9_icdar2017_v1.1_dev_en.jsonl
│   │   └── hipe-ocrepair-bench_v0.9_icdar2017_v1.1_train_en.jsonl
│   └── fr
│       └── hipe-ocrepair-bench_v0.9_icdar2017_v1.1_train_fr.jsonl
├── impresso-nzz
│   └── de
│       ├── hipe-ocrepair-bench_v0.9_impresso-nzz_v1.1_test_de.jsonl
│       └── hipe-ocrepair-bench_v0.9_impresso-nzz_v1.1_train_de.jsonl
├── impresso-snippets
│   ├── de
│   │   ├── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_dev_de.jsonl
│   │   └── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_train_de.jsonl
│   ├── en
│   │   ├── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_dev_en.jsonl
│   │   └── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_train_en.jsonl
│   └── fr
│       ├── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_dev_fr.jsonl
│       └── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_train_fr.jsonl
└── overproof
    └── en
        ├── hipe-ocrepair-bench_v0.9_overproof-combined_v1.0_dev_en.jsonl
        ├── hipe-ocrepair-bench_v0.9_overproof-combined_v1.0_test_en.jsonl
        └── hipe-ocrepair-bench_v0.9_overproof-combined_v1.0_train_en.jsonl

14 directories, 26 files

and the datasets are currently this:

dataset := icdar2017_v1.1 | overproof_v1.0|impresso-nzz_v1.1| dta19-l0_v0.1| dta19-l1_v0.1| dta19-l2_v0.1|impresso-snippets_v1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions