-
Notifications
You must be signed in to change notification settings - Fork 1
Explicit file name convention missing for train/dev/test files and resolution of file name inconsistency #9
Description
@e-maud We must specify exactly what the files of the submission have to look like with respect to dataset versions.
In this repository and the guidelines, we just talk about source dataset names without the dataset version tags. For the evaluation it is crucial to know whether we always talk about specific versions of a datasets. Also for submission.
In the participation guidelines we have:
hipe-ocrepair-bench_<version>_<dataset>_<split>_<language>.jsonl
and the datasets are described as :
https://github.com/hipe-eval/HIPE-OCRepair-2026-data/blob/main/README-Participation-Guidelines.md#32-dataset-descriptions
dataset := icdar2017 | overproof|impresso-nzz| dta19|impresso-snippets
But the actual filenames include a dataset version, wich is not specified properly in the guidelines. For the benchmark evaluation repository to work well and for the participants to know about the exact filenames to deliver to us, it should be as exact as possible.
├── dta19
│ └── de
│ ├── hipe-ocrepair-bench_v0.9_dta19-l0_v0.1_dev_de.jsonl
│ ├── hipe-ocrepair-bench_v0.9_dta19-l0_v0.1_dev-unmatched_de.jsonl
│ ├── hipe-ocrepair-bench_v0.9_dta19-l0_v0.1_train_de.jsonl
│ ├── hipe-ocrepair-bench_v0.9_dta19-l0_v0.1_train-unmatched_de.jsonl
│ ├── hipe-ocrepair-bench_v0.9_dta19-l1_v0.1_dev_de.jsonl
│ ├── hipe-ocrepair-bench_v0.9_dta19-l1_v0.1_dev-unmatched_de.jsonl
│ ├── hipe-ocrepair-bench_v0.9_dta19-l1_v0.1_train_de.jsonl
│ ├── hipe-ocrepair-bench_v0.9_dta19-l1_v0.1_train-unmatched_de.jsonl
│ ├── hipe-ocrepair-bench_v0.9_dta19-l2_v0.1_dev_de.jsonl
│ ├── hipe-ocrepair-bench_v0.9_dta19-l2_v0.1_dev-unmatched_de.jsonl
│ ├── hipe-ocrepair-bench_v0.9_dta19-l2_v0.1_train_de.jsonl
│ └── hipe-ocrepair-bench_v0.9_dta19-l2_v0.1_train-unmatched_de.jsonl
├── icdar2017
│ ├── en
│ │ ├── hipe-ocrepair-bench_v0.9_icdar2017_v1.1_dev_en.jsonl
│ │ └── hipe-ocrepair-bench_v0.9_icdar2017_v1.1_train_en.jsonl
│ └── fr
│ └── hipe-ocrepair-bench_v0.9_icdar2017_v1.1_train_fr.jsonl
├── impresso-nzz
│ └── de
│ ├── hipe-ocrepair-bench_v0.9_impresso-nzz_v1.1_test_de.jsonl
│ └── hipe-ocrepair-bench_v0.9_impresso-nzz_v1.1_train_de.jsonl
├── impresso-snippets
│ ├── de
│ │ ├── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_dev_de.jsonl
│ │ └── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_train_de.jsonl
│ ├── en
│ │ ├── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_dev_en.jsonl
│ │ └── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_train_en.jsonl
│ └── fr
│ ├── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_dev_fr.jsonl
│ └── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_train_fr.jsonl
└── overproof
└── en
├── hipe-ocrepair-bench_v0.9_overproof-combined_v1.0_dev_en.jsonl
├── hipe-ocrepair-bench_v0.9_overproof-combined_v1.0_test_en.jsonl
└── hipe-ocrepair-bench_v0.9_overproof-combined_v1.0_train_en.jsonl
14 directories, 26 files
and the datasets are currently this:
dataset := icdar2017_v1.1 | overproof_v1.0|impresso-nzz_v1.1| dta19-l0_v0.1| dta19-l1_v0.1| dta19-l2_v0.1|impresso-snippets_v1.0