Explicit file name convention missing for train/dev/test files and resolution of file name inconsistency

@e-maud  We must specify exactly what the files of the submission have to look like with respect to dataset versions.

In this repository and the guidelines, we just talk about source dataset names without the dataset version tags. For the evaluation it is crucial to know whether we always talk about specific versions of a datasets. Also for submission.

In the participation guidelines we have: 

`hipe-ocrepair-bench_<version>_<dataset>_<split>_<language>.jsonl`

and the datasets are described as :
https://github.com/hipe-eval/HIPE-OCRepair-2026-data/blob/main/README-Participation-Guidelines.md#32-dataset-descriptions

`dataset := icdar2017 | overproof|impresso-nzz| dta19|impresso-snippets`

But the actual filenames include a **dataset version**, wich is not specified properly in the guidelines. For the benchmark evaluation repository to work well and for the participants to know about the exact filenames to deliver to us, it should be as exact as possible.

```
├── dta19
│   └── de
│       ├── hipe-ocrepair-bench_v0.9_dta19-l0_v0.1_dev_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l0_v0.1_dev-unmatched_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l0_v0.1_train_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l0_v0.1_train-unmatched_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l1_v0.1_dev_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l1_v0.1_dev-unmatched_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l1_v0.1_train_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l1_v0.1_train-unmatched_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l2_v0.1_dev_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l2_v0.1_dev-unmatched_de.jsonl
│       ├── hipe-ocrepair-bench_v0.9_dta19-l2_v0.1_train_de.jsonl
│       └── hipe-ocrepair-bench_v0.9_dta19-l2_v0.1_train-unmatched_de.jsonl
├── icdar2017
│   ├── en
│   │   ├── hipe-ocrepair-bench_v0.9_icdar2017_v1.1_dev_en.jsonl
│   │   └── hipe-ocrepair-bench_v0.9_icdar2017_v1.1_train_en.jsonl
│   └── fr
│       └── hipe-ocrepair-bench_v0.9_icdar2017_v1.1_train_fr.jsonl
├── impresso-nzz
│   └── de
│       ├── hipe-ocrepair-bench_v0.9_impresso-nzz_v1.1_test_de.jsonl
│       └── hipe-ocrepair-bench_v0.9_impresso-nzz_v1.1_train_de.jsonl
├── impresso-snippets
│   ├── de
│   │   ├── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_dev_de.jsonl
│   │   └── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_train_de.jsonl
│   ├── en
│   │   ├── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_dev_en.jsonl
│   │   └── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_train_en.jsonl
│   └── fr
│       ├── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_dev_fr.jsonl
│       └── hipe-ocrepair-bench_v0.9_impresso-snippets_v1.0_train_fr.jsonl
└── overproof
    └── en
        ├── hipe-ocrepair-bench_v0.9_overproof-combined_v1.0_dev_en.jsonl
        ├── hipe-ocrepair-bench_v0.9_overproof-combined_v1.0_test_en.jsonl
        └── hipe-ocrepair-bench_v0.9_overproof-combined_v1.0_train_en.jsonl

14 directories, 26 files
```

 and the datasets are currently this:

`dataset := icdar2017_v1.1 | overproof_v1.0|impresso-nzz_v1.1| dta19-l0_v0.1| dta19-l1_v0.1| dta19-l2_v0.1|impresso-snippets_v1.0`




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicit file name convention missing for train/dev/test files and resolution of file name inconsistency #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explicit file name convention missing for train/dev/test files and resolution of file name inconsistency #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions