Split humanities_data_benchmark into three repositories

For the sake of better manageability and future developments.

- **REPO 1: (this one) Benchmarks/Datasets**
  - Contains benchmark definitions only:
    - ground truths, prompts, context data (images/texts)
    - benchmark-local Python logic (`benchmark.py`)
    - response schemas (`dataclass.py`)
    - namespaced meta.json (benchmark / evaluation / presentation)
  - Benchmarks subclass a stable Benchmark API and must not import system code
  - Versioned per benchmark; each version is immutable and snapshotted to Zenodo

- **REPO 2: Benchmark system**
  - Contains all execution and infrastructure logic:
    - runner, CLI, loaders
    - scoring orchestration and result collection
    - benchmark API implementation
    - Django + MongoDB + Plotly visualization frontend
  - Treats benchmarks as external, versioned inputs (local path, git ref, or Zenodo DOI)
  - Reads only:
    - benchmark.* and evaluation.* metadata for execution
    - presentation.* metadata for first-party frontend rendering

- **REPO 3: Results, collected results**
  - Stores curated system runs:
    - raw results
    - aggregated / collected results
    - figures and tables
  - Manual Zenodo releases after a full evaluation cycle (e.g. multi-day runs)

 

**Metadata contract**
- meta.json is namespaced:
  - benchmark: identity, description, contributors (scientific, immutable)
  - evaluation: primary metric, ranking semantics (affects comparability)
  - presentation: display and visualization hints (frontend only)
 - Only changes to benchmark or evaluation require a benchmark version bump

**Reproducibility guarantees**
- Any change to benchmark code, data, prompts, or schemas ⇒ new benchmark version
  - Results reference:
  - benchmark DOI(s)
  - system version
  - model configuration
- Execution logic and visualization can evolve independently of benchmark validity


TODO:
[] Discuss this proposition (@MHindermann)
[] Plan implementation
[] Plan publication

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split humanities_data_benchmark into three repositories #82

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Split humanities_data_benchmark into three repositories #82

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions