Skip to content

Split humanities_data_benchmark into three repositories #82

@sorinmarti

Description

@sorinmarti

For the sake of better manageability and future developments.

  • REPO 1: (this one) Benchmarks/Datasets

    • Contains benchmark definitions only:
      • ground truths, prompts, context data (images/texts)
      • benchmark-local Python logic (benchmark.py)
      • response schemas (dataclass.py)
      • namespaced meta.json (benchmark / evaluation / presentation)
    • Benchmarks subclass a stable Benchmark API and must not import system code
    • Versioned per benchmark; each version is immutable and snapshotted to Zenodo
  • REPO 2: Benchmark system

    • Contains all execution and infrastructure logic:
      • runner, CLI, loaders
      • scoring orchestration and result collection
      • benchmark API implementation
      • Django + MongoDB + Plotly visualization frontend
    • Treats benchmarks as external, versioned inputs (local path, git ref, or Zenodo DOI)
    • Reads only:
      • benchmark.* and evaluation.* metadata for execution
      • presentation.* metadata for first-party frontend rendering
  • REPO 3: Results, collected results

    • Stores curated system runs:
      • raw results
      • aggregated / collected results
      • figures and tables
    • Manual Zenodo releases after a full evaluation cycle (e.g. multi-day runs)

Metadata contract

  • meta.json is namespaced:
    • benchmark: identity, description, contributors (scientific, immutable)
    • evaluation: primary metric, ranking semantics (affects comparability)
    • presentation: display and visualization hints (frontend only)
  • Only changes to benchmark or evaluation require a benchmark version bump

Reproducibility guarantees

  • Any change to benchmark code, data, prompts, or schemas ⇒ new benchmark version
    • Results reference:
    • benchmark DOI(s)
    • system version
    • model configuration
  • Execution logic and visualization can evolve independently of benchmark validity

TODO:
[] Discuss this proposition (@MHindermann)
[] Plan implementation
[] Plan publication

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions