-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
For the sake of better manageability and future developments.
-
REPO 1: (this one) Benchmarks/Datasets
- Contains benchmark definitions only:
- ground truths, prompts, context data (images/texts)
- benchmark-local Python logic (
benchmark.py) - response schemas (
dataclass.py) - namespaced meta.json (benchmark / evaluation / presentation)
- Benchmarks subclass a stable Benchmark API and must not import system code
- Versioned per benchmark; each version is immutable and snapshotted to Zenodo
- Contains benchmark definitions only:
-
REPO 2: Benchmark system
- Contains all execution and infrastructure logic:
- runner, CLI, loaders
- scoring orchestration and result collection
- benchmark API implementation
- Django + MongoDB + Plotly visualization frontend
- Treats benchmarks as external, versioned inputs (local path, git ref, or Zenodo DOI)
- Reads only:
- benchmark.* and evaluation.* metadata for execution
- presentation.* metadata for first-party frontend rendering
- Contains all execution and infrastructure logic:
-
REPO 3: Results, collected results
- Stores curated system runs:
- raw results
- aggregated / collected results
- figures and tables
- Manual Zenodo releases after a full evaluation cycle (e.g. multi-day runs)
- Stores curated system runs:
Metadata contract
- meta.json is namespaced:
- benchmark: identity, description, contributors (scientific, immutable)
- evaluation: primary metric, ranking semantics (affects comparability)
- presentation: display and visualization hints (frontend only)
- Only changes to benchmark or evaluation require a benchmark version bump
Reproducibility guarantees
- Any change to benchmark code, data, prompts, or schemas ⇒ new benchmark version
- Results reference:
- benchmark DOI(s)
- system version
- model configuration
- Execution logic and visualization can evolve independently of benchmark validity
TODO:
[] Discuss this proposition (@MHindermann)
[] Plan implementation
[] Plan publication
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request