This file is the entry point for any Claude Code session running experiments. Read this FIRST before doing anything.
This is an autonomous research lab. Each experiment follows a strict template so findings accumulate and future sessions can build on past work without re-reading everything.
# 1. Read this file (LAB.md)
# 2. Read DISCOVERIES.md for what's known so far
# 3. Pick an open question from DISCOVERIES.md or TODO.md
# 4. Create experiment using the template below
# 5. Run, record, commitLAB.md # You are here — lab protocol
DISCOVERIES.md # Accumulated knowledge (READ THIS)
TODO.md # Open research tasks
src/sparse_parity/
experiments/
_template.py # Copy this to start a new experiment
exp1_*.py # Completed experiments
exp_a_*.py
...
findings/
{exp_name}.md # One file per experiment, strict format
results/
{exp_name}/
results.json # Machine-readable metrics
*.png # Plots (optional)
1. HYPOTHESIS → What do you expect and why?
2. SETUP → Config, code, what you're measuring
3. RUN → Execute, capture all output
4. RESULTS → Numbers in a table
5. ANALYSIS → Why did it work/fail? What's surprising?
6. NEXT → What should be tried next based on this?
7. COMMIT → findings/ + results/ + experiments/
Every finding MUST follow this format exactly:
# Experiment {ID}: {Title}
**Date**: YYYY-MM-DD
**Status**: SUCCESS | PARTIAL | FAILED
**Answers**: {Which question from DISCOVERIES.md does this address?}
## Hypothesis
{One sentence: "If we do X, then Y will happen because Z."}
## Config
| Parameter | Value |
|-----------|-------|
| n_bits | |
| k_sparse | |
| hidden | |
| lr | |
| wd | |
| batch_size | |
| max_epochs | |
| n_train | |
| seed | |
| method | {standard/fused/perlayer/forward-forward/sign-sgd/...} |
## Results
| Metric | Value |
|--------|-------|
| Best test accuracy | |
| Epochs to >90% | |
| Wall time | |
| Weighted ARD | |
| ARD improvement vs baseline | |
## Key Table
{The ONE comparison table that tells the story}
## Analysis
### What worked
{Bullet points}
### What didn't work
{Bullet points}
### Surprise
{The one thing you didn't expect}
## Open Questions (for next experiment)
- {Question 1 — specific enough to be an experiment}
- {Question 2}
## Files
- Experiment: `src/sparse_parity/experiments/{exp_name}.py`
- Results: `results/{exp_name}/results.json`See src/sparse_parity/experiments/_template.py for the code template.
- Read DISCOVERIES.md first — don't repeat what's known
- One hypothesis per experiment — change one thing at a time
- Always compare against a baseline — never report absolute numbers alone
- Record failures — a failed hypothesis is still a finding
- Update DISCOVERIES.md — add your finding to the knowledge base
- Keep runtime < 5 minutes — reduce hidden/epochs if needed
- Commit locally, don't push — the human decides when to push
- Leave a "Next" section — so the next session knows what to try
- Metric isolation — never modify measurement code (tracker.py, cache_tracker.py, data.py, config.py, harness.py). Agents that rewrite evaluation code to get better scores are gaming the metric, not improving the algorithm.
- Two-phase results — see "Two-Phase Results Pipeline" below.
- Reproducibility — see
.claude/rules/experiment-reproducibility.md. Every experiment must record seed, config, environment, and git commit hash.
Separate the evidence bundle (machine output) from the findings narrative (human interpretation). This prevents the common failure mode where an agent writes conclusions before verifying the numbers.
Experiment code writes ONLY raw data. No interpretation, no "this shows that," no impact claims.
results/{exp_id}/
results.json # raw numbers, full config, environment (python/numpy
# version, platform, git commit), seed(s)
figures/ # plots (optional)
stats.md # statistical summary across seeds (optional;
# means, stds, per-seed values — no prose)
run.log # captured stdout/stderr (optional)
Rules for Phase 1:
- Never write prose conclusions here.
- Always dump the full config even if parameters are default.
- Always record seed. If the experiment ran multiple seeds, record every seed.
- Record the environment (see
.claude/rules/experiment-reproducibility.md).
Only after Phase 1 is written, open docs/findings/{exp_id}.md and write the interpretation.
docs/findings/{exp_id}.md
- Hypothesis
- Link to results/{exp_id}/results.json # do NOT inline raw data
- Key table (the one comparison that tells the story)
- Classification: WIN / LOSS / INVALID / INCONCLUSIVE / BASELINE
- Analysis: what worked, what didn't, the surprise
- Impact on DISCOVERIES.md (if any)
- Next experiment
Rules for Phase 2:
- Reference the results JSON by path. Do not paste raw arrays or log dumps.
- Must include the classification. "COMPLETED" is not a classification.
- If you change your interpretation later, edit Phase 2 only. Phase 1 stays immutable.
The separation is enforced in code review. If a findings doc contains numbers that do not appear in the matching results/{exp_id}/results.json, that is a review block. If a results.json contains prose fields, same block. This is the cheapest way to catch agents (and humans) writing conclusions ahead of their data.
The run-experiment skill automates this flow; see .claude/skills/run-experiment/.
| Config | Method | Accuracy | ARD | DMC | Time | Reference |
|---|---|---|---|---|---|---|
| n=20, k=3 | numpy SGD (fast.py) | 100% | — | — | 0.12s | fast.py |
| n=20, k=3 | standard (LR=0.1, batch=32) | 100% | 17,976 | — | — | exp_a |
| n=20, k=3 | standard (single sample, tracked) | 100% | 4,104 | 300,298 | 1.78s | baseline |
| n=20, k=3 | perlayer (LR=0.1) | 99.5% | 17,299 | — | exp_c | |
| n=20, k=3 | forward-forward | 58.5% | 277,256 | — | exp_e | |
| n=20, k=5 | sign SGD (n_train=5000) | >90% | — | — | exp_sign_sgd | |
| n=20, k=5 | standard (n_train=5000) | >90% | — | — | exp_sign_sgd | |
| n=30, k=3 | standard (LR=0.1, batch=32) | 94.5% | — | — | exp_d | |
| n=50, k=3 | curriculum (n=10→30→50) | >90% | — | — | exp_curriculum | |
| n=50, k=3 | standard (direct) | 54% (FAIL) | — | — | exp_d | |
| n=3, k=3 | standard | 100% | 10,640 | — | run_20260303_200353 | |
| n=3, k=3 | perlayer | 100% | 9,674 | — | run_20260303_200353 | |
| n=20, k=3 | fourier (Walsh-Hadamard) | 100% | 1,147,375 | 0.009s | exp_fourier | |
| n=50, k=3 | fourier (Walsh-Hadamard) | 100% | — | 0.16s | exp_fourier | |
| n=20, k=5 | fourier (Walsh-Hadamard) | 100% | — | 0.14s | exp_fourier |