-
-
Notifications
You must be signed in to change notification settings - Fork 711
Autoresearch Agent
Leo edited this page Mar 13, 2026
·
1 revision
Autonomous experiment loop inspired by Karpathy's autoresearch. Optimizes any file by a measurable metric — code speed, prompt quality, content CTR, bundle size — through continuous iteration.
┌─────────────────────────┐
│ 1. Read target file │
│ 2. Make ONE change │
│ 3. Run evaluation │
│ 4. Score improved? │
│ YES → git commit │
│ NO → git revert │
│ 5. Loop to step 1 │
└─────────────────────────┘
The agent edits a target file, runs a fixed evaluation command, keeps improvements (git commit), discards failures (git reset), and loops indefinitely.
| Command | What It Does |
|---|---|
/ar:setup |
Create a new experiment interactively |
/ar:run |
Run a single experiment iteration |
/ar:loop |
Run continuous optimization loop |
/ar:status |
Show experiment dashboard and results |
/ar:resume |
Resume a paused experiment |
# Set up a new experiment
/ar:setup
# You'll be asked:
# - Domain (engineering, marketing, content, prompts, custom)
# - Target file to optimize
# - Evaluation command
# - Metric name and direction (higher/lower is better)
# - Evaluator type
# Run it
/ar:loop| Domain | Example Target | Example Metric | Evaluator |
|---|---|---|---|
engineering |
src/api/search.py |
p50_ms (lower) |
benchmark_speed |
engineering |
Dockerfile |
image_size_mb (lower) |
benchmark_size |
marketing |
content/titles.md |
ctr_score (higher) |
llm_judge_content |
prompts |
system_prompt.txt |
accuracy (higher) |
llm_judge_prompt |
content |
blog/draft.md |
engagement_score (higher) |
llm_judge_copy |
-
benchmark_speed— Execution time measurement -
benchmark_size— File/bundle/image size -
test_pass_rate— Test suite pass percentage -
build_speed— Build time measurement -
memory_usage— Peak memory consumption
-
llm_judge_content— Content quality scoring -
llm_judge_prompt— Prompt effectiveness rating -
llm_judge_copy— Marketing copy evaluation
.autoresearch/
├── config.yaml # Global settings
└── {domain}/
└── {experiment-name}/
├── config.cfg # Experiment config
├── program.md # Strategy document
├── results.tsv # Experiment log
└── target_*.txt # Target file backup
Evaluators are Python scripts that output a single metric:
#!/usr/bin/env python3
"""Custom evaluator — must print 'metric_name: value' to stdout."""
def evaluate():
# Your evaluation logic here
score = run_your_benchmark()
print(f"my_metric: {score}")
if __name__ == "__main__":
evaluate()Requirements:
- stdlib-only Python
- Print exactly one line:
metric_name: value - Exit code 0 on success
We use autoresearch internally to optimize LINDERA's medical recommendation prompts:
- Target: System prompt for MedGemma 4B medical LLM
- Evaluator: 5-dimension scorer (keyword accuracy, clinical appropriateness, hallucination rate, tone accuracy)
- Results: Composite score improved from 62.5 → 63.7 in 6 iterations
- Cost: $0 (local model inference + free evaluator)
See LINDERA ML project for details.