Autoresearch Agent

Autonomous experiment loop inspired by Karpathy's autoresearch. Optimizes any file by a measurable metric — code speed, prompt quality, content CTR, bundle size — through continuous iteration.

How It Works

┌─────────────────────────┐
│  1. Read target file    │
│  2. Make ONE change     │
│  3. Run evaluation      │
│  4. Score improved?     │
│     YES → git commit    │
│     NO  → git revert    │
│  5. Loop to step 1      │
└─────────────────────────┘

The agent edits a target file, runs a fixed evaluation command, keeps improvements (git commit), discards failures (git reset), and loops indefinitely.

Slash Commands

Command	What It Does
`/ar:setup`	Create a new experiment interactively
`/ar:run`	Run a single experiment iteration
`/ar:loop`	Run continuous optimization loop
`/ar:status`	Show experiment dashboard and results
`/ar:resume`	Resume a paused experiment

Quick Start

# Set up a new experiment
/ar:setup

# You'll be asked:
# - Domain (engineering, marketing, content, prompts, custom)
# - Target file to optimize
# - Evaluation command
# - Metric name and direction (higher/lower is better)
# - Evaluator type

# Run it
/ar:loop

Experiment Domains

Domain	Example Target	Example Metric	Evaluator
`engineering`	`src/api/search.py`	`p50_ms` (lower)	`benchmark_speed`
`engineering`	`Dockerfile`	`image_size_mb` (lower)	`benchmark_size`
`marketing`	`content/titles.md`	`ctr_score` (higher)	`llm_judge_content`
`prompts`	`system_prompt.txt`	`accuracy` (higher)	`llm_judge_prompt`
`content`	`blog/draft.md`	`engagement_score` (higher)	`llm_judge_copy`

Built-in Evaluators

Free (no API costs)

benchmark_speed — Execution time measurement
benchmark_size — File/bundle/image size
test_pass_rate — Test suite pass percentage
build_speed — Build time measurement
memory_usage — Peak memory consumption

LLM Judge (uses your existing AI subscription)

llm_judge_content — Content quality scoring
llm_judge_prompt — Prompt effectiveness rating
llm_judge_copy — Marketing copy evaluation

File Structure

.autoresearch/
├── config.yaml              # Global settings
└── {domain}/
    └── {experiment-name}/
        ├── config.cfg       # Experiment config
        ├── program.md       # Strategy document
        ├── results.tsv      # Experiment log
        └── target_*.txt     # Target file backup

Writing Custom Evaluators

Evaluators are Python scripts that output a single metric:

#!/usr/bin/env python3
"""Custom evaluator — must print 'metric_name: value' to stdout."""

def evaluate():
    # Your evaluation logic here
    score = run_your_benchmark()
    print(f"my_metric: {score}")

if __name__ == "__main__":
    evaluate()

Requirements:

stdlib-only Python
Print exactly one line: metric_name: value
Exit code 0 on success

Real-World Example

We use autoresearch internally to optimize LINDERA's medical recommendation prompts:

Target: System prompt for MedGemma 4B medical LLM
Evaluator: 5-dimension scorer (keyword accuracy, clinical appropriateness, hallucination rate, tone accuracy)
Results: Composite score improved from 62.5 → 63.7 in 6 iterations
Cost: $0 (local model inference + free evaluator)

See LINDERA ML project for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Autoresearch Agent

Autoresearch Agent

How It Works

Slash Commands

Quick Start

Experiment Domains

Built-in Evaluators

Free (no API costs)

LLM Judge (uses your existing AI subscription)

File Structure

Writing Custom Evaluators

Real-World Example

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally