Skip to content

Autoresearch Agent

Leo edited this page Mar 13, 2026 · 1 revision

Autoresearch Agent

Autonomous experiment loop inspired by Karpathy's autoresearch. Optimizes any file by a measurable metric — code speed, prompt quality, content CTR, bundle size — through continuous iteration.

How It Works

┌─────────────────────────┐
│  1. Read target file    │
│  2. Make ONE change     │
│  3. Run evaluation      │
│  4. Score improved?     │
│     YES → git commit    │
│     NO  → git revert    │
│  5. Loop to step 1      │
└─────────────────────────┘

The agent edits a target file, runs a fixed evaluation command, keeps improvements (git commit), discards failures (git reset), and loops indefinitely.

Slash Commands

Command What It Does
/ar:setup Create a new experiment interactively
/ar:run Run a single experiment iteration
/ar:loop Run continuous optimization loop
/ar:status Show experiment dashboard and results
/ar:resume Resume a paused experiment

Quick Start

# Set up a new experiment
/ar:setup

# You'll be asked:
# - Domain (engineering, marketing, content, prompts, custom)
# - Target file to optimize
# - Evaluation command
# - Metric name and direction (higher/lower is better)
# - Evaluator type

# Run it
/ar:loop

Experiment Domains

Domain Example Target Example Metric Evaluator
engineering src/api/search.py p50_ms (lower) benchmark_speed
engineering Dockerfile image_size_mb (lower) benchmark_size
marketing content/titles.md ctr_score (higher) llm_judge_content
prompts system_prompt.txt accuracy (higher) llm_judge_prompt
content blog/draft.md engagement_score (higher) llm_judge_copy

Built-in Evaluators

Free (no API costs)

  • benchmark_speed — Execution time measurement
  • benchmark_size — File/bundle/image size
  • test_pass_rate — Test suite pass percentage
  • build_speed — Build time measurement
  • memory_usage — Peak memory consumption

LLM Judge (uses your existing AI subscription)

  • llm_judge_content — Content quality scoring
  • llm_judge_prompt — Prompt effectiveness rating
  • llm_judge_copy — Marketing copy evaluation

File Structure

.autoresearch/
├── config.yaml              # Global settings
└── {domain}/
    └── {experiment-name}/
        ├── config.cfg       # Experiment config
        ├── program.md       # Strategy document
        ├── results.tsv      # Experiment log
        └── target_*.txt     # Target file backup

Writing Custom Evaluators

Evaluators are Python scripts that output a single metric:

#!/usr/bin/env python3
"""Custom evaluator — must print 'metric_name: value' to stdout."""

def evaluate():
    # Your evaluation logic here
    score = run_your_benchmark()
    print(f"my_metric: {score}")

if __name__ == "__main__":
    evaluate()

Requirements:

  • stdlib-only Python
  • Print exactly one line: metric_name: value
  • Exit code 0 on success

Real-World Example

We use autoresearch internally to optimize LINDERA's medical recommendation prompts:

  • Target: System prompt for MedGemma 4B medical LLM
  • Evaluator: 5-dimension scorer (keyword accuracy, clinical appropriateness, hallucination rate, tone accuracy)
  • Results: Composite score improved from 62.5 → 63.7 in 6 iterations
  • Cost: $0 (local model inference + free evaluator)

See LINDERA ML project for details.

Clone this wiki locally