Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
237 changes: 237 additions & 0 deletions docs/auto-improve-skill-v1.3-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
# auto-improve-skill v1.3 — design proposal

**Status:** draft, written 2026-05-12 during the v1.2.1 PR-prep session.
**Audience:** team review before implementation.
**Tracking:** the in-flight v1.2.1 pilot work (web-design-guidelines /
agent-browser / supabase) is the empirical basis for this proposal.

## Executive summary

v1.3 adds two structural phases to the auto-improve-skill pipeline,
both motivated by failure modes observed across 4 v1.2.1 pilots:

1. **Phase 0 — Research-first context.** A research subagent reads the
target upstream repo's contribution conventions, frontmatter spec,
prefix taxonomy, and merged-PR shape patterns, and writes a context
file that v1.2.1's `--context` flag consumes. Without this, the
auto-pilot produces output that requires manual reformulation
before submission.
2. **Phase 3.5 — Eval-readiness loop.** The pipeline iterates on the
eval (seed harder/simpler cases) until baseline lands in the
"interesting zone" `(0.50, 0.95)`. Without this, baselines saturate
at 1.00 (no headroom to demonstrate uplift) or floor at <0.50 (skill
shape blocks measurement).

The skill-iteration loop (current Phase 4) is unchanged.

## Lesson 1 — Research-first context is mandatory

### Evidence (4 pilots this session)

| Skill | Without context | With researched context |
|---|---|---|
| web-design-guidelines | Manual proposal needed retargeting (SKILL.md→command.md), reformulation across 3 stylistic siblings, frontmatter mismatch. Manual labor: ~2 hours per PR. | Auto-pilot produced a clean, mergeable diff to the right file in the right voice. Manual labor: ~10 min mirror to AGENTS.md/README.md. |
| agent-browser | Auto-pilot proposed editing `skills/agent-browser/SKILL.md`. Per upstream `AGENTS.md`, that file is intentionally a discovery stub; real content lives at `skill-data/core/SKILL.md`. Manual retarget required. | (Pending — pilot in flight; context file says edit `agent-browser-core.md` and produced output names `before-skill-data-core-SKILL.md`.) |
| supabase (batch-1) | Produced shape-novel `references/review-...md` with non-existent prefix (`review-`), missing `impactDescription` frontmatter field, philosophical-style content (MEDIUM-HIGH rejection risk per CONTRIBUTING patterns). | Auto-pilot reshaped into convention-perfect SQL anti-pattern under correct prefix (`monitor-`), full 4-field frontmatter, `**Incorrect**`/`**Correct**` SQL blocks per `_template.md`, trailing `Reference:` link. Zero manual reformulation needed. |

### Generalization

The auto-pilot is good at *finding what to change* (which rules, which
files, which absence-type gaps). It is bad at *fitting upstream
conventions*: frontmatter schemas, file-location norms, prefix
taxonomies, additive-only rules, "Discussion-first" gates, voice
consistency. Conventions are repo-specific tribal knowledge that
cannot be inferred from reading the SKILL.md alone.

### Phase 0 design

```text
Phase 0 — Research upstream (NEW, runs before Phase 1)

Inputs:
- target slug <owner>/<repo>/<skill-id>

Subtasks (executed by a research subagent):
1. Repo metadata: license, CLA, default branch, recent activity
2. Read CONTRIBUTING.md, AGENTS.md, .github/PULL_REQUEST_TEMPLATE.md,
CODEOWNERS, .github/workflows/*.yml
3. Read skill-specific convention files: _contributing.md,
_template.md, _sections.md (or equivalents)
4. Read sanity-test source if present (don't trust prior assumptions
about what CI validates)
5. Sample last 10 merged PRs to the target skill (or repo) for shape:
file count, body shape, conventional-commit usage, scope sizing
6. Sample last 5 closed-without-merge PRs for rejection signals:
"Discussion-first gate violated", "shape-novel content rejected",
etc.
7. Identify other consumers (gh search for raw URL references; check
for install scripts; check repo's own README for distribution
channels)

Output:
tools/auto-improve-contexts/<owner>-<skill>.md
- Repository facts (license, CI, maintainers, merge style)
- Hard constraints (additive-only, file-location, prefix taxonomy,
forbidden modifications)
- Frontmatter spec (exact required fields + allowed values)
- Content shape template (copy-and-fill)
- Optimization target file (where the skill change should land)
- Risk profile (LOW/MEDIUM/HIGH + reasons)
- Pre-submit checklist (what auto-pilot must verify)
- Useful URLs

Cost: ~$0.50–$1.00 per skill (single subagent invocation).

Caching: context files are committed to the repo. Re-running on the
same skill within 30 days: skip Phase 0, reuse cached context (with
explicit `--refresh-context` flag to force re-research).

Operator override: `--context <path>` flag continues to work; if
provided, Phase 0 is skipped.
```

## Lesson 2 — Two-loop iteration: eval AND skill

### Evidence

| Skill | Initial baseline | Failure mode | Manual fix |
|---|---|---|---|
| agent-browser (Tier-0 only) | 0.97 | Shallow eval — only graded command-presence, not the skill's actual value prop (ref-based interaction, snapshot interpretation, multi-step state) | Built Tier-1 cases via subagent (~half-day): pre-recorded fixtures, stateful fake CLI, 4 new cases targeting the differentiator |
| supabase (calibrated graders, frontier models) | 1.00 | Eval saturated; calibrated graders + capable models perfect-detect the 9 seeded violations | Built deeper eval via subagent (~30 min): 3 new cases with absence-type violations requiring enumeration across multi-statement files |

In both cases, the **eval was the bug, not the skill**. The skill-
iteration loop in Phase 4 can't escape the dead zone — it just exits
"baseline >= 0.95, success" with no measurement.

### Phase 3.5 design

```text
Phase 3.5 — Eval-readiness loop (NEW, between Phase 3 and Phase 4)

while baseline NOT IN (0.50, 0.95):
if baseline >= 0.95:
dispatch eval-iteration subagent with prompt:
"Add 2-3 cases targeting absence-type rules / failure modes
not yet exercised. Realistic seedings, force enumeration.
Don't touch existing cases."
elif baseline < 0.50:
options (operator-decided or auto-judged):
a) Grader miscalibrated → run grader-vs-skill check (existing
in Phase 4); if grader bug, fix and re-baseline
b) Cases too contrived → simplify (remove ambiguous violations,
tighten task descriptions)
c) Skill genuinely doesn't address this shape → exit
"blocked-by-skill-shape" honestly
re-measure baseline
abort if iteration count > 3 (eval is harder to converge than skill)

Then proceed to Phase 4 unchanged.

Cost: ~$1.00 per eval iteration (subagent + smoke check). Bounded at
3 iterations.

Convergence criterion: baseline in (0.50, 0.95). The interesting zone.
```

### Why these bounds?

- **>= 0.95**: ceiling effect; can't measure uplift because there's no
headroom. Even +0.04 wouldn't clear our existing 0.05 success
threshold.
- **< 0.50**: floor effect; either the eval is broken (grader bugs,
ambiguous tasks) or the skill genuinely doesn't address the seeded
rules. In either case, the optimizer can't reliably improve.
- **(0.50, 0.95)**: the optimizer has clear signal. Both successful
iteration and lack-of-improvement are interpretable.

## Combined v1.3 architecture

```text
0. Research upstream → context file (NEW)
1. Discover skill, classify
2. Build initial suite
3. Measure baseline
3.5 Eval-readiness loop (NEW):
while baseline NOT IN (0.50, 0.95): iterate eval
4. Skill-iteration loop (existing):
while uplift < 0.05 AND iterations < 2: iterate skill
5. Re-check baseline (did eval drift after skill change?)
6. Package
```

## Implementation cost

| Component | Effort | Cost per pilot run |
|---|---|---|
| Phase 0 research subagent | ~1 day to write the prompt template + repo-detection logic | +$0.50–$1.00 |
| Phase 3.5 eval-iteration subagent | ~2 days to write the subagent prompt + integration into the wrapper loop | +$1.00 per eval iteration (bounded at 3) |
| Wrapper integration | ~1 day for new flags (`--refresh-context`, `--max-eval-iterations`), result aggregation, telemetry | n/a |
| Testing on 5 representative skills | ~1 day | ~$10 total |

**Total v1.3 build cost:** ~5 days of work + ~$15 of pilot runs to
validate.

**Per-pilot incremental cost:** ~$1.50–$5.00 over v1.2.1, depending on
how many eval iterations are needed (most skills will converge in 0–1).

## Migration / backwards compatibility

- v1.2.1 wrapper continues to work standalone (`--context` flag is
preserved).
- v1.3 is opt-in via a new flag, e.g. `--research` to enable Phase 0
and `--auto-eval` to enable Phase 3.5. Default off until validated.
- Once validated, defaults flip to on; operator can opt out via
`--no-research` / `--no-auto-eval`.

## Open questions

1. **Research-subagent prompt template** — should the Phase 0 subagent
prompt be skill-classification-aware? E.g. ask different questions
for code-reviewer vs tool-use vs document-producer skills. Probably
yes, but adds template branching complexity.
2. **Eval-iteration subagent prompt template** — same question. The
"what makes a harder case" guidance differs sharply by skill type.
3. **When to refuse eval iteration** — if baseline is at 1.00 because
the skill genuinely is excellent at its job, we shouldn't fabricate
harder cases. How does Phase 3.5 distinguish "ceiling because skill
is good" from "ceiling because eval is shallow"?
- One heuristic: if the existing eval already exercises the skill's
stated value prop (per the SKILL.md description), assume good. If
it tests only mechanical command presence, assume shallow.
- This needs a "value-prop coverage" check in Phase 3.5, ideally
read from the skill's frontmatter description.
4. **Cost ceiling** — Phase 0 + Phase 3.5 each cost ~$1; Phase 4 costs
$1–3. v1.3 raises typical pilot cost from ~$2 (v1.2.1) to ~$3–6.
Still within the $10 wrapper budget but worth keeping under
observation.
5. **When to accept lossy reshape** — supabase v1.2.1 forced reshape
from "two-pass meta-workflow" into "concrete SQL anti-pattern with
`**Incorrect**`/`**Correct**` blocks". Worked beautifully. Will
this transfer to other skills, or did we get lucky with supabase's
tight `_template.md`? Probably needs more pilots before generalizing.

## Open architectural questions (longer-term)

- **Should the auto-pilot also produce the AGENTS.md/README.md mirrors
for repos with multi-file convention (PR #23 shape)?** Currently
manual at PR-draft time. Could be a separate "packaging" subagent.
- **Should we treat upstream PR-submission as a phase too (Phase 6)?**
i.e. fork-clone-push-create-PR automation. Operator-gated for high-
visibility actions, but otherwise plausible.
- **Can the research subagent be made repo-agnostic?** Right now we
assumed a "skill repo" structure. For repos with non-standard layout
(vendored skills, monorepos, etc.) the research needs different
patterns.

## Provenance

This design is grounded in the v1.2.1 pilot session captured in:

- `docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-web-interface-guidelines.md`
- (pending) `docs/pilot-runs/upstream-pr-drafts/3-vercel-labs-agent-browser-*.md`
- (pending) `docs/pilot-runs/upstream-pr-drafts/4-supabase-agent-skills-*.md`
- `tools/auto-improve-contexts/{vercel-web-interface-guidelines,
vercel-agent-browser, supabase-postgres-best-practices}.md`
- Eval branches: `eval/auto-pilot/web-design-guidelines`,
`eval/auto-pilot/agent-browser` (in flight),
`eval/auto-pilot/supabase-postgres-best-practices-v2` (in flight).
109 changes: 109 additions & 0 deletions docs/pilot-runs/2026-05-08-auto-improve-pilot-summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Auto-improve-skill pilot summary — 2026-05-08

## Setup

Built a `tools/auto-improve-skill.mjs` wrapper + `tools/auto-improve-skill-prompt.md` template.
Operator says "optimize `<slug>`"; orchestrator runs the wrapper via `Bash run_in_background`,
the inner `claude -p` agent does the entire find → eval → diagnose → improve → package loop,
writes `examples/workbench/<skill-id>/analysis.md`, exits.

Branch: `feat/auto-improve-skill` (wrapper + prompt). Per-pilot output on `eval/auto-pilot/<skill-id>`.

## Three pilot runs

Run sequentially-ish: pilot #1 in main worktree, pilots #2 and #3 in parallel via `git worktree`
in separate working folders. Three providers × three trials × N cases per pilot.

| Skill | Classification | Status | Baseline | Final | Uplift | Iter | Plan-cost | OpenRouter |
|---|---|---|---|---|---|---|---|---|
| `vercel-labs/agent-browser/agent-browser` | tool-use | success | 0.56 | 1.00 | +0.44 | 1 | $3.15 | ~$2.80 |
| `supabase/agent-skills/supabase-postgres-best-practices` | code-reviewer | success | 0.54 | 0.86 | +0.32 | 1 | $0 | ~$2.40 |
| `anthropics/skills/pdf` | document-producer | success | 1.00 | 1.00 | +0 | 0 | $0 | ~$1.40 |

3/3 succeeded. Each surfaced a distinct success path:

- **agent-browser**: auto-pilot diagnosed that its own grader was over-specified (required `snapshot` for non-interactive ops, but the skill says CSS selectors are valid). Demoted the grader, +0.44 uplift mostly from grader correction. Also proposed a small additive "Quick task reference" section to upstream SKILL.md.
- **supabase**: 9 SQL violations seeded (FK indexes, RLS, covering indexes, etc.). Auto-pilot first self-corrected its grader (line tolerance ±3 → ±8, added keyword variants), then independently rediscovered the same **two-pass workflow** pattern we found manually for web-design-guidelines (pass 1 = visible token misuse, pass 2 = absence checks). Real upstream proposal generated.
- **pdf**: baseline already 1.00, auto-pilot triggered the "≥0.95 → exit clean, no proposal" path correctly. Did NOT manufacture problems. Noticed and noted that upstream's REFERENCE.md / FORMS.md links are 404.

## Costs

- OpenRouter (matrix runs): ~$6.60 total across 3 pilots.
- Plan budget (the inner `claude -p` self-reported `total_cost_usd`): only #1 hit the cap.
Pilot #1 first attempt blocked at $3.42 from the docker-permissions issue. Pilot #1c with
`--budget 15` settled at $3.15. Pilots #2 and #3 reported $0 (likely under tracking floor
or didn't iterate enough to register).
- Wall clock: ~50 min for 3 parallel pilots (vs ~150 min sequential).

## Auto-pilot capabilities validated

1. **Correct skill-shape classification** in all 3 cases (`tool-use`, `code-reviewer`, `document-producer`).
2. **Self-correction of own grader bugs** before diagnosing the underlying skill — happened in 2 of 3 pilots without operator nudging. Same patterns we manually applied (line-tolerance widening, hyphenated regex variants, keyword alternations).
3. **Pattern transfer**: the auto-pilot rediscovered the "two-pass workflow for absence-type rules" insight on supabase — a different skill in a different rule space — confirming the pattern generalizes.
4. **Clean exit on already-good skills**: pdf ran 36/36 trials passing at baseline; auto-pilot did not manufacture changes.
5. **Distinguishing skill problem from grader problem**: agent-browser caught grader-over-specification, separated it from skill quality.

## Issues found in v1 of the auto-pilot

1. **"Always: commit" step unreliable.** Pilots #1b and #2 didn't reach it — case files were left untracked in the worktree. Fix: hoist the commit step earlier (right after analysis.md is written), or split the prompt into two `claude -p` invocations (build + analyze).
2. **`--max-budget-usd 3.50` is too tight** for runs that need any real iteration. Pilot #1's first real-data attempt hit the cap mid-modification. Bumping to $15 worked. Sensible default for v2: $7-10.
3. **Phase 4 grader-fix iteration eats one of the two iteration slots.** The agent often spends iteration 1 fixing graders and only has one shot at modifying the skill. Fix: pre-bake known grader-tuning patterns into `_grader-utils.mjs` so the agent doesn't have to discover them, or count grader-only fixes separately from skill-modification iterations.

## Patterns we should bake into v2

From pilots and prior manual runs, these recurring techniques are stable enough to embed as defaults:

**Optimizing patterns** (bake into prompt as Phase-4 priors):

- Two-pass workflow (pass 1 visible / pass 2 absence) for code-reviewer skills
- Per-element checklists for skills with rule-by-element structure
- BAD/GOOD examples for anti-pattern and absence-type rules
- "Verify-tool-installed" nudge for tool-use skills (agents fall back to `curl`/`npm i`)

**Grader-reliability patterns** (bake into `_grader-utils.mjs`):

- Default `±5–8` line tolerance
- Hyphen-tolerant regex (`/empty[-\s]+state/`)
- Per-finding-line keyword matching
- Multiple keyword variants (`/cover/i` for both "covering" and "does not cover")

**Default seeded violation types** (bake into Phase-2 instructions):

- For code-reviewer: ≥1 visible-token, ≥1 missing-attribute, ≥1 missing-branch, ≥1 anti-pattern, ≥1 state-machine
- For tool-use: ≥1 reaches-for-fallback, ≥1 wrong-flag, ≥1 missing-step
- For document-producer: ≥1 missing-field, ≥1 wrong-format, ≥1 edge-case-input

## Decision points for the team

1. **Continue scaling.** With these results, "optimize 10 skills" is a sequential loop the
orchestrator already supports (just call the wrapper N times). With worktrees, N=3 in
parallel is also straightforward. Cost per skill ~$2-3 OpenRouter + plan-tokens.

2. **Tighten the prompt before scaling.** The "Always: commit" issue and the budget-too-tight
issue are real and would cost a fraction of one pilot to fix. ~30 min of work for v2.

3. **Build the lessons doc.** A `tools/auto-improve-skill-lessons.md` referenced by the
prompt as Phase-4 prior, updated after every pilot. Compounds: pilot N benefits from
patterns 1..N-1. Not started; sub-project for after the next batch.

4. **Skill-batch parallelism.** Worktree-per-pilot worked. For 10 skills, 3-way parallel
would land in ~3-4 batches (~3 hours). 5-way is also feasible if the dev machine has
the resources.

## Reproducing the pilots

```bash
cd /home/yuqing/Documents/Code/skill-optimizer
git checkout feat/auto-improve-skill
node tools/auto-improve-skill.mjs <owner>/<repo>/<skill-id> [--budget 15]

# Output: examples/workbench/<skill-id>/{analysis.md, suite.yml, ...}
# Branch: eval/auto-pilot/<skill-id>
```

For parallel runs, use git worktrees:

```bash
git worktree add ../wt-pilot-2 -b auto-pilot/wt-2 feat/auto-improve-skill
cd ../wt-pilot-2 && node tools/auto-improve-skill.mjs <slug-2> --budget 15
```
Loading
Loading