diff --git a/docs/auto-improve-skill-v1.3-design.md b/docs/auto-improve-skill-v1.3-design.md new file mode 100644 index 0000000..33f7aa4 --- /dev/null +++ b/docs/auto-improve-skill-v1.3-design.md @@ -0,0 +1,237 @@ +# auto-improve-skill v1.3 — design proposal + +**Status:** draft, written 2026-05-12 during the v1.2.1 PR-prep session. +**Audience:** team review before implementation. +**Tracking:** the in-flight v1.2.1 pilot work (web-design-guidelines / +agent-browser / supabase) is the empirical basis for this proposal. + +## Executive summary + +v1.3 adds two structural phases to the auto-improve-skill pipeline, +both motivated by failure modes observed across 4 v1.2.1 pilots: + +1. **Phase 0 — Research-first context.** A research subagent reads the + target upstream repo's contribution conventions, frontmatter spec, + prefix taxonomy, and merged-PR shape patterns, and writes a context + file that v1.2.1's `--context` flag consumes. Without this, the + auto-pilot produces output that requires manual reformulation + before submission. +2. **Phase 3.5 — Eval-readiness loop.** The pipeline iterates on the + eval (seed harder/simpler cases) until baseline lands in the + "interesting zone" `(0.50, 0.95)`. Without this, baselines saturate + at 1.00 (no headroom to demonstrate uplift) or floor at <0.50 (skill + shape blocks measurement). + +The skill-iteration loop (current Phase 4) is unchanged. + +## Lesson 1 — Research-first context is mandatory + +### Evidence (4 pilots this session) + +| Skill | Without context | With researched context | +|---|---|---| +| web-design-guidelines | Manual proposal needed retargeting (SKILL.md→command.md), reformulation across 3 stylistic siblings, frontmatter mismatch. Manual labor: ~2 hours per PR. | Auto-pilot produced a clean, mergeable diff to the right file in the right voice. Manual labor: ~10 min mirror to AGENTS.md/README.md. | +| agent-browser | Auto-pilot proposed editing `skills/agent-browser/SKILL.md`. Per upstream `AGENTS.md`, that file is intentionally a discovery stub; real content lives at `skill-data/core/SKILL.md`. Manual retarget required. | (Pending — pilot in flight; context file says edit `agent-browser-core.md` and produced output names `before-skill-data-core-SKILL.md`.) | +| supabase (batch-1) | Produced shape-novel `references/review-...md` with non-existent prefix (`review-`), missing `impactDescription` frontmatter field, philosophical-style content (MEDIUM-HIGH rejection risk per CONTRIBUTING patterns). | Auto-pilot reshaped into convention-perfect SQL anti-pattern under correct prefix (`monitor-`), full 4-field frontmatter, `**Incorrect**`/`**Correct**` SQL blocks per `_template.md`, trailing `Reference:` link. Zero manual reformulation needed. | + +### Generalization + +The auto-pilot is good at *finding what to change* (which rules, which +files, which absence-type gaps). It is bad at *fitting upstream +conventions*: frontmatter schemas, file-location norms, prefix +taxonomies, additive-only rules, "Discussion-first" gates, voice +consistency. Conventions are repo-specific tribal knowledge that +cannot be inferred from reading the SKILL.md alone. + +### Phase 0 design + +```text +Phase 0 — Research upstream (NEW, runs before Phase 1) + +Inputs: + - target slug // + +Subtasks (executed by a research subagent): + 1. Repo metadata: license, CLA, default branch, recent activity + 2. Read CONTRIBUTING.md, AGENTS.md, .github/PULL_REQUEST_TEMPLATE.md, + CODEOWNERS, .github/workflows/*.yml + 3. Read skill-specific convention files: _contributing.md, + _template.md, _sections.md (or equivalents) + 4. Read sanity-test source if present (don't trust prior assumptions + about what CI validates) + 5. Sample last 10 merged PRs to the target skill (or repo) for shape: + file count, body shape, conventional-commit usage, scope sizing + 6. Sample last 5 closed-without-merge PRs for rejection signals: + "Discussion-first gate violated", "shape-novel content rejected", + etc. + 7. Identify other consumers (gh search for raw URL references; check + for install scripts; check repo's own README for distribution + channels) + +Output: + tools/auto-improve-contexts/-.md + - Repository facts (license, CI, maintainers, merge style) + - Hard constraints (additive-only, file-location, prefix taxonomy, + forbidden modifications) + - Frontmatter spec (exact required fields + allowed values) + - Content shape template (copy-and-fill) + - Optimization target file (where the skill change should land) + - Risk profile (LOW/MEDIUM/HIGH + reasons) + - Pre-submit checklist (what auto-pilot must verify) + - Useful URLs + +Cost: ~$0.50–$1.00 per skill (single subagent invocation). + +Caching: context files are committed to the repo. Re-running on the +same skill within 30 days: skip Phase 0, reuse cached context (with +explicit `--refresh-context` flag to force re-research). + +Operator override: `--context ` flag continues to work; if +provided, Phase 0 is skipped. +``` + +## Lesson 2 — Two-loop iteration: eval AND skill + +### Evidence + +| Skill | Initial baseline | Failure mode | Manual fix | +|---|---|---|---| +| agent-browser (Tier-0 only) | 0.97 | Shallow eval — only graded command-presence, not the skill's actual value prop (ref-based interaction, snapshot interpretation, multi-step state) | Built Tier-1 cases via subagent (~half-day): pre-recorded fixtures, stateful fake CLI, 4 new cases targeting the differentiator | +| supabase (calibrated graders, frontier models) | 1.00 | Eval saturated; calibrated graders + capable models perfect-detect the 9 seeded violations | Built deeper eval via subagent (~30 min): 3 new cases with absence-type violations requiring enumeration across multi-statement files | + +In both cases, the **eval was the bug, not the skill**. The skill- +iteration loop in Phase 4 can't escape the dead zone — it just exits +"baseline >= 0.95, success" with no measurement. + +### Phase 3.5 design + +```text +Phase 3.5 — Eval-readiness loop (NEW, between Phase 3 and Phase 4) + +while baseline NOT IN (0.50, 0.95): + if baseline >= 0.95: + dispatch eval-iteration subagent with prompt: + "Add 2-3 cases targeting absence-type rules / failure modes + not yet exercised. Realistic seedings, force enumeration. + Don't touch existing cases." + elif baseline < 0.50: + options (operator-decided or auto-judged): + a) Grader miscalibrated → run grader-vs-skill check (existing + in Phase 4); if grader bug, fix and re-baseline + b) Cases too contrived → simplify (remove ambiguous violations, + tighten task descriptions) + c) Skill genuinely doesn't address this shape → exit + "blocked-by-skill-shape" honestly + re-measure baseline + abort if iteration count > 3 (eval is harder to converge than skill) + +Then proceed to Phase 4 unchanged. + +Cost: ~$1.00 per eval iteration (subagent + smoke check). Bounded at +3 iterations. + +Convergence criterion: baseline in (0.50, 0.95). The interesting zone. +``` + +### Why these bounds? + +- **>= 0.95**: ceiling effect; can't measure uplift because there's no + headroom. Even +0.04 wouldn't clear our existing 0.05 success + threshold. +- **< 0.50**: floor effect; either the eval is broken (grader bugs, + ambiguous tasks) or the skill genuinely doesn't address the seeded + rules. In either case, the optimizer can't reliably improve. +- **(0.50, 0.95)**: the optimizer has clear signal. Both successful + iteration and lack-of-improvement are interpretable. + +## Combined v1.3 architecture + +```text +0. Research upstream → context file (NEW) +1. Discover skill, classify +2. Build initial suite +3. Measure baseline +3.5 Eval-readiness loop (NEW): + while baseline NOT IN (0.50, 0.95): iterate eval +4. Skill-iteration loop (existing): + while uplift < 0.05 AND iterations < 2: iterate skill +5. Re-check baseline (did eval drift after skill change?) +6. Package +``` + +## Implementation cost + +| Component | Effort | Cost per pilot run | +|---|---|---| +| Phase 0 research subagent | ~1 day to write the prompt template + repo-detection logic | +$0.50–$1.00 | +| Phase 3.5 eval-iteration subagent | ~2 days to write the subagent prompt + integration into the wrapper loop | +$1.00 per eval iteration (bounded at 3) | +| Wrapper integration | ~1 day for new flags (`--refresh-context`, `--max-eval-iterations`), result aggregation, telemetry | n/a | +| Testing on 5 representative skills | ~1 day | ~$10 total | + +**Total v1.3 build cost:** ~5 days of work + ~$15 of pilot runs to +validate. + +**Per-pilot incremental cost:** ~$1.50–$5.00 over v1.2.1, depending on +how many eval iterations are needed (most skills will converge in 0–1). + +## Migration / backwards compatibility + +- v1.2.1 wrapper continues to work standalone (`--context` flag is + preserved). +- v1.3 is opt-in via a new flag, e.g. `--research` to enable Phase 0 + and `--auto-eval` to enable Phase 3.5. Default off until validated. +- Once validated, defaults flip to on; operator can opt out via + `--no-research` / `--no-auto-eval`. + +## Open questions + +1. **Research-subagent prompt template** — should the Phase 0 subagent + prompt be skill-classification-aware? E.g. ask different questions + for code-reviewer vs tool-use vs document-producer skills. Probably + yes, but adds template branching complexity. +2. **Eval-iteration subagent prompt template** — same question. The + "what makes a harder case" guidance differs sharply by skill type. +3. **When to refuse eval iteration** — if baseline is at 1.00 because + the skill genuinely is excellent at its job, we shouldn't fabricate + harder cases. How does Phase 3.5 distinguish "ceiling because skill + is good" from "ceiling because eval is shallow"? + - One heuristic: if the existing eval already exercises the skill's + stated value prop (per the SKILL.md description), assume good. If + it tests only mechanical command presence, assume shallow. + - This needs a "value-prop coverage" check in Phase 3.5, ideally + read from the skill's frontmatter description. +4. **Cost ceiling** — Phase 0 + Phase 3.5 each cost ~$1; Phase 4 costs + $1–3. v1.3 raises typical pilot cost from ~$2 (v1.2.1) to ~$3–6. + Still within the $10 wrapper budget but worth keeping under + observation. +5. **When to accept lossy reshape** — supabase v1.2.1 forced reshape + from "two-pass meta-workflow" into "concrete SQL anti-pattern with + `**Incorrect**`/`**Correct**` blocks". Worked beautifully. Will + this transfer to other skills, or did we get lucky with supabase's + tight `_template.md`? Probably needs more pilots before generalizing. + +## Open architectural questions (longer-term) + +- **Should the auto-pilot also produce the AGENTS.md/README.md mirrors + for repos with multi-file convention (PR #23 shape)?** Currently + manual at PR-draft time. Could be a separate "packaging" subagent. +- **Should we treat upstream PR-submission as a phase too (Phase 6)?** + i.e. fork-clone-push-create-PR automation. Operator-gated for high- + visibility actions, but otherwise plausible. +- **Can the research subagent be made repo-agnostic?** Right now we + assumed a "skill repo" structure. For repos with non-standard layout + (vendored skills, monorepos, etc.) the research needs different + patterns. + +## Provenance + +This design is grounded in the v1.2.1 pilot session captured in: + +- `docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-web-interface-guidelines.md` +- (pending) `docs/pilot-runs/upstream-pr-drafts/3-vercel-labs-agent-browser-*.md` +- (pending) `docs/pilot-runs/upstream-pr-drafts/4-supabase-agent-skills-*.md` +- `tools/auto-improve-contexts/{vercel-web-interface-guidelines, + vercel-agent-browser, supabase-postgres-best-practices}.md` +- Eval branches: `eval/auto-pilot/web-design-guidelines`, + `eval/auto-pilot/agent-browser` (in flight), + `eval/auto-pilot/supabase-postgres-best-practices-v2` (in flight). diff --git a/docs/pilot-runs/2026-05-08-auto-improve-pilot-summary.md b/docs/pilot-runs/2026-05-08-auto-improve-pilot-summary.md new file mode 100644 index 0000000..e639c1f --- /dev/null +++ b/docs/pilot-runs/2026-05-08-auto-improve-pilot-summary.md @@ -0,0 +1,109 @@ +# Auto-improve-skill pilot summary — 2026-05-08 + +## Setup + +Built a `tools/auto-improve-skill.mjs` wrapper + `tools/auto-improve-skill-prompt.md` template. +Operator says "optimize ``"; orchestrator runs the wrapper via `Bash run_in_background`, +the inner `claude -p` agent does the entire find → eval → diagnose → improve → package loop, +writes `examples/workbench//analysis.md`, exits. + +Branch: `feat/auto-improve-skill` (wrapper + prompt). Per-pilot output on `eval/auto-pilot/`. + +## Three pilot runs + +Run sequentially-ish: pilot #1 in main worktree, pilots #2 and #3 in parallel via `git worktree` +in separate working folders. Three providers × three trials × N cases per pilot. + +| Skill | Classification | Status | Baseline | Final | Uplift | Iter | Plan-cost | OpenRouter | +|---|---|---|---|---|---|---|---|---| +| `vercel-labs/agent-browser/agent-browser` | tool-use | success | 0.56 | 1.00 | +0.44 | 1 | $3.15 | ~$2.80 | +| `supabase/agent-skills/supabase-postgres-best-practices` | code-reviewer | success | 0.54 | 0.86 | +0.32 | 1 | $0 | ~$2.40 | +| `anthropics/skills/pdf` | document-producer | success | 1.00 | 1.00 | +0 | 0 | $0 | ~$1.40 | + +3/3 succeeded. Each surfaced a distinct success path: + +- **agent-browser**: auto-pilot diagnosed that its own grader was over-specified (required `snapshot` for non-interactive ops, but the skill says CSS selectors are valid). Demoted the grader, +0.44 uplift mostly from grader correction. Also proposed a small additive "Quick task reference" section to upstream SKILL.md. +- **supabase**: 9 SQL violations seeded (FK indexes, RLS, covering indexes, etc.). Auto-pilot first self-corrected its grader (line tolerance ±3 → ±8, added keyword variants), then independently rediscovered the same **two-pass workflow** pattern we found manually for web-design-guidelines (pass 1 = visible token misuse, pass 2 = absence checks). Real upstream proposal generated. +- **pdf**: baseline already 1.00, auto-pilot triggered the "≥0.95 → exit clean, no proposal" path correctly. Did NOT manufacture problems. Noticed and noted that upstream's REFERENCE.md / FORMS.md links are 404. + +## Costs + +- OpenRouter (matrix runs): ~$6.60 total across 3 pilots. +- Plan budget (the inner `claude -p` self-reported `total_cost_usd`): only #1 hit the cap. + Pilot #1 first attempt blocked at $3.42 from the docker-permissions issue. Pilot #1c with + `--budget 15` settled at $3.15. Pilots #2 and #3 reported $0 (likely under tracking floor + or didn't iterate enough to register). +- Wall clock: ~50 min for 3 parallel pilots (vs ~150 min sequential). + +## Auto-pilot capabilities validated + +1. **Correct skill-shape classification** in all 3 cases (`tool-use`, `code-reviewer`, `document-producer`). +2. **Self-correction of own grader bugs** before diagnosing the underlying skill — happened in 2 of 3 pilots without operator nudging. Same patterns we manually applied (line-tolerance widening, hyphenated regex variants, keyword alternations). +3. **Pattern transfer**: the auto-pilot rediscovered the "two-pass workflow for absence-type rules" insight on supabase — a different skill in a different rule space — confirming the pattern generalizes. +4. **Clean exit on already-good skills**: pdf ran 36/36 trials passing at baseline; auto-pilot did not manufacture changes. +5. **Distinguishing skill problem from grader problem**: agent-browser caught grader-over-specification, separated it from skill quality. + +## Issues found in v1 of the auto-pilot + +1. **"Always: commit" step unreliable.** Pilots #1b and #2 didn't reach it — case files were left untracked in the worktree. Fix: hoist the commit step earlier (right after analysis.md is written), or split the prompt into two `claude -p` invocations (build + analyze). +2. **`--max-budget-usd 3.50` is too tight** for runs that need any real iteration. Pilot #1's first real-data attempt hit the cap mid-modification. Bumping to $15 worked. Sensible default for v2: $7-10. +3. **Phase 4 grader-fix iteration eats one of the two iteration slots.** The agent often spends iteration 1 fixing graders and only has one shot at modifying the skill. Fix: pre-bake known grader-tuning patterns into `_grader-utils.mjs` so the agent doesn't have to discover them, or count grader-only fixes separately from skill-modification iterations. + +## Patterns we should bake into v2 + +From pilots and prior manual runs, these recurring techniques are stable enough to embed as defaults: + +**Optimizing patterns** (bake into prompt as Phase-4 priors): + +- Two-pass workflow (pass 1 visible / pass 2 absence) for code-reviewer skills +- Per-element checklists for skills with rule-by-element structure +- BAD/GOOD examples for anti-pattern and absence-type rules +- "Verify-tool-installed" nudge for tool-use skills (agents fall back to `curl`/`npm i`) + +**Grader-reliability patterns** (bake into `_grader-utils.mjs`): + +- Default `±5–8` line tolerance +- Hyphen-tolerant regex (`/empty[-\s]+state/`) +- Per-finding-line keyword matching +- Multiple keyword variants (`/cover/i` for both "covering" and "does not cover") + +**Default seeded violation types** (bake into Phase-2 instructions): + +- For code-reviewer: ≥1 visible-token, ≥1 missing-attribute, ≥1 missing-branch, ≥1 anti-pattern, ≥1 state-machine +- For tool-use: ≥1 reaches-for-fallback, ≥1 wrong-flag, ≥1 missing-step +- For document-producer: ≥1 missing-field, ≥1 wrong-format, ≥1 edge-case-input + +## Decision points for the team + +1. **Continue scaling.** With these results, "optimize 10 skills" is a sequential loop the + orchestrator already supports (just call the wrapper N times). With worktrees, N=3 in + parallel is also straightforward. Cost per skill ~$2-3 OpenRouter + plan-tokens. + +2. **Tighten the prompt before scaling.** The "Always: commit" issue and the budget-too-tight + issue are real and would cost a fraction of one pilot to fix. ~30 min of work for v2. + +3. **Build the lessons doc.** A `tools/auto-improve-skill-lessons.md` referenced by the + prompt as Phase-4 prior, updated after every pilot. Compounds: pilot N benefits from + patterns 1..N-1. Not started; sub-project for after the next batch. + +4. **Skill-batch parallelism.** Worktree-per-pilot worked. For 10 skills, 3-way parallel + would land in ~3-4 batches (~3 hours). 5-way is also feasible if the dev machine has + the resources. + +## Reproducing the pilots + +```bash +cd /home/yuqing/Documents/Code/skill-optimizer +git checkout feat/auto-improve-skill +node tools/auto-improve-skill.mjs // [--budget 15] + +# Output: examples/workbench//{analysis.md, suite.yml, ...} +# Branch: eval/auto-pilot/ +``` + +For parallel runs, use git worktrees: + +```bash +git worktree add ../wt-pilot-2 -b auto-pilot/wt-2 feat/auto-improve-skill +cd ../wt-pilot-2 && node tools/auto-improve-skill.mjs --budget 15 +``` diff --git a/docs/pilot-runs/2026-05-09-auto-improve-batch-2-summary.md b/docs/pilot-runs/2026-05-09-auto-improve-batch-2-summary.md new file mode 100644 index 0000000..de6ee50 --- /dev/null +++ b/docs/pilot-runs/2026-05-09-auto-improve-batch-2-summary.md @@ -0,0 +1,100 @@ +# Auto-improve-skill batch 2 summary — 10 pilots, 8 success, 0 failures + +## Setup + +- **Wrapper version:** v1.1 + #3 (atomic write-and-commit, $10 default budget, lessons.md, pre-baked grader helpers) +- **Skills:** ranks 5–14 from the prioritized top-N list (skips the 4 already covered in batch 1: web-design-guidelines, agent-browser, supabase, pdf) +- **Parallelism:** 10 git worktrees, hardlinked `node_modules`, fired simultaneously +- **Wall clock:** ~50 min (slowest pilot to longest), down from estimated ~150 min sequential + +## Headline results + +| # | Skill | Classification | Status | Coverage | Mods | Notes | +|---|---|---|---|---|---|---| +| 1 | `anthropics/skills/pptx` | document-producer | ✅ success | 0.85 → 0.85 | 0 | grader cal raised raw 0.74 → 0.85; gpt-4o-mini fails entirely (model gap) | +| 2 | `vercel-labs/next-skills/next-best-practices` | code-reviewer | ✅ success | 0.80 → 0.975 | 0 | grader cal only — skill already strong | +| 3 | `firebase/agent-skills/firebase-auth-basics` | code-reviewer | ✅ success | 1.00 → 1.00 | 0 | reclassified from prior `tool-use` | +| 4 | `firebase/agent-skills/firebase-hosting-basics` | code-patterns | ✅ success | 0.89 → 1.00 | 1 | Recipe A + E added a Configuration Review section | +| 5 | `expo/skills/building-native-ui` | code-patterns | ✅ success | 0.99 → 0.99 | 0 | 17/18 trials — single gpt-5-mini miss accepted as noise | +| 6 | `google-labs-code/stitch-skills/shadcn-ui` | code-patterns | ✅ success | 0.82 → 0.89 | 1 | Recipe A + D — Gemini's wrong-location miss rate dropped 100% → 0% | +| 7 | `expo/skills/native-data-fetching` | code-reviewer | ✅ success | 1.00 → 1.00 | 0 | already-good | +| 8 | `firecrawl/skills/firecrawl-build-scrape` | code-patterns | ⚠️ uplift-too-small | 0.84 → 0.89 | 2 | +0.05, exactly on threshold; gpt-4o-mini verbosity floor caps it | +| 9 | `vercel-labs/next-skills/next-upgrade` | code-reviewer | ⚠️ uplift-too-small | **0.83 → 0.76** | 2 | **regression** — modifications hurt; new failure mode surfaced | +| 10 | `github/awesome-copilot/prd` | document-producer | ✅ success | 1.00 → 1.00 | 0 | sonnet API errors, judged on 12 valid trials from gpt-5-mini + gemini | + +**8/10 success • 2/10 uplift-too-small • 0/10 blocked or budget-exceeded** + +## Cost + +- OpenRouter spend during batch: **~$21.30** ($40.65 used – $19.35 prior to batch start) +- Per-pilot avg: **$2.13** (well under the $3.50 budgeted) +- Plan-token spend (inner `claude -p`): each pilot reported between $0.00 and $1.00 — no pilot hit the $10 wrapper cap + +## What v1.1 + #3 actually delivered + +The pilots demonstrate the prompt improvements working as intended: + +1. **"Atomic write-analysis-and-commit" worked.** **All 10 inner agents committed cleanly.** No manual recovery needed (vs batch 1 where 2 of 3 needed manual commits). +2. **Recipe citations by letter.** Pilots 4, 6, 8 explicitly cited Recipe A / D / E from `lessons.md` in their analysis bullets. They didn't rediscover the patterns from scratch. +3. **"Grader-vs-skill check first" worked.** Pilots 1, 2, 4, 6, 8, 9 all did iteration 0 grader calibration before counting against their iteration budget. Saved meaningful budget on pilots 2, 4, 6. +4. **`looseRange` / `tolerantKeyword` pre-baked helpers** — used in graders the auto-pilot wrote without rediscovering the patterns. Several pilots had to widen specifically for gpt-4o-mini drift (range 8 → 12 or 16) which is new signal worth adding to lessons.md. +5. **"Don't manufacture problems"** worked in all 5 already-good cases (3, 5, 7, 10, plus pilot 1 after grader cal). None proposed unnecessary changes. + +## New patterns surfaced — worth adding to `lessons.md` + +### Optimization patterns + +- **(NEW) Recipe F? — Don't add bash commands for small models.** Pilot 9 added bash grep commands to `next-upgrade`'s SKILL.md. gpt-4o-mini tried to *execute* them rather than reading files, dropping coverage from 0.83 to 0.69. **Anti-pattern.** When skill is aimed at small/cheap models, prefer pure declarative wording over executable commands. + +### Failure modes + +- **CLI fabrication on "upgrade-style" skills.** gpt-4o-mini will hallucinate a `npx -upgrade` CLI for any skill whose name suggests transformation/upgrade work, then write the error message as findings. Distinct from the agent-browser `curl` fallback (where the CLI exists but the model picks the wrong tool). Worth its own anti-pattern entry. +- **Verbosity floor on gpt-4o-mini.** Confirmed across pilots 8, 9 — emits 3-4 line responses, sometimes drops trailing rules entirely. Rules requiring multi-finding output above this floor are systematically under-detected. + +### Grader patterns + +- **(NEW) Per-model line tolerance.** sonnet/gemini drift 0–3 lines; gpt-4o-mini drifts 6–15 lines. The `looseRange` default of ±8 is calibrated for the first two but undertuned for the third. Future graders should default to `looseRange(N, 12)` or use per-model tolerance maps. + +### Skill-shape edge cases + +- **Repo path conventions vary.** `expo/skills` uses `plugins/expo/skills//SKILL.md` (not the canonical `skills//SKILL.md`). Pilots 5 and 7 both surfaced this and adapted. Worth noting in Phase-1 instructions. + +## Branches pushed + +- `eval/auto-pilot/batch-2-2026-05-09` (consolidated, all 10 cherry-picked) +- 10 individual `eval/auto-pilot/` branches (for per-skill review) + +## What to PR upstream + +Three pilots produced real, additive proposals: + +| Skill | Uplift | Where the change goes | +|---|---|---| +| firebase-hosting-basics | 0.89 → 1.00 | `firebase/agent-skills` | +| shadcn-ui | 0.82 → 0.89 | `google-labs-code/stitch-skills` | +| firecrawl-build-scrape | 0.84 → 0.89 | `firecrawl/skills` | + +**Skip from PR queue:** + +- All 5 baseline-already-good skills (no changes warranted) +- pilot 9 (next-upgrade) — modifications regressed; needs human review or a different approach (probably "drop bash commands, use BAD/GOOD only") +- pilot 8 (firecrawl-build-scrape) is on the bubble at +0.05 — judgment call + +## Decision points for the team + +1. **Scale further.** With v1.1+#3 working, batch 3 of 10 skills should land in another ~50 min for ~$25 OpenRouter. Plenty of remaining slugs in the top-N (15–47). +2. **Lessons.md v1.2 update.** Add the patterns from this batch (CLI fabrication, gpt-4o-mini line drift, repo-path variants, "don't add bash for small models"). 30 min of doc work that compounds for batch 3. +3. **Drop gpt-4o-mini from default matrix.** Repeated capability gap (verbosity floor + CLI fabrication + line drift) is dragging multiple pilots' scores. Switching the matrix to sonnet/gemini/another-mid-tier would likely lift batch coverage by 5-10pp without any skill changes. Worth piloting. + +## Reproducing + +```bash +# This batch can be reproduced from a fresh checkout of feat/auto-improve-skill: +cd /home/yuqing/Documents/Code/skill-optimizer +git checkout feat/auto-improve-skill +node tools/auto-improve-skill.mjs + +# For parallel batches, use git worktrees (see batch script in this commit's Setup section) +``` + +Cumulative spend: $40.65 of $60 OpenRouter credits. diff --git a/docs/pilot-runs/README.md b/docs/pilot-runs/README.md new file mode 100644 index 0000000..de86482 --- /dev/null +++ b/docs/pilot-runs/README.md @@ -0,0 +1,39 @@ +# Auto-improve-skill pilot runs + +Summaries of batched runs of the `tools/auto-improve-skill.mjs` auto-pilot +against public agent skills from our prioritized top-N list. Each summary +documents what skills ran, what the auto-pilot proposed, what worked, what +didn't, and what changes we should make to the prompt before the next batch. + +The per-skill eval artifacts (suite, graders, vendored upstream, proposed-upstream-changes/) +live on `eval/auto-pilot/` branches and the consolidated +`eval/auto-pilot/batch--` branches. + +## Index + +- [`2026-05-08-auto-improve-pilot-summary.md`](./2026-05-08-auto-improve-pilot-summary.md) + — Batch 1, 3 skills (agent-browser, supabase-postgres-best-practices, pdf). + Validated end-to-end. 3/3 success. +- [`2026-05-09-auto-improve-batch-2-summary.md`](./2026-05-09-auto-improve-batch-2-summary.md) + — Batch 2, 10 skills (pptx, next-best-practices, firebase-auth-basics, + firebase-hosting-basics, building-native-ui, shadcn-ui, native-data-fetching, + firecrawl-build-scrape, next-upgrade, prd). 8/10 success, 2/10 uplift-too-small. + +## How to run a new batch + +```bash +# Single skill from the main repo: +node tools/auto-improve-skill.mjs // [--budget 10] + +# Parallel batch via git worktrees: +for i in {1..N}; do + git worktree add ../wt-pilot-$i -b auto-pilot/wt-batch-$i feat/auto-improve-skill + cp -al node_modules dist ../wt-pilot-$i/ + cp .env ../wt-pilot-$i/ +done + +# Then fire one wrapper invocation per worktree in parallel. +``` + +After all pilots complete, cherry-pick each `eval/auto-pilot/` onto +a consolidated batch branch and open a PR. diff --git a/docs/pilot-runs/upstream-pr-conventions.md b/docs/pilot-runs/upstream-pr-conventions.md new file mode 100644 index 0000000..4d62611 --- /dev/null +++ b/docs/pilot-runs/upstream-pr-conventions.md @@ -0,0 +1,128 @@ +# Upstream PR conventions for skill repositories + +Operational guide for submitting skill-improvement PRs to upstream +maintainers. Each row was verified by reading the repo's +`AGENTS.md` / `CONTRIBUTING.md` / `.github/workflows/` + scanning the +last 5–10 merged PRs. Update this doc when we observe new patterns. + +## Quick reference + +| Repo | License | Style | Title format | Body | CI gates | CLA | +|---|---|---|---|---|---|---| +| `vercel-labs/agent-skills` | (no LICENSE) | casual | `{skill}: ` | `## Summary` + `## Test plan` | path-filtered (only fires for react-best-practices changes) | no | +| `vercel-labs/web-interface-guidelines` | MIT | terse | sentence-case freeform, optional `feat:`/`fix:` | 1–2 sentences | none (no workflows) | no | +| `vercel-labs/agent-browser` | Apache-2.0 | formal | `feat/fix/docs(scope): description` (conventional commits) | `## Summary` + `## Test plan` | Rust fmt/clippy/test + dashboard build + version-sync | no CLA bot observed | +| `supabase/agent-skills` | MIT | formal | `feat/fix/docs: description` (conventional commits, used by Release Please) | terse `## Summary` bullets | `pnpm test:sanity` only | no (CONTRIBUTING.md states MIT auto-license) | + +## Per-repo notes + +### `vercel-labs/agent-skills` + +- **Title**: `{skill-name}: ` — skill name as the scope, no + conventional-commit prefix needed. +- **Body**: Multi-section. Use `## Summary` bullets + `## Test plan` + checkboxes. 600–2500 chars is the observed norm. Claude Code footer + (`🤖 Generated with Claude Code`) is fully normalized — appears in + multiple merged PRs. +- **CI**: One workflow (`react-best-practices-ci.yml`) is path-filtered; + unless our change touches `skills/react-best-practices/**`, it won't + fire. Vercel deploy preview is cosmetic, not blocking. +- **Merge style**: Squash. Maintainer (`bhrigu123`) approves silently and + same-day for clean PRs. +- **PR scope**: Tight per-skill (one skill per PR). Improvements to + existing skills merge faster than new-skill additions (PR #238 + proposing a brand-new skill has sat for weeks). +- **Gotcha**: Some skills have a `.zip` alongside the directory. Not + blocking but a known convention. + +### `vercel-labs/web-interface-guidelines` + +- **Title**: Freeform sentence (e.g., `Add translate="no" guideline for + verbatim content`) or `feat:`/`fix:` prefix — both merged. +- **Body**: Minimal. PR #20 is exemplary: two sentences of rationale, no + headers. 0–400 chars is the observed norm. +- **CI**: No workflows. Zero automated checks. +- **Merge style**: Silent approve from `JohnPhamous` (Vercel staff). +- **Sync constraint**: `README.md` and `AGENTS.md` are dual copies of + the same content (one human-readable, one agent-readable). If we add + or change a guideline, **touch both files** in the same PR. PR #20 + did this; ours should too. +- **Pace**: Repo is low-traffic (48 forks, last merge ~5 weeks ago). + Expect slow response. Don't optimize for immediate merge. + +### `vercel-labs/agent-browser` + +- **Title**: Strict conventional commits — `feat(scope): description`, + `fix(scope): description`, `docs: description`. Scope is the + subsystem (`docs`, `doctor`, `native`, etc.). +- **Body**: `## Summary` (2 bullets) + `## Test plan` (2 checkboxes). + PR #1305 is a reference template. +- **CI**: Strict. Three blocking jobs (Rust fmt+clippy+test, dashboard + pnpm build, version-sync). **Docs-only and skill-data-only changes + should pass automatically**; anything touching Rust will trigger + expensive checks. +- **Merge style**: `ctate` is sole maintainer; very active, merges + same-day silently for clean PRs. +- **Critical gotcha**: Skill content lives at + `skill-data/core/SKILL.md`, **not** at `skills/agent-browser/SKILL.md` + (which is intentionally a thin stub per AGENTS.md). Any meaningful + skill change touches: + + 1. `skill-data/core/SKILL.md` + 2. `skill-data/core/references/*.md` (the per-rule reference docs) + 3. `README.md` + 4. The docs MDX pages + + Per AGENTS.md, omitting any of these is grounds for rejection. Use + HTML `` syntax in MDX (not markdown pipe tables). +- **PR scope**: Tight per subsystem. Docs-only changes are the + lowest-friction path — they bypass the Rust CI gates. + +### `supabase/agent-skills` + +- **Title**: Strict conventional commits — `feat: `, + `fix: `, `docs: `. Release Please uses these + to determine semver bumps. **Do not** bump `metadata.version` + manually in SKILL.md — Release Please handles it post-merge. +- **Body**: Short `## Summary` with 1–4 bullets. Link issues with + `Resolves AI-NNN` if applicable. No template. +- **CI**: One job — `Skills CI` runs `pnpm test:sanity`. Sanity tests + check that new reference files follow the `{prefix}-{name}.md` + naming convention with valid frontmatter (`title`, `impact`, `tags`). + Run `pnpm test:sanity` locally before submitting. +- **Merge style**: Squash. `gregnr` (Supabase staff) and `Rodriguespn` + (sole active community maintainer) merge in under 30 min for clean + PRs by core team members; external PRs may need a single LGTM. +- **PR scope**: Additive file change only. Add a new reference file + under `skills//references/{prefix}-{name}.md` with proper + frontmatter + Incorrect/Correct examples. CONTRIBUTING.md says + significant new skills need a prior GitHub Discussion; reference + additions don't. + +## Process for our own PRs + +For each PR we submit: + +1. **Branch** off a fresh local clone of the upstream repo, NOT off our + `examples/workbench//proposed-upstream-changes/`. Copy the + `after-*.md` content into the actual upstream file paths. +2. **Run any local checks** the repo requires (e.g., `pnpm test:sanity` + for supabase). +3. **Title and body** per the table above. +4. **Add the Claude Code footer** unless the repo's style sheet objects + (vercel-labs repos accept it; supabase hasn't shown a precedent + either way). +5. **Cap each PR to one skill**. If a skill has both a SKILL.md change + and a rules-doc change (as web-design-guidelines does, spanning two + repos), open two PRs and reference each from the other. + +## Reference: which repo each skill lives in + +| Our top-N skill | SKILL.md repo | Rules doc repo (if separate) | +|---|---|---| +| `vercel-labs/agent-skills/web-design-guidelines` | `vercel-labs/agent-skills` | `vercel-labs/web-interface-guidelines` | +| `vercel-labs/agent-browser/agent-browser` | `vercel-labs/agent-browser` (`skill-data/core/SKILL.md`) | n/a (inline) | +| `supabase/agent-skills/supabase-postgres-best-practices` | `supabase/agent-skills` | n/a (inline via `references/`) | + +Future skills we run on will surface their own conventions. Append +them here. diff --git a/docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-agent-skills-web-design-guidelines.md b/docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-agent-skills-web-design-guidelines.md new file mode 100644 index 0000000..d1bdc23 --- /dev/null +++ b/docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-agent-skills-web-design-guidelines.md @@ -0,0 +1,133 @@ +# PR #1 — vercel-labs/agent-skills: web-design-guidelines + +**Target:** `vercel-labs/agent-skills` +**File:** `skills/web-design-guidelines/SKILL.md` +**Base branch:** `main` +**Title:** `web-design-guidelines: add explicit two-pass workflow` + +## Body + +```markdown +## Summary + +- Adds an explicit "Pass 1 — visible anti-patterns / Pass 2 — absences" workflow to the SKILL.md, so reviewing agents do a structured per-element absence check after scanning for visible bad patterns. +- The skill's rules are mostly about *what's missing* (a missing `alt`, a missing `aria-label`, a missing focus replacement). Models reliably catch the visible patterns but skip the absence checks unless explicitly told to look for them. +- Diff vs upstream is purely additive: no rule deletions, no wording changes to existing rules. Adds ~15 lines under "How It Works" plus a tightened "Usage" block. The WebFetch behavior and the rules URL are unchanged. + +## Evidence + +Built a workbench of 4 sample React/TSX components seeded with 20 known violations across a11y / focus / forms / typography / animation rule families, then ran a 3-model matrix (`claude-sonnet-4.6`, `openai/gpt-5-mini`, `google/gemini-2.5-pro`) × 3 trials. + +| Model | Before | After | +|---|---|---| +| `claude-sonnet-4.6` | 10/12 (83%) | 12/12 (100%) | +| `openai/gpt-5-mini` | 9/12 (75%) | 10/12 (83%) | +| `google/gemini-2.5-pro` | 7/12 (58%) | 9/12 (75%) | +| **Total** | **26/36 (72%)** | **31/36 (86%)** | + +`gpt-5-mini`'s gains come almost entirely from the new per-element checklist surfacing absence rules. Two rules (`no-empty-state-handling`, `input-missing-autocomplete`) were eliminated entirely. + +A companion PR to `vercel-labs/web-interface-guidelines` adds matching per-element checklists + 5 BAD/GOOD code blocks to `command.md`. Both PRs land independently but are most useful merged together. + +## Test plan + +- [ ] Read the diff — confirm additive only, no existing rules touched +- [ ] Verify the SKILL.md still parses correctly as a Claude Code skill +- [ ] Optional: re-run with your preferred review test files +``` + +## File diff + +**Before** (`skills/web-design-guidelines/SKILL.md`, 39 lines): + +The current upstream version. No changes needed before applying the diff below. + +**After** (54 lines, +15 net): adds explicit Pass 1 / Pass 2 sections to "How It Works" and tightens the "Usage" numbered list to reflect the two-pass workflow. + +The full proposed file is checked into our repo at: + +- [`examples/workbench/web-design-guidelines/proposed-upstream-changes/agent-skills--web-design-guidelines/after-SKILL.md`](../../../examples/workbench/web-design-guidelines/proposed-upstream-changes/agent-skills--web-design-guidelines/after-SKILL.md) + +A unified diff against the upstream: + +```diff +--- skills/web-design-guidelines/SKILL.md (current upstream) ++++ skills/web-design-guidelines/SKILL.md (proposed) +@@ metadata block @@ + author: vercel +- version: "1.0.0" ++ version: "1.1.0" + argument-hint: + +@@ "How It Works" section @@ + ## How It Works + + 1. Fetch the latest guidelines from the source URL below. + 2. Read the specified files (or prompt user for files/pattern). +-3. Check against all rules in the fetched guidelines +-4. Output findings in the terse `file:line` format ++3. Review each file in **TWO passes** — both passes are required. ++4. Output findings in the terse `file:line ` format. ++ ++### Pass 1 — Visible anti-patterns ++ ++Scan each file for literal patterns that appear in the code: ++`
` for actions, `transition: all`, `outline-none` className, ++`onPaste={(e) => e.preventDefault()}`, `"..."` (three dots), straight ++`"..."` quotes, etc. The full list is in the fetched guidelines. One ++finding per match. ++ ++### Pass 2 — Absences (per-element checklist) ++ ++The most-missed rules are about *what's missing*. After Pass 1, walk ++each ``, ``, `