diff --git a/docs/auto-improve-skill-v1.3-design.md b/docs/auto-improve-skill-v1.3-design.md
new file mode 100644
index 0000000..33f7aa4
--- /dev/null
+++ b/docs/auto-improve-skill-v1.3-design.md
@@ -0,0 +1,237 @@
+# auto-improve-skill v1.3 — design proposal
+
+**Status:** draft, written 2026-05-12 during the v1.2.1 PR-prep session.
+**Audience:** team review before implementation.
+**Tracking:** the in-flight v1.2.1 pilot work (web-design-guidelines /
+agent-browser / supabase) is the empirical basis for this proposal.
+
+## Executive summary
+
+v1.3 adds two structural phases to the auto-improve-skill pipeline,
+both motivated by failure modes observed across 4 v1.2.1 pilots:
+
+1. **Phase 0 — Research-first context.** A research subagent reads the
+   target upstream repo's contribution conventions, frontmatter spec,
+   prefix taxonomy, and merged-PR shape patterns, and writes a context
+   file that v1.2.1's `--context` flag consumes. Without this, the
+   auto-pilot produces output that requires manual reformulation
+   before submission.
+2. **Phase 3.5 — Eval-readiness loop.** The pipeline iterates on the
+   eval (seed harder/simpler cases) until baseline lands in the
+   "interesting zone" `(0.50, 0.95)`. Without this, baselines saturate
+   at 1.00 (no headroom to demonstrate uplift) or floor at <0.50 (skill
+   shape blocks measurement).
+
+The skill-iteration loop (current Phase 4) is unchanged.
+
+## Lesson 1 — Research-first context is mandatory
+
+### Evidence (4 pilots this session)
+
+| Skill | Without context | With researched context |
+|---|---|---|
+| web-design-guidelines | Manual proposal needed retargeting (SKILL.md→command.md), reformulation across 3 stylistic siblings, frontmatter mismatch. Manual labor: ~2 hours per PR. | Auto-pilot produced a clean, mergeable diff to the right file in the right voice. Manual labor: ~10 min mirror to AGENTS.md/README.md. |
+| agent-browser | Auto-pilot proposed editing `skills/agent-browser/SKILL.md`. Per upstream `AGENTS.md`, that file is intentionally a discovery stub; real content lives at `skill-data/core/SKILL.md`. Manual retarget required. | (Pending — pilot in flight; context file says edit `agent-browser-core.md` and produced output names `before-skill-data-core-SKILL.md`.) |
+| supabase (batch-1) | Produced shape-novel `references/review-...md` with non-existent prefix (`review-`), missing `impactDescription` frontmatter field, philosophical-style content (MEDIUM-HIGH rejection risk per CONTRIBUTING patterns). | Auto-pilot reshaped into convention-perfect SQL anti-pattern under correct prefix (`monitor-`), full 4-field frontmatter, `**Incorrect**`/`**Correct**` SQL blocks per `_template.md`, trailing `Reference:` link. Zero manual reformulation needed. |
+
+### Generalization
+
+The auto-pilot is good at *finding what to change* (which rules, which
+files, which absence-type gaps). It is bad at *fitting upstream
+conventions*: frontmatter schemas, file-location norms, prefix
+taxonomies, additive-only rules, "Discussion-first" gates, voice
+consistency. Conventions are repo-specific tribal knowledge that
+cannot be inferred from reading the SKILL.md alone.
+
+### Phase 0 design
+
+```text
+Phase 0 — Research upstream (NEW, runs before Phase 1)
+
+Inputs:
+  - target slug <owner>/<repo>/<skill-id>
+
+Subtasks (executed by a research subagent):
+  1. Repo metadata: license, CLA, default branch, recent activity
+  2. Read CONTRIBUTING.md, AGENTS.md, .github/PULL_REQUEST_TEMPLATE.md,
+     CODEOWNERS, .github/workflows/*.yml
+  3. Read skill-specific convention files: _contributing.md,
+     _template.md, _sections.md (or equivalents)
+  4. Read sanity-test source if present (don't trust prior assumptions
+     about what CI validates)
+  5. Sample last 10 merged PRs to the target skill (or repo) for shape:
+     file count, body shape, conventional-commit usage, scope sizing
+  6. Sample last 5 closed-without-merge PRs for rejection signals:
+     "Discussion-first gate violated", "shape-novel content rejected",
+     etc.
+  7. Identify other consumers (gh search for raw URL references; check
+     for install scripts; check repo's own README for distribution
+     channels)
+
+Output:
+  tools/auto-improve-contexts/<owner>-<skill>.md
+  - Repository facts (license, CI, maintainers, merge style)
+  - Hard constraints (additive-only, file-location, prefix taxonomy,
+    forbidden modifications)
+  - Frontmatter spec (exact required fields + allowed values)
+  - Content shape template (copy-and-fill)
+  - Optimization target file (where the skill change should land)
+  - Risk profile (LOW/MEDIUM/HIGH + reasons)
+  - Pre-submit checklist (what auto-pilot must verify)
+  - Useful URLs
+
+Cost: ~$0.50–$1.00 per skill (single subagent invocation).
+
+Caching: context files are committed to the repo. Re-running on the
+same skill within 30 days: skip Phase 0, reuse cached context (with
+explicit `--refresh-context` flag to force re-research).
+
+Operator override: `--context <path>` flag continues to work; if
+provided, Phase 0 is skipped.
+```
+
+## Lesson 2 — Two-loop iteration: eval AND skill
+
+### Evidence
+
+| Skill | Initial baseline | Failure mode | Manual fix |
+|---|---|---|---|
+| agent-browser (Tier-0 only) | 0.97 | Shallow eval — only graded command-presence, not the skill's actual value prop (ref-based interaction, snapshot interpretation, multi-step state) | Built Tier-1 cases via subagent (~half-day): pre-recorded fixtures, stateful fake CLI, 4 new cases targeting the differentiator |
+| supabase (calibrated graders, frontier models) | 1.00 | Eval saturated; calibrated graders + capable models perfect-detect the 9 seeded violations | Built deeper eval via subagent (~30 min): 3 new cases with absence-type violations requiring enumeration across multi-statement files |
+
+In both cases, the **eval was the bug, not the skill**. The skill-
+iteration loop in Phase 4 can't escape the dead zone — it just exits
+"baseline >= 0.95, success" with no measurement.
+
+### Phase 3.5 design
+
+```text
+Phase 3.5 — Eval-readiness loop (NEW, between Phase 3 and Phase 4)
+
+while baseline NOT IN (0.50, 0.95):
+  if baseline >= 0.95:
+    dispatch eval-iteration subagent with prompt:
+      "Add 2-3 cases targeting absence-type rules / failure modes
+       not yet exercised. Realistic seedings, force enumeration.
+       Don't touch existing cases."
+  elif baseline < 0.50:
+    options (operator-decided or auto-judged):
+      a) Grader miscalibrated → run grader-vs-skill check (existing
+         in Phase 4); if grader bug, fix and re-baseline
+      b) Cases too contrived → simplify (remove ambiguous violations,
+         tighten task descriptions)
+      c) Skill genuinely doesn't address this shape → exit
+         "blocked-by-skill-shape" honestly
+  re-measure baseline
+  abort if iteration count > 3 (eval is harder to converge than skill)
+
+Then proceed to Phase 4 unchanged.
+
+Cost: ~$1.00 per eval iteration (subagent + smoke check). Bounded at
+3 iterations.
+
+Convergence criterion: baseline in (0.50, 0.95). The interesting zone.
+```
+
+### Why these bounds?
+
+- **>= 0.95**: ceiling effect; can't measure uplift because there's no
+  headroom. Even +0.04 wouldn't clear our existing 0.05 success
+  threshold.
+- **< 0.50**: floor effect; either the eval is broken (grader bugs,
+  ambiguous tasks) or the skill genuinely doesn't address the seeded
+  rules. In either case, the optimizer can't reliably improve.
+- **(0.50, 0.95)**: the optimizer has clear signal. Both successful
+  iteration and lack-of-improvement are interpretable.
+
+## Combined v1.3 architecture
+
+```text
+0. Research upstream → context file (NEW)
+1. Discover skill, classify
+2. Build initial suite
+3. Measure baseline
+3.5 Eval-readiness loop (NEW):
+    while baseline NOT IN (0.50, 0.95): iterate eval
+4. Skill-iteration loop (existing):
+    while uplift < 0.05 AND iterations < 2: iterate skill
+5. Re-check baseline (did eval drift after skill change?)
+6. Package
+```
+
+## Implementation cost
+
+| Component | Effort | Cost per pilot run |
+|---|---|---|
+| Phase 0 research subagent | ~1 day to write the prompt template + repo-detection logic | +$0.50–$1.00 |
+| Phase 3.5 eval-iteration subagent | ~2 days to write the subagent prompt + integration into the wrapper loop | +$1.00 per eval iteration (bounded at 3) |
+| Wrapper integration | ~1 day for new flags (`--refresh-context`, `--max-eval-iterations`), result aggregation, telemetry | n/a |
+| Testing on 5 representative skills | ~1 day | ~$10 total |
+
+**Total v1.3 build cost:** ~5 days of work + ~$15 of pilot runs to
+validate.
+
+**Per-pilot incremental cost:** ~$1.50–$5.00 over v1.2.1, depending on
+how many eval iterations are needed (most skills will converge in 0–1).
+
+## Migration / backwards compatibility
+
+- v1.2.1 wrapper continues to work standalone (`--context` flag is
+  preserved).
+- v1.3 is opt-in via a new flag, e.g. `--research` to enable Phase 0
+  and `--auto-eval` to enable Phase 3.5. Default off until validated.
+- Once validated, defaults flip to on; operator can opt out via
+  `--no-research` / `--no-auto-eval`.
+
+## Open questions
+
+1. **Research-subagent prompt template** — should the Phase 0 subagent
+   prompt be skill-classification-aware? E.g. ask different questions
+   for code-reviewer vs tool-use vs document-producer skills. Probably
+   yes, but adds template branching complexity.
+2. **Eval-iteration subagent prompt template** — same question. The
+   "what makes a harder case" guidance differs sharply by skill type.
+3. **When to refuse eval iteration** — if baseline is at 1.00 because
+   the skill genuinely is excellent at its job, we shouldn't fabricate
+   harder cases. How does Phase 3.5 distinguish "ceiling because skill
+   is good" from "ceiling because eval is shallow"?
+   - One heuristic: if the existing eval already exercises the skill's
+     stated value prop (per the SKILL.md description), assume good. If
+     it tests only mechanical command presence, assume shallow.
+   - This needs a "value-prop coverage" check in Phase 3.5, ideally
+     read from the skill's frontmatter description.
+4. **Cost ceiling** — Phase 0 + Phase 3.5 each cost ~$1; Phase 4 costs
+   $1–3. v1.3 raises typical pilot cost from ~$2 (v1.2.1) to ~$3–6.
+   Still within the $10 wrapper budget but worth keeping under
+   observation.
+5. **When to accept lossy reshape** — supabase v1.2.1 forced reshape
+   from "two-pass meta-workflow" into "concrete SQL anti-pattern with
+   `**Incorrect**`/`**Correct**` blocks". Worked beautifully. Will
+   this transfer to other skills, or did we get lucky with supabase's
+   tight `_template.md`? Probably needs more pilots before generalizing.
+
+## Open architectural questions (longer-term)
+
+- **Should the auto-pilot also produce the AGENTS.md/README.md mirrors
+  for repos with multi-file convention (PR #23 shape)?** Currently
+  manual at PR-draft time. Could be a separate "packaging" subagent.
+- **Should we treat upstream PR-submission as a phase too (Phase 6)?**
+  i.e. fork-clone-push-create-PR automation. Operator-gated for high-
+  visibility actions, but otherwise plausible.
+- **Can the research subagent be made repo-agnostic?** Right now we
+  assumed a "skill repo" structure. For repos with non-standard layout
+  (vendored skills, monorepos, etc.) the research needs different
+  patterns.
+
+## Provenance
+
+This design is grounded in the v1.2.1 pilot session captured in:
+
+- `docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-web-interface-guidelines.md`
+- (pending) `docs/pilot-runs/upstream-pr-drafts/3-vercel-labs-agent-browser-*.md`
+- (pending) `docs/pilot-runs/upstream-pr-drafts/4-supabase-agent-skills-*.md`
+- `tools/auto-improve-contexts/{vercel-web-interface-guidelines,
+  vercel-agent-browser, supabase-postgres-best-practices}.md`
+- Eval branches: `eval/auto-pilot/web-design-guidelines`,
+  `eval/auto-pilot/agent-browser` (in flight),
+  `eval/auto-pilot/supabase-postgres-best-practices-v2` (in flight).
diff --git a/docs/pilot-runs/2026-05-08-auto-improve-pilot-summary.md b/docs/pilot-runs/2026-05-08-auto-improve-pilot-summary.md
new file mode 100644
index 0000000..e639c1f
--- /dev/null
+++ b/docs/pilot-runs/2026-05-08-auto-improve-pilot-summary.md
@@ -0,0 +1,109 @@
+# Auto-improve-skill pilot summary — 2026-05-08
+
+## Setup
+
+Built a `tools/auto-improve-skill.mjs` wrapper + `tools/auto-improve-skill-prompt.md` template.
+Operator says "optimize `<slug>`"; orchestrator runs the wrapper via `Bash run_in_background`,
+the inner `claude -p` agent does the entire find → eval → diagnose → improve → package loop,
+writes `examples/workbench/<skill-id>/analysis.md`, exits.
+
+Branch: `feat/auto-improve-skill` (wrapper + prompt). Per-pilot output on `eval/auto-pilot/<skill-id>`.
+
+## Three pilot runs
+
+Run sequentially-ish: pilot #1 in main worktree, pilots #2 and #3 in parallel via `git worktree`
+in separate working folders. Three providers × three trials × N cases per pilot.
+
+| Skill | Classification | Status | Baseline | Final | Uplift | Iter | Plan-cost | OpenRouter |
+|---|---|---|---|---|---|---|---|---|
+| `vercel-labs/agent-browser/agent-browser` | tool-use | success | 0.56 | 1.00 | +0.44 | 1 | $3.15 | ~$2.80 |
+| `supabase/agent-skills/supabase-postgres-best-practices` | code-reviewer | success | 0.54 | 0.86 | +0.32 | 1 | $0 | ~$2.40 |
+| `anthropics/skills/pdf` | document-producer | success | 1.00 | 1.00 | +0 | 0 | $0 | ~$1.40 |
+
+3/3 succeeded. Each surfaced a distinct success path:
+
+- **agent-browser**: auto-pilot diagnosed that its own grader was over-specified (required `snapshot` for non-interactive ops, but the skill says CSS selectors are valid). Demoted the grader, +0.44 uplift mostly from grader correction. Also proposed a small additive "Quick task reference" section to upstream SKILL.md.
+- **supabase**: 9 SQL violations seeded (FK indexes, RLS, covering indexes, etc.). Auto-pilot first self-corrected its grader (line tolerance ±3 → ±8, added keyword variants), then independently rediscovered the same **two-pass workflow** pattern we found manually for web-design-guidelines (pass 1 = visible token misuse, pass 2 = absence checks). Real upstream proposal generated.
+- **pdf**: baseline already 1.00, auto-pilot triggered the "≥0.95 → exit clean, no proposal" path correctly. Did NOT manufacture problems. Noticed and noted that upstream's REFERENCE.md / FORMS.md links are 404.
+
+## Costs
+
+- OpenRouter (matrix runs): ~$6.60 total across 3 pilots.
+- Plan budget (the inner `claude -p` self-reported `total_cost_usd`): only #1 hit the cap.
+  Pilot #1 first attempt blocked at $3.42 from the docker-permissions issue. Pilot #1c with
+  `--budget 15` settled at $3.15. Pilots #2 and #3 reported $0 (likely under tracking floor
+  or didn't iterate enough to register).
+- Wall clock: ~50 min for 3 parallel pilots (vs ~150 min sequential).
+
+## Auto-pilot capabilities validated
+
+1. **Correct skill-shape classification** in all 3 cases (`tool-use`, `code-reviewer`, `document-producer`).
+2. **Self-correction of own grader bugs** before diagnosing the underlying skill — happened in 2 of 3 pilots without operator nudging. Same patterns we manually applied (line-tolerance widening, hyphenated regex variants, keyword alternations).
+3. **Pattern transfer**: the auto-pilot rediscovered the "two-pass workflow for absence-type rules" insight on supabase — a different skill in a different rule space — confirming the pattern generalizes.
+4. **Clean exit on already-good skills**: pdf ran 36/36 trials passing at baseline; auto-pilot did not manufacture changes.
+5. **Distinguishing skill problem from grader problem**: agent-browser caught grader-over-specification, separated it from skill quality.
+
+## Issues found in v1 of the auto-pilot
+
+1. **"Always: commit" step unreliable.** Pilots #1b and #2 didn't reach it — case files were left untracked in the worktree. Fix: hoist the commit step earlier (right after analysis.md is written), or split the prompt into two `claude -p` invocations (build + analyze).
+2. **`--max-budget-usd 3.50` is too tight** for runs that need any real iteration. Pilot #1's first real-data attempt hit the cap mid-modification. Bumping to $15 worked. Sensible default for v2: $7-10.
+3. **Phase 4 grader-fix iteration eats one of the two iteration slots.** The agent often spends iteration 1 fixing graders and only has one shot at modifying the skill. Fix: pre-bake known grader-tuning patterns into `_grader-utils.mjs` so the agent doesn't have to discover them, or count grader-only fixes separately from skill-modification iterations.
+
+## Patterns we should bake into v2
+
+From pilots and prior manual runs, these recurring techniques are stable enough to embed as defaults:
+
+**Optimizing patterns** (bake into prompt as Phase-4 priors):
+
+- Two-pass workflow (pass 1 visible / pass 2 absence) for code-reviewer skills
+- Per-element checklists for skills with rule-by-element structure
+- BAD/GOOD examples for anti-pattern and absence-type rules
+- "Verify-tool-installed" nudge for tool-use skills (agents fall back to `curl`/`npm i`)
+
+**Grader-reliability patterns** (bake into `_grader-utils.mjs`):
+
+- Default `±5–8` line tolerance
+- Hyphen-tolerant regex (`/empty[-\s]+state/`)
+- Per-finding-line keyword matching
+- Multiple keyword variants (`/cover/i` for both "covering" and "does not cover")
+
+**Default seeded violation types** (bake into Phase-2 instructions):
+
+- For code-reviewer: ≥1 visible-token, ≥1 missing-attribute, ≥1 missing-branch, ≥1 anti-pattern, ≥1 state-machine
+- For tool-use: ≥1 reaches-for-fallback, ≥1 wrong-flag, ≥1 missing-step
+- For document-producer: ≥1 missing-field, ≥1 wrong-format, ≥1 edge-case-input
+
+## Decision points for the team
+
+1. **Continue scaling.** With these results, "optimize 10 skills" is a sequential loop the
+   orchestrator already supports (just call the wrapper N times). With worktrees, N=3 in
+   parallel is also straightforward. Cost per skill ~$2-3 OpenRouter + plan-tokens.
+
+2. **Tighten the prompt before scaling.** The "Always: commit" issue and the budget-too-tight
+   issue are real and would cost a fraction of one pilot to fix. ~30 min of work for v2.
+
+3. **Build the lessons doc.** A `tools/auto-improve-skill-lessons.md` referenced by the
+   prompt as Phase-4 prior, updated after every pilot. Compounds: pilot N benefits from
+   patterns 1..N-1. Not started; sub-project for after the next batch.
+
+4. **Skill-batch parallelism.** Worktree-per-pilot worked. For 10 skills, 3-way parallel
+   would land in ~3-4 batches (~3 hours). 5-way is also feasible if the dev machine has
+   the resources.
+
+## Reproducing the pilots
+
+```bash
+cd /home/yuqing/Documents/Code/skill-optimizer
+git checkout feat/auto-improve-skill
+node tools/auto-improve-skill.mjs <owner>/<repo>/<skill-id> [--budget 15]
+
+# Output: examples/workbench/<skill-id>/{analysis.md, suite.yml, ...}
+# Branch: eval/auto-pilot/<skill-id>
+```
+
+For parallel runs, use git worktrees:
+
+```bash
+git worktree add ../wt-pilot-2 -b auto-pilot/wt-2 feat/auto-improve-skill
+cd ../wt-pilot-2 && node tools/auto-improve-skill.mjs <slug-2> --budget 15
+```
diff --git a/docs/pilot-runs/2026-05-09-auto-improve-batch-2-summary.md b/docs/pilot-runs/2026-05-09-auto-improve-batch-2-summary.md
new file mode 100644
index 0000000..de6ee50
--- /dev/null
+++ b/docs/pilot-runs/2026-05-09-auto-improve-batch-2-summary.md
@@ -0,0 +1,100 @@
+# Auto-improve-skill batch 2 summary — 10 pilots, 8 success, 0 failures
+
+## Setup
+
+- **Wrapper version:** v1.1 + #3 (atomic write-and-commit, $10 default budget, lessons.md, pre-baked grader helpers)
+- **Skills:** ranks 5–14 from the prioritized top-N list (skips the 4 already covered in batch 1: web-design-guidelines, agent-browser, supabase, pdf)
+- **Parallelism:** 10 git worktrees, hardlinked `node_modules`, fired simultaneously
+- **Wall clock:** ~50 min (slowest pilot to longest), down from estimated ~150 min sequential
+
+## Headline results
+
+| # | Skill | Classification | Status | Coverage | Mods | Notes |
+|---|---|---|---|---|---|---|
+| 1 | `anthropics/skills/pptx` | document-producer | ✅ success | 0.85 → 0.85 | 0 | grader cal raised raw 0.74 → 0.85; gpt-4o-mini fails entirely (model gap) |
+| 2 | `vercel-labs/next-skills/next-best-practices` | code-reviewer | ✅ success | 0.80 → 0.975 | 0 | grader cal only — skill already strong |
+| 3 | `firebase/agent-skills/firebase-auth-basics` | code-reviewer | ✅ success | 1.00 → 1.00 | 0 | reclassified from prior `tool-use` |
+| 4 | `firebase/agent-skills/firebase-hosting-basics` | code-patterns | ✅ success | 0.89 → 1.00 | 1 | Recipe A + E added a Configuration Review section |
+| 5 | `expo/skills/building-native-ui` | code-patterns | ✅ success | 0.99 → 0.99 | 0 | 17/18 trials — single gpt-5-mini miss accepted as noise |
+| 6 | `google-labs-code/stitch-skills/shadcn-ui` | code-patterns | ✅ success | 0.82 → 0.89 | 1 | Recipe A + D — Gemini's wrong-location miss rate dropped 100% → 0% |
+| 7 | `expo/skills/native-data-fetching` | code-reviewer | ✅ success | 1.00 → 1.00 | 0 | already-good |
+| 8 | `firecrawl/skills/firecrawl-build-scrape` | code-patterns | ⚠️ uplift-too-small | 0.84 → 0.89 | 2 | +0.05, exactly on threshold; gpt-4o-mini verbosity floor caps it |
+| 9 | `vercel-labs/next-skills/next-upgrade` | code-reviewer | ⚠️ uplift-too-small | **0.83 → 0.76** | 2 | **regression** — modifications hurt; new failure mode surfaced |
+| 10 | `github/awesome-copilot/prd` | document-producer | ✅ success | 1.00 → 1.00 | 0 | sonnet API errors, judged on 12 valid trials from gpt-5-mini + gemini |
+
+**8/10 success • 2/10 uplift-too-small • 0/10 blocked or budget-exceeded**
+
+## Cost
+
+- OpenRouter spend during batch: **~$21.30** ($40.65 used – $19.35 prior to batch start)
+- Per-pilot avg: **$2.13** (well under the $3.50 budgeted)
+- Plan-token spend (inner `claude -p`): each pilot reported between $0.00 and $1.00 — no pilot hit the $10 wrapper cap
+
+## What v1.1 + #3 actually delivered
+
+The pilots demonstrate the prompt improvements working as intended:
+
+1. **"Atomic write-analysis-and-commit" worked.** **All 10 inner agents committed cleanly.** No manual recovery needed (vs batch 1 where 2 of 3 needed manual commits).
+2. **Recipe citations by letter.** Pilots 4, 6, 8 explicitly cited Recipe A / D / E from `lessons.md` in their analysis bullets. They didn't rediscover the patterns from scratch.
+3. **"Grader-vs-skill check first" worked.** Pilots 1, 2, 4, 6, 8, 9 all did iteration 0 grader calibration before counting against their iteration budget. Saved meaningful budget on pilots 2, 4, 6.
+4. **`looseRange` / `tolerantKeyword` pre-baked helpers** — used in graders the auto-pilot wrote without rediscovering the patterns. Several pilots had to widen specifically for gpt-4o-mini drift (range 8 → 12 or 16) which is new signal worth adding to lessons.md.
+5. **"Don't manufacture problems"** worked in all 5 already-good cases (3, 5, 7, 10, plus pilot 1 after grader cal). None proposed unnecessary changes.
+
+## New patterns surfaced — worth adding to `lessons.md`
+
+### Optimization patterns
+
+- **(NEW) Recipe F? — Don't add bash commands for small models.** Pilot 9 added bash grep commands to `next-upgrade`'s SKILL.md. gpt-4o-mini tried to *execute* them rather than reading files, dropping coverage from 0.83 to 0.69. **Anti-pattern.** When skill is aimed at small/cheap models, prefer pure declarative wording over executable commands.
+
+### Failure modes
+
+- **CLI fabrication on "upgrade-style" skills.** gpt-4o-mini will hallucinate a `npx <something>-upgrade` CLI for any skill whose name suggests transformation/upgrade work, then write the error message as findings. Distinct from the agent-browser `curl` fallback (where the CLI exists but the model picks the wrong tool). Worth its own anti-pattern entry.
+- **Verbosity floor on gpt-4o-mini.** Confirmed across pilots 8, 9 — emits 3-4 line responses, sometimes drops trailing rules entirely. Rules requiring multi-finding output above this floor are systematically under-detected.
+
+### Grader patterns
+
+- **(NEW) Per-model line tolerance.** sonnet/gemini drift 0–3 lines; gpt-4o-mini drifts 6–15 lines. The `looseRange` default of ±8 is calibrated for the first two but undertuned for the third. Future graders should default to `looseRange(N, 12)` or use per-model tolerance maps.
+
+### Skill-shape edge cases
+
+- **Repo path conventions vary.** `expo/skills` uses `plugins/expo/skills/<id>/SKILL.md` (not the canonical `skills/<id>/SKILL.md`). Pilots 5 and 7 both surfaced this and adapted. Worth noting in Phase-1 instructions.
+
+## Branches pushed
+
+- `eval/auto-pilot/batch-2-2026-05-09` (consolidated, all 10 cherry-picked)
+- 10 individual `eval/auto-pilot/<skill-id>` branches (for per-skill review)
+
+## What to PR upstream
+
+Three pilots produced real, additive proposals:
+
+| Skill | Uplift | Where the change goes |
+|---|---|---|
+| firebase-hosting-basics | 0.89 → 1.00 | `firebase/agent-skills` |
+| shadcn-ui | 0.82 → 0.89 | `google-labs-code/stitch-skills` |
+| firecrawl-build-scrape | 0.84 → 0.89 | `firecrawl/skills` |
+
+**Skip from PR queue:**
+
+- All 5 baseline-already-good skills (no changes warranted)
+- pilot 9 (next-upgrade) — modifications regressed; needs human review or a different approach (probably "drop bash commands, use BAD/GOOD only")
+- pilot 8 (firecrawl-build-scrape) is on the bubble at +0.05 — judgment call
+
+## Decision points for the team
+
+1. **Scale further.** With v1.1+#3 working, batch 3 of 10 skills should land in another ~50 min for ~$25 OpenRouter. Plenty of remaining slugs in the top-N (15–47).
+2. **Lessons.md v1.2 update.** Add the patterns from this batch (CLI fabrication, gpt-4o-mini line drift, repo-path variants, "don't add bash for small models"). 30 min of doc work that compounds for batch 3.
+3. **Drop gpt-4o-mini from default matrix.** Repeated capability gap (verbosity floor + CLI fabrication + line drift) is dragging multiple pilots' scores. Switching the matrix to sonnet/gemini/another-mid-tier would likely lift batch coverage by 5-10pp without any skill changes. Worth piloting.
+
+## Reproducing
+
+```bash
+# This batch can be reproduced from a fresh checkout of feat/auto-improve-skill:
+cd /home/yuqing/Documents/Code/skill-optimizer
+git checkout feat/auto-improve-skill
+node tools/auto-improve-skill.mjs <slug>
+
+# For parallel batches, use git worktrees (see batch script in this commit's Setup section)
+```
+
+Cumulative spend: $40.65 of $60 OpenRouter credits.
diff --git a/docs/pilot-runs/README.md b/docs/pilot-runs/README.md
new file mode 100644
index 0000000..de86482
--- /dev/null
+++ b/docs/pilot-runs/README.md
@@ -0,0 +1,39 @@
+# Auto-improve-skill pilot runs
+
+Summaries of batched runs of the `tools/auto-improve-skill.mjs` auto-pilot
+against public agent skills from our prioritized top-N list. Each summary
+documents what skills ran, what the auto-pilot proposed, what worked, what
+didn't, and what changes we should make to the prompt before the next batch.
+
+The per-skill eval artifacts (suite, graders, vendored upstream, proposed-upstream-changes/)
+live on `eval/auto-pilot/<skill-id>` branches and the consolidated
+`eval/auto-pilot/batch-<n>-<date>` branches.
+
+## Index
+
+- [`2026-05-08-auto-improve-pilot-summary.md`](./2026-05-08-auto-improve-pilot-summary.md)
+  — Batch 1, 3 skills (agent-browser, supabase-postgres-best-practices, pdf).
+  Validated end-to-end. 3/3 success.
+- [`2026-05-09-auto-improve-batch-2-summary.md`](./2026-05-09-auto-improve-batch-2-summary.md)
+  — Batch 2, 10 skills (pptx, next-best-practices, firebase-auth-basics,
+  firebase-hosting-basics, building-native-ui, shadcn-ui, native-data-fetching,
+  firecrawl-build-scrape, next-upgrade, prd). 8/10 success, 2/10 uplift-too-small.
+
+## How to run a new batch
+
+```bash
+# Single skill from the main repo:
+node tools/auto-improve-skill.mjs <owner>/<repo>/<skill-id> [--budget 10]
+
+# Parallel batch via git worktrees:
+for i in {1..N}; do
+  git worktree add ../wt-pilot-$i -b auto-pilot/wt-batch-$i feat/auto-improve-skill
+  cp -al node_modules dist ../wt-pilot-$i/
+  cp .env ../wt-pilot-$i/
+done
+
+# Then fire one wrapper invocation per worktree in parallel.
+```
+
+After all pilots complete, cherry-pick each `eval/auto-pilot/<skill-id>` onto
+a consolidated batch branch and open a PR.
diff --git a/docs/pilot-runs/upstream-pr-conventions.md b/docs/pilot-runs/upstream-pr-conventions.md
new file mode 100644
index 0000000..4d62611
--- /dev/null
+++ b/docs/pilot-runs/upstream-pr-conventions.md
@@ -0,0 +1,128 @@
+# Upstream PR conventions for skill repositories
+
+Operational guide for submitting skill-improvement PRs to upstream
+maintainers. Each row was verified by reading the repo's
+`AGENTS.md` / `CONTRIBUTING.md` / `.github/workflows/` + scanning the
+last 5–10 merged PRs. Update this doc when we observe new patterns.
+
+## Quick reference
+
+| Repo | License | Style | Title format | Body | CI gates | CLA |
+|---|---|---|---|---|---|---|
+| `vercel-labs/agent-skills` | (no LICENSE) | casual | `{skill}: <change>` | `## Summary` + `## Test plan` | path-filtered (only fires for react-best-practices changes) | no |
+| `vercel-labs/web-interface-guidelines` | MIT | terse | sentence-case freeform, optional `feat:`/`fix:` | 1–2 sentences | none (no workflows) | no |
+| `vercel-labs/agent-browser` | Apache-2.0 | formal | `feat/fix/docs(scope): description` (conventional commits) | `## Summary` + `## Test plan` | Rust fmt/clippy/test + dashboard build + version-sync | no CLA bot observed |
+| `supabase/agent-skills` | MIT | formal | `feat/fix/docs: description` (conventional commits, used by Release Please) | terse `## Summary` bullets | `pnpm test:sanity` only | no (CONTRIBUTING.md states MIT auto-license) |
+
+## Per-repo notes
+
+### `vercel-labs/agent-skills`
+
+- **Title**: `{skill-name}: <what changed>` — skill name as the scope, no
+  conventional-commit prefix needed.
+- **Body**: Multi-section. Use `## Summary` bullets + `## Test plan`
+  checkboxes. 600–2500 chars is the observed norm. Claude Code footer
+  (`🤖 Generated with Claude Code`) is fully normalized — appears in
+  multiple merged PRs.
+- **CI**: One workflow (`react-best-practices-ci.yml`) is path-filtered;
+  unless our change touches `skills/react-best-practices/**`, it won't
+  fire. Vercel deploy preview is cosmetic, not blocking.
+- **Merge style**: Squash. Maintainer (`bhrigu123`) approves silently and
+  same-day for clean PRs.
+- **PR scope**: Tight per-skill (one skill per PR). Improvements to
+  existing skills merge faster than new-skill additions (PR #238
+  proposing a brand-new skill has sat for weeks).
+- **Gotcha**: Some skills have a `.zip` alongside the directory. Not
+  blocking but a known convention.
+
+### `vercel-labs/web-interface-guidelines`
+
+- **Title**: Freeform sentence (e.g., `Add translate="no" guideline for
+  verbatim content`) or `feat:`/`fix:` prefix — both merged.
+- **Body**: Minimal. PR #20 is exemplary: two sentences of rationale, no
+  headers. 0–400 chars is the observed norm.
+- **CI**: No workflows. Zero automated checks.
+- **Merge style**: Silent approve from `JohnPhamous` (Vercel staff).
+- **Sync constraint**: `README.md` and `AGENTS.md` are dual copies of
+  the same content (one human-readable, one agent-readable). If we add
+  or change a guideline, **touch both files** in the same PR. PR #20
+  did this; ours should too.
+- **Pace**: Repo is low-traffic (48 forks, last merge ~5 weeks ago).
+  Expect slow response. Don't optimize for immediate merge.
+
+### `vercel-labs/agent-browser`
+
+- **Title**: Strict conventional commits — `feat(scope): description`,
+  `fix(scope): description`, `docs: description`. Scope is the
+  subsystem (`docs`, `doctor`, `native`, etc.).
+- **Body**: `## Summary` (2 bullets) + `## Test plan` (2 checkboxes).
+  PR #1305 is a reference template.
+- **CI**: Strict. Three blocking jobs (Rust fmt+clippy+test, dashboard
+  pnpm build, version-sync). **Docs-only and skill-data-only changes
+  should pass automatically**; anything touching Rust will trigger
+  expensive checks.
+- **Merge style**: `ctate` is sole maintainer; very active, merges
+  same-day silently for clean PRs.
+- **Critical gotcha**: Skill content lives at
+  `skill-data/core/SKILL.md`, **not** at `skills/agent-browser/SKILL.md`
+  (which is intentionally a thin stub per AGENTS.md). Any meaningful
+  skill change touches:
+
+  1. `skill-data/core/SKILL.md`
+  2. `skill-data/core/references/*.md` (the per-rule reference docs)
+  3. `README.md`
+  4. The docs MDX pages
+
+  Per AGENTS.md, omitting any of these is grounds for rejection. Use
+  HTML `<table>` syntax in MDX (not markdown pipe tables).
+- **PR scope**: Tight per subsystem. Docs-only changes are the
+  lowest-friction path — they bypass the Rust CI gates.
+
+### `supabase/agent-skills`
+
+- **Title**: Strict conventional commits — `feat: <description>`,
+  `fix: <description>`, `docs: <description>`. Release Please uses these
+  to determine semver bumps. **Do not** bump `metadata.version`
+  manually in SKILL.md — Release Please handles it post-merge.
+- **Body**: Short `## Summary` with 1–4 bullets. Link issues with
+  `Resolves AI-NNN` if applicable. No template.
+- **CI**: One job — `Skills CI` runs `pnpm test:sanity`. Sanity tests
+  check that new reference files follow the `{prefix}-{name}.md`
+  naming convention with valid frontmatter (`title`, `impact`, `tags`).
+  Run `pnpm test:sanity` locally before submitting.
+- **Merge style**: Squash. `gregnr` (Supabase staff) and `Rodriguespn`
+  (sole active community maintainer) merge in under 30 min for clean
+  PRs by core team members; external PRs may need a single LGTM.
+- **PR scope**: Additive file change only. Add a new reference file
+  under `skills/<skill-id>/references/{prefix}-{name}.md` with proper
+  frontmatter + Incorrect/Correct examples. CONTRIBUTING.md says
+  significant new skills need a prior GitHub Discussion; reference
+  additions don't.
+
+## Process for our own PRs
+
+For each PR we submit:
+
+1. **Branch** off a fresh local clone of the upstream repo, NOT off our
+   `examples/workbench/<skill-id>/proposed-upstream-changes/`. Copy the
+   `after-*.md` content into the actual upstream file paths.
+2. **Run any local checks** the repo requires (e.g., `pnpm test:sanity`
+   for supabase).
+3. **Title and body** per the table above.
+4. **Add the Claude Code footer** unless the repo's style sheet objects
+   (vercel-labs repos accept it; supabase hasn't shown a precedent
+   either way).
+5. **Cap each PR to one skill**. If a skill has both a SKILL.md change
+   and a rules-doc change (as web-design-guidelines does, spanning two
+   repos), open two PRs and reference each from the other.
+
+## Reference: which repo each skill lives in
+
+| Our top-N skill | SKILL.md repo | Rules doc repo (if separate) |
+|---|---|---|
+| `vercel-labs/agent-skills/web-design-guidelines` | `vercel-labs/agent-skills` | `vercel-labs/web-interface-guidelines` |
+| `vercel-labs/agent-browser/agent-browser` | `vercel-labs/agent-browser` (`skill-data/core/SKILL.md`) | n/a (inline) |
+| `supabase/agent-skills/supabase-postgres-best-practices` | `supabase/agent-skills` | n/a (inline via `references/`) |
+
+Future skills we run on will surface their own conventions. Append
+them here.
diff --git a/docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-agent-skills-web-design-guidelines.md b/docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-agent-skills-web-design-guidelines.md
new file mode 100644
index 0000000..d1bdc23
--- /dev/null
+++ b/docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-agent-skills-web-design-guidelines.md
@@ -0,0 +1,133 @@
+# PR #1 — vercel-labs/agent-skills: web-design-guidelines
+
+**Target:** `vercel-labs/agent-skills`
+**File:** `skills/web-design-guidelines/SKILL.md`
+**Base branch:** `main`
+**Title:** `web-design-guidelines: add explicit two-pass workflow`
+
+## Body
+
+```markdown
+## Summary
+
+- Adds an explicit "Pass 1 — visible anti-patterns / Pass 2 — absences" workflow to the SKILL.md, so reviewing agents do a structured per-element absence check after scanning for visible bad patterns.
+- The skill's rules are mostly about *what's missing* (a missing `alt`, a missing `aria-label`, a missing focus replacement). Models reliably catch the visible patterns but skip the absence checks unless explicitly told to look for them.
+- Diff vs upstream is purely additive: no rule deletions, no wording changes to existing rules. Adds ~15 lines under "How It Works" plus a tightened "Usage" block. The WebFetch behavior and the rules URL are unchanged.
+
+## Evidence
+
+Built a workbench of 4 sample React/TSX components seeded with 20 known violations across a11y / focus / forms / typography / animation rule families, then ran a 3-model matrix (`claude-sonnet-4.6`, `openai/gpt-5-mini`, `google/gemini-2.5-pro`) × 3 trials.
+
+| Model | Before | After |
+|---|---|---|
+| `claude-sonnet-4.6` | 10/12 (83%) | 12/12 (100%) |
+| `openai/gpt-5-mini` | 9/12 (75%) | 10/12 (83%) |
+| `google/gemini-2.5-pro` | 7/12 (58%) | 9/12 (75%) |
+| **Total** | **26/36 (72%)** | **31/36 (86%)** |
+
+`gpt-5-mini`'s gains come almost entirely from the new per-element checklist surfacing absence rules. Two rules (`no-empty-state-handling`, `input-missing-autocomplete`) were eliminated entirely.
+
+A companion PR to `vercel-labs/web-interface-guidelines` adds matching per-element checklists + 5 BAD/GOOD code blocks to `command.md`. Both PRs land independently but are most useful merged together.
+
+## Test plan
+
+- [ ] Read the diff — confirm additive only, no existing rules touched
+- [ ] Verify the SKILL.md still parses correctly as a Claude Code skill
+- [ ] Optional: re-run with your preferred review test files
+```
+
+## File diff
+
+**Before** (`skills/web-design-guidelines/SKILL.md`, 39 lines):
+
+The current upstream version. No changes needed before applying the diff below.
+
+**After** (54 lines, +15 net): adds explicit Pass 1 / Pass 2 sections to "How It Works" and tightens the "Usage" numbered list to reflect the two-pass workflow.
+
+The full proposed file is checked into our repo at:
+
+- [`examples/workbench/web-design-guidelines/proposed-upstream-changes/agent-skills--web-design-guidelines/after-SKILL.md`](../../../examples/workbench/web-design-guidelines/proposed-upstream-changes/agent-skills--web-design-guidelines/after-SKILL.md)
+
+A unified diff against the upstream:
+
+```diff
+--- skills/web-design-guidelines/SKILL.md  (current upstream)
++++ skills/web-design-guidelines/SKILL.md  (proposed)
+@@ metadata block @@
+   author: vercel
+-  version: "1.0.0"
++  version: "1.1.0"
+   argument-hint: <file-or-pattern>
+
+@@ "How It Works" section @@
+ ## How It Works
+
+ 1. Fetch the latest guidelines from the source URL below.
+ 2. Read the specified files (or prompt user for files/pattern).
+-3. Check against all rules in the fetched guidelines
+-4. Output findings in the terse `file:line` format
++3. Review each file in **TWO passes** — both passes are required.
++4. Output findings in the terse `file:line <issue>` format.
++
++### Pass 1 — Visible anti-patterns
++
++Scan each file for literal patterns that appear in the code:
++`<div onClick>` for actions, `transition: all`, `outline-none` className,
++`onPaste={(e) => e.preventDefault()}`, `"..."` (three dots), straight
++`"..."` quotes, etc. The full list is in the fetched guidelines. One
++finding per match.
++
++### Pass 2 — Absences (per-element checklist)
++
++The most-missed rules are about *what's missing*. After Pass 1, walk
++each `<img>`, `<input>`, `<button>`, and `<form>` once and run the
++checklist in the **"Per-element review"** section of the fetched
++guidelines. Report every attribute or behavior that should be present
++but isn't.
++
++Pass 2 is the difference between a 70% review and a 95% review. Do not skip it.
+
+@@ "Usage" section @@
+ ## Usage
+
+ When a user provides a file or pattern argument:
++
+ 1. Fetch guidelines from the source URL above.
+ 2. Read the specified files.
+-3. Apply all rules from the fetched guidelines
+-4. Output findings using the format specified in the guidelines
++3. Run Pass 1 (visible anti-patterns).
++4. Run Pass 2 (per-element absence checklist).
++5. Output findings using the format specified in the guidelines.
+```
+
+## Caveats
+
+1. **Companion PR dependency.** Pass 2 references a "Per-element review" section in the fetched rules doc (`command.md`). That section doesn't exist upstream yet — PR #2 in this batch adds it. Without PR #2 merged, the SKILL.md change is still useful (the two-pass workflow is well-defined) but Pass 2's per-element instruction has nothing to reference.
+2. **Version bump.** Set to `1.1.0` since this is a content addition. The repo doesn't appear to use Release Please-style automation, so the manual bump is fine.
+
+## Operator steps to submit
+
+```bash
+# 1. Clone a fork (assume fastxyz fork exists)
+git clone git@github.com:fastxyz/agent-skills.git /tmp/upstream-agent-skills
+cd /tmp/upstream-agent-skills
+git remote add upstream https://github.com/vercel-labs/agent-skills.git
+git fetch upstream
+git checkout -b skill/web-design-guidelines-two-pass upstream/main
+
+# 2. Apply the change
+# Copy the proposed file from our repo:
+cp /home/yuqing/Documents/Code/skill-optimizer/examples/workbench/web-design-guidelines/proposed-upstream-changes/agent-skills--web-design-guidelines/after-SKILL.md \
+   skills/web-design-guidelines/SKILL.md
+
+# 3. Commit + push
+git add skills/web-design-guidelines/SKILL.md
+git commit -m "web-design-guidelines: add explicit two-pass workflow"
+git push -u origin skill/web-design-guidelines-two-pass
+
+# 4. Open PR
+gh pr create --repo vercel-labs/agent-skills --base main \
+  --title "web-design-guidelines: add explicit two-pass workflow" \
+  --body-file path/to/this-draft-body.md
+```
diff --git a/docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-web-interface-guidelines.md b/docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-web-interface-guidelines.md
new file mode 100644
index 0000000..30b1482
--- /dev/null
+++ b/docs/pilot-runs/upstream-pr-drafts/1-vercel-labs-web-interface-guidelines.md
@@ -0,0 +1,220 @@
+# PR #1 — vercel-labs/web-interface-guidelines: per-element checklist
+
+**Target:** `vercel-labs/web-interface-guidelines`
+**Files:** `command.md`, `AGENTS.md`, `README.md` (3-file mirror per
+PR #23 precedent)
+**Base branch:** `main`
+**Title:** `Add per-element checklist for absence-type rules`
+
+## Summary
+
+This is a single consolidated PR — one logical change mirrored across
+the repo's three distribution channels (`command.md` for slash-command
+agents, `AGENTS.md` for project-level ambient context, `README.md` for
+human readers). PR #23 (`Add translate="no" guideline`) is the
+precedent for the 3-file shape.
+
+The auto-pilot's `command.md` change was measured against a 3-frontier-
+model eval (claude-sonnet-4.6, openai/gpt-5, google/gemini-2.5-pro × 3
+trials × 2 React components with 8 seeded violations each). The
+`AGENTS.md` and `README.md` mirrors are style-faithful reformulations
+of the same rule additions and are NOT independently measured —
+honestly noted in the PR body.
+
+## PR body (qualitative pitch + supporting evidence)
+
+```markdown
+Adds a per-element checklist (`<img>`, `<input>`, `<button>`) that
+surfaces absence-type rules — the kind of rules that are easy to miss
+because they require enumerating elements and checking each, rather
+than recognizing a visible bad pattern. Useful for both human and AI
+reviewers. Slots into existing structure between the form/content
+rules and `## Performance`. Purely additive — no existing rules
+touched.
+
+Same logical addition mirrored across `command.md`, `AGENTS.md`, and
+`README.md`, matching the PR #23 precedent for content additions.
+
+## Evidence (supporting, not headline)
+
+Ran an eval of 18 trials (3 frontier models × 3 trials × 2 seeded
+React components with 4 absence-type and 4 presence-type violations
+each).
+
+| Variant | Catch rate |
+|---|---|
+| Existing rules | 92% (66/72) |
+| With per-element checklist added to `command.md` | 100% (72/72) |
+
+The 6 missed violations were all absence-type, mostly missed by
+smaller models that don't proactively enumerate elements when given
+declarative rules. The checklist converts declarative rules into a
+procedural enumeration that frontier and smaller models both follow
+reliably.
+
+Note: the `command.md` variant is measured. The `AGENTS.md` and
+`README.md` mirrors are style-faithful reformulations of the same
+rule content (MUST/SHOULD/NEVER and prose styles per the existing
+voice of each file) and are not independently measured. They follow
+PR #23's pattern of mirroring content additions across all three
+files in one PR.
+```
+
+## File 1 — `command.md` (the measured change)
+
+**Insertion point:** between the existing `### Images` section and
+the `### Performance` section (around line 79 of upstream `main`).
+
+```diff
+@@ -76,6 +76,28 @@
+ - Below-fold images: `loading="lazy"`
+ - Above-fold critical images: `priority` or `fetchpriority="high"`
+ 
++### Per-element checklist (absence rules)
++
++Walk **every** instance of these elements — absence violations are the most-missed. Check each attribute is present, not just the element.
++
++**Every `<img>`:**
++- explicit `width` AND `height` (prevents CLS) — flag if either attribute is missing
++- below-fold → `loading="lazy"`
++- above-fold critical → `priority` or `fetchpriority="high"`
++
++**Every `<input>`:**
++- `autoComplete` set (specific value: `"email"`, `"current-password"`, `"username"`, etc.)
++- correct `type` + `inputmode`
++- associated `<label htmlFor>` or wrapping `<label>`
++- emails/codes/usernames → `spellCheck={false}`
++
++**Every icon-only `<button>` (no visible text):**
++- `aria-label` present
++
++**Every submit `<button>`:**
++- `disabled` only while request is in-flight (`isSubmitting`)—not gated on form validity
++- spinner or loading indicator during request
++
+ ### Performance
+```
+
+**Frontmatter note:** the auto-pilot's vendored `before-command.md`
+has slightly different `description:` and `argument-hint:` strings
+than upstream `main`. The actual PR diff against upstream should NOT
+touch the frontmatter — only insert the body content above.
+
+## File 2 — `AGENTS.md` (MUST/SHOULD/NEVER mirror)
+
+**Insertion point:** as a new top-level section between
+`## Content Handling` (ends ~line 113) and `## Performance` (~line
+114).
+
+```diff
+@@ around line 113, after the last bullet of "## Content Handling" @@
+ 
++## Per-element checklist (absence rules)
++
++Walk every instance—absence rules are the most-missed.
++
++**Every `<img>`:**
++- MUST: explicit `width` AND `height` (prevents CLS)
++- MUST: below-fold → `loading="lazy"`
++- SHOULD: above-fold critical → `priority` or `fetchpriority="high"`
++
++**Every `<input>`:**
++- MUST: `autoComplete` set to specific value (`email`, `current-password`, `username`, etc.)
++- MUST: correct `type` + `inputmode`
++- MUST: associated `<label htmlFor>` or wrapping `<label>`
++- SHOULD: `spellCheck={false}` for emails, codes, usernames
++
++**Every icon-only `<button>` (no visible text):**
++- MUST: descriptive `aria-label`
++
++**Every `<button type="submit">`:**
++- NEVER: `disabled={!form.valid}` style gating
++- MUST: `disabled` only while request in-flight; spinner during request
++
+ ## Performance
+```
+
+## File 3 — `README.md` (prose mirror)
+
+**Insertion point:** as a new top-level section between `## Forms`
+(ends ~line 107) and `## Performance` (~line 108).
+
+```diff
+@@ around line 107, after the last bullet of "## Forms" @@
+ 
++## Per-element checklist
++
++When reviewing a file, walk each element type and check every instance against the relevant attributes. Absence violations (a missing `aria-label`, a missing `autoComplete`, a missing `width`/`height`) are the most-missed because they require enumerating elements rather than recognizing a visible bad pattern.
++
++- **Every `<img>`.** Explicit `width` AND `height` (prevents CLS). Below-fold images: `loading="lazy"`. Above-fold critical images: `priority` or `fetchpriority="high"`.
++- **Every `<input>`.** Specific `autoComplete` value (`"email"`, `"current-password"`, `"username"`, etc.). Correct `type` + `inputmode`. Associated `<label htmlFor>` or wrapping `<label>`. Use `spellCheck={false}` for emails, codes, and usernames.
++- **Every icon-only `<button>` (no visible text).** Descriptive `aria-label` present.
++- **Every `<button type="submit">`.** `disabled` only while the request is in-flight (`isSubmitting`) — never gated on form validity. Show a loading indicator during the request.
++
+ ## Performance
+```
+
+## Caveats
+
+1. **3-file sync is intentional and matches repo convention.** PR #23
+   (Add `translate="no"` guideline) is the canonical precedent for
+   additive content addition touching `AGENTS.md` + `README.md` +
+   `command.md` in one PR. Some merged PRs touched only 1–2 files,
+   but the maintainer-preferred shape is the 3-file mirror.
+2. **Slight `<img>` overlap with existing `### Images` section in
+   `command.md`.** The existing Images section already has bullets
+   for `width`/`height`, `loading="lazy"`, and `priority`. The new
+   per-element checklist restates those in a per-element-context
+   framing. This is intentional — the checklist's value is the
+   procedural framing ("walk every img"), not new rules. If the
+   maintainer flags it as redundant, we can drop the duplicate
+   `<img>` lines from the checklist (keeping just the procedural
+   `<input>`/`<button>` content) without affecting the eval result.
+3. **Style match.** Each file's mirror matches the surrounding voice
+   in that file (terse imperative bullets in command.md;
+   MUST/SHOULD/NEVER directives in AGENTS.md; prose with bold-lead
+   bullets in README.md). The actual rule content is identical
+   across all three.
+4. **Low traffic repo.** 48 forks, last merge ~5 weeks ago. Don't
+   expect immediate response. PR #23 had the same shape (terse body,
+   additive guideline, 3-file mirror) and merged silently with one
+   maintainer approve.
+5. **The wrapper-skill PR (`vercel-labs/agent-skills/skills/web-design-guidelines/SKILL.md`)
+   is dropped.** Per upstream research, the SKILL.md is a thin
+   Claude-Code-specific adapter that WebFetches `command.md`. The
+   value lives in `command.md` (consumed by 7 agent tools via
+   `install.sh` plus 10+ downstream repos). Editing the wrapper
+   SKILL.md is low-leverage and high-risk-of-bitrot; we ship a
+   single PR to `web-interface-guidelines` instead.
+
+## Operator steps to submit
+
+```bash
+# 1. Clone fork
+git clone git@github.com:fastxyz/web-interface-guidelines.git \
+  /tmp/upstream-web-interface-guidelines
+cd /tmp/upstream-web-interface-guidelines
+git remote add upstream https://github.com/vercel-labs/web-interface-guidelines.git
+git fetch upstream
+git checkout -b feat/per-element-checklist upstream/main
+
+# 2. Apply the three diffs (manual edits to command.md, AGENTS.md, README.md)
+# Use the diff blocks above as guidance.
+
+# 3. Commit + push
+git add command.md AGENTS.md README.md
+git commit -m "Add per-element checklist for absence-type rules"
+git push -u origin feat/per-element-checklist
+
+# 4. Open PR
+gh pr create --repo vercel-labs/web-interface-guidelines --base main \
+  --title "Add per-element checklist for absence-type rules" \
+  --body-file path/to/this-draft-pr-body.md
+```
+
+## Provenance
+
+- v1.2.1 auto-pilot run: branch `eval/auto-pilot/web-design-guidelines`,
+  commit `df7149e`, status `success`, baseline 0.92, final 1.00
+- Context file: `tools/auto-improve-contexts/vercel-web-interface-guidelines.md`
+- Pilot cost: $2.29
diff --git a/docs/pilot-runs/upstream-pr-drafts/2-vercel-labs-web-interface-guidelines.md b/docs/pilot-runs/upstream-pr-drafts/2-vercel-labs-web-interface-guidelines.md
new file mode 100644
index 0000000..bcbfe2f
--- /dev/null
+++ b/docs/pilot-runs/upstream-pr-drafts/2-vercel-labs-web-interface-guidelines.md
@@ -0,0 +1,125 @@
+# PR #2 — vercel-labs/web-interface-guidelines: per-element checklist + examples
+
+**Target:** `vercel-labs/web-interface-guidelines`
+**Files:** `command.md` AND `AGENTS.md` (per the repo's dual-copy
+convention)
+**Base branch:** `main`
+**Title:** `Add per-element checklist and BAD/GOOD examples for absence-type rules`
+
+## Body (kept terse per this repo's style)
+
+```markdown
+Adds a "Per-element review (Pass 2)" section organized by element (`<img>`, `<input>`, `<button>`, etc.) plus 5 BAD/GOOD code examples for the rules our eval shows are most often overlooked: submit-button-disabled, paste-blocking, missing `autoComplete`, above-fold image priority hint, missing empty-state branch.
+
+Additive only — no existing rules deleted or reworded. Same content mirrored to `README.md` and `AGENTS.md` per repo convention.
+
+Eval evidence: same 4-case workbench × 3-model matrix × 3 trials lifted total rule-coverage from 72% → 86% after adding these (companion to vercel-labs/agent-skills SKILL.md PR which adds the two-pass workflow that references this section).
+```
+
+## File diff summary
+
+Upstream `command.md` is 180 lines. Proposed: 304 lines (+124 net).
+
+The full proposed file is checked into our repo at:
+
+- [`examples/workbench/web-design-guidelines/proposed-upstream-changes/web-interface-guidelines/after-command.md`](../../../examples/workbench/web-design-guidelines/proposed-upstream-changes/web-interface-guidelines/after-command.md)
+
+**Two structural additions**, after the existing "Rules" section and before "Output Format":
+
+### Section A — "Per-element review (Pass 2 checklist)"
+
+A reference table organized by element type that Pass 2 walks through:
+
+```markdown
+## Per-element review (Pass 2 checklist)
+
+For each element in the file, walk the relevant checklist and flag every
+attribute or behavior that should be present but isn't.
+
+**Every `<img>`:**
+- explicit `width` AND `height` (prevents CLS)
+- above-fold critical → `priority` or `fetchpriority="high"` (LCP)
+- below-fold → `loading="lazy"`
+- decorative → `alt=""`, meaningful → descriptive `alt`
+
+**Every `<input>`:**
+- `autoComplete` set
+- meaningful `name`
+- correct `type` (`email`, `tel`, `url`, `number`)
+- `inputMode` for mobile keyboards
+- `<label htmlFor>` or wrapping `<label>`
+- NO `onPaste={(e) => e.preventDefault()}`
+- emails / codes / usernames → `spellCheck={false}`
+
+**Every `<button>` (any type):**
+- visible focus style (`focus-visible:ring-*`)
+- `hover:` state for visual feedback
+- `type="button"` if not a form submit
+
+**Every `<button type="submit">`** (in addition to the above):
+- stays enabled until the request starts; spinner during the request
+- NEVER `disabled={!form.valid}` style
+
+[... continues for form, list/array render, interactive element,
+animation/transition, modal/dialog, native `<select>`, headings,
+brand names ...]
+```
+
+### Section B — "Common-miss examples"
+
+Five BAD/GOOD code blocks for the absence-type and anti-pattern rules
+the eval surfaced as systematically missed:
+
+1. **Submit button stays enabled until request starts** — BAD: `disabled={!email}`; GOOD: `disabled={submitting}` + spinner
+2. **Never block paste** — BAD: `onPaste={(e) => e.preventDefault()}`; GOOD: allow paste, validate after
+3. **Inputs need `autoComplete`** — BAD: no `autoComplete`; GOOD: `autoComplete="email"` (or `"off"` only when intended)
+4. **Above-fold critical images need a priority hint** — BAD: bare `<img>`; GOOD: `priority` or `fetchpriority="high"`
+5. **Handle empty states** — BAD: `<ul>{items.map(...)}</ul>`; GOOD: explicit `items.length === 0` branch
+
+## Dual-copy constraint
+
+Per AGENTS.md, this repo keeps `README.md` and `AGENTS.md` as parallel
+copies of the same content (one human-readable, one agent-readable).
+The proposed changes are content additions — they need to land in
+**both** files in the same PR. PR #20 ("Add `translate='no'` guideline")
+is the reference precedent.
+
+In the canonical workbench, our `command.md` is the master copy.
+`AGENTS.md` is the same content reformatted for the AGENTS standard;
+the manual diff is mechanical.
+
+## Operator steps to submit
+
+```bash
+# 1. Clone the fork
+git clone git@github.com:fastxyz/web-interface-guidelines.git \
+  /tmp/upstream-web-interface-guidelines
+cd /tmp/upstream-web-interface-guidelines
+git remote add upstream https://github.com/vercel-labs/web-interface-guidelines.git
+git fetch upstream
+git checkout -b feat/per-element-checklist-and-examples upstream/main
+
+# 2. Replace command.md with the proposed version
+cp /home/yuqing/Documents/Code/skill-optimizer/examples/workbench/web-design-guidelines/proposed-upstream-changes/web-interface-guidelines/after-command.md \
+   command.md
+
+# 3. Mirror the content into AGENTS.md (manual reformat — same sections,
+# AGENTS-standard frontmatter)
+# Reference: how PR #20 mirrored README.md → AGENTS.md.
+
+# 4. Commit + push
+git add command.md AGENTS.md
+git commit -m "Add per-element checklist and BAD/GOOD examples"
+git push -u origin feat/per-element-checklist-and-examples
+
+# 5. Open PR (terse body per this repo's style)
+gh pr create --repo vercel-labs/web-interface-guidelines --base main \
+  --title "Add per-element checklist and BAD/GOOD examples for absence-type rules" \
+  --body-file path/to/this-draft-body.md
+```
+
+## Caveats
+
+1. **README.md/AGENTS.md sync.** The diff above is for `command.md`. Need to mirror into `README.md` and `AGENTS.md` for parity with the repo's convention.
+2. **Low-traffic repo.** Last merge ~5 weeks ago. Don't expect immediate response. PR #20 had the same shape (terse body, additive guideline) and merged silently with a single approve.
+3. **Companion PR.** Most useful if SKILL.md PR (#1) in this batch is merged in parallel.
diff --git a/docs/pilot-runs/upstream-pr-drafts/3-vercel-labs-agent-browser-pre-flight.md b/docs/pilot-runs/upstream-pr-drafts/3-vercel-labs-agent-browser-pre-flight.md
new file mode 100644
index 0000000..aca5cbd
--- /dev/null
+++ b/docs/pilot-runs/upstream-pr-drafts/3-vercel-labs-agent-browser-pre-flight.md
@@ -0,0 +1,114 @@
+# PR #3 — vercel-labs/agent-browser: add Pre-flight section
+
+**Target:** `vercel-labs/agent-browser`
+**File:** `skill-data/core/SKILL.md` (NOT `skills/agent-browser/SKILL.md` — see Caveats)
+**Base branch:** `main`
+**Title:** `docs(skill): add pre-flight section discouraging curl/wget fallback`
+
+## Body
+
+```markdown
+## Summary
+
+- Adds a small additive `## Pre-flight` section to the core skill telling agents to verify the CLI is installed (`which agent-browser`) and NOT to fall back to `curl`, `wget`, `requests`, or `npm install`/`npx`.
+- Closes a real failure mode: across a 3-model eval matrix (claude-sonnet-4.6, openai/gpt-5, google/gemini-2.5-pro × 3 trials × 2 cases), Gemini fell back to `curl` for HTTP fetches once in 9 trials despite the skill prescribing `agent-browser navigate`. Smaller/older models in earlier runs (gpt-5-mini) did this more frequently.
+- Purely additive — 11 lines inserted, no existing content changed.
+
+## Test plan
+
+- [ ] Read the diff; confirm additive only
+- [ ] Verify no formatting regressions in the surrounding sections
+- [ ] (Optional) Run the agent-browser self-tests if any
+```
+
+## File diff
+
+Target: `skill-data/core/SKILL.md` (the real content file per `AGENTS.md`)
+
+```diff
+@@ near the top of the file, after the initial install/intro block @@
+
+ Install: `npm i -g agent-browser && agent-browser install`
+
++## Pre-flight
++
++Verify the CLI is ready before starting any task:
++
++```bash
++which agent-browser        # confirm it's installed and in PATH
++```
++
++**Do not** fall back to `curl`, `wget`, or `requests` for page fetches.
++**Do not** `npm install` or `npx` the CLI — use the pre-installed version.
++
+ ## Start here
+
+ This file is a discovery stub, not the usage guide. Before running any
+```
+
+The full proposed `after-SKILL.md` is checked into our repo at:
+
+- [`examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/after-SKILL.md`](../../../examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/after-SKILL.md)
+  (note: the auto-pilot's proposal points at `skills/agent-browser/SKILL.md`; for the
+  upstream PR we re-target to `skill-data/core/SKILL.md` per AGENTS.md)
+
+## Caveats
+
+1. **Location adjustment.** The auto-pilot proposed adding Pre-flight
+   to `skills/agent-browser/SKILL.md`. Per the upstream `AGENTS.md`,
+   that file is intentionally a thin discovery stub and feature content
+   lives in `skill-data/core/SKILL.md`. We retarget the change to the
+   correct file when submitting.
+
+2. **CI strictness.** This repo runs Rust fmt/clippy/test + dashboard
+   `pnpm build` + version-sync on every PR. Docs-only changes should
+   pass automatically. If anything trips, the diff is so small that
+   the fix is trivial.
+
+3. **No dashboard/MDX page update needed?** Per AGENTS.md, "Any skill
+   improvement PR must touch `skill-data/core/SKILL.md` and its
+   `references/` files, plus README and docs MDX pages." This change
+   is so minor (a single ## section) that it likely doesn't need the
+   README or MDX updates — but worth checking with the maintainer
+   (`ctate`) in the PR description if you want zero-friction merge.
+   Alternative: also add a one-line bullet to README's "Tips" or
+   equivalent that says "verify install with `which agent-browser`".
+
+4. **Deeper-eval pilot timed out.** A v1.2.1 re-run with 4 new Tier-1
+   cases (ref-based-search, ref-disambiguation, output-correctness,
+   multi-step-state — pre-recorded fixtures, stateful fake CLI, all
+   smoke-tested) was attempted in this session to surface harder
+   failure modes than the original 2-case Tier-0 eval. The pilot was
+   killed by the wrapper's 90-min hard timeout mid-baseline (50/54
+   trials complete, no Phase 5 commit). The deeper eval itself is
+   committed at branch `eval/agent-browser-deeper-v1` (commit
+   `f0883ad`); the partial baseline trial data is preserved at
+   `examples/workbench/agent-browser/.results/20260512-101220/` and
+   could be analyzed in a future session. For this PR we ship the
+   original Pre-flight diff (eval baseline 0.97, 1 of 9 Gemini trials
+   used `curl` instead of `agent-browser navigate`) since the deeper
+   eval's measurement was incomplete.
+
+## Operator steps to submit
+
+```bash
+# 1. Clone fork
+git clone git@github.com:fastxyz/agent-browser.git /tmp/upstream-agent-browser
+cd /tmp/upstream-agent-browser
+git remote add upstream https://github.com/vercel-labs/agent-browser.git
+git fetch upstream
+git checkout -b docs/skill-pre-flight upstream/main
+
+# 2. Apply the change (manual edit to skill-data/core/SKILL.md)
+# The diff is small — paste the +11 lines after the install line.
+
+# 3. Commit + push (conventional commits per the repo's style)
+git add skill-data/core/SKILL.md
+git commit -m "docs(skill): add pre-flight section discouraging curl/wget fallback"
+git push -u origin docs/skill-pre-flight
+
+# 4. Open PR
+gh pr create --repo vercel-labs/agent-browser --base main \
+  --title "docs(skill): add pre-flight section discouraging curl/wget fallback" \
+  --body-file path/to/this-draft-body.md
+```
diff --git a/docs/pilot-runs/upstream-pr-drafts/4-supabase-agent-skills-two-pass.md b/docs/pilot-runs/upstream-pr-drafts/4-supabase-agent-skills-two-pass.md
new file mode 100644
index 0000000..3914f76
--- /dev/null
+++ b/docs/pilot-runs/upstream-pr-drafts/4-supabase-agent-skills-two-pass.md
@@ -0,0 +1,206 @@
+# PR #4 — supabase/agent-skills: monitor-two-pass-review reference
+
+**Target:** `supabase/agent-skills`
+**File:** `skills/supabase-postgres-best-practices/references/monitor-two-pass-review.md` (NEW file, additive)
+**Base branch:** `main`
+**Title:** `feat: add monitor-two-pass-review reference for absence-class SQL bugs`
+
+## Summary
+
+Single additive reference file under the existing `monitor-` prefix
+(diagnostic workflow). Frames a two-pass SQL-review pattern around a
+concrete anti-pattern (`UPDATE` missing `WHERE`) using the repo's
+required `**Incorrect**` / `**Correct**` SQL-block convention.
+
+The reference is the v1.2.1 auto-pilot's reshaping of a more abstract
+"two-pass review" concept. The auto-pilot read the upstream context
+file (`tools/auto-improve-contexts/supabase-postgres-best-practices.md`,
+encoded from gh-CLI research of CONTRIBUTING.md, `_template.md`,
+`_contributing.md`, `_sections.md`, plus the last 10 merged PRs) and
+produced a file that conforms exactly to the existing 28-reference
+convention: 4-field frontmatter, `monitor-` prefix, single rule,
+`**Incorrect**`/`**Correct**` SQL blocks, trailing `Reference:` link,
+~50 lines.
+
+## PR body (terse, per supabase convention)
+
+```markdown
+## Summary
+
+- Adds a new reference under the `monitor-` prefix that teaches a two-pass SQL review pattern catching absence-class bugs (missing `WHERE`, missing RLS, missing FK index) that single-pass review systematically misses.
+- Slots into the existing 28-reference convention: same frontmatter (`title`, `impact`, `impactDescription`, `tags`), same `**Incorrect**` / `**Correct**` SQL-block shape, same trailing `Reference:` link.
+- Purely additive — no existing files modified. `metadata.version` left to Release Please.
+```
+
+## File to add
+
+**Path:** `skills/supabase-postgres-best-practices/references/monitor-two-pass-review.md` (NEW file)
+
+````markdown
+---
+title: Run Two Passes on Generated SQL Reviews
+impact: MEDIUM
+impactDescription: Catch absence-class bugs (missing WHERE, missing index) that single-pass review skips
+tags: review, diagnostics, code-review, sql-review
+---
+
+## Run Two Passes on Generated SQL Reviews
+
+Single-pass SQL review catches tokens that should not be there (presence violations) but
+systematically misses required elements that are absent (absence violations). The most
+dangerous SQL bugs — mutations without `WHERE`, tables without RLS, foreign keys without
+indexes — all fall into the absence class and survive single-pass review undetected.
+
+**Incorrect (single-pass review approves unsafe mutation):**
+
+```sql
+-- Single pass: scanned for SELECT *, OFFSET, subqueries — none found
+-- Reviewer approves the following as safe:
+
+update orders set status = 'archived';
+-- Absence violation missed: no WHERE clause — this archives ALL rows, not just old ones
+```
+
+**Correct (two-pass review catches the absence violation):**
+
+```sql
+-- Pass 1 (presence): scan for known-bad tokens
+--   SELECT *? No.  OFFSET? No.  auth.uid() direct? No.  IF NOT EXISTS on ALTER? No.
+--   Passed.
+
+-- Pass 2 (absence): verify required patterns exist on every mutation and user table
+--   UPDATE/DELETE without WHERE? YES — absence violation caught
+
+-- Fix: add WHERE clause before approving
+update orders set status = 'archived'
+  where created_at < now() - interval '1 year';
+-- Now only rows older than one year are archived — safe and intentional
+```
+
+Pass 2 absence checklist — verify these exist:
+
+```sql
+-- UPDATE/DELETE must have a WHERE clause
+update users set is_active = false where last_login < now() - interval '1 year';
+
+-- User-data tables must have RLS enabled
+alter table messages enable row level security;
+
+-- FK columns must have a supporting index
+create index posts_author_id_idx on posts (author_id);
+```
+
+Reference: [Row Level Security](https://supabase.com/docs/guides/database/postgres/row-level-security)
+````
+
+## Evidence (honest framing)
+
+**This pilot did not produce measured uplift.** Two reasons up-front:
+
+1. The v2 auto-pilot baseline on a 5-case eval (45 trials × 3 frontier
+   models) hit **0.97 overall** — above the 0.95 "skill needs no
+   changes" threshold. No iteration loop fired.
+2. The `monitor-two-pass-review.md` reference itself was therefore
+   produced as a packaging-only output (per upstream context constraint
+   "add EXACTLY ONE additive file"), not as a response to measured
+   failures.
+
+**Per-case breakdown reveals one weak case the reference targets:**
+
+| Case | Coverage | Notes |
+|---|---|---|
+| `review-schema` (5 violations) | 100% | Calibrated baseline from prior pilot |
+| `review-rls` (4 violations) | 97.2% | Calibrated baseline |
+| `review-multi-table-rls` (3 violations) — NEW | 100% | Frontier models handled enumeration cleanly |
+| `review-fk-index-audit` (3 violations) — NEW | 96.3% | Gemini missed 1 trial |
+| **`review-update-without-where` (1 violation) — NEW** | **77.8%** | **2/9 trials missed by gpt-5-mini + gemini** |
+
+The `update-without-where` case at 77.8% is the failure mode the
+reference directly addresses. The 0.97 overall average masks it
+because the auto-pilot's exit-on-≥0.95 logic uses overall average
+rather than per-case minimum (a known v1.2.1 limitation; addressed in
+the v1.3 design proposal).
+
+**Earlier evidence (batch-1 pilot, 2026-05-08):** the same two-pass
+concept (then in less polished form) showed an uncalibrated baseline
+of 0.54 → 0.86 with grader-fixes-plus-skill-change bundled. We never
+cleanly separated the grader-calibration uplift from the skill-change
+uplift, so this number is **not** clean evidence either.
+
+**Net pitch:** the reference is structurally sound and convention-perfect.
+It addresses an observed failure pattern (update-without-where at
+77.8%) that single-pass review systematically misses. We don't have
+clean v1.2.1 measurement that quantifies its effect because frontier
+models on the rest of the suite are at ceiling. Maintainer decides if
+that's worth merging.
+
+## Caveats
+
+1. **Convention compliance.** Filename uses existing `monitor-` prefix
+   (no new prefix added; would have required modifying `_sections.md`
+   which is not additive). Frontmatter has all 4 required fields per
+   `_template.md`. Body has `**Incorrect (...)**` + `**Correct (...)**`
+   blocks per `_contributing.md` Key Principle #1 ("Show exact SQL
+   rewrites. Avoid philosophical advice."). Code blocks tagged `sql`
+   with lowercase keywords. Trailing `Reference:` link.
+2. **No SKILL.md changes.** Per `release-please-config.json`, the
+   `metadata.version: "1.1.1"` field is auto-managed by Release
+   Please's `extra-files` regex. Manual edits would conflict with the
+   bot's release PR.
+3. **No `_sections.md`, `_template.md`, or `_contributing.md` changes.**
+   Those are infrastructure files; CONTRIBUTING.md treats touching them
+   as a "major change requiring prior Discussion".
+4. **Discussion-first gate.** PR #48 (qvad's "Add YugabyteDB write
+   throughput optimization skill", 13 reference files, no prior
+   Discussion) was closed without merge. Single additive reference
+   files under existing prefixes do NOT trigger this gate per recent
+   merged PRs (PR #71 from gregnr, PR #55 from external `staaldraad`
+   both merged within hours).
+5. **`pnpm test:sanity` does NOT validate frontmatter.** Confirmed by
+   reading `test/sanity.test.ts` directly — it only runs
+   `npx skills add` to verify install. Convention is enforced by
+   maintainer review only.
+
+## Operator steps to submit
+
+```bash
+# 1. Clone fork
+git clone git@github.com:fastxyz/agent-skills-supabase.git \
+  /tmp/upstream-supabase-agent-skills
+cd /tmp/upstream-supabase-agent-skills
+git remote add upstream https://github.com/supabase/agent-skills.git
+git fetch upstream
+git checkout -b feat/monitor-two-pass-review upstream/main
+
+# 2. Add the reference file (paste the content above)
+mkdir -p skills/supabase-postgres-best-practices/references
+# Paste content into:
+# skills/supabase-postgres-best-practices/references/monitor-two-pass-review.md
+
+# 3. Run sanity tests
+pnpm install
+pnpm test:sanity
+
+# 4. Commit + push
+git add skills/supabase-postgres-best-practices/references/monitor-two-pass-review.md
+git commit -m "feat: add monitor-two-pass-review reference for absence-class SQL bugs"
+git push -u origin feat/monitor-two-pass-review
+
+# 5. Open PR (terse body per repo convention)
+gh pr create --repo supabase/agent-skills --base main \
+  --title "feat: add monitor-two-pass-review reference for absence-class SQL bugs" \
+  --body-file path/to/this-draft-body.md
+```
+
+## Provenance
+
+- v2 auto-pilot run: branch `eval/auto-pilot/supabase-postgres-best-practices-v2`,
+  commit `59c3e85`, status `success`, baseline 0.97, final 0.97 (no iteration; per-case
+  breakdown shows update-without-where at 77.8%).
+- v1 auto-pilot run: branch `eval/auto-pilot/supabase-postgres-best-practices--v1-shallow`,
+  commit `7721534`, status `success`, baseline 1.00, same proposed file.
+- Batch-1 (older models, uncalibrated graders): branch
+  `eval/auto-pilot/supabase-postgres-best-practices--v1`, commit `94659af`,
+  status `success`, baseline 0.54, final 0.86 (uplift conflated with grader-fix).
+- Context file: `tools/auto-improve-contexts/supabase-postgres-best-practices.md`
+- Total v2 pilot cost: $3.15
diff --git a/docs/pilot-runs/upstream-pr-drafts/5-google-labs-code-stitch-skills-shadcn-ui.md b/docs/pilot-runs/upstream-pr-drafts/5-google-labs-code-stitch-skills-shadcn-ui.md
new file mode 100644
index 0000000..dbb090a
--- /dev/null
+++ b/docs/pilot-runs/upstream-pr-drafts/5-google-labs-code-stitch-skills-shadcn-ui.md
@@ -0,0 +1,206 @@
+# PR #5 — google-labs-code/stitch-skills: shadcn-ui code review checklist
+
+**Target:** `google-labs-code/stitch-skills`
+**File:** `skills/shadcn-ui/SKILL.md`
+**Base branch:** `main`
+**Title:** `feat: add code review checklist + custom-component placement guidance to shadcn-ui`
+
+## Summary
+
+Adds two additive sections to `skills/shadcn-ui/SKILL.md`:
+
+1. A "**CRITICAL: Never place custom/composed components in `components/ui/`**" callout
+   inside the existing `### 3. Extending Components` section, with a side-by-side BAD/GOOD
+   TSX example showing the path-comment cue (`// src/components/ui/StatusBadge.tsx` ← WRONG
+   vs `// src/components/StatusBadge.tsx` ← CORRECT).
+2. A new `## Code Review Checklist` section before `## Validation and Quality` that walks
+   reviewers through a two-pass scan: Pass 1 catches visible anti-patterns (file location,
+   class merging with `cn()`, variant logic with `cva`, ARIA preservation), Pass 2 catches
+   absence violations (interactive elements without keyboard handlers; theme colors
+   hard-coded instead of CSS variables).
+
+Purely additive — no existing rules deleted or reworded. ~50 net lines added (~387 lines
+total vs upstream's 326).
+
+## PR body
+
+```markdown
+## Summary
+
+- Adds an explicit BAD/GOOD example for the `components/ui/` placement rule so reviewers can spot wrong-location violations from the first-line path comment.
+- Adds a `## Code Review Checklist` section that frames shadcn/ui review as a two-pass workflow (visible anti-patterns then absence checks). Useful for both human and AI reviewers.
+- Purely additive — no existing rules touched.
+
+## Evidence
+
+Eval against the v1.3 auto-pilot orchestrator on a 2-case, 3-frontier-model matrix
+(claude-sonnet-4.6, openai/gpt-5, google/gemini-2.5-pro × 3 trials = 18 trials):
+
+| Metric | Baseline | After this change |
+|---|---|---|
+| Per-case-min rule coverage | **0.667** | **0.889** (+0.222 uplift) |
+| review-usercard mean | 0.889 | 1.000 |
+| review-statusbadge mean | 0.667 | 0.889 |
+
+Targeted miss: gemini-2.5-pro missed the `wrong-file-location` violation on
+StatusBadge.tsx in 3/3 trials at baseline. The skill change moved gemini to 2/3 on
+that case (the path-comment cue made the absence-type rule recognizable).
+
+A prior batch with the older `gpt-4o-mini` matrix showed +0.111 uplift; switching to
+gpt-5 raised the baseline AND showed a larger absolute uplift (+0.222), confirming the
+addition isn't a small-model artifact.
+```
+
+## File diff
+
+Target: `skills/shadcn-ui/SKILL.md` (the canonical skill file at the repo root).
+
+The full proposed file is committed in our repo at:
+
+- [`examples/workbench/shadcn-ui/proposed-upstream-changes/google-labs-code-stitch-skills/after-SKILL.md`](../../../examples/workbench/shadcn-ui/proposed-upstream-changes/google-labs-code-stitch-skills/after-SKILL.md)
+
+Two insertion points (unified diff against upstream `main`):
+
+```diff
+@@ around line 184, inside "### 3. Extending Components" @@
+
+ ### 3. Extending Components
+
++**CRITICAL: Never place custom/composed components in `components/ui/`.**
++
++`components/ui/` is reserved exclusively for the raw shadcn/ui primitive components (installed
++via `npx shadcn@latest add`). Any wrapper, composed, or business-logic component must live in
++`components/` (or a subdirectory like `components/cards/`, `components/forms/`).
++
++```tsx
++// BAD: custom composed component placed in components/ui/
++// src/components/ui/UserCard.tsx  ← WRONG
++export function UserCard({ name, role }: UserCardProps) {
++  return <Card>...</Card>;
++}
++
++// GOOD: custom composed component in components/
++// src/components/UserCard.tsx     ← CORRECT
++export function UserCard({ name, role }: UserCardProps) {
++  return <Card>...</Card>;
++}
++```
++
+ Create wrapper components in `components/` (not `components/ui/`):
+```
+
+```diff
+@@ around line 322, between "### Component-Specific Notes" and "## Validation and Quality" @@
+
++## Code Review Checklist
++
++When reviewing existing code for shadcn/ui best-practice compliance, scan each file in two passes:
++
++### Pass 1 — File placement and visible anti-patterns
++
++- [ ] **File location**: Custom/composed components must NOT be in `components/ui/`. **Always
++      read the first line of each file** — source files begin with a path comment (e.g.
++      `// src/components/ui/StatusBadge.tsx`). If that path contains `components/ui/` AND the
++      component is NOT a raw shadcn primitive (installed via CLI), that is a wrong-location
++      violation. Flag it: `StatusBadge.tsx:1 — custom component placed in components/ui/; move
++      to components/`.
++
++  ```tsx
++  // BAD: path comment reveals wrong location
++  // src/components/ui/StatusBadge.tsx   ← WRONG (custom composed component in ui/)
++  export function StatusBadge(...) { ... }
++
++  // GOOD: custom component in components/
++  // src/components/StatusBadge.tsx      ← CORRECT
++  export function StatusBadge(...) { ... }
++  ```
++- [ ] **Class merging**: Every dynamic `className` must use `cn()` (clsx + tailwind-merge).
++      Reject bare string concatenation: `"base " + extra` or template literals without `cn()`.
++- [ ] **Variant logic**: Multiple style variants must use `cva` from `class-variance-authority`.
++      Reject `if/else` or ternary chains that select class strings manually.
++- [ ] **ARIA preservation**: Custom components that wrap Radix UI / shadcn primitives must not
++      set `aria-*` props to `undefined` — that strips the accessibility attribute entirely.
++
++### Pass 2 — Absence checks (per element)
++
++**Every interactive element** (`<div onClick>`, `<span onClick>`, non-`<button>` click targets):
++- Has `role="button"` (or appropriate role)
++- Has `onKeyDown` or `onKeyUp` keyboard handler
++- Has `tabIndex={0}` so it is keyboard-reachable
++
++**Every theme color** in custom components:
++- Uses CSS variables (`bg-primary`, `text-foreground`, etc.) for brand colors
++- Hard-coded Tailwind color utilities (`bg-blue-600`) are acceptable for semantic status
++  colors (success/error/warning) but not for primary/secondary/background theme colors
++
+ ## Validation and Quality
+```
+
+## Caveats
+
+1. **Google CLA required.** Per `CONTRIBUTING.md`, contributors must sign the
+   [Google Contributor License Agreement](https://cla.developers.google.com/about) before
+   the PR can be merged. One-time step per Google account; covers all Google-Open-Source
+   projects. The bot blocks merges until the CLA shows green.
+2. **Apache 2.0 license** on the repo (verified via `LICENSE` file at repo root).
+3. **No Release Please / no semver-bump bot.** Maintainers manage versions manually.
+   Do NOT bump `metadata` versions in any frontmatter.
+4. **CI gating.** The repo's CI validates the `react-components/` subtree only;
+   shadcn-ui skill changes bypass CI. Docs-only / SKILL.md changes pass automatically.
+5. **Recent merged-PR shape.** Last 5 merged PRs (#23, #31, #33, #36, #38) are all
+   single-skill additive changes with `feat:` titles. Convention is loose — `feat:`,
+   `chore:`, no-prefix all merged. Conventional commits encouraged but not strict.
+6. **Cosmetic whitespace changes in the diff.** When `markdownlint --fix` ran on the
+   workbench copy, it removed a few trailing whitespace characters (lines 147, 186-189
+   in upstream) and reformatted one function signature. These are unrelated to the
+   substantive additions. Either include them as a "while you're here" cleanup or
+   manually revert before submitting (cleaner: keep additive-only).
+
+## Operator steps to submit
+
+```bash
+# 1. Sign the Google CLA at https://cla.developers.google.com/ if you haven't already.
+
+# 2. Clone the upstream fork
+git clone git@github.com:fastxyz/stitch-skills.git \
+  /tmp/upstream-stitch-skills
+cd /tmp/upstream-stitch-skills
+git remote add upstream https://github.com/google-labs-code/stitch-skills.git
+git fetch upstream
+git checkout -b feat/shadcn-ui-code-review-checklist upstream/main
+
+# 3. Apply the change
+# Easiest: copy the after-SKILL.md from this repo, then strip the cosmetic whitespace
+# fixes if you want strict additive-only.
+cp /home/yuqing/Documents/Code/skill-optimizer/.claude/worktrees/v1.3-impl/examples/workbench/shadcn-ui/proposed-upstream-changes/google-labs-code-stitch-skills/after-SKILL.md \
+   skills/shadcn-ui/SKILL.md
+
+# (Optional) revert the cosmetic whitespace changes:
+# git diff upstream/main -- skills/shadcn-ui/SKILL.md
+# Then manually revert the trailing-whitespace and function-signature reformatting hunks.
+
+# 4. Commit + push
+git add skills/shadcn-ui/SKILL.md
+git commit -m "feat: add code review checklist + custom-component placement guidance to shadcn-ui"
+git push -u origin feat/shadcn-ui-code-review-checklist
+
+# 5. Open the PR
+gh pr create --repo google-labs-code/stitch-skills --base main \
+  --title "feat: add code review checklist + custom-component placement guidance to shadcn-ui" \
+  --body-file path/to/this-draft-body.md
+```
+
+## Provenance
+
+- v1.3 orchestrator dispatch (gpt-5 frontier matrix):
+  - Branch: `eval/auto-pilot/shadcn-ui-gpt5-refire`
+  - Commit: `4c7d112`
+  - Status: `success`
+  - Baseline per-case-min: 0.667 → final: 0.889 (+0.222 uplift)
+  - Total cost: $2.50 ($0.91 baseline + $1.59 iteration 1)
+- Earlier batch-2 dispatch (gpt-4o-mini matrix) showed +0.111 uplift —
+  branch `eval/auto-pilot/shadcn-ui` commit `1744daf`
+- Context file (research subagent output):
+  `skills/auto-improve-orchestrator/references/contexts/google-labs-code-shadcn-ui.md`
+- Eval workbench: `examples/workbench/shadcn-ui/` (2 cases:
+  `review-usercard`, `review-statusbadge`)
diff --git a/docs/pilot-runs/upstream-pr-drafts/README.md b/docs/pilot-runs/upstream-pr-drafts/README.md
new file mode 100644
index 0000000..f1b6295
--- /dev/null
+++ b/docs/pilot-runs/upstream-pr-drafts/README.md
@@ -0,0 +1,44 @@
+# Upstream PR drafts
+
+Polished PR drafts for the first round of upstream contributions. Each
+draft is ready to copy-paste into the actual upstream repo after a
+final review. The actual `git push` to a fork + `gh pr create` is left
+to the operator (the orchestrator only drafts).
+
+## Drafts (current canonical set)
+
+| # | Skill | Target repo | Evidence strength | Draft |
+|---|---|---|---|---|
+| 1 | web-design-guidelines (rules doc) | `vercel-labs/web-interface-guidelines` | **Strong.** v1.2.1 measured 0.92→1.00 across 18 trials × 3 frontier models. 22-line additive change. | [draft](./1-vercel-labs-web-interface-guidelines.md) |
+| 3 | agent-browser (Pre-flight) | `vercel-labs/agent-browser` | **Soft.** v1.0 baseline 0.97; observed 1/9 Gemini trial fell back to `curl`. Deeper-eval v1.2.1 pilot was attempted but timed out at the 90-min wrapper cap mid-baseline (50/54 trials done, no Phase 5 commit). 11-line additive Pre-flight section. | [draft](./3-vercel-labs-agent-browser-pre-flight.md) |
+| 4 | supabase-postgres-best-practices | `supabase/agent-skills` | **Soft.** v2 baseline 0.97 overall; per-case shows update-without-where at 77.8% (the failure pattern the reference targets). Auto-pilot's exit-on-≥0.95-overall logic missed the per-case signal (v1.3 design addresses this). Single additive reference file under existing `monitor-` prefix. | [draft](./4-supabase-agent-skills-two-pass.md) |
+
+The wrapper-skill PR target (`vercel-labs/agent-skills/skills/web-design-guidelines/SKILL.md`)
+was dropped — see `superseded/README.md`. The SKILL.md is a thin
+discovery-stub adapter; all value lives in `command.md` (PR #1).
+
+## Process to submit each PR
+
+1. **Fork** the upstream repo on GitHub (or use an existing fork).
+2. **Clone the fork** locally outside this repo (e.g.
+   `git clone git@github.com:fastxyz/<upstream-repo>.git /tmp/upstream-<repo>`).
+3. **Make the changes** described in the draft on a new branch.
+4. **Run any local checks** the convention doc calls for (e.g.
+   `pnpm test:sanity` for supabase — but note: sanity test does NOT
+   validate per-reference frontmatter; convention is enforced by
+   maintainer review).
+5. **Commit + push** to the fork.
+6. **Open the PR** with the title/body from the draft. Use
+   `gh pr create --base main --repo <upstream> --title "..." --body "..."`.
+7. **Link** the resulting URL back to this draft for traceability.
+
+## Conventions reference
+
+See [`../upstream-pr-conventions.md`](../upstream-pr-conventions.md) for
+the per-repo title format, body convention, CI gates, and gotchas
+discovered while researching each upstream.
+
+## Superseded drafts
+
+Earlier drafts (pre-v1.2.1, pre-research) are archived under
+[`superseded/`](./superseded/) for historical reference.
diff --git a/docs/pilot-runs/upstream-pr-drafts/superseded/README.md b/docs/pilot-runs/upstream-pr-drafts/superseded/README.md
new file mode 100644
index 0000000..2087dc2
--- /dev/null
+++ b/docs/pilot-runs/upstream-pr-drafts/superseded/README.md
@@ -0,0 +1,11 @@
+# Superseded drafts
+
+These drafts were written before the v1.2.1 auto-pilot re-runs. They
+remain here for historical reference only.
+
+| File | Status | Why superseded |
+|---|---|---|
+| `1-vercel-labs-agent-skills-web-design-guidelines.md` | DROPPED | The vercel-labs/agent-skills/web-design-guidelines SKILL.md is a thin discovery-stub adapter. Real value lives in the rules doc at vercel-labs/web-interface-guidelines/command.md, which is the canonical Vercel design artifact distributed to 7 agent tools via install.sh. The wrapper SKILL.md isn't a meaningful improvement target — it's been essentially untouched since creation. We dropped this PR target entirely. |
+| `2-vercel-labs-web-interface-guidelines.md` | REPLACED | Replaced by the v1.2.1 auto-pilot's evidence-grounded draft at `../1-vercel-labs-web-interface-guidelines.md`. Old draft sourced content from the manual proposal; new draft sources from a measured eval (0.92→1.00, 18 trials × 3 frontier models). |
+
+The current canonical drafts are in the parent directory.
diff --git a/examples/workbench/agent-browser/README.md b/examples/workbench/agent-browser/README.md
new file mode 100644
index 0000000..0f5393d
--- /dev/null
+++ b/examples/workbench/agent-browser/README.md
@@ -0,0 +1,191 @@
+# agent-browser eval
+
+Eval suite for
+[`vercel-labs/agent-browser/agent-browser`](https://github.com/vercel-labs/agent-browser) —
+Browser automation CLI for AI agents. Chrome/Chromium via CDP with accessibility-tree
+snapshots and compact `@eN` element refs.
+
+## Cases
+
+The suite has two tiers. Tier-0 inherits from the v1 baseline and grades only
+that the right CLI subcommands were invoked. Tier-1 grades the actual value of
+the skill: snapshot-driven workflows, `@eN` ref discipline, and content
+correctness derived from pre-recorded accessibility trees.
+
+### Tier-0 — command presence
+
+#### `navigate-and-report` — tool invocation + skill load + navigate + snapshot
+
+| Check | Behavior tested | Rule |
+|---|---|---|
+| V1 | `agent-browser` was invoked at all | Use agent-browser over built-in tools |
+| V2 | `agent-browser skills get core` called before other commands | "Before running any command, load the actual workflow content" |
+| V3 | `agent-browser navigate` used (not `curl`/`wget`) | Prefer agent-browser over built-in browser automation or web tools |
+| V4 | `agent-browser snapshot` called to inspect the page | Take snapshot after navigating to understand page structure |
+| V5 | `heading.txt` written with non-empty content | Task output produced |
+
+#### `screenshot-capture` — tool invocation + skill load + screenshot + output files
+
+| Check | Behavior tested | Rule |
+|---|---|---|
+| V1 | `agent-browser` was invoked at all | Use agent-browser over built-in tools |
+| V2 | `agent-browser skills get core` called before other commands | "Before running any command, load the actual workflow content" |
+| V3 | `agent-browser navigate` used (not `curl`/`wget`) | Prefer agent-browser over built-in browser automation or web tools |
+| V4 | `agent-browser screenshot` called | Use screenshot command for visual capture |
+| V5a | `screenshot.png` created (non-empty) | Screenshot output file produced |
+| V5b | `title.txt` written with non-empty content | Task text output produced |
+
+### Tier-1 — snapshot-driven `@eN` refs and content correctness
+
+These cases play back **pre-recorded accessibility-tree snapshots** so the
+grader can verify the agent reached the right element, took the right
+state-machine path, and extracted the right text. See
+[Recording playback](#recording-playback) below.
+
+#### `ref-based-search` — Wikipedia-style search via `@eN` refs
+
+Recordings: `references/agent-browser/recordings/wikipedia/`
+(`snapshot.out`, `snapshot-after-search.out`, `transitions.txt`)
+
+| Check | Behavior tested |
+|---|---|
+| V1 | agent-browser was invoked |
+| V2 | snapshot called BEFORE the first click/type (snapshot-first discipline) |
+| V3 | `type @e7 …` — typed into the searchbox by its accessibility ref, not a CSS selector |
+| V4 | `click @e8` — clicked the submit button by its ref, not the searchbox or another link |
+| V5 | snapshot re-taken AFTER `click @e8` (must observe the new page) |
+| V6 | `top-result.txt` contains the actual top result ("Hypertext Transfer Protocol") from the recorded results page |
+| V7 | no CSS-selector-style refs anywhere in click/type calls |
+
+#### `ref-disambiguation` — pick the right of two visually-similar buttons
+
+Recordings: `references/agent-browser/recordings/signin-signup/`
+(buttons `@e5 "Sign In"` vs `@e6 "Sign Up"`, with separate post-click pages)
+
+| Check | Behavior tested |
+|---|---|
+| V1 | agent-browser was invoked |
+| V2 | snapshot-first discipline |
+| V3 | clicked `@e5` (Sign In), NOT `@e6` (Sign Up) |
+| V4 | exactly one click on `@e5` (no retry loop) |
+| V5 | `next-heading.txt` matches the Sign In flow heading (NOT the Sign Up heading) |
+| V6 | no CSS-selector-style refs |
+
+#### `output-correctness` — extract the right text from three plausible candidates
+
+Recordings: `references/agent-browser/recordings/blog-article/`. The page has
+a kicker tagline, a level-1 article heading, and a byline — only the heading
+is the article title.
+
+| Check | Behavior tested |
+|---|---|
+| V1 | agent-browser was invoked |
+| V2 | snapshot was called |
+| V3 | `title.txt` matches the article title exactly ("Why We Migrated Our Build System to Bazel") |
+| V4 | `title.txt` does NOT include the kicker "FROM THE PLATFORM TEAM" |
+| V5 | `title.txt` does NOT include the byline ("By Jordan Lee") |
+| V6 | no CSS-selector-style refs |
+
+#### `multi-step-state` — full state-machine traversal across a 2-field form
+
+Recordings: `references/agent-browser/recordings/multistep-form/` —
+`initial -> name-entered -> email-entered -> submitted`. Each post-action
+snapshot reveals new state (filled values, button-disabled flag, then a
+confirmation page with code `NL-7QF3-2026`).
+
+| Check | Behavior tested |
+|---|---|
+| V1 | agent-browser was invoked |
+| V2 | snapshot-first discipline |
+| V3 | full path traversed in order: `type @e5` -> `type @e6` -> `click @e7` |
+| V4 | the value typed into `@e6` is an email-shaped string (matches "<ada@example.com>") |
+| V5 | snapshot re-taken AFTER the final click (must observe confirmation page) |
+| V6 | `confirm.txt` contains "NL-7QF3-2026" (extracted from the post-submit recording) |
+| V7 | no CSS-selector-style refs |
+
+## Recording playback
+
+The Tier-1 cases use **fabricated but realistic** accessibility-tree
+recordings. Real `agent-browser` would need a Rust binary plus a headless
+Chrome inside the Docker workbench; the eval avoids that by replaying static
+fixtures that look exactly like real `snapshot` output.
+
+Layout per page:
+
+```text
+references/agent-browser/recordings/<page>/
+  transitions.txt        # URL match, initial state, click/type -> next-state rules
+  snapshot.out           # initial snapshot
+  snapshot-<state>.out   # one file per reachable post-action state
+```
+
+`transitions.txt` example:
+
+```text
+url=https://en.wikipedia.org/wiki/Main_Page
+url-prefix=https://en.wikipedia.org
+state=initial
+
+type  @e7 -> initial
+click @e8 -> after-search
+```
+
+The fake CLI at `bin/agent-browser`:
+
+- Logs every invocation to `/work/ab-calls.log` (graders depend on this).
+- Maintains a 2-line state cookie at `/work/.ab-state` (`page=…`, `state=…`).
+- On `navigate <url>`: matches the URL against each recording's `url=` /
+  `url-prefix=` and resets to that page's `state=`.
+- On `snapshot`: emits the recorded `snapshot[-<state>].out` for the current
+  page+state. Falls back to the legacy generic `Example Domain` snapshot when
+  no page is set (this preserves Tier-0 behaviour).
+- On `click @eN` / `type @eN …`: looks up matching transitions and advances
+  state if a rule fires. Always echoes a realistic `Clicked @eN` /
+  `Typed "…" into @eN` line.
+- For `screenshot`, `evaluate`, `skills get …`, `version`, `which`: returns
+  canned but shape-correct output.
+
+The CLI accepts an `AB_WORK` environment variable so it can be smoke-tested
+outside Docker against a local sandbox directory; in production the workbench
+mounts the agent at `/work` so the default applies.
+
+## Vendored snapshot
+
+The skill normally loads the core workflow by running
+`agent-browser skills get core`, which fetches version-matched content from
+the installed CLI. For deterministic eval we vendor a snapshot at
+`references/agent-browser/agent-browser-core.md` (updated for `@eN` ref
+syntax) and tweak `SKILL.md` to read it locally via
+`cat /work/references/agent-browser/agent-browser-core.md`. Diff vs upstream
+is one line.
+
+## Smoke-check the graders
+
+Before spending real model dollars, verify each grader's checks fire as
+designed. The smoke script crafts good and bad `ab-calls.log` + output-file
+fixtures for every Tier-1 case and asserts the JSON envelope:
+
+```bash
+node examples/workbench/agent-browser/checks/smoke-graders.mjs
+```
+
+It runs each grader twice or more (one GOOD scenario expected to pass, plus
+one or more BAD scenarios that must fail with specific evidence substrings).
+The script exits non-zero if any assertion is violated. There are 14
+assertions across the 4 new graders; failures preserve the temp workspace
+for triage.
+
+## Run
+
+```bash
+export OPENROUTER_API_KEY=sk-or-...
+npx tsx ../../../src/cli.ts run-suite ./suite.yml --trials 3
+```
+
+## Models
+
+The suite runs a 3-provider mid-tier matrix:
+
+- `openrouter/anthropic/claude-sonnet-4.6`
+- `openrouter/openai/gpt-5`
+- `openrouter/google/gemini-2.5-pro`
diff --git a/examples/workbench/agent-browser/analysis.md b/examples/workbench/agent-browser/analysis.md
new file mode 100644
index 0000000..13e4d84
--- /dev/null
+++ b/examples/workbench/agent-browser/analysis.md
@@ -0,0 +1,18 @@
+---
+skill: vercel-labs/agent-browser/agent-browser
+status: success
+classification: tool-use
+baseline_rule_coverage: 0.97
+final_rule_coverage: 0.97
+modifications_tried: 0
+total_cost_usd: 0.73
+---
+
+# Auto-pilot run for `vercel-labs/agent-browser/agent-browser`
+
+- Classified as **tool-use / mcp-driver**: SKILL.md is a discovery stub that directs the agent to run `agent-browser skills get core` to load the actual workflow content; the skill's value is steering agents toward the `agent-browser` CLI instead of curl/playwright fallbacks.
+- Eval shape: two cases (`navigate-and-report`, `screenshot-capture`) with a fake `bin/agent-browser` CLI that logs all invocations to `/work/ab-calls.log`; vendored core skill at `references/agent-browser/agent-browser-core.md` (SKILL.md modified to `cat` the local file instead of calling the CLI).
+- Grader calibration: initial grader only checked `ab-calls.log` for `skills get core`; fixed to also accept `cat agent-browser-core.md` in trace.jsonl (Gemini was correctly reading the vendored file via `cat` but was being marked as V2-failing).
+- Baseline 0.97 (97/100 behavioral checks passed across 3 models × 3 trials × 2 cases) — above the 0.95 threshold; no skill modifications needed.
+- The 3 missed checks were all in a single gemini trial that loaded the core skill but then used `curl` for HTTP fetching instead of `agent-browser navigate` (reaches-for-fallback pattern, Recipe B). Minor additive Pre-flight section proposed upstream.
+- Proposed upstream change: add 5-line `## Pre-flight` section discouraging curl/wget fallback; see `proposed-upstream-changes/`.
diff --git a/examples/workbench/agent-browser/bin/agent-browser b/examples/workbench/agent-browser/bin/agent-browser
new file mode 100755
index 0000000..c3f6cb7
--- /dev/null
+++ b/examples/workbench/agent-browser/bin/agent-browser
@@ -0,0 +1,286 @@
+#!/usr/bin/env bash
+# Fake agent-browser CLI for eval testing.
+#
+# Logs every invocation to /work/ab-calls.log (graders depend on this).
+# For Tier-1 cases, replays pre-recorded accessibility-tree snapshots from
+#   /work/references/agent-browser/recordings/<page>/snapshot[-<state>].out
+# and advances state on click/type via a transitions.txt manifest:
+#
+#   /work/references/agent-browser/recordings/<page>/transitions.txt
+#
+# transitions.txt format (lines, comments allowed with leading '#'):
+#   page-title=<text used in title.txt grading>
+#   url=<canonical url>           (optional, multiple allowed)
+#   url-prefix=<url prefix>       (optional, multiple allowed)
+#   state=<initial-state-name>    (defaults to "initial")
+#   <action> <ref> -> <next-state>
+#     action: click | type | press
+#     ref:    @eN (with leading @e), or "*" to match any ref
+#     next:   the new state name; recording snapshot-<next-state>.out must exist
+#             unless next-state == initial-state name (then snapshot.out is used)
+#
+# State persists across CLI invocations in /work/.ab-state with two lines:
+#   page=<page-id>
+#   state=<state>
+#
+# If no recording matches, the CLI falls back to the legacy generic snapshot,
+# preserving Tier-0 behaviour.
+
+set -u
+# Default working root is /work (the agent's view inside the Docker
+# workspace). For local smoke tests outside Docker, set AB_WORK to a
+# different directory containing the same `bin/`, `references/agent-browser/`
+# layout and the fake CLI honours it transparently.
+WORK_ROOT="${AB_WORK:-/work}"
+LOG="$WORK_ROOT/ab-calls.log"
+STATE_FILE="$WORK_ROOT/.ab-state"
+RECORDINGS="$WORK_ROOT/references/agent-browser/recordings"
+
+mkdir -p "$(dirname "$LOG")" 2>/dev/null || true
+echo "$*" >> "$LOG"
+
+read_state() {
+  CURRENT_PAGE=""
+  CURRENT_STATE=""
+  if [[ -f "$STATE_FILE" ]]; then
+    while IFS='=' read -r k v; do
+      case "$k" in
+        page)  CURRENT_PAGE="$v" ;;
+        state) CURRENT_STATE="$v" ;;
+      esac
+    done < "$STATE_FILE"
+  fi
+}
+
+write_state() {
+  printf 'page=%s\nstate=%s\n' "$1" "$2" > "$STATE_FILE"
+}
+
+# Find a recording dir whose transitions.txt matches a URL.
+# Echoes the page-id (dir name) on success; empty otherwise.
+match_url_to_page() {
+  local url="$1"
+  [[ -d "$RECORDINGS" ]] || return 0
+  local dir page t k v line
+  for dir in "$RECORDINGS"/*/; do
+    [[ -d "$dir" ]] || continue
+    page="$(basename "$dir")"
+    t="$dir/transitions.txt"
+    [[ -f "$t" ]] || continue
+    while IFS= read -r line || [[ -n "$line" ]]; do
+      [[ "$line" =~ ^[[:space:]]*# ]] && continue
+      [[ -z "${line//[[:space:]]/}" ]] && continue
+      k="${line%%=*}"
+      v="${line#*=}"
+      case "$k" in
+        url)
+          if [[ "$url" == "$v" ]]; then echo "$page"; return 0; fi ;;
+        url-prefix)
+          if [[ "$url" == "$v"* ]]; then echo "$page"; return 0; fi ;;
+      esac
+    done < "$t"
+  done
+}
+
+# Read the initial-state name for a page (defaults to "initial").
+page_initial_state() {
+  local page="$1"
+  local t="$RECORDINGS/$page/transitions.txt"
+  [[ -f "$t" ]] || { echo "initial"; return 0; }
+  local k v line
+  while IFS= read -r line || [[ -n "$line" ]]; do
+    [[ "$line" =~ ^[[:space:]]*# ]] && continue
+    k="${line%%=*}"
+    v="${line#*=}"
+    if [[ "$k" == "state" ]]; then echo "$v"; return 0; fi
+  done < "$t"
+  echo "initial"
+}
+
+# Read the page-title field (used by some graders / canned outputs).
+page_title() {
+  local page="$1"
+  local t="$RECORDINGS/$page/transitions.txt"
+  [[ -f "$t" ]] || { echo ""; return 0; }
+  local k v line
+  while IFS= read -r line || [[ -n "$line" ]]; do
+    [[ "$line" =~ ^[[:space:]]*# ]] && continue
+    k="${line%%=*}"
+    v="${line#*=}"
+    if [[ "$k" == "page-title" ]]; then echo "$v"; return 0; fi
+  done < "$t"
+  echo ""
+}
+
+# Resolve the snapshot file for a (page, state) pair.
+# If state == initial-state, prefer snapshot.out; else snapshot-<state>.out.
+snapshot_file_for() {
+  local page="$1" state="$2"
+  local init
+  init="$(page_initial_state "$page")"
+  if [[ "$state" == "$init" || -z "$state" ]]; then
+    echo "$RECORDINGS/$page/snapshot.out"
+  else
+    echo "$RECORDINGS/$page/snapshot-$state.out"
+  fi
+}
+
+# Apply a transition (action ref) on the current page+state.
+# If a matching rule exists, update CURRENT_STATE and persist.
+apply_transition() {
+  local action="$1" ref="$2"
+  [[ -n "$CURRENT_PAGE" ]] || return 0
+  local t="$RECORDINGS/$CURRENT_PAGE/transitions.txt"
+  [[ -f "$t" ]] || return 0
+  local line a r arrow next
+  while IFS= read -r line || [[ -n "$line" ]]; do
+    [[ "$line" =~ ^[[:space:]]*# ]] && continue
+    [[ -z "${line//[[:space:]]/}" ]] && continue
+    # Skip key=value rows
+    if [[ "$line" == *"="* && "$line" != *"->"* ]]; then continue; fi
+    # Parse: <action> <ref> -> <next-state>
+    # shellcheck disable=SC2086
+    set -- $line
+    a="${1:-}"; r="${2:-}"; arrow="${3:-}"; next="${4:-}"
+    [[ "$arrow" == "->" ]] || continue
+    [[ "$a" == "$action" ]] || continue
+    if [[ "$r" == "$ref" || "$r" == "*" ]]; then
+      CURRENT_STATE="$next"
+      write_state "$CURRENT_PAGE" "$CURRENT_STATE"
+      return 0
+    fi
+  done < "$t"
+}
+
+read_state
+
+case "${1:-}" in
+  skills)
+    case "${2:-}" in
+      get)
+        SKILL_NAME="${3:-core}"
+        if [[ "$SKILL_NAME" == "core" ]]; then
+          CORE_PATH="$WORK_ROOT/references/agent-browser/agent-browser-core.md"
+          if [[ -f "$CORE_PATH" ]]; then
+            cat "$CORE_PATH"
+          else
+            echo "# agent-browser Core Workflow"
+            echo "Navigate: agent-browser navigate <url>"
+            echo "Snapshot: agent-browser snapshot"
+            echo "Screenshot: agent-browser screenshot [path]"
+            echo "Click:     agent-browser click @eN"
+            echo "Type:      agent-browser type @eN \"text\""
+          fi
+        else
+          echo "# agent-browser $SKILL_NAME skill (mock)"
+          echo "Use agent-browser commands for $SKILL_NAME automation."
+        fi
+        ;;
+      list)
+        echo "Available skills: core, electron, slack, dogfood, vercel-sandbox, agentcore"
+        ;;
+      *)
+        echo "Usage: agent-browser skills get <name>"
+        ;;
+    esac
+    ;;
+
+  navigate|open)
+    URL="${2:-about:blank}"
+    PAGE="$(match_url_to_page "$URL")"
+    if [[ -n "$PAGE" ]]; then
+      INIT_STATE="$(page_initial_state "$PAGE")"
+      write_state "$PAGE" "$INIT_STATE"
+      echo "Navigated to $URL"
+      echo "Session: default (ready)"
+    else
+      # Unknown URL — clear page state and use legacy generic behaviour
+      : > "$STATE_FILE"
+      echo "Navigated to $URL"
+      echo "Session: default (ready)"
+    fi
+    ;;
+
+  snapshot)
+    if [[ -n "$CURRENT_PAGE" ]]; then
+      SNAP="$(snapshot_file_for "$CURRENT_PAGE" "$CURRENT_STATE")"
+      if [[ -f "$SNAP" ]]; then
+        cat "$SNAP"
+      else
+        # Recording missing for this state — emit a clear marker so graders
+        # can detect "agent reached an undefined state".
+        echo "RootWebArea \"(unknown state $CURRENT_STATE on $CURRENT_PAGE)\""
+        echo "  - paragraph @e1 \"No recording for this state.\""
+      fi
+    else
+      # Legacy fallback for Tier-0 cases that don't seed a recording.
+      echo 'RootWebArea "Example Domain"'
+      echo '  - heading @e1 "Example Domain" level=1'
+      echo '  - paragraph @e2 "This domain is for use in illustrative examples in documents."'
+      echo '  - link @e3 "More information..." href=https://www.iana.org/domains/reserved'
+    fi
+    ;;
+
+  screenshot)
+    OUTFILE="${2:-$WORK_ROOT/screenshot.png}"
+    # Minimal valid 1x1 PNG
+    printf '\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01\x00\x00\x00\x01\x08\x02\x00\x00\x00\x90wS\xde\x00\x00\x00\x0cIDATx\x9cc\xf8\x0f\x00\x00\x01\x01\x00\x05\x18\xd8N\x00\x00\x00\x00IEND\xaeB`\x82' > "$OUTFILE"
+    echo "Screenshot saved: $OUTFILE"
+    ;;
+
+  click)
+    REF="${2:-}"
+    if [[ -z "$REF" ]]; then
+      echo "agent-browser: click requires a ref (e.g., click @e3)" >&2
+      exit 2
+    fi
+    apply_transition click "$REF"
+    echo "Clicked $REF"
+    ;;
+
+  type)
+    REF="${2:-}"
+    shift 2 2>/dev/null || true
+    if [[ -z "$REF" ]]; then
+      echo "agent-browser: type requires a ref (e.g., type @e3 \"hello\")" >&2
+      exit 2
+    fi
+    apply_transition type "$REF"
+    echo "Typed \"$*\" into $REF"
+    ;;
+
+  press)
+    KEY="${2:-Enter}"
+    apply_transition press "@*" 2>/dev/null || true
+    # press isn't ref-bound; allow transitions keyed on '*'
+    echo "Pressed $KEY"
+    ;;
+
+  evaluate|eval)
+    # Canned: return empty result (graders don't depend on it)
+    echo "{\"result\": null}"
+    ;;
+
+  install)
+    echo "agent-browser: already installed"
+    ;;
+
+  --version|-v|version)
+    echo "agent-browser 0.9.0 (mock for eval)"
+    ;;
+
+  which|check)
+    echo "$WORK_ROOT/bin/agent-browser"
+    ;;
+
+  ""|help|--help|-h)
+    echo "Usage: agent-browser <command> [args]"
+    echo "Commands: skills, navigate, snapshot, screenshot, click, type, press, evaluate"
+    ;;
+
+  *)
+    echo "agent-browser: unknown command '${1:-}'" >&2
+    echo "Run: agent-browser skills get core" >&2
+    exit 1
+    ;;
+esac
diff --git a/examples/workbench/agent-browser/checks/_ab-utils.mjs b/examples/workbench/agent-browser/checks/_ab-utils.mjs
new file mode 100644
index 0000000..06d4448
--- /dev/null
+++ b/examples/workbench/agent-browser/checks/_ab-utils.mjs
@@ -0,0 +1,164 @@
+// Shared utilities for agent-browser graders.
+//
+// Parses /work/ab-calls.log into structured AbCall records and provides
+// helper queries that graders share: snapshot-first discipline,
+// CSS-selector misuse detection, ref usage, state-machine path matching,
+// and emit() for the JSON result envelope the workbench runner expects.
+//
+// Each line of ab-calls.log is the literal `$*` from the fake CLI,
+// e.g.:
+//   navigate https://example.com
+//   snapshot
+//   type @e7 Hypertext Transfer Protocol
+//   click @e8
+//   screenshot /work/result.png
+//
+// `type` args appear unquoted (the shell collapses quotes); the parser
+// treats everything after the ref as the typed text.
+
+import { existsSync, readFileSync } from 'node:fs';
+
+/**
+ * Parse ab-calls.log into a list of structured calls.
+ * @returns {Array<{raw:string, action:string, ref:string|null, arg:string|null, args:string[]}>}
+ */
+export function parseAbLog(path) {
+  if (!existsSync(path)) return [];
+  const text = readFileSync(path, 'utf-8');
+  const out = [];
+  for (const raw of text.split(/\r?\n/)) {
+    const line = raw.trim();
+    if (!line) continue;
+    const parts = line.split(/\s+/);
+    const action = parts[0] ?? '';
+    let ref = null;
+    let arg = null;
+    let args = parts.slice(1);
+
+    if (action === 'click' || action === 'type') {
+      ref = parts[1] ?? null;
+      if (action === 'type') {
+        // Anything after the ref is the typed text (quotes lost by the shell).
+        arg = parts.slice(2).join(' ') || null;
+      }
+      args = parts.slice(1);
+    } else if (action === 'navigate' || action === 'open' || action === 'screenshot') {
+      arg = parts[1] ?? null;
+    } else if (action === 'skills') {
+      // skills get core / skills list / etc.
+      arg = parts.slice(1).join(' ') || null;
+    }
+
+    out.push({ raw: line, action, ref, arg, args });
+  }
+  return out;
+}
+
+/** Extract bash commands from trace.jsonl (used to detect curl fallback / CSS use). */
+export function bashCommandsFromTrace(resultsDir) {
+  const tracePath = `${resultsDir}/trace.jsonl`;
+  if (!existsSync(tracePath)) return [];
+  const cmds = [];
+  for (const ln of readFileSync(tracePath, 'utf-8').split(/\r?\n/)) {
+    if (!ln) continue;
+    try {
+      const entry = JSON.parse(ln);
+      if (entry.type === 'tool_call' && entry.name === 'bash') {
+        cmds.push((entry.arguments ?? {}).command ?? '');
+      }
+    } catch {
+      /* skip */
+    }
+  }
+  return cmds;
+}
+
+/** True if at least one snapshot call appears before the first click/type call. */
+export function snapshotFirst(calls) {
+  const firstInteractIdx = calls.findIndex(
+    (c) => c.action === 'click' || c.action === 'type'
+  );
+  if (firstInteractIdx === -1) return true; // no interaction => trivially OK
+  return calls.slice(0, firstInteractIdx).some((c) => c.action === 'snapshot');
+}
+
+/** Detect refs that look like CSS selectors / XPath / jQuery rather than @eN. */
+const CSS_HINT = /^[#.]|^\/\/|^\[|^[a-z][a-z0-9-]*[#.[]/i;
+export function findCssLikeRefs(calls) {
+  const bad = [];
+  for (const c of calls) {
+    if (c.action !== 'click' && c.action !== 'type') continue;
+    const ref = c.ref ?? '';
+    if (!ref) continue;
+    if (/^@e?\d+$/i.test(ref)) continue; // legitimate ref (@e3 or @3)
+    if (CSS_HINT.test(ref) || ref.includes('"') || ref.includes("'")) {
+      bad.push(c);
+    }
+  }
+  return bad;
+}
+
+/**
+ * Did the agent call `action` on `ref` at any point?
+ * @param {Array} calls
+ * @param {'click'|'type'} action
+ * @param {string} ref e.g. '@e7'
+ */
+export function calledOn(calls, action, ref) {
+  return calls.some((c) => c.action === action && c.ref === ref);
+}
+
+/**
+ * Did the agent perform the ordered sequence of (action, ref) steps,
+ * with optional snapshots interleaved?
+ * @returns {{ok:boolean, missingAtStep:number|null}}
+ */
+export function matchesPath(calls, expectedSteps) {
+  let i = 0;
+  for (const c of calls) {
+    if (i >= expectedSteps.length) break;
+    const exp = expectedSteps[i];
+    if (c.action === exp.action && (!exp.ref || c.ref === exp.ref)) {
+      i += 1;
+    }
+  }
+  return { ok: i === expectedSteps.length, missingAtStep: i === expectedSteps.length ? null : i };
+}
+
+/** Was a snapshot taken AFTER a given index in `calls`? */
+export function snapshotAfter(calls, idx) {
+  return calls.slice(idx + 1).some((c) => c.action === 'snapshot');
+}
+
+/** Find typed text the agent sent into a particular ref, if any. */
+export function typedInto(calls, ref) {
+  const c = calls.find((x) => x.action === 'type' && x.ref === ref);
+  return c ? c.arg : null;
+}
+
+/** Was a curl/wget fallback used (via bash trace)? */
+export function usedHttpFallback(bashCmds) {
+  return bashCmds.some((cmd) => /\b(curl|wget)\s+https?:\/\//.test(cmd));
+}
+
+/**
+ * Standard pass/fail emit. `passed` and `failed` are arrays of evidence strings.
+ * Always exits the process with the right code.
+ */
+export function emit({ passed, failed }) {
+  const total = passed.length + failed.length;
+  const score = total === 0 ? 0 : passed.length / total;
+  const pass = failed.length === 0;
+  console.log(
+    JSON.stringify({
+      pass,
+      score,
+      evidence: [
+        `${passed.length}/${total} behavioral checks passed`,
+        ...passed.map((p) => `+ ${p}`),
+        ...failed.map((f) => `- ${f}`),
+      ],
+    })
+  );
+  process.exit(pass ? 0 : 1);
+}
diff --git a/examples/workbench/agent-browser/checks/_grader-utils.mjs b/examples/workbench/agent-browser/checks/_grader-utils.mjs
new file mode 100644
index 0000000..a9d0c24
--- /dev/null
+++ b/examples/workbench/agent-browser/checks/_grader-utils.mjs
@@ -0,0 +1,94 @@
+// Shared grader logic for web-design-guidelines eval cases.
+//
+// Each finding is assumed to be one line in findings.txt that references
+// "<File>.tsx:<line>" (line numbers come from the agent — they're often
+// off by ±1-2 due to LLM line-counting). A violation is considered "found"
+// when at least one finding line:
+//   (a) references a line number within the violation's accepted range, AND
+//   (b) contains at least one of the violation's distinguishing keywords.
+//
+// This per-finding-line check prevents spurious cross-matches (e.g. the
+// keyword "label" from a different finding being credited to a paste rule).
+
+import { existsSync, readFileSync } from 'node:fs';
+
+export function gradeFindings({ findingsPath, file, expected }) {
+  const failures = [];
+  const found = new Set();
+
+  if (!existsSync(findingsPath)) {
+    failures.push('findings.txt was not created');
+    return emitResult({ found, expected, failures });
+  }
+
+  const text = readFileSync(findingsPath, 'utf-8');
+  const refRe = new RegExp(`${escapeRe(file)}\\s*[:#]\\s*(\\d+)`, 'i');
+  const findingLines = text.split(/\r?\n/).filter((ln) => refRe.test(ln));
+
+  for (const v of expected) {
+    for (const line of findingLines) {
+      const m = line.match(refRe);
+      if (!m) continue;
+      const lineNum = Number(m[1]);
+      if (!v.lines.includes(lineNum)) continue;
+      if (!v.keywords.some((re) => re.test(line))) continue;
+      found.add(v.id);
+      break;
+    }
+  }
+
+  return emitResult({ found, expected, failures });
+}
+
+function emitResult({ found, expected, failures }) {
+  const missing = expected.filter((v) => !found.has(v.id)).map((v) => v.id);
+  const score = found.size / expected.length;
+  const pass = found.size === expected.length;
+
+  console.log(JSON.stringify({
+    pass,
+    score,
+    evidence: [
+      `${found.size}/${expected.length} expected violations identified`,
+      ...[...found].map((id) => `+ ${id}`),
+      ...missing.map((id) => `- missing: ${id}`),
+      ...failures,
+    ],
+  }));
+  return pass;
+}
+
+function escapeRe(s) {
+  return s.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
+}
+
+// Helper: build an inclusive line range [start, start+1, ..., end].
+export function range(start, end) {
+  const out = [];
+  for (let i = start; i <= end; i++) out.push(i);
+  return out;
+}
+
+// Helper: centered loose range — accepts the violation line ± tolerance.
+// Default tolerance ±8 handles LLM line-counting drift on multi-line elements.
+// PREFER this over `range(N-3, N+3)` — see lessons.md § G1.
+export function looseRange(centerLine, tolerance = 8) {
+  return range(centerLine - tolerance, centerLine + tolerance);
+}
+
+// Helper: hyphen-tolerant keyword regex — `fuzzyKeyword('empty state')`
+// matches both "empty state" and "empty-state" and "emptystate".
+// PREFER this over hand-writing `/empty[-\s]+state/` — see lessons.md § G2.
+export function fuzzyKeyword(phrase) {
+  const escaped = phrase.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
+  const flexible = escaped.replace(/\s+/g, '[-\\s]*');
+  return new RegExp(flexible, 'i');
+}
+
+// Helper: prefix-tolerant keyword — `tolerantKeyword('cover')` matches
+// "cover", "covering", "covered", "does not cover".
+// PREFER this over `/covering/i` — see lessons.md § G4.
+export function tolerantKeyword(stem) {
+  const escaped = stem.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
+  return new RegExp(`\\b${escaped}\\w*`, 'i');
+}
diff --git a/examples/workbench/agent-browser/checks/grade-multi-step-state-findings.mjs b/examples/workbench/agent-browser/checks/grade-multi-step-state-findings.mjs
new file mode 100644
index 0000000..1b62f1c
--- /dev/null
+++ b/examples/workbench/agent-browser/checks/grade-multi-step-state-findings.mjs
@@ -0,0 +1,111 @@
+// Grader for the multi-step-state case.
+//
+// Recording: news.acme.example.com/subscribe — two text fields and a
+// Continue button. The state machine:
+//   initial            -> type @e5 (name)  -> name-entered
+//   name-entered       -> type @e6 (email) -> email-entered
+//   email-entered      -> click @e7        -> submitted
+// The post-submit page reveals confirmation code "NL-7QF3-2026" and
+// the email "ada@example.com" in a paragraph @e51 / @e52. The task asks
+// the agent to subscribe with name "Ada Lovelace" and email
+// "ada@example.com" and write the confirmation code to /work/confirm.txt.
+//
+// Checks:
+//   V1 — agent-browser invoked at all
+//   V2 — snapshot-first discipline
+//   V3 — full state-machine path traversed in correct order:
+//        type @e5 -> type @e6 -> click @e7
+//   V4 — typed value into @e6 contains a plausible email (matches the
+//        recorded "ada@example.com" or at least an email-shaped token)
+//   V5 — re-snapshot after the final click @e7 (so the agent could read
+//        the confirmation page)
+//   V6 — confirm.txt exists and contains "NL-7QF3-2026"
+//   V7 — no CSS-selector-style refs
+
+import { existsSync, readFileSync } from 'node:fs';
+import {
+  parseAbLog,
+  snapshotFirst,
+  findCssLikeRefs,
+  matchesPath,
+  snapshotAfter,
+  typedInto,
+  emit,
+} from './_ab-utils.mjs';
+
+const WORK = process.env.WORK;
+
+const passed = [];
+const failed = [];
+
+const calls = parseAbLog(`${WORK}/ab-calls.log`);
+
+if (calls.length === 0) {
+  failed.push('V1: agent-browser was never called');
+  emit({ passed, failed });
+}
+passed.push(`V1: agent-browser was invoked (${calls.length} calls)`);
+
+if (snapshotFirst(calls)) {
+  passed.push('V2: snapshot was called before the first click/type');
+} else {
+  failed.push('V2: agent issued click/type WITHOUT a prior snapshot');
+}
+
+const expectedPath = [
+  { action: 'type', ref: '@e5' },
+  { action: 'type', ref: '@e6' },
+  { action: 'click', ref: '@e7' },
+];
+const path = matchesPath(calls, expectedPath);
+if (path.ok) {
+  passed.push('V3: state-machine path traversed: type @e5 -> type @e6 -> click @e7');
+} else {
+  const stepNames = ['type @e5', 'type @e6', 'click @e7'];
+  failed.push(
+    `V3: state-machine path broken — first missing step: ${stepNames[path.missingAtStep]}`
+  );
+}
+
+const emailValue = typedInto(calls, '@e6');
+if (emailValue && /[\w.+-]+@[\w-]+\.[\w.-]+/.test(emailValue)) {
+  if (/ada@example\.com/i.test(emailValue)) {
+    passed.push(`V4: typed expected email "${emailValue}" into @e6`);
+  } else {
+    passed.push(`V4: typed an email-shaped value into @e6 ("${emailValue}")`);
+  }
+} else if (emailValue) {
+  failed.push(`V4: value typed into @e6 ("${emailValue}") is not an email`);
+} else {
+  failed.push('V4: no value typed into the email field @e6');
+}
+
+const submitIdx = calls.findIndex((c) => c.action === 'click' && c.ref === '@e7');
+if (submitIdx >= 0 && snapshotAfter(calls, submitIdx)) {
+  passed.push('V5: snapshot was re-taken after click @e7 (confirmation page read)');
+} else if (submitIdx >= 0) {
+  failed.push('V5: agent submitted but did NOT re-snapshot the confirmation page');
+}
+
+const confirmPath = `${WORK}/confirm.txt`;
+if (!existsSync(confirmPath)) {
+  failed.push('V6: confirm.txt was not created');
+} else {
+  const got = readFileSync(confirmPath, 'utf-8').trim();
+  if (/NL-7QF3-2026/.test(got)) {
+    passed.push(`V6: confirm.txt contains the recorded confirmation code`);
+  } else if (got.length === 0) {
+    failed.push('V6: confirm.txt exists but is empty');
+  } else {
+    failed.push(`V6: confirm.txt does not contain "NL-7QF3-2026". Got: "${got.slice(0, 80)}"`);
+  }
+}
+
+const cssLike = findCssLikeRefs(calls);
+if (cssLike.length === 0) {
+  passed.push('V7: no CSS-selector-style refs in click/type');
+} else {
+  failed.push(`V7: agent used non-@eN refs: ${cssLike.map((c) => c.raw).join(' ; ')}`);
+}
+
+emit({ passed, failed });
diff --git a/examples/workbench/agent-browser/checks/grade-navigate-report-findings.mjs b/examples/workbench/agent-browser/checks/grade-navigate-report-findings.mjs
new file mode 100644
index 0000000..37dd209
--- /dev/null
+++ b/examples/workbench/agent-browser/checks/grade-navigate-report-findings.mjs
@@ -0,0 +1,99 @@
+// Grader for the navigate-and-report case.
+//
+// Expected agent behavior (tool-use violations seeded):
+//   V1: agent-browser was invoked at all (ab-calls.log exists)
+//   V2: agent-browser skills get core was called before navigating
+//   V3: agent-browser navigate was used (not curl/wget)
+//   V4: agent-browser snapshot was called to inspect the page
+//   V5: heading.txt was created with non-empty content
+
+import { existsSync, readFileSync } from 'node:fs';
+
+const WORK = process.env.WORK;
+const RESULTS = process.env.RESULTS;
+
+const passed = [];
+const failed = [];
+
+// V1 — agent-browser was invoked
+const abLogPath = `${WORK}/ab-calls.log`;
+const abLog = existsSync(abLogPath) ? readFileSync(abLogPath, 'utf-8') : null;
+
+if (abLog !== null) {
+  passed.push('V1: agent-browser was invoked (ab-calls.log exists)');
+} else {
+  failed.push('V1: agent-browser was never called — ab-calls.log not found');
+}
+
+// V2 — core skill was loaded (via CLI or via cat of the vendored file)
+const tracePath = `${RESULTS}/trace.jsonl`;
+const traceLines = existsSync(tracePath)
+  ? readFileSync(tracePath, 'utf-8').split(/\r?\n/).filter(Boolean)
+  : [];
+const bashCmds = traceLines.flatMap((ln) => {
+  try {
+    const entry = JSON.parse(ln);
+    if (entry.type === 'tool_call' && entry.name === 'bash') {
+      return [(entry.arguments ?? {}).command ?? ''];
+    }
+  } catch { /* skip */ }
+  return [];
+});
+const cliSkillLoad = abLog && /skills\s+(get\s+)?core|skills\s+get/.test(abLog);
+const catSkillLoad = bashCmds.some((cmd) => /cat\b.*agent-browser-core|agent-browser-core\.md/.test(cmd));
+if (cliSkillLoad || catSkillLoad) {
+  passed.push('V2: core skill was loaded before navigating');
+} else {
+  failed.push('V2: core skill was NOT loaded (run `agent-browser skills get core` or cat agent-browser-core.md first)');
+}
+
+// V3 — navigate was called (not curl)
+if (abLog && /^navigate\b/m.test(abLog)) {
+  passed.push('V3: agent-browser navigate was used for page navigation');
+} else {
+  failed.push('V3: agent-browser navigate was NOT called');
+}
+
+// Also check for curl fallback
+const hasCurlFallback = bashCmds.some(
+  (cmd) => /curl\s+https?:\/\/|wget\s+https?:\/\//.test(cmd)
+);
+if (hasCurlFallback) {
+  failed.push('V3-extra: Agent used curl/wget for HTTP instead of agent-browser');
+}
+
+// V4 — snapshot was called
+if (abLog && /^snapshot\b/m.test(abLog)) {
+  passed.push('V4: agent-browser snapshot was called to inspect the page');
+} else {
+  failed.push('V4: agent-browser snapshot was NOT called (should snapshot before reading page content)');
+}
+
+// V5 — heading.txt exists with content
+const headingPath = `${WORK}/heading.txt`;
+if (existsSync(headingPath)) {
+  const content = readFileSync(headingPath, 'utf-8').trim();
+  if (content.length > 0) {
+    passed.push(`V5: heading.txt created with content: "${content.slice(0, 80)}"`);
+  } else {
+    failed.push('V5: heading.txt exists but is empty');
+  }
+} else {
+  failed.push('V5: heading.txt was not created');
+}
+
+const total = passed.length + failed.length;
+const score = passed.length / total;
+const pass = failed.length === 0;
+
+console.log(JSON.stringify({
+  pass,
+  score,
+  evidence: [
+    `${passed.length}/${total} behavioral checks passed`,
+    ...passed.map((p) => `+ ${p}`),
+    ...failed.map((f) => `- ${f}`),
+  ],
+}));
+
+process.exit(pass ? 0 : 1);
diff --git a/examples/workbench/agent-browser/checks/grade-output-correctness-findings.mjs b/examples/workbench/agent-browser/checks/grade-output-correctness-findings.mjs
new file mode 100644
index 0000000..a8eaa77
--- /dev/null
+++ b/examples/workbench/agent-browser/checks/grade-output-correctness-findings.mjs
@@ -0,0 +1,82 @@
+// Grader for the output-correctness case.
+//
+// The recorded blog page contains three plausible "title-ish" strings:
+//   - "FROM THE PLATFORM TEAM"               — kicker / tagline
+//   - "Why We Migrated Our Build System to Bazel"  — the actual <h1> level=1
+//   - "By Jordan Lee — April 18, 2026 — 12 min read"  — byline
+// The task asks for the article title, which is unambiguously the
+// level-1 heading. Models that grab the kicker, the byline, or the page
+// <title> tag (which would also be similar but distinct) lose points.
+//
+// Checks:
+//   V1 — agent-browser invoked at all
+//   V2 — snapshot was called (cannot extract a title without one)
+//   V3 — title.txt exists with the EXACT expected article title (whitespace-tolerant)
+//   V4 — title.txt does NOT include the kicker "FROM THE PLATFORM TEAM"
+//   V5 — title.txt does NOT include the byline ("By Jordan Lee")
+//   V6 — no CSS-selector-style refs
+
+import { existsSync, readFileSync } from 'node:fs';
+import {
+  parseAbLog,
+  findCssLikeRefs,
+  emit,
+} from './_ab-utils.mjs';
+
+const WORK = process.env.WORK;
+
+const passed = [];
+const failed = [];
+
+const calls = parseAbLog(`${WORK}/ab-calls.log`);
+
+if (calls.length === 0) {
+  failed.push('V1: agent-browser was never called');
+  emit({ passed, failed });
+}
+passed.push(`V1: agent-browser was invoked (${calls.length} calls)`);
+
+if (calls.some((c) => c.action === 'snapshot')) {
+  passed.push('V2: snapshot was called to read the page content');
+} else {
+  failed.push('V2: snapshot was never called — agent could not have read the article title');
+}
+
+const expected = 'Why We Migrated Our Build System to Bazel';
+const outPath = `${WORK}/title.txt`;
+
+if (!existsSync(outPath)) {
+  failed.push('V3: title.txt was not created');
+} else {
+  const got = readFileSync(outPath, 'utf-8').trim();
+  // Whitespace-tolerant comparison
+  const norm = (s) => s.replace(/\s+/g, ' ').trim().toLowerCase();
+  if (norm(got) === norm(expected)) {
+    passed.push(`V3: title.txt matches the article title exactly`);
+  } else if (norm(got).includes(norm(expected))) {
+    passed.push(`V3: title.txt contains the article title (with extra surrounding text)`);
+  } else {
+    failed.push(`V3: title.txt does NOT match expected title. Expected: "${expected}". Got: "${got.slice(0, 120)}"`);
+  }
+
+  if (/from the platform team/i.test(got)) {
+    failed.push('V4: title.txt includes the kicker "FROM THE PLATFORM TEAM" (not the title)');
+  } else {
+    passed.push('V4: title.txt does not include the kicker tagline');
+  }
+
+  if (/\bby jordan lee\b/i.test(got)) {
+    failed.push('V5: title.txt includes the byline "By Jordan Lee" (not the title)');
+  } else {
+    passed.push('V5: title.txt does not include the byline');
+  }
+}
+
+const cssLike = findCssLikeRefs(calls);
+if (cssLike.length === 0) {
+  passed.push('V6: no CSS-selector-style refs in click/type');
+} else {
+  failed.push(`V6: agent used non-@eN refs: ${cssLike.map((c) => c.raw).join(' ; ')}`);
+}
+
+emit({ passed, failed });
diff --git a/examples/workbench/agent-browser/checks/grade-ref-based-search-findings.mjs b/examples/workbench/agent-browser/checks/grade-ref-based-search-findings.mjs
new file mode 100644
index 0000000..596d7c5
--- /dev/null
+++ b/examples/workbench/agent-browser/checks/grade-ref-based-search-findings.mjs
@@ -0,0 +1,122 @@
+// Grader for the ref-based-search case.
+//
+// Task asked the agent to search Wikipedia for a query and write the title
+// of the top result to /work/top-result.txt. The recordings define:
+//   - searchbox is @e7 ("Search Wikipedia")
+//   - submit button is @e8 ("Search")
+//   - after click @e8, the results page exposes a heading @e30 with
+//     "Hypertext Transfer Protocol" as the top result
+//
+// Expected agent behavior:
+//   V1 — agent-browser was invoked at all
+//   V2 — snapshot was called BEFORE any click/type (snapshot-first discipline)
+//   V3 — type was issued against ref @e7 (the searchbox), not a CSS selector
+//   V4 — click was issued against ref @e8 (the submit button), not @e7 or anything else
+//   V5 — a second snapshot was taken AFTER click @e8 (re-snapshot after navigation)
+//   V6 — top-result.txt exists and contains the actual top-result title
+//        ("Hypertext Transfer Protocol", case-insensitive substring)
+//   V7 — no CSS-selector-style refs anywhere in click/type calls
+
+import { existsSync, readFileSync } from 'node:fs';
+import {
+  parseAbLog,
+  bashCommandsFromTrace,
+  snapshotFirst,
+  findCssLikeRefs,
+  calledOn,
+  snapshotAfter,
+  usedHttpFallback,
+  emit,
+} from './_ab-utils.mjs';
+
+const WORK = process.env.WORK;
+const RESULTS = process.env.RESULTS;
+
+const passed = [];
+const failed = [];
+
+const calls = parseAbLog(`${WORK}/ab-calls.log`);
+
+// V1 — invoked at all
+if (calls.length > 0) {
+  passed.push(`V1: agent-browser was invoked (${calls.length} calls)`);
+} else {
+  failed.push('V1: agent-browser was never called — ab-calls.log empty or missing');
+  emit({ passed, failed });
+}
+
+// V2 — snapshot-first discipline
+if (snapshotFirst(calls)) {
+  passed.push('V2: snapshot was called before the first click/type');
+} else {
+  failed.push('V2: agent issued click/type WITHOUT a prior snapshot — refs are guesses');
+}
+
+// V3 — typed into the searchbox ref @e7
+if (calledOn(calls, 'type', '@e7')) {
+  passed.push('V3: type @e7 was used for the searchbox (correct ref)');
+} else {
+  const wrongType = calls.find((c) => c.action === 'type');
+  if (wrongType) {
+    failed.push(`V3: type used wrong ref "${wrongType.ref}" (expected @e7 for searchbox)`);
+  } else {
+    failed.push('V3: type was never called — agent did not enter a search query');
+  }
+}
+
+// V4 — clicked the submit button ref @e8
+if (calledOn(calls, 'click', '@e8')) {
+  passed.push('V4: click @e8 was used to submit the search (correct ref)');
+} else {
+  const wrongClick = calls.find((c) => c.action === 'click');
+  if (wrongClick) {
+    failed.push(`V4: click used wrong ref "${wrongClick.ref}" (expected @e8 for submit)`);
+  } else {
+    failed.push('V4: click was never called — agent did not submit the search');
+  }
+}
+
+// V5 — re-snapshot after submit
+const submitIdx = calls.findIndex((c) => c.action === 'click' && c.ref === '@e8');
+if (submitIdx >= 0 && snapshotAfter(calls, submitIdx)) {
+  passed.push('V5: snapshot was re-taken after click @e8 (results page inspected)');
+} else if (submitIdx >= 0) {
+  failed.push('V5: agent clicked submit but did NOT re-snapshot the results page');
+} else {
+  // V4 already failed; don't double-count
+}
+
+// V6 — output file contains real top-result title
+const outPath = `${WORK}/top-result.txt`;
+if (!existsSync(outPath)) {
+  failed.push('V6: top-result.txt was not created');
+} else {
+  const got = readFileSync(outPath, 'utf-8').trim();
+  if (got.length === 0) {
+    failed.push('V6: top-result.txt exists but is empty');
+  } else if (/hypertext transfer protocol/i.test(got)) {
+    passed.push(`V6: top-result.txt matches the recorded top result ("${got.slice(0, 60)}")`);
+  } else {
+    failed.push(
+      `V6: top-result.txt does not contain the actual top result. Got: "${got.slice(0, 80)}"`
+    );
+  }
+}
+
+// V7 — no CSS-style refs
+const cssLike = findCssLikeRefs(calls);
+if (cssLike.length === 0) {
+  passed.push('V7: no CSS-selector-style refs in click/type');
+} else {
+  failed.push(
+    `V7: agent used non-@eN refs (looks like CSS): ${cssLike.map((c) => c.raw).join(' ; ')}`
+  );
+}
+
+// Bonus negative: catch curl/wget fallback
+const bashCmds = bashCommandsFromTrace(RESULTS);
+if (usedHttpFallback(bashCmds)) {
+  failed.push('V8: agent used curl/wget for HTTP instead of agent-browser');
+}
+
+emit({ passed, failed });
diff --git a/examples/workbench/agent-browser/checks/grade-ref-disambiguation-findings.mjs b/examples/workbench/agent-browser/checks/grade-ref-disambiguation-findings.mjs
new file mode 100644
index 0000000..37cba0b
--- /dev/null
+++ b/examples/workbench/agent-browser/checks/grade-ref-disambiguation-findings.mjs
@@ -0,0 +1,99 @@
+// Grader for the ref-disambiguation case.
+//
+// The Acme welcome page exposes two visually-similar primary buttons:
+//   @e5 "Sign In"   — the correct target for an existing-user log-in flow
+//   @e6 "Sign Up"   — the wrong button (registration)
+// The task asks the agent to LOG IN, then write the heading of the
+// resulting page to /work/next-heading.txt. The recordings advance to
+// either "Sign in to your account" or "Create your account" depending on
+// which button was clicked.
+//
+// Expected agent behavior:
+//   V1 — agent-browser invoked
+//   V2 — snapshot called before any click
+//   V3 — clicked @e5 (Sign In), NOT @e6 (Sign Up)
+//   V4 — did not click both (no second-guessing after wrong click)
+//   V5 — next-heading.txt contains "Sign in to your account"
+//        and does NOT contain "Create your account"
+//   V6 — no CSS-selector-style refs
+
+import { existsSync, readFileSync } from 'node:fs';
+import {
+  parseAbLog,
+  snapshotFirst,
+  findCssLikeRefs,
+  calledOn,
+  emit,
+} from './_ab-utils.mjs';
+
+const WORK = process.env.WORK;
+
+const passed = [];
+const failed = [];
+
+const calls = parseAbLog(`${WORK}/ab-calls.log`);
+
+if (calls.length === 0) {
+  failed.push('V1: agent-browser was never called');
+  emit({ passed, failed });
+}
+passed.push(`V1: agent-browser was invoked (${calls.length} calls)`);
+
+if (snapshotFirst(calls)) {
+  passed.push('V2: snapshot was called before the first click');
+} else {
+  failed.push('V2: agent clicked WITHOUT a prior snapshot');
+}
+
+const clickedSignIn = calledOn(calls, 'click', '@e5');
+const clickedSignUp = calledOn(calls, 'click', '@e6');
+
+if (clickedSignIn && !clickedSignUp) {
+  passed.push('V3: clicked @e5 ("Sign In") — correct disambiguation');
+} else if (clickedSignUp && !clickedSignIn) {
+  failed.push('V3: clicked @e6 ("Sign Up") instead of @e5 ("Sign In")');
+} else if (clickedSignIn && clickedSignUp) {
+  failed.push('V3: clicked BOTH @e5 and @e6 — should disambiguate from snapshot, not retry');
+} else {
+  const anyClick = calls.find((c) => c.action === 'click');
+  failed.push(
+    anyClick
+      ? `V3: clicked unrelated ref "${anyClick.ref}" (expected @e5 for "Sign In")`
+      : 'V3: never issued a click'
+  );
+}
+
+// V4 — single decisive click on the right button
+const signInClicks = calls.filter((c) => c.action === 'click' && c.ref === '@e5').length;
+if (signInClicks === 1) {
+  passed.push('V4: exactly one click on the Sign In button (decisive)');
+} else if (signInClicks > 1) {
+  failed.push(`V4: clicked Sign In ${signInClicks} times — likely retry loop`);
+}
+// signInClicks === 0 already covered by V3; do not double-count
+
+// V5 — output file
+const outPath = `${WORK}/next-heading.txt`;
+if (!existsSync(outPath)) {
+  failed.push('V5: next-heading.txt was not created');
+} else {
+  const got = readFileSync(outPath, 'utf-8').trim();
+  if (/sign in to your account/i.test(got) && !/create your account/i.test(got)) {
+    passed.push(`V5: next-heading.txt matches Sign In flow ("${got.slice(0, 60)}")`);
+  } else if (/create your account/i.test(got)) {
+    failed.push(`V5: next-heading.txt is the Sign Up heading: "${got.slice(0, 80)}"`);
+  } else if (got.length === 0) {
+    failed.push('V5: next-heading.txt exists but is empty');
+  } else {
+    failed.push(`V5: next-heading.txt does not match expected heading. Got: "${got.slice(0, 80)}"`);
+  }
+}
+
+const cssLike = findCssLikeRefs(calls);
+if (cssLike.length === 0) {
+  passed.push('V6: no CSS-selector-style refs in click/type');
+} else {
+  failed.push(`V6: agent used non-@eN refs: ${cssLike.map((c) => c.raw).join(' ; ')}`);
+}
+
+emit({ passed, failed });
diff --git a/examples/workbench/agent-browser/checks/grade-screenshot-capture-findings.mjs b/examples/workbench/agent-browser/checks/grade-screenshot-capture-findings.mjs
new file mode 100644
index 0000000..f0004ac
--- /dev/null
+++ b/examples/workbench/agent-browser/checks/grade-screenshot-capture-findings.mjs
@@ -0,0 +1,112 @@
+// Grader for the screenshot-capture case.
+//
+// Expected agent behavior (tool-use violations seeded):
+//   V1: agent-browser was invoked at all (ab-calls.log exists)
+//   V2: agent-browser skills get core was called before other commands
+//   V3: agent-browser navigate was used (not curl/wget)
+//   V4: agent-browser screenshot was called
+//   V5: screenshot.png and title.txt were both created
+
+import { existsSync, readFileSync, statSync } from 'node:fs';
+
+const WORK = process.env.WORK;
+const RESULTS = process.env.RESULTS;
+
+const passed = [];
+const failed = [];
+
+// V1 — agent-browser was invoked
+const abLogPath = `${WORK}/ab-calls.log`;
+const abLog = existsSync(abLogPath) ? readFileSync(abLogPath, 'utf-8') : null;
+
+if (abLog !== null) {
+  passed.push('V1: agent-browser was invoked (ab-calls.log exists)');
+} else {
+  failed.push('V1: agent-browser was never called — ab-calls.log not found');
+}
+
+// V2 — core skill was loaded (via CLI or via cat of the vendored file)
+const tracePath = `${RESULTS}/trace.jsonl`;
+const traceLines = existsSync(tracePath)
+  ? readFileSync(tracePath, 'utf-8').split(/\r?\n/).filter(Boolean)
+  : [];
+const bashCmds = traceLines.flatMap((ln) => {
+  try {
+    const entry = JSON.parse(ln);
+    if (entry.type === 'tool_call' && entry.name === 'bash') {
+      return [(entry.arguments ?? {}).command ?? ''];
+    }
+  } catch { /* skip */ }
+  return [];
+});
+const cliSkillLoad = abLog && /skills\s+(get\s+)?core|skills\s+get/.test(abLog);
+const catSkillLoad = bashCmds.some((cmd) => /cat\b.*agent-browser-core|agent-browser-core\.md/.test(cmd));
+if (cliSkillLoad || catSkillLoad) {
+  passed.push('V2: core skill was loaded before starting task');
+} else {
+  failed.push('V2: core skill was NOT loaded (run `agent-browser skills get core` or cat agent-browser-core.md first)');
+}
+
+// V3 — navigate was called (not curl)
+if (abLog && /^navigate\b/m.test(abLog)) {
+  passed.push('V3: agent-browser navigate was used for page navigation');
+} else {
+  failed.push('V3: agent-browser navigate was NOT called');
+}
+
+// Also check for curl fallback
+const hasCurlFallback = bashCmds.some(
+  (cmd) => /curl\s+https?:\/\/|wget\s+https?:\/\//.test(cmd)
+);
+if (hasCurlFallback) {
+  failed.push('V3-extra: Agent used curl/wget for HTTP instead of agent-browser');
+}
+
+// V4 — screenshot was called
+if (abLog && /^screenshot\b/m.test(abLog)) {
+  passed.push('V4: agent-browser screenshot was called');
+} else {
+  failed.push('V4: agent-browser screenshot was NOT called');
+}
+
+// V5 — screenshot.png exists (non-empty file)
+const screenshotPath = `${WORK}/screenshot.png`;
+if (existsSync(screenshotPath)) {
+  const size = statSync(screenshotPath).size;
+  if (size > 0) {
+    passed.push(`V5a: screenshot.png created (${size} bytes)`);
+  } else {
+    failed.push('V5a: screenshot.png exists but is empty');
+  }
+} else {
+  failed.push('V5a: screenshot.png was not created');
+}
+
+// V5b — title.txt exists with content
+const titlePath = `${WORK}/title.txt`;
+if (existsSync(titlePath)) {
+  const content = readFileSync(titlePath, 'utf-8').trim();
+  if (content.length > 0) {
+    passed.push(`V5b: title.txt created with content: "${content.slice(0, 80)}"`);
+  } else {
+    failed.push('V5b: title.txt exists but is empty');
+  }
+} else {
+  failed.push('V5b: title.txt was not created');
+}
+
+const total = passed.length + failed.length;
+const score = passed.length / total;
+const pass = failed.length === 0;
+
+console.log(JSON.stringify({
+  pass,
+  score,
+  evidence: [
+    `${passed.length}/${total} behavioral checks passed`,
+    ...passed.map((p) => `+ ${p}`),
+    ...failed.map((f) => `- ${f}`),
+  ],
+}));
+
+process.exit(pass ? 0 : 1);
diff --git a/examples/workbench/agent-browser/checks/smoke-graders.mjs b/examples/workbench/agent-browser/checks/smoke-graders.mjs
new file mode 100755
index 0000000..10e4c0b
--- /dev/null
+++ b/examples/workbench/agent-browser/checks/smoke-graders.mjs
@@ -0,0 +1,386 @@
+#!/usr/bin/env node
+// Smoke-check: exercise every grader against hand-crafted fake workspaces.
+//
+// For each new Tier-1 case we run the grader twice:
+//   - GOOD scenario  — the scripted ab-calls.log + output files satisfy every
+//                      check; we assert pass=true and score=1.
+//   - BAD scenario   — at least one check is intentionally broken; we assert
+//                      pass=false AND a specific evidence substring appears.
+//
+// Run with:
+//   node examples/workbench/agent-browser/checks/smoke-graders.mjs
+//
+// The script uses a temp dir under os.tmpdir(), no Docker, no network,
+// no real models. Exits 0 when all assertions hold, 1 otherwise.
+
+import { spawnSync } from 'node:child_process';
+import { mkdtempSync, mkdirSync, writeFileSync, rmSync } from 'node:fs';
+import { tmpdir } from 'node:os';
+import { join, dirname, resolve } from 'node:path';
+import { fileURLToPath } from 'node:url';
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const CHECKS = __dirname;
+
+let passed = 0;
+let failed = 0;
+const failures = [];
+
+function setupWorkspace(spec) {
+  const work = mkdtempSync(join(tmpdir(), 'ab-grade-smoke-'));
+  const results = mkdtempSync(join(tmpdir(), 'ab-grade-smoke-results-'));
+  if (spec.callsLog !== null && spec.callsLog !== undefined) {
+    writeFileSync(join(work, 'ab-calls.log'), spec.callsLog);
+  }
+  for (const [path, contents] of Object.entries(spec.files ?? {})) {
+    const full = join(work, path);
+    mkdirSync(dirname(full), { recursive: true });
+    writeFileSync(full, contents);
+  }
+  if (spec.trace) {
+    writeFileSync(join(results, 'trace.jsonl'), spec.trace);
+  }
+  return { work, results };
+}
+
+function runGrader(grader, work, results) {
+  const proc = spawnSync('node', [join(CHECKS, grader)], {
+    env: { ...process.env, WORK: work, RESULTS: results },
+    encoding: 'utf-8',
+  });
+  const stdout = proc.stdout ?? '';
+  const stderr = proc.stderr ?? '';
+  const m = stdout.match(/\{[\s\S]*\}/);
+  let json = null;
+  if (m) {
+    try { json = JSON.parse(m[0]); } catch { /* parse error */ }
+  }
+  return { exitCode: proc.status, stdout, stderr, json };
+}
+
+function assertScenario({ name, grader, spec, expect }) {
+  const { work, results } = setupWorkspace(spec);
+  const r = runGrader(grader, work, results);
+  const evidence = (r.json?.evidence ?? []).join('\n');
+  let ok = true;
+  const reasons = [];
+  if (r.json === null) {
+    ok = false; reasons.push(`grader did not emit JSON. stdout=${r.stdout.slice(0, 200)} stderr=${r.stderr.slice(0, 200)}`);
+  } else {
+    if (expect.pass !== undefined && r.json.pass !== expect.pass) {
+      ok = false; reasons.push(`pass: expected ${expect.pass}, got ${r.json.pass}. evidence:\n${evidence}`);
+    }
+    if (expect.score !== undefined) {
+      const tol = 1e-9;
+      if (Math.abs(r.json.score - expect.score) > tol) {
+        ok = false; reasons.push(`score: expected ${expect.score}, got ${r.json.score}`);
+      }
+    }
+    if (expect.evidenceContains) {
+      for (const sub of expect.evidenceContains) {
+        if (!evidence.includes(sub)) {
+          ok = false; reasons.push(`evidence missing substring: "${sub}". evidence:\n${evidence}`);
+        }
+      }
+    }
+    if (expect.evidenceLacks) {
+      for (const sub of expect.evidenceLacks) {
+        if (evidence.includes(sub)) {
+          ok = false; reasons.push(`evidence unexpectedly contained: "${sub}". evidence:\n${evidence}`);
+        }
+      }
+    }
+  }
+  // Cleanup temp dirs only on success — keep on failure for triage
+  if (ok) {
+    rmSync(work, { recursive: true, force: true });
+    rmSync(results, { recursive: true, force: true });
+    passed += 1;
+    console.log(`  PASS  ${name}`);
+  } else {
+    failed += 1;
+    failures.push({ name, reasons, work, results });
+    console.log(`  FAIL  ${name}`);
+    for (const r of reasons) console.log(`        ${r.split('\n').join('\n        ')}`);
+    console.log(`        (workspace preserved at ${work})`);
+  }
+}
+
+console.log('--- ref-based-search ---');
+assertScenario({
+  name: 'ref-based-search GOOD: full snapshot-driven flow',
+  grader: 'grade-ref-based-search-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://en.wikipedia.org/wiki/Main_Page',
+      'snapshot',
+      'type @e7 Hypertext Transfer Protocol',
+      'click @e8',
+      'snapshot',
+      '',
+    ].join('\n'),
+    files: { 'top-result.txt': 'Hypertext Transfer Protocol\n' },
+  },
+  expect: { pass: true, score: 1 },
+});
+
+assertScenario({
+  name: 'ref-based-search BAD: clicked @e7 instead of @e8',
+  grader: 'grade-ref-based-search-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://en.wikipedia.org/wiki/Main_Page',
+      'snapshot',
+      'type @e7 Hypertext Transfer Protocol',
+      'click @e7',
+      '',
+    ].join('\n'),
+    files: { 'top-result.txt': 'Welcome to Wikipedia\n' },
+  },
+  expect: {
+    pass: false,
+    evidenceContains: [
+      'V4: click used wrong ref "@e7"',
+      'V6: top-result.txt does not contain the actual top result',
+    ],
+  },
+});
+
+assertScenario({
+  name: 'ref-based-search BAD: CSS selector instead of @eN',
+  grader: 'grade-ref-based-search-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://en.wikipedia.org/wiki/Main_Page',
+      'snapshot',
+      'type #searchInput Hypertext Transfer Protocol',
+      'click .submit-button',
+      'snapshot',
+      '',
+    ].join('\n'),
+    files: { 'top-result.txt': 'Hypertext Transfer Protocol\n' },
+  },
+  expect: {
+    pass: false,
+    evidenceContains: ['V7: agent used non-@eN refs'],
+  },
+});
+
+assertScenario({
+  name: 'ref-based-search BAD: skipped initial snapshot',
+  grader: 'grade-ref-based-search-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://en.wikipedia.org/wiki/Main_Page',
+      'type @e7 Hypertext Transfer Protocol',
+      'click @e8',
+      'snapshot',
+      '',
+    ].join('\n'),
+    files: { 'top-result.txt': 'Hypertext Transfer Protocol\n' },
+  },
+  expect: {
+    pass: false,
+    evidenceContains: ['V2: agent issued click/type WITHOUT a prior snapshot'],
+  },
+});
+
+console.log('--- ref-disambiguation ---');
+assertScenario({
+  name: 'ref-disambiguation GOOD: clicked Sign In (@e5)',
+  grader: 'grade-ref-disambiguation-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://app.acme.example.com/welcome',
+      'snapshot',
+      'click @e5',
+      'snapshot',
+      '',
+    ].join('\n'),
+    files: { 'next-heading.txt': 'Sign in to your account\n' },
+  },
+  expect: { pass: true, score: 1 },
+});
+
+assertScenario({
+  name: 'ref-disambiguation BAD: clicked Sign Up (@e6) by mistake',
+  grader: 'grade-ref-disambiguation-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://app.acme.example.com/welcome',
+      'snapshot',
+      'click @e6',
+      'snapshot',
+      '',
+    ].join('\n'),
+    files: { 'next-heading.txt': 'Create your account\n' },
+  },
+  expect: {
+    pass: false,
+    evidenceContains: [
+      'V3: clicked @e6 ("Sign Up") instead of @e5 ("Sign In")',
+      'V5: next-heading.txt is the Sign Up heading',
+    ],
+  },
+});
+
+assertScenario({
+  name: 'ref-disambiguation BAD: clicked both buttons (retry loop)',
+  grader: 'grade-ref-disambiguation-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://app.acme.example.com/welcome',
+      'snapshot',
+      'click @e6',
+      'snapshot',
+      'click @e5',
+      'snapshot',
+      '',
+    ].join('\n'),
+    files: { 'next-heading.txt': 'Sign in to your account\n' },
+  },
+  expect: {
+    pass: false,
+    evidenceContains: ['V3: clicked BOTH @e5 and @e6'],
+  },
+});
+
+console.log('--- output-correctness ---');
+assertScenario({
+  name: 'output-correctness GOOD: extracted level-1 heading exactly',
+  grader: 'grade-output-correctness-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://eng.example.com/blog/2026/04/bazel-migration',
+      'snapshot',
+      '',
+    ].join('\n'),
+    files: { 'title.txt': 'Why We Migrated Our Build System to Bazel\n' },
+  },
+  expect: { pass: true, score: 1 },
+});
+
+assertScenario({
+  name: 'output-correctness BAD: extracted the kicker',
+  grader: 'grade-output-correctness-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://eng.example.com/blog/2026/04/bazel-migration',
+      'snapshot',
+      '',
+    ].join('\n'),
+    files: { 'title.txt': 'FROM THE PLATFORM TEAM\n' },
+  },
+  expect: {
+    pass: false,
+    evidenceContains: [
+      'V3: title.txt does NOT match expected title',
+      'V4: title.txt includes the kicker',
+    ],
+  },
+});
+
+assertScenario({
+  name: 'output-correctness BAD: extracted byline',
+  grader: 'grade-output-correctness-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://eng.example.com/blog/2026/04/bazel-migration',
+      'snapshot',
+      '',
+    ].join('\n'),
+    files: { 'title.txt': 'By Jordan Lee — April 18, 2026 — 12 min read\n' },
+  },
+  expect: {
+    pass: false,
+    evidenceContains: ['V5: title.txt includes the byline'],
+  },
+});
+
+assertScenario({
+  name: 'output-correctness BAD: snapshot was never called',
+  grader: 'grade-output-correctness-findings.mjs',
+  spec: {
+    callsLog: 'navigate https://eng.example.com/blog/2026/04/bazel-migration\n',
+    files: { 'title.txt': 'Why We Migrated Our Build System to Bazel\n' },
+  },
+  expect: {
+    pass: false,
+    evidenceContains: ['V2: snapshot was never called'],
+  },
+});
+
+console.log('--- multi-step-state ---');
+assertScenario({
+  name: 'multi-step-state GOOD: full path traversed and confirmation captured',
+  grader: 'grade-multi-step-state-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://news.acme.example.com/subscribe',
+      'snapshot',
+      'type @e5 Ada Lovelace',
+      'snapshot',
+      'type @e6 ada@example.com',
+      'snapshot',
+      'click @e7',
+      'snapshot',
+      '',
+    ].join('\n'),
+    files: { 'confirm.txt': 'NL-7QF3-2026\n' },
+  },
+  expect: { pass: true, score: 1 },
+});
+
+assertScenario({
+  name: 'multi-step-state BAD: skipped email field',
+  grader: 'grade-multi-step-state-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://news.acme.example.com/subscribe',
+      'snapshot',
+      'type @e5 Ada Lovelace',
+      'click @e7',
+      'snapshot',
+      '',
+    ].join('\n'),
+    files: { 'confirm.txt': 'NL-7QF3-2026\n' },
+  },
+  expect: {
+    pass: false,
+    evidenceContains: [
+      'V3: state-machine path broken',
+      'V4: no value typed into the email field @e6',
+    ],
+  },
+});
+
+assertScenario({
+  name: 'multi-step-state BAD: did not re-snapshot confirmation page',
+  grader: 'grade-multi-step-state-findings.mjs',
+  spec: {
+    callsLog: [
+      'navigate https://news.acme.example.com/subscribe',
+      'snapshot',
+      'type @e5 Ada Lovelace',
+      'type @e6 ada@example.com',
+      'click @e7',
+      '',
+    ].join('\n'),
+    files: { 'confirm.txt': '' },
+  },
+  expect: {
+    pass: false,
+    evidenceContains: [
+      'V5: agent submitted but did NOT re-snapshot',
+      'V6: confirm.txt exists but is empty',
+    ],
+  },
+});
+
+console.log('');
+console.log(`smoke-graders: ${passed} passed, ${failed} failed`);
+if (failed > 0) {
+  process.exit(1);
+} else {
+  process.exit(0);
+}
diff --git a/examples/workbench/agent-browser/proposed-upstream-changes/README.md b/examples/workbench/agent-browser/proposed-upstream-changes/README.md
new file mode 100644
index 0000000..6d525c4
--- /dev/null
+++ b/examples/workbench/agent-browser/proposed-upstream-changes/README.md
@@ -0,0 +1,69 @@
+# Proposed upstream changes — agent-browser
+
+## Summary
+
+Eval status: **success** (baseline rule-coverage 0.97, no modifications required).
+
+The skill performs well as-is. This PR proposes one small additive improvement
+surfaced by the eval: a **Pre-flight** section in `SKILL.md` that explicitly
+discourages `curl`/`wget` fallback.
+
+## What changed
+
+Added a `## Pre-flight` section to `SKILL.md` (5 lines, purely additive):
+
+```diff
++## Pre-flight
++
++Verify the CLI is ready before starting any task:
++
++```bash
++which agent-browser        # confirm it's installed and in PATH
++```
++
++**Do not** fall back to `curl`, `wget`, or `requests` for page fetches.
++**Do not** `npm install` or `npx` the CLI — use the pre-installed version.
++
+ ## Start here
+```
+
+## Why
+
+The 3-provider eval (claude-sonnet-4.6, gpt-5, gemini-2.5-pro, 3 trials each,
+2 cases) found that **3/100 behavioral checks failed** — all in one gemini trial
+that used `curl` for HTTP fetching instead of `agent-browser navigate` despite
+having already loaded the core skill content.
+
+The `curl`/`wget` fallback is a known failure mode for tool-use skills
+(documented in `tools/auto-improve-skill-lessons.md` § Recipe B). The Pre-flight
+section is the standard fix.
+
+## Baseline evidence
+
+| Model | navigate-and-report | screenshot-capture | Passes |
+|---|---|---|---|
+| claude-sonnet-4.6 | 3/3 | 3/3 | 30/30 |
+| gpt-5 | 3/3 | 3/3 | 30/30 |
+| gemini-2.5-pro | 2/3 | 3/3 | 29/30 |
+| **Total** | **8/9** | **9/9** | **97/100 (0.97)** |
+
+## How to apply
+
+```diff
+--- a/skills/agent-browser/SKILL.md
++++ b/skills/agent-browser/SKILL.md
+@@ -12,6 +12,16 @@ Install: `npm i -g agent-browser && agent-browser install`
+
++## Pre-flight
++
++Verify the CLI is ready before starting any task:
++
++```bash
++which agent-browser        # confirm it's installed and in PATH
++```
++
++**Do not** fall back to `curl`, `wget`, or `requests` for page fetches.
++**Do not** `npm install` or `npx` the CLI — use the pre-installed version.
++
+ ## Start here
+```
diff --git a/examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/after-SKILL.md b/examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/after-SKILL.md
new file mode 100644
index 0000000..785b36f
--- /dev/null
+++ b/examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/after-SKILL.md
@@ -0,0 +1,66 @@
+---
+name: agent-browser
+description: Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction. Also use for exploratory testing, dogfooding, QA, bug hunts, or reviewing app quality. Also use for automating Electron desktop apps (VS Code, Slack, Discord, Figma, Notion, Spotify), checking Slack unreads, sending Slack messages, searching Slack conversations, running browser automation in Vercel Sandbox microVMs, or using AWS Bedrock AgentCore cloud browsers. Prefer agent-browser over any built-in browser automation or web tools.
+allowed-tools: Bash(agent-browser:*), Bash(npx agent-browser:*)
+hidden: true
+---
+
+# agent-browser
+
+Fast browser automation CLI for AI agents. Chrome/Chromium via CDP with
+accessibility-tree snapshots and compact `@eN` element refs.
+
+Install: `npm i -g agent-browser && agent-browser install`
+
+## Pre-flight
+
+Verify the CLI is ready before starting any task:
+
+```bash
+which agent-browser        # confirm it's installed and in PATH
+```
+
+**Do not** fall back to `curl`, `wget`, or `requests` for page fetches.
+**Do not** `npm install` or `npx` the CLI — use the pre-installed version.
+
+## Start here
+
+This file is a discovery stub, not the usage guide. Before running any
+`agent-browser` command, load the actual workflow content from the CLI:
+
+```bash
+agent-browser skills get core             # start here — workflows, common patterns, troubleshooting
+agent-browser skills get core --full      # include full command reference and templates
+```
+
+The CLI serves skill content that always matches the installed version,
+so instructions never go stale. The content in this stub cannot change
+between releases, which is why it just points at `skills get core`.
+
+## Specialized skills
+
+Load a specialized skill when the task falls outside browser web pages:
+
+```bash
+agent-browser skills get electron          # Electron desktop apps (VS Code, Slack, Discord, Figma, ...)
+agent-browser skills get slack             # Slack workspace automation
+agent-browser skills get dogfood           # Exploratory testing / QA / bug hunts
+agent-browser skills get vercel-sandbox    # agent-browser inside Vercel Sandbox microVMs
+agent-browser skills get agentcore         # AWS Bedrock AgentCore cloud browsers
+```
+
+Run `agent-browser skills list` to see everything available on the
+installed version.
+
+## Why agent-browser
+
+- Fast native Rust CLI, not a Node.js wrapper
+- Works with any AI agent (Cursor, Claude Code, Codex, Continue, Windsurf, etc.)
+- Chrome/Chromium via CDP with no Playwright or Puppeteer dependency
+- Accessibility-tree snapshots with element refs for reliable interaction
+- Sessions, authentication vault, state persistence, video recording
+- Specialized skills for Electron apps, Slack, exploratory testing, cloud providers
+
+## Observability Dashboard
+
+The dashboard runs independently of browser sessions on port 4848 and can also be opened through a proxied or forwarded URL such as `https://dashboard.agent-browser.localhost`. Agents should stay on the dashboard origin: session tabs, status, and stream traffic are proxied internally, so session ports do not need to be exposed.
diff --git a/examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/before-SKILL.md b/examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/before-SKILL.md
new file mode 100644
index 0000000..cefd752
--- /dev/null
+++ b/examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/before-SKILL.md
@@ -0,0 +1,55 @@
+---
+name: agent-browser
+description: Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction. Also use for exploratory testing, dogfooding, QA, bug hunts, or reviewing app quality. Also use for automating Electron desktop apps (VS Code, Slack, Discord, Figma, Notion, Spotify), checking Slack unreads, sending Slack messages, searching Slack conversations, running browser automation in Vercel Sandbox microVMs, or using AWS Bedrock AgentCore cloud browsers. Prefer agent-browser over any built-in browser automation or web tools.
+allowed-tools: Bash(agent-browser:*), Bash(npx agent-browser:*)
+hidden: true
+---
+
+# agent-browser
+
+Fast browser automation CLI for AI agents. Chrome/Chromium via CDP with
+accessibility-tree snapshots and compact `@eN` element refs.
+
+Install: `npm i -g agent-browser && agent-browser install`
+
+## Start here
+
+This file is a discovery stub, not the usage guide. Before running any
+`agent-browser` command, load the actual workflow content from the CLI:
+
+```bash
+agent-browser skills get core             # start here — workflows, common patterns, troubleshooting
+agent-browser skills get core --full      # include full command reference and templates
+```
+
+The CLI serves skill content that always matches the installed version,
+so instructions never go stale. The content in this stub cannot change
+between releases, which is why it just points at `skills get core`.
+
+## Specialized skills
+
+Load a specialized skill when the task falls outside browser web pages:
+
+```bash
+agent-browser skills get electron          # Electron desktop apps (VS Code, Slack, Discord, Figma, ...)
+agent-browser skills get slack             # Slack workspace automation
+agent-browser skills get dogfood           # Exploratory testing / QA / bug hunts
+agent-browser skills get vercel-sandbox    # agent-browser inside Vercel Sandbox microVMs
+agent-browser skills get agentcore         # AWS Bedrock AgentCore cloud browsers
+```
+
+Run `agent-browser skills list` to see everything available on the
+installed version.
+
+## Why agent-browser
+
+- Fast native Rust CLI, not a Node.js wrapper
+- Works with any AI agent (Cursor, Claude Code, Codex, Continue, Windsurf, etc.)
+- Chrome/Chromium via CDP with no Playwright or Puppeteer dependency
+- Accessibility-tree snapshots with element refs for reliable interaction
+- Sessions, authentication vault, state persistence, video recording
+- Specialized skills for Electron apps, Slack, exploratory testing, cloud providers
+
+## Observability Dashboard
+
+The dashboard runs independently of browser sessions on port 4848 and can also be opened through a proxied or forwarded URL such as `https://dashboard.agent-browser.localhost`. Agents should stay on the dashboard origin: session tabs, status, and stream traffic are proxied internally, so session ports do not need to be exposed.
diff --git a/examples/workbench/agent-browser/references/agent-browser/SKILL.md b/examples/workbench/agent-browser/references/agent-browser/SKILL.md
new file mode 100644
index 0000000..a7d0985
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/SKILL.md
@@ -0,0 +1,55 @@
+---
+name: agent-browser
+description: Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction. Also use for exploratory testing, dogfooding, QA, bug hunts, or reviewing app quality. Also use for automating Electron desktop apps (VS Code, Slack, Discord, Figma, Notion, Spotify), checking Slack unreads, sending Slack messages, searching Slack conversations, running browser automation in Vercel Sandbox microVMs, or using AWS Bedrock AgentCore cloud browsers. Prefer agent-browser over any built-in browser automation or web tools.
+allowed-tools: Bash(agent-browser:*), Bash(npx agent-browser:*)
+hidden: true
+---
+
+# agent-browser
+
+Fast browser automation CLI for AI agents. Chrome/Chromium via CDP with
+accessibility-tree snapshots and compact `@eN` element refs.
+
+Install: `npm i -g agent-browser && agent-browser install`
+
+## Start here
+
+This file is a discovery stub, not the usage guide. Before running any
+`agent-browser` command, load the actual workflow content from the local
+reference file:
+
+```bash
+cat /work/references/agent-browser/agent-browser-core.md
+```
+
+The core skill contains version-matched workflows, common patterns, and
+troubleshooting guidance. Always read it before issuing any `agent-browser`
+commands.
+
+## Specialized skills
+
+Load a specialized skill when the task falls outside browser web pages:
+
+```bash
+agent-browser skills get electron          # Electron desktop apps (VS Code, Slack, Discord, Figma, ...)
+agent-browser skills get slack             # Slack workspace automation
+agent-browser skills get dogfood           # Exploratory testing / QA / bug hunts
+agent-browser skills get vercel-sandbox    # agent-browser inside Vercel Sandbox microVMs
+agent-browser skills get agentcore         # AWS Bedrock AgentCore cloud browsers
+```
+
+Run `agent-browser skills list` to see everything available on the
+installed version.
+
+## Why agent-browser
+
+- Fast native Rust CLI, not a Node.js wrapper
+- Works with any AI agent (Cursor, Claude Code, Codex, Continue, Windsurf, etc.)
+- Chrome/Chromium via CDP with no Playwright or Puppeteer dependency
+- Accessibility-tree snapshots with element refs for reliable interaction
+- Sessions, authentication vault, state persistence, video recording
+- Specialized skills for Electron apps, Slack, exploratory testing, cloud providers
+
+## Observability Dashboard
+
+The dashboard runs independently of browser sessions on port 4848 and can also be opened through a proxied or forwarded URL such as `https://dashboard.agent-browser.localhost`. Agents should stay on the dashboard origin: session tabs, status, and stream traffic are proxied internally, so session ports do not need to be exposed.
diff --git a/examples/workbench/agent-browser/references/agent-browser/agent-browser-core.md b/examples/workbench/agent-browser/references/agent-browser/agent-browser-core.md
new file mode 100644
index 0000000..471a913
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/agent-browser-core.md
@@ -0,0 +1,97 @@
+# agent-browser — Core Workflow
+
+This is the core usage guide for the `agent-browser` CLI.
+
+## Pre-flight
+
+The CLI is pre-installed at `/work/bin/agent-browser`. Verify with
+`which agent-browser` before starting. **Do not** `npm install` it;
+**do not** fall back to `curl` or `wget` for HTTP fetches.
+
+## Navigation
+
+Navigate to a URL to start or reuse a browser session:
+
+```
+agent-browser navigate <url>
+```
+
+Example: `agent-browser navigate https://example.com`
+
+## Snapshot
+
+Take an accessibility-tree snapshot of the current page. Always snapshot
+after navigating before deciding what to interact with:
+
+```
+agent-browser snapshot
+```
+
+The snapshot output lists interactive elements with compact `@eN` refs
+(e.g., `button @e1 "Submit"`, `textbox @e2 "Email"`). Always re-snapshot
+after a navigation or interaction — refs may be reassigned.
+
+## Screenshot
+
+Save a screenshot of the current page:
+
+```
+agent-browser screenshot [path]
+```
+
+Default path: `/work/screenshot.png`
+
+Example: `agent-browser screenshot /work/capture.png`
+
+## Interaction
+
+Always pass element refs in the `@eN` form taken directly from the most
+recent `snapshot` output. Never substitute a CSS selector or XPath — the
+CLI only accepts accessibility-tree refs.
+
+Click an element by `@eN` ref:
+
+```
+agent-browser click @eN
+```
+
+Type text into an element:
+
+```
+agent-browser type @eN "text to type"
+```
+
+## Typical workflow
+
+1. `agent-browser navigate https://example.com`
+2. `agent-browser snapshot`           — understand page structure
+3. `agent-browser click @eN`          — click an element (use ref from snapshot)
+4. `agent-browser snapshot`           — re-snapshot after interaction
+5. `agent-browser screenshot /work/result.png`  — capture final state
+
+## Sessions
+
+Sessions persist across commands in the same agent run. The CLI connects
+to the running Chrome/Chromium instance via CDP. If no session exists,
+`navigate` starts one automatically.
+
+## Common patterns
+
+**Extract page title:**
+After `snapshot`, the root line shows the page title:
+`RootWebArea "Page Title"`
+
+**Find a heading:**
+Look for `heading @eN "text" level=1` in snapshot output.
+
+**Fill a form:**
+1. `snapshot` to identify input `@eN` refs
+2. `type @eN "value"` for each field
+3. `click @eN` on the submit button (use the ref reported by snapshot, not
+   `#submit` or any CSS selector)
+
+## Troubleshooting
+
+- If `navigate` hangs: the page may have a heavy JS bundle. Try again.
+- If an `@eN` ref is stale: re-snapshot and use the new ref.
+- If `screenshot` shows a blank page: navigate first.
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/blog-article/snapshot.out b/examples/workbench/agent-browser/references/agent-browser/recordings/blog-article/snapshot.out
new file mode 100644
index 0000000..9397470
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/blog-article/snapshot.out
@@ -0,0 +1,18 @@
+RootWebArea "Why We Migrated Our Build System to Bazel" focused=true
+  - banner @e1
+    - link @e2 "Engineering Blog"
+    - link @e3 "Archive"
+    - link @e4 "RSS"
+  - main @e0
+    - paragraph @e9 "FROM THE PLATFORM TEAM"
+    - heading @e10 "Why We Migrated Our Build System to Bazel" level=1
+    - paragraph @e11 "By Jordan Lee — April 18, 2026 — 12 min read"
+    - paragraph @e12 "Six months ago we replaced our custom Make-based build orchestration with Bazel. Here is what changed, what broke, and what we would do differently."
+    - heading @e13 "Background" level=2
+    - paragraph @e14 "Our monorepo had grown to 4.2 million lines of code across 11 languages."
+    - heading @e15 "Why Bazel" level=2
+    - paragraph @e16 "We picked Bazel for its hermetic builds and remote caching support."
+    - heading @e17 "What broke" level=2
+    - paragraph @e18 "Two CI workflows depended on implicit globs that Bazel rejected."
+  - contentinfo @e20
+    - link @e21 "Subscribe"
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/blog-article/transitions.txt b/examples/workbench/agent-browser/references/agent-browser/recordings/blog-article/transitions.txt
new file mode 100644
index 0000000..f74825b
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/blog-article/transitions.txt
@@ -0,0 +1,7 @@
+# Single-page blog article. The grader checks that the agent extracted the
+# article title verbatim, not the marketing tagline above it nor the byline.
+
+page-title=Why We Migrated Our Build System to Bazel
+url=https://eng.example.com/blog/2026/04/bazel-migration
+url-prefix=https://eng.example.com/blog
+state=initial
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-email-entered.out b/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-email-entered.out
new file mode 100644
index 0000000..fb835d1
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-email-entered.out
@@ -0,0 +1,9 @@
+RootWebArea "Subscribe — Acme Newsletter" focused=true
+  - main @e0
+    - heading @e10 "Subscribe to the Acme weekly digest" level=1
+    - paragraph @e11 "Two fields, no spam, unsubscribe anytime."
+    - textbox @e5 "Your name" required=true value="Ada Lovelace"
+    - textbox @e6 "Email address" required=true type=email value="ada@example.com"
+    - button @e7 "Continue" type=submit disabled=false
+  - contentinfo @e20
+    - link @e21 "Privacy"
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-name-entered.out b/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-name-entered.out
new file mode 100644
index 0000000..59c64d5
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-name-entered.out
@@ -0,0 +1,9 @@
+RootWebArea "Subscribe — Acme Newsletter" focused=true
+  - main @e0
+    - heading @e10 "Subscribe to the Acme weekly digest" level=1
+    - paragraph @e11 "Two fields, no spam, unsubscribe anytime."
+    - textbox @e5 "Your name" required=true value="Ada Lovelace"
+    - textbox @e6 "Email address" required=true type=email
+    - button @e7 "Continue" type=submit disabled=true
+  - contentinfo @e20
+    - link @e21 "Privacy"
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-submitted.out b/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-submitted.out
new file mode 100644
index 0000000..76f8e7e
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-submitted.out
@@ -0,0 +1,8 @@
+RootWebArea "Subscribed — Acme Newsletter" focused=true
+  - main @e0
+    - heading @e50 "You're subscribed!" level=1
+    - paragraph @e51 "Confirmation code: NL-7QF3-2026"
+    - paragraph @e52 "We just sent a verification email to ada@example.com."
+    - link @e53 "Back to home" href="/"
+  - contentinfo @e20
+    - link @e21 "Privacy"
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot.out b/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot.out
new file mode 100644
index 0000000..e156dc4
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot.out
@@ -0,0 +1,9 @@
+RootWebArea "Subscribe — Acme Newsletter" focused=true
+  - main @e0
+    - heading @e10 "Subscribe to the Acme weekly digest" level=1
+    - paragraph @e11 "Two fields, no spam, unsubscribe anytime."
+    - textbox @e5 "Your name" required=true
+    - textbox @e6 "Email address" required=true type=email
+    - button @e7 "Continue" type=submit disabled=false
+  - contentinfo @e20
+    - link @e21 "Privacy"
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/transitions.txt b/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/transitions.txt
new file mode 100644
index 0000000..64c4e14
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/transitions.txt
@@ -0,0 +1,13 @@
+# Two-field newsletter form. Continue button (@e7) is the only path forward.
+# Required progression: type @e5 (name) -> type @e6 (email) -> click @e7.
+# The Continue button is visible from the start but completing both fields
+# is what makes the post-submit confirmation page valid.
+
+page-title=Subscribe — Acme Newsletter
+url=https://news.acme.example.com/subscribe
+url-prefix=https://news.acme.example.com/subscribe
+state=initial
+
+type  @e5 -> name-entered
+type  @e6 -> email-entered
+click @e7 -> submitted
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot-after-signin.out b/examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot-after-signin.out
new file mode 100644
index 0000000..c7f1a3a
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot-after-signin.out
@@ -0,0 +1,7 @@
+RootWebArea "Sign in — Acme" focused=true
+  - main @e0
+    - heading @e10 "Sign in to your account" level=1
+    - textbox @e30 "Email"
+    - textbox @e31 "Password" type=password
+    - button @e32 "Sign In" type=submit
+    - link @e33 "Forgot password?"
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot-after-signup.out b/examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot-after-signup.out
new file mode 100644
index 0000000..63dd130
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot-after-signup.out
@@ -0,0 +1,8 @@
+RootWebArea "Create your account — Acme" focused=true
+  - main @e0
+    - heading @e10 "Create your account" level=1
+    - textbox @e40 "Full name"
+    - textbox @e41 "Work email"
+    - textbox @e42 "Password" type=password
+    - textbox @e43 "Confirm password" type=password
+    - button @e44 "Create account" type=submit
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot.out b/examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot.out
new file mode 100644
index 0000000..6969c49
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot.out
@@ -0,0 +1,14 @@
+RootWebArea "Welcome to Acme — Sign in or create an account" focused=true
+  - banner @e1
+    - link @e2 "Acme"
+    - link @e3 "Pricing"
+    - link @e4 "Docs"
+  - main @e0
+    - heading @e10 "Welcome back" level=1
+    - paragraph @e11 "Sign in to continue, or create a new account."
+    - button @e5 "Sign In" type=button class="btn btn-primary"
+    - button @e6 "Sign Up" type=button class="btn btn-primary"
+    - link @e12 "Forgot password?" href="/reset"
+  - contentinfo @e20
+    - link @e21 "Terms"
+    - link @e22 "Privacy"
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/transitions.txt b/examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/transitions.txt
new file mode 100644
index 0000000..a68cd9e
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/transitions.txt
@@ -0,0 +1,11 @@
+# Auth landing page with two visually-similar primary buttons.
+# Task asks the agent to log in (Sign In = @e5), NOT to register (Sign Up = @e6).
+# A grader fails the case when the agent clicks the wrong ref.
+
+page-title=Welcome to Acme — Sign in or create an account
+url=https://app.acme.example.com/welcome
+url-prefix=https://app.acme.example.com/welcome
+state=initial
+
+click @e5 -> after-signin
+click @e6 -> after-signup
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/snapshot-after-search.out b/examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/snapshot-after-search.out
new file mode 100644
index 0000000..5eb8fae
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/snapshot-after-search.out
@@ -0,0 +1,15 @@
+RootWebArea "Search results - Wikipedia" focused=true
+  - banner @e1
+    - link @e2 "Main page"
+  - main @e0
+    - heading @e9 "Search results" level=1
+    - searchbox @e7 "Hypertext Transfer Protocol" value="Hypertext Transfer Protocol"
+    - button @e8 "Search" type=submit
+    - heading @e30 "Hypertext Transfer Protocol" level=2
+    - link @e31 "Hypertext Transfer Protocol" href="/wiki/Hypertext_Transfer_Protocol"
+    - paragraph @e32 "The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems."
+    - heading @e33 "HTTPS" level=2
+    - link @e34 "HTTPS" href="/wiki/HTTPS"
+    - paragraph @e35 "Hypertext Transfer Protocol Secure (HTTPS) is an extension of HTTP."
+  - contentinfo @e20
+    - link @e21 "Privacy policy"
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/snapshot.out b/examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/snapshot.out
new file mode 100644
index 0000000..9fc52a3
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/snapshot.out
@@ -0,0 +1,17 @@
+RootWebArea "Wikipedia, the free encyclopedia" focused=true
+  - banner @e1
+    - link @e2 "Main page"
+    - link @e3 "Talk"
+  - navigation @e4 "Site"
+    - link @e5 "Contents"
+    - link @e6 "Current events"
+  - main @e0
+    - heading @e9 "Welcome to Wikipedia" level=1
+    - paragraph @e10 "the free encyclopedia that anyone can edit."
+    - searchbox @e7 "Search Wikipedia" placeholder="Search Wikipedia"
+    - button @e8 "Search" type=submit
+    - link @e11 "Featured article"
+    - link @e12 "In the news"
+  - contentinfo @e20
+    - link @e21 "Privacy policy"
+    - link @e22 "About Wikipedia"
diff --git a/examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/transitions.txt b/examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/transitions.txt
new file mode 100644
index 0000000..e1c5687
--- /dev/null
+++ b/examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/transitions.txt
@@ -0,0 +1,16 @@
+# Wikipedia search flow.
+# Initial page is the English Wikipedia main page; agent must locate the
+# searchbox (@e7), type a query, then click the magnifying-glass submit
+# button (@e8) to navigate to the results page.
+
+page-title=Wikipedia, the free encyclopedia
+url=https://en.wikipedia.org/wiki/Main_Page
+url=https://en.wikipedia.org/
+url-prefix=https://en.wikipedia.org
+state=initial
+
+# Typing into the searchbox does not navigate; state stays the same.
+type @e7 -> initial
+
+# Clicking the submit button advances to the results page recording.
+click @e8 -> after-search
diff --git a/examples/workbench/agent-browser/suite.yml b/examples/workbench/agent-browser/suite.yml
new file mode 100644
index 0000000..958cdb4
--- /dev/null
+++ b/examples/workbench/agent-browser/suite.yml
@@ -0,0 +1,95 @@
+name: agent-browser-eval
+references: ./references
+models:
+  - openrouter/anthropic/claude-sonnet-4.6
+  - openrouter/openai/gpt-5
+  - openrouter/google/gemini-2.5-pro
+env:
+  - OPENROUTER_API_KEY
+timeoutSeconds: 600
+
+cases:
+  # ----- Tier-0 (kept for continuity with the v1 baseline) -----
+
+  - name: navigate-and-report
+    task: |
+      Navigate to https://example.com and find the main heading on the page.
+      Write the heading text to /work/heading.txt.
+
+      The agent-browser CLI is available. Check /work/references/agent-browser/SKILL.md
+      for guidance on which tool to use.
+    graders:
+      - name: navigate-report-findings
+        command: node $CASE/checks/grade-navigate-report-findings.mjs
+
+  - name: screenshot-capture
+    task: |
+      Take a screenshot of https://example.com and save it to /work/screenshot.png.
+      Also write the page title to /work/title.txt.
+
+      The agent-browser CLI is available. Check /work/references/agent-browser/SKILL.md
+      for guidance on which tool to use.
+    graders:
+      - name: screenshot-capture-findings
+        command: node $CASE/checks/grade-screenshot-capture-findings.mjs
+
+  # ----- Tier-1 (snapshot-driven @eN ref discipline + content correctness) -----
+
+  - name: ref-based-search
+    task: |
+      Open the English Wikipedia main page at https://en.wikipedia.org/wiki/Main_Page
+      and search for "Hypertext Transfer Protocol".
+
+      On the search results page, identify the title of the top result and write
+      it to /work/top-result.txt (one line, no extra prose).
+
+      The agent-browser CLI is available. Check /work/references/agent-browser/SKILL.md
+      for guidance on which tool to use.
+    graders:
+      - name: ref-based-search-findings
+        command: node $CASE/checks/grade-ref-based-search-findings.mjs
+
+  - name: ref-disambiguation
+    task: |
+      Open https://app.acme.example.com/welcome. The page has two primary
+      buttons: "Sign In" and "Sign Up". Sign in to your existing account
+      (do NOT create a new one).
+
+      After you reach the next page, write the page's level-1 heading to
+      /work/next-heading.txt (one line).
+
+      The agent-browser CLI is available. Check /work/references/agent-browser/SKILL.md
+      for guidance on which tool to use.
+    graders:
+      - name: ref-disambiguation-findings
+        command: node $CASE/checks/grade-ref-disambiguation-findings.mjs
+
+  - name: output-correctness
+    task: |
+      Open https://eng.example.com/blog/2026/04/bazel-migration and extract
+      the article title — the level-1 heading of the article itself, not the
+      kicker tagline above it and not the byline.
+
+      Write the title to /work/title.txt (one line, exact text, no extra prose).
+
+      The agent-browser CLI is available. Check /work/references/agent-browser/SKILL.md
+      for guidance on which tool to use.
+    graders:
+      - name: output-correctness-findings
+        command: node $CASE/checks/grade-output-correctness-findings.mjs
+
+  - name: multi-step-state
+    task: |
+      Open https://news.acme.example.com/subscribe and complete the newsletter
+      subscription form with the name "Ada Lovelace" and the email
+      "ada@example.com", then submit.
+
+      The confirmation page shows a confirmation code in the form
+      "NL-XXXX-YYYY". Write that code (just the code, one line) to
+      /work/confirm.txt.
+
+      The agent-browser CLI is available. Check /work/references/agent-browser/SKILL.md
+      for guidance on which tool to use.
+    graders:
+      - name: multi-step-state-findings
+        command: node $CASE/checks/grade-multi-step-state-findings.mjs
diff --git a/tools/auto-improve-contexts/supabase-postgres-best-practices.md b/tools/auto-improve-contexts/supabase-postgres-best-practices.md
new file mode 100644
index 0000000..41c7d35
--- /dev/null
+++ b/tools/auto-improve-contexts/supabase-postgres-best-practices.md
@@ -0,0 +1,183 @@
+# Auto-pilot context: supabase/agent-skills — supabase-postgres-best-practices
+
+## Repository facts
+
+- Repo: supabase/agent-skills (default branch: main)
+- License: MIT, no CLA
+- Maintainers: gregnr (Supabase staff), Rodriguespn (active community maintainer)
+- Merge style: squash, conventional commits enforced by Release Please
+- CI: `Skills CI` runs `pnpm test:sanity` which executes `npx skills add` to
+  confirm install — does NOT validate per-reference frontmatter; convention
+  is enforced by maintainer review only
+- Discovery index published at `.well-known/agent-skills/index.json` on every
+  release
+- Downstream sync: supabase-community/supabase-plugin receives
+  workflow_dispatch on release
+
+## Hard constraints (additive-only PR)
+
+- Add EXACTLY ONE new file:
+  `skills/supabase-postgres-best-practices/references/{prefix}-{name}.md`
+- DO NOT modify `SKILL.md` — Release Please owns `metadata.version`. Manual
+  edits cause merge conflicts with the bot's release PR.
+- DO NOT modify `_sections.md`, `_template.md`, `_contributing.md`, the
+  SKILL.md "Rule Categories by Priority" table, `release-please-config.json`,
+  `package.json`, or `CHANGELOG.md`
+- DO NOT add a new prefix. Use only the existing 8: `query-`, `conn-`,
+  `security-`, `schema-`, `lock-`, `data-`, `monitor-`, `advanced-`
+- DO NOT bump `metadata.version` in SKILL.md
+- DO NOT add README.md, INSTALLATION_GUIDE.md, QUICK_REFERENCE.md, or
+  CHANGELOG.md inside the skill (AGENTS.md explicitly forbids)
+
+## Frontmatter spec for the new reference file
+
+Required fields (exact form, comma-separated tags as a STRING, not a YAML
+list):
+
+```yaml
+---
+title: <Action-oriented title, ~3-8 words>
+impact: <one of: CRITICAL | HIGH | MEDIUM-HIGH | MEDIUM | LOW-MEDIUM | LOW>
+impactDescription: <Quantified benefit, e.g. "10-100x faster queries">
+tags: <3-6 hyphenated-keywords, comma-separated, e.g. "indexes, performance, query-optimization">
+---
+```
+
+## Content shape template (copy and fill)
+
+```markdown
+---
+title: <Title>
+impact: <CRITICAL|HIGH|MEDIUM-HIGH|MEDIUM|LOW-MEDIUM|LOW>
+impactDescription: <quantified benefit>
+tags: <comma, separated, keywords>
+---
+
+## <Same title as frontmatter>
+
+<1-2 sentence explanation of the problem and why it matters.>
+
+**Incorrect (<short parenthetical naming the problem>):**
+
+\`\`\`sql
+-- comment explaining what makes this slow/wrong
+<bad SQL>
+\`\`\`
+
+**Correct (<short parenthetical naming the fix>):**
+
+\`\`\`sql
+-- comment explaining why this is better
+<good SQL>
+\`\`\`
+
+<Optional: 1 follow-up subsection with another correct variant or trade-off note.>
+
+Reference: [<Link Text>](<https URL to postgres or supabase docs>)
+```
+
+Target length: 40–80 lines, 1.2–1.9 KB. Code blocks must be tagged `sql`
+(lowercase keywords). Comments explain WHY not WHAT. Use semantic
+table/column names (`users`, `orders`, `customer_id`).
+
+## Two-pass-review proposal — required reshaping
+
+The proposed content (two-pass review, presence vs absence violations) does
+NOT fit the existing single-rule SQL-transformation convention. All 28
+existing references are concrete SQL anti-pattern fixes, not meta-workflow
+guidance. `_contributing.md` Key Principle #1: "Show exact SQL rewrites.
+Avoid philosophical advice." Key Principle #2: "Error-First Structure."
+
+**Reshape strategy (REQUIRED before writing):** Pick the single
+highest-impact concrete SQL anti-pattern that two-pass review catches and
+the single-pass workflow misses. Frame the reference around that
+anti-pattern. Example framing:
+
+- Filename: `monitor-two-pass-review.md` (prefix `monitor-` because
+  diagnostic workflow)
+- title: "Run Two Passes on Generated SQL Reviews"
+- Incorrect block: a single-pass review that approves SQL missing a
+  `WHERE` (absence violation) or containing `DROP` (presence violation)
+- Correct block: a two-pass review that catches both classes
+- impact: MEDIUM (matches monitor-* siblings)
+- impactDescription: "Catch absence-class bugs (missing WHERE, missing
+  index) that single-pass review skips"
+- tags: review, diagnostics, code-review, sql-review
+
+If the reshape makes the SQL examples feel contrived, ABORT and surface a
+`needs-discussion` signal in `analysis.md` (use status:
+`blocked-by-skill-shape` and explain) instead of opening a borderline PR.
+Open a GitHub Discussion under
+<https://github.com/orgs/supabase/discussions> as the next manual step.
+
+## PR composition (for downstream packaging)
+
+- Branch name: `feat/{short-kebab-name}` (matches Rodriguespn convention)
+- Title: `feat: <short imperative summary>` (use `feat:` for additive
+  content; `fix:` only for corrections — both currently bump patch under
+  bump-patch-for-minor-pre-major)
+- Body shape (no PR template enforced — ignore the stale template at
+  `.github/`):
+
+  ```markdown
+  ## Summary
+
+  - <1-line what>
+  - <1-line why>
+  - <optional: 1 line on which prefix/section it slots into>
+  ```
+
+- Optionally append `Resolves AI-NNN` if a Linear ticket exists; otherwise
+  omit.
+- Single commit, single file, no co-authoring trailer required by repo
+  (their merge is squash).
+- DO NOT include a "Test plan" section — no merged PR uses one.
+
+## Pre-submit checklist (auto-pilot must verify before declaring success)
+
+1. Exactly 1 file added under
+   `skills/supabase-postgres-best-practices/references/`
+2. Filename matches `{existing-prefix}-{kebab-name}.md`
+3. Frontmatter has `title`, `impact` (allowed enum), `impactDescription`,
+   `tags` (comma-separated string)
+4. Body has `## <Title>`, `**Incorrect (...):**` block with ` ```sql `,
+   `**Correct (...):**` block with ` ```sql `, trailing
+   `Reference: [...](https://...)` link
+5. Total file size 1.0–2.0 KB, 35–90 lines
+6. SKILL.md, _sections.md,_template.md, _contributing.md,
+   release-please-config.json, package.json all UNCHANGED
+7. `metadata.version` in SKILL.md UNCHANGED (currently "1.1.1" — Release
+   Please owns it)
+8. No README.md/INSTALLATION_GUIDE.md/CHANGELOG.md added anywhere
+
+## Optimization target file
+
+**Edit:** `references/supabase-postgres-best-practices/{new-reference}.md`
+(create it as a new file under the workbench's vendored references dir, then
+package it as the proposed upstream addition).
+
+**Do NOT edit:** `references/supabase-postgres-best-practices/SKILL.md`.
+
+## Risk flags
+
+- HIGH: a meta-workflow reference is shape-novel for this skill; expect
+  "fit-the-convention" pushback from gregnr/Rodriguespn. Reshape to
+  concrete SQL anti-pattern as above OR open a Discussion first.
+- MEDIUM: `npx skills add supabase/agent-skills` is publicly consumed;
+  additive-only is mandatory.
+- LOW: external small `feat:` PRs do merge same-day if convention is
+  followed.
+
+## Useful URLs
+
+- Convention source of truth:
+  `skills/supabase-postgres-best-practices/references/_contributing.md`
+- Section taxonomy:
+  `skills/supabase-postgres-best-practices/references/_sections.md`
+- Reference template:
+  `skills/supabase-postgres-best-practices/references/_template.md`
+- Frontmatter format spec: `AGENTS.md` (symlinked as `CLAUDE.md`)
+- Release config: `release-please-config.json`
+- Sanity test (does NOT validate frontmatter): `test/sanity.test.ts`
+- CONTRIBUTING gate: `CONTRIBUTING.md` ("open a Discussion first" for
+  major changes)
diff --git a/tools/auto-improve-contexts/vercel-agent-browser.md b/tools/auto-improve-contexts/vercel-agent-browser.md
new file mode 100644
index 0000000..4c67f2c
--- /dev/null
+++ b/tools/auto-improve-contexts/vercel-agent-browser.md
@@ -0,0 +1,140 @@
+# Auto-pilot context: vercel-labs/agent-browser
+
+## Workbench is ALREADY BUILT (Tier-1 deeper eval) — skip rebuilding
+
+`examples/workbench/agent-browser/` is already populated with a
+hand-built Tier-1 eval (4 cases beyond the 2 inherited Tier-0 cases =
+6 cases total) that uses **pre-recorded snapshots played back by a
+stateful fake CLI**. DO NOT rebuild it. Specifically:
+
+- **Phase 1 (Discover):** classify the skill (`tool-use`) but DO NOT
+  WebFetch upstream SKILL.md or rules docs. The vendored copies at
+  `references/agent-browser/SKILL.md` and
+  `references/agent-browser/agent-browser-core.md` are authoritative
+  for this pilot.
+- **Phase 2 (Build suite):** SKIP ENTIRELY. Verify the existing
+  `suite.yml`, `workspace/`, `bin/agent-browser`, `references/`, and
+  `checks/` files are present and proceed. DO NOT overwrite ANY of
+  them. If a file is missing, exit `status: blocked-by-error` —
+  something has gone wrong with the cherry-pick, not your fault.
+- **Phase 3 (Baseline):** run normally. Use the existing 6-case suite
+  with the standard model matrix.
+
+## Optimization target file
+
+**Edit:** `references/agent-browser/agent-browser-core.md`
+
+This is the vendored copy of upstream `skill-data/core/SKILL.md` — the
+**actual workflow content** that teaches the agent how to use
+agent-browser (navigate, snapshot, click @eN, type @eN, etc.). When the
+agent runs `agent-browser skills get core`, the fake CLI emits this
+file's contents.
+
+**Do NOT edit:**
+
+- `references/agent-browser/SKILL.md` — that's the discovery stub. Per
+  upstream `AGENTS.md`, it's intentionally thin and should not contain
+  workflow content.
+- `bin/agent-browser` (the fake CLI), `suite.yml`, `workspace/`, or any
+  file under `checks/` — those are the eval harness and must stay
+  fixed; modifying them invalidates the measurement.
+
+## Architecture intent (from prior research)
+
+- Upstream repo is `vercel-labs/agent-browser` (Rust CLI for
+  Chrome/Chromium automation via CDP, designed for AI agents).
+- The split is intentional: `skills/agent-browser/SKILL.md` is a thin
+  discovery stub; the real workflow content lives at
+  `skill-data/core/SKILL.md` and is loaded by the agent at runtime via
+  `agent-browser skills get core`. This keeps the SKILL.md token-cheap
+  and lets the workflow doc evolve with the CLI version.
+- License: Apache-2.0, no CLA observed.
+- Maintainer: `ctate` (sole, very active; same-day merges for clean
+  PRs).
+- Strict CI: Rust fmt + clippy + test + dashboard `pnpm build` +
+  version-sync. **Docs-only changes (changes confined to
+  `skill-data/core/SKILL.md` or its references) pass automatically** —
+  do not touch any Rust file or dashboard code.
+- Conventional commits required: `feat(scope):`, `fix(scope):`,
+  `docs(scope): description`. Scope is the subsystem (`docs`,
+  `doctor`, `native`, etc.).
+- Per upstream AGENTS.md: "Any skill improvement PR must touch
+  `skill-data/core/SKILL.md` and its `references/` files, plus
+  `README` and the docs MDX pages." This 4-file mirror is a packaging
+  concern at PR-draft time, not auto-pilot scope. Auto-pilot should
+  produce just the proposed change to `skill-data/core/SKILL.md`; the
+  PR-draft step manually mirrors the relevant additions to README and
+  MDX.
+
+## What the deeper eval tests (informs which additions are likely valuable)
+
+| Tier-0 (existing) | Tier-1 (new) |
+|---|---|
+| Tool-was-invoked-at-all | **Ref correctness** — agent must `click @eN` where `@eN` is the right element from the recorded snapshot |
+| `skills get core` was called first | **Snapshot-first discipline** — must `snapshot` before any `click`/`type` |
+| `navigate` (not `curl`/`wget`) | **No CSS selectors** — `click "#button"` fails; `click @e3` passes |
+| Snapshot/screenshot was called | **Content correctness** — `title.txt` must equal the actual title from the recording, not just non-empty |
+| Output file is non-empty | **State-machine path completeness** — multi-step flows: `type @e5 → type @e6 → click @e7 → re-snapshot → extract` |
+
+Likely failure modes (and where additive guidance helps):
+
+- Agents fall back to CSS selectors when an element name "looks
+  obvious" → recipe: explicit "NEVER use CSS selectors. Always use
+  `@eN` refs from the most recent snapshot." (Recipe D — BAD/GOOD
+  example showing wrong vs right.)
+- Agents skip `snapshot` when they "know" what's on the page → recipe:
+  "Always `snapshot` immediately after `navigate`, and again after any
+  `click`/`type` that changes state. The snapshot is your only source
+  of valid `@eN` refs."
+- Agents pick the wrong `@eN` when multiple visually-similar elements
+  exist → recipe: per-action checklist "Read the snapshot's role +
+  label fields before choosing a ref."
+- Agents extract content from the wrong recording field (kicker vs h1
+  vs byline) → recipe: explicit "When asked for the article title,
+  use the `<h1>` text, not the kicker or byline."
+
+## Hard constraints
+
+1. **Additive only.** No deletions, no rewording of existing core.md
+   content.
+2. **Style:** match the existing `agent-browser-core.md` voice — terse,
+   command-oriented bullet lists. Examples are encouraged (BAD/GOOD
+   blocks). No prose paragraphs.
+3. **Length budget:** the existing `agent-browser-core.md` is ~90 lines.
+   Additions of 20–40 lines are reasonable; >60 lines is suspect (means
+   you're rewriting, not augmenting).
+4. **Do not modify the fake CLI or the eval harness.** If you find a
+   genuine grader bug (e.g. graders mismark a correct trace), fix the
+   GRADER (per the prompt's grader-vs-skill check, free retry not
+   counted against iteration budget) — do NOT change the skill content
+   to satisfy a buggy grader.
+5. **Fake-CLI awareness.** The fake CLI is stateful — it tracks which
+   page the agent is on (`/work/.ab-state`) and which post-action
+   snapshot to serve next. Recordings define which `@eN` refs exist on
+   each page+state. Your skill changes should encourage agents to
+   actually USE the snapshot's refs, not invent them.
+
+## Packaging
+
+When Phase 5 packages the proposed change:
+
+- Name files `before-skill-data-core-SKILL.md` /
+  `after-skill-data-core-SKILL.md` (the upstream target file is at
+  `skill-data/core/SKILL.md`)
+- Put them under
+  `proposed-upstream-changes/vercel-labs-agent-browser/` (matches
+  prior pilot's directory layout)
+- Per upstream AGENTS.md, the human PR-draft step (separate from this
+  pilot) will also mirror the relevant additions into upstream
+  `README.md` and the docs MDX pages. Auto-pilot is NOT responsible
+  for those mirrors.
+
+## Risk profile
+
+- LOW for additive changes to `skill-data/core/SKILL.md` if the diff
+  is small and matches existing voice. ctate ships docs-only PRs
+  same-day.
+- MEDIUM if the diff is large or rewords existing content (slight
+  drift from "additive only" trips clippy-style review).
+- HIGH if any non-docs file is touched (Rust changes trigger expensive
+  CI; not in scope here).
diff --git a/tools/auto-improve-contexts/vercel-web-interface-guidelines.md b/tools/auto-improve-contexts/vercel-web-interface-guidelines.md
new file mode 100644
index 0000000..6a264db
--- /dev/null
+++ b/tools/auto-improve-contexts/vercel-web-interface-guidelines.md
@@ -0,0 +1,93 @@
+# Upstream context: vercel-labs/web-interface-guidelines
+
+This pilot targets the **rules doc** consumed by the
+`vercel-labs/agent-skills/web-design-guidelines` skill, not the skill
+itself. The SKILL.md is a thin Claude-Code-specific adapter and is NOT
+the right optimization target for this pilot.
+
+## Optimization target file
+
+**Edit:** `references/web-design-guidelines/command.md`
+**Do NOT edit:** `references/web-design-guidelines/SKILL.md`
+
+The SKILL.md is essentially untouched in upstream history (the last
+substantive change was its initial commit). All meaningful improvements
+should land in `command.md`, which is the canonical Vercel design
+artifact distributed natively to 7 agent tools (Amp Code, Claude Code,
+Cursor, OpenCode, Windsurf, Antigravity, Gemini CLI) via `install.sh`,
+plus consumed by 10+ downstream repos via raw GitHub URL fetch.
+
+## Architecture intent (from upstream research)
+
+- `command.md` is the **canonical source of truth**. The skill is one
+  of many thin downstream adapters (others: 7 native tool installs +
+  the `vercel-labs/agent-skills` wrapper).
+- `command.md`, `AGENTS.md`, and `README.md` are three stylistic
+  reformulations of the same rule set, each distributed through a
+  different channel. The auto-pilot only needs to optimize `command.md`
+  here; the AGENTS.md / README.md mirrors are produced manually at
+  PR-draft time.
+- The skill always WebFetches `main` (no commit pinning), confirming
+  the rules doc is expected to evolve independently and downstream
+  consumers ride latest.
+
+## Hard constraints
+
+1. **Additive only.** Every merged PR in the last year is additive
+   (add rules, reword rules, fix links, add tool installers). Zero
+   restructure / reorganization PRs have been merged. **Do not delete,
+   reorder, or substantively reword existing rules.**
+2. **Do not consolidate or restructure.** Merging `command.md` content
+   into the SKILL.md, or splitting `command.md` into multiple files,
+   would break `install.sh` and sever the canonical URL that 10+
+   external repos and vercel.com/design link to. Risk of rejection:
+   HIGH.
+3. **Two-pass workflow goes in `command.md` only.** Meta-instructions
+   about "how to apply the rules" (e.g. Pass 1 = visible / Pass 2 =
+   absences) fit `command.md` because it's the file consumed at audit
+   time. They do NOT fit `AGENTS.md` (ambient project context, read at
+   every coding action — the agent isn't "doing a review"). Out of
+   scope here, but worth knowing the scope limit.
+4. **Rule additions / clarifications are in scope.** Per-element
+   checklists, BAD/GOOD examples, and explicit "missing X" rules are
+   the kinds of changes the merged-PR pattern welcomes (PR #23 is the
+   canonical precedent — adds `translate="no"` guideline as an additive
+   rule).
+5. **Maintain frontmatter.** `command.md` starts with YAML frontmatter
+   (`description:`, `argument-hint:`). Preserve it.
+6. **Style:** terse imperative bullets (e.g. `- Icon-only buttons need
+   \`aria-label\``). Match the existing voice. No prose, no rationale
+   in-line, no "MUST/SHOULD/NEVER" (that's the AGENTS.md voice).
+
+## Where headroom likely lies (prior from manual eval)
+
+The load-bearing prior from `auto-improve-skill-lessons.md` applies:
+absence-type rules ("a missing attribute", "a missing branch") are
+5-10x harder than presence-type rules ("a wrong token in code"). Prior
+manual eval on this skill showed the biggest uplift came from:
+
+- Per-element absence checklists (`<img>`, `<input>`, `<button>` —
+  walk each one, flag missing attributes)
+- BAD/GOOD code examples for anti-patterns where the bad pattern looks
+  idiomatic (`disabled={!form.valid}`, `onPaste={(e) => e.preventDefault()}`)
+- Explicit "missing X" rules where the rule is currently phrased only
+  as a presence check
+
+When seeding violations, lean toward absence-type — that's where the
+existing `command.md` likely has gaps and where additive rules will
+create measurable uplift.
+
+## Packaging
+
+When Phase 5 packages the proposed change:
+
+- Name files `before-command.md` / `after-command.md` (not
+  `before-SKILL.md`)
+- Put them under `proposed-upstream-changes/vercel-labs-web-interface-guidelines/`
+  (the rules-doc repo)
+- Do not produce a `before-SKILL.md` / `after-SKILL.md` for the
+  `vercel-labs-agent-skills` repo — we're not changing it
+- The PR-draft step (manual, separate from this pilot) will produce
+  `AGENTS.md` (MUST/SHOULD/NEVER style) and `README.md` (prose style)
+  reformulations of the same change set, following the PR #23
+  precedent of touching all 3 files in one PR
diff --git a/tools/auto-improve-skill-lessons.md b/tools/auto-improve-skill-lessons.md
index 119995a..ae3a27f 100644
--- a/tools/auto-improve-skill-lessons.md
+++ b/tools/auto-improve-skill-lessons.md
@@ -357,4 +357,6 @@ something new. Format:
 + **auto-pilot supabase (2026-05-08):** "covering" / "does not cover" alternation
   pattern. Confirmed ±3 → ±8 line widening is needed by default.
 
++ **auto-pilot supabase v2 (2026-05-12):** Upstream constraints required adding a new reference file (`monitor-two-pass-review.md`) instead of editing SKILL.md. Baseline was already 1.00 (calibrated graders from prior run). Pattern: when a re-run starts from calibrated graders, the Phase 3 exit condition fires before Phase 4 — the "modification" step then serves purely as upstream PR packaging rather than eval improvement.
+
 (Future pilots: append your additions here.)
diff --git a/tools/auto-improve-skill-prompt.md b/tools/auto-improve-skill-prompt.md
index abc745d..d49e356 100644
--- a/tools/auto-improve-skill-prompt.md
+++ b/tools/auto-improve-skill-prompt.md
@@ -25,6 +25,21 @@ branch — proceed without it; the prompt is self-sufficient.
 
 ---
 
+## Constraints / Upstream context
+
+The operator has provided the following upstream context. **Treat each
+directive here as a hard constraint** — your Phase 4 modifications and
+Phase 5 packaging must respect every line. If a constraint conflicts
+with this prompt's defaults (e.g. names a different target file, or
+forbids a recipe), the constraint wins. If the block below is the
+"no upstream context" placeholder, proceed with this prompt's defaults.
+
+```text
+${CONTEXT_BLOCK}
+```
+
+---
+
 ## Phase 1 — Discover
 
 1. Fetch the upstream `SKILL.md` via WebFetch from
@@ -292,7 +307,15 @@ For each iteration `I` (1 then 2):
    modification once the grader is calibrated.
 
 2. **Modify** — write a *minimal additive* edit using the recipes
-   from `auto-improve-skill-lessons.md` § "Optimization patterns":
+   from `auto-improve-skill-lessons.md` § "Optimization patterns".
+
+   **Target file selection.** By default, edit
+   `references/${SKILL_ID}/SKILL.md`. If the Constraints section above
+   names a different target file (e.g. a fetched rules doc), edit
+   THAT file instead. Never modify both unless the constraints
+   explicitly authorize a multi-file change.
+
+   Recipes:
     - **Recipe A** (two-pass workflow) for code-reviewer skills with
       mixed presence/absence rules
     - **Recipe B** (verify-tool-installed nudge) for tool-use skills
@@ -346,6 +369,11 @@ If final status is `success`:
    NOT the local-path tweak from Phase 2 (revert that line). Diff vs
    upstream should be purely additive.
 
+   **If Phase 4 targeted a non-SKILL.md file** (per the Constraints
+   section), name the packaged files after the actual target instead
+   (e.g. `before-command.md` / `after-command.md`) and put them under
+   the rules-doc repo's directory rather than the skill repo's.
+
 3. Write `proposed-upstream-changes/README.md`. If
    `examples/workbench/web-design-guidelines/proposed-upstream-changes/README.md`
    is available, use it as a style reference; otherwise write a short
diff --git a/tools/auto-improve-skill.mjs b/tools/auto-improve-skill.mjs
index 93216a6..07dff5b 100644
--- a/tools/auto-improve-skill.mjs
+++ b/tools/auto-improve-skill.mjs
@@ -2,13 +2,17 @@
 // Auto-improve-skill wrapper.
 //
 // Operator usage (from this Claude Code session via Bash):
-//   node tools/auto-improve-skill.mjs <owner>/<repo>/<skill-id> [--force] [--budget <usd>]
+//   node tools/auto-improve-skill.mjs <owner>/<repo>/<skill-id> [--force] [--budget <usd>] [--context <path>]
 //
 // Flags:
 //   --force          overwrite an existing examples/workbench/<skill-id>/
 //   --budget <usd>   per-run claude -p budget cap (default: 10.00)
 //                    The prompt's Phase-4 cost guard stops the agent at $7
 //                    so the last $3 covers analysis.md + commit cleanup.
+//   --context <path> markdown file with upstream constraints (architecture
+//                    intent, target-file override, additive-only directives,
+//                    etc.) injected verbatim into the prompt's "Constraints"
+//                    section. Phase 4 must respect every constraint stated.
 //
 // Spawns `claude -p` with the templated prompt; the inner agent does the
 // 5-phase work (vendor → build suite → baseline → iterate → package) and
@@ -41,9 +45,27 @@ const BUDGET = parseBudgetFlag();
 const BUDGET_FLAG_IDX = args.indexOf('--budget');
 const BUDGET_VALUE_IDX = BUDGET_FLAG_IDX < 0 ? null : BUDGET_FLAG_IDX + 1;
 
-const slug = args.find((a, i) => !a.startsWith('--') && i !== BUDGET_VALUE_IDX);
+const CONTEXT_FLAG_IDX = args.indexOf('--context');
+const CONTEXT_VALUE_IDX = CONTEXT_FLAG_IDX < 0 ? null : CONTEXT_FLAG_IDX + 1;
+let CONTEXT_BLOCK = '';
+if (CONTEXT_FLAG_IDX >= 0) {
+  const ctxPath = args[CONTEXT_VALUE_IDX];
+  if (!ctxPath) {
+    console.error('--context requires a path argument');
+    process.exit(2);
+  }
+  const absCtx = resolve(REPO_ROOT, ctxPath);
+  if (!existsSync(absCtx)) {
+    console.error(`--context file not found: ${absCtx}`);
+    process.exit(2);
+  }
+  CONTEXT_BLOCK = readFileSync(absCtx, 'utf-8').trim();
+}
+
+const RESERVED_VALUE_IDXS = new Set([BUDGET_VALUE_IDX, CONTEXT_VALUE_IDX].filter((x) => x !== null));
+const slug = args.find((a, i) => !a.startsWith('--') && !RESERVED_VALUE_IDXS.has(i));
 if (!slug) {
-  console.error('usage: auto-improve-skill.mjs <owner>/<repo>/<skill-id> [--force] [--budget <usd>]');
+  console.error('usage: auto-improve-skill.mjs <owner>/<repo>/<skill-id> [--force] [--budget <usd>] [--context <path>]');
   process.exit(2);
 }
 const parts = slug.split('/');
@@ -61,7 +83,13 @@ if (existsSync(caseDir) && !FORCE) {
 mkdirSync(caseDir, { recursive: true });
 
 const promptTemplate = readFileSync(PROMPT_PATH, 'utf-8');
-const prompt = promptTemplate.replace(/\$\{SLUG\}/g, slug).replace(/\$\{SKILL_ID\}/g, skillId);
+const contextRendered = CONTEXT_BLOCK
+  ? CONTEXT_BLOCK
+  : '_(no upstream context provided — proceed with default behavior)_';
+const prompt = promptTemplate
+  .replace(/\$\{SLUG\}/g, slug)
+  .replace(/\$\{SKILL_ID\}/g, skillId)
+  .replace(/\$\{CONTEXT_BLOCK\}/g, contextRendered);
 
 const logPath = join(caseDir, '.run.log');
 const logStream = createWriteStream(logPath, { flags: 'a' });