feat(auto-improve-orchestrator): v1.3 — operator-dispatched orchestrator skill#50
Open
Zhaiyuqing2003 wants to merge 43 commits into
Open
feat(auto-improve-orchestrator): v1.3 — operator-dispatched orchestrator skill#50Zhaiyuqing2003 wants to merge 43 commits into
Zhaiyuqing2003 wants to merge 43 commits into
Conversation
The agent container runs non-root and writes results into a host-mounted results directory. On Linux the bind-mount inherits the host owner, so the container couldn't write `result.json`. Fix: - chmod 0777 on workDir + resultsDir before starting the container - chmod -R a+rw on resultsDir after `docker cp` so cleanup can read it Also gitignore .superpowers/ — runtime state from the categorization pipeline (per-skill JSON cache, progress logs).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewer flagged that the prompt referenced case-source files that don't exist on this branch (web-design-guidelines/checks/, find-skills/). Make the prompt self-sufficient: - Inline _grader-utils.mjs content under Phase 2 step 4 - Soften 'mirror <path>' references to advisory - Add minimal Cases-table README skeleton in Phase 2 step 6 - Explicit file list in commit step so .run.log can never sneak in Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default 3.50 (unchanged). Pilot #1 (agent-browser) hit the original 3.50 cap mid-iteration before reaching the "Always: commit" step, losing the run record. With --budget 15 the same pilot completed cleanly: 0.56 → 1.00, +0.44 uplift, $3.15 actual spend. Operator usage: node tools/auto-improve-skill.mjs <slug> --budget 15
Three changes informed by the 3-skill pilot batch (PR #47): 1. **"Always: write analysis.md AND commit" merged into a single atomic step.** Pilots #1b and #2 wrote analysis.md but ran out of budget before reaching the separate commit step, leaving case files uncommitted. The merged section explicitly tells the agent to skip everything else if budget is low and finish this section first. 2. **Default --max-budget-usd bumped 3.50 → 10.00.** Pilot #1's first real-data attempt died at the cap mid-modification. Pilot #1c at --budget 15 settled at $3.15 with full success. The prompt's Phase-4 self-cap also moved from $3.00 to $7.00 to leave a $2-3 buffer for the analysis.md + commit cleanup below the wrapper hard cap. 3. **New tools/auto-improve-skill-lessons.md** — living doc the prompt reads as Phase-4 prior. Captures recipes A-E (two-pass workflow, verify-tool-installed, per-element checklists, BAD/GOOD examples, rationale + bug-story) and grader-reliability patterns G1-G6 (line tolerance, hyphen regex, per-finding-line matching, keyword variants, set-semantics, verbosity floor) with empirical evidence from the manual web-design-guidelines run + the 3 auto pilots. Phase 4 of the prompt now references the recipes by letter so the auto-pilot doesn't rediscover patterns from scratch each run. Also fixes a slug-parsing regression introduced by the --budget flag (when --budget was absent, the filter wrongly skipped argv[0]). Smoke tests pass: bare invocation prints usage, "nope" gives bad-slug, existing dir gets refused, --budget validates input.
Adds three grader-helper utilities to the inlined `_grader-utils.mjs`
content the auto-pilot writes to each new case in Phase 2:
- looseRange(N, tolerance=8) — centered range with default ±8 line
tolerance. Replaces hand-rolling `range(N-3, N+3)`. Default absorbs
the LLM line-counting drift seen across all 4 prior pilots.
- fuzzyKeyword(phrase) — hyphen-and-space-tolerant regex builder.
fuzzyKeyword('empty state') matches "empty state", "empty-state",
"emptystate". Replaces hand-rolling `/empty[-\s]+state/`.
- tolerantKeyword(stem) — word-stem prefix matcher. tolerantKeyword('cover')
matches "covering", "covered", "does not cover" but NOT "discovery"
(word boundary). Replaces alternation regexes for common phrasing
variants.
Also updates lessons.md G1 / G2 / G4 to reference the helpers in their
recipes, so the auto-pilot's Phase-4 reading naturally guides it to use
them rather than rediscovering by hand.
Verified end-to-end: extracted the inlined block from the prompt, ran
each helper, confirmed expected behavior on the canonical patterns from
prior pilots.
Moves auto-improve-skill pilot summaries from gitignored docs/superpowers/pilot-runs/ to tracked docs/pilot-runs/ so the team can review them in-tree. Includes: - docs/pilot-runs/README.md — directory index + reproduction recipe - 2026-05-08-auto-improve-pilot-summary.md — batch 1 (3 skills, 3/3 success: agent-browser, supabase, pdf) - 2026-05-09-auto-improve-batch-2-summary.md — batch 2 (10 skills, 8/10 success, 0 failures: pptx, next-best-practices, firebase-auth-basics, firebase-hosting-basics, building-native-ui, shadcn-ui, native-data-fetching, firecrawl-build-scrape, next-upgrade, prd) Per-skill eval artifacts and proposed-upstream-changes live on eval/auto-pilot/<skill-id> branches and the consolidated batch branches (eval/auto-pilot/batch-2026-05-08, eval/auto-pilot/batch-2-2026-05-09).
Operational guide for submitting skill-improvement PRs to the four repos we're currently working with (vercel-labs/agent-skills, vercel-labs/web-interface-guidelines, vercel-labs/agent-browser, supabase/agent-skills). Per repo: title format, body convention, CI gates, CLA status, merge style, scope guidance, and any gotchas discovered by reading AGENTS.md/CONTRIBUTING.md/workflow files plus the last 5–10 merged PRs. Future batches: append new repos as their conventions become known.
Polished PR drafts ready for operator review + submission to upstream. Each draft contains: - Target repo + base branch - Title in the repo's preferred convention (see upstream-pr-conventions.md) - PR body matching the repo's style (formal/casual/terse) - File diff or path to the full proposed file in our repo - Caveats and gotchas specific to the repo - Operator copy-paste shell snippet for fork → branch → commit → push → gh pr create The 4 PRs cover 3 skills (web-design-guidelines spans 2 repos): 1. vercel-labs/agent-skills — web-design-guidelines SKILL.md two-pass workflow 2. vercel-labs/web-interface-guidelines — per-element checklist + 5 BAD/GOOD examples 3. vercel-labs/agent-browser — Pre-flight section (retargeted to skill-data/core/SKILL.md per AGENTS.md) 4. supabase/agent-skills — two-pass review reference (reformulated as a new references/ file per CONTRIBUTING.md, not a SKILL.md edit) Sources: - PR 1 + 2: manual web-design-guidelines run (eval/web-design-guidelines) - PR 3: agent-browser v1.2 re-run (the small additive Pre-flight) - PR 4: supabase batch-1 result (0.54 → 0.86, content reformulated to fit repo convention)
Adds a `--context <path>` flag to the auto-pilot wrapper that reads a markdown file and injects it into the prompt as a "Constraints" section Phase 4 must respect. Enables steering pilots toward upstream-specific targets (e.g. fetched rules docs instead of skill SKILL.md) and encoding architecture intent (additive-only, no restructure, etc.) as hard constraints. Phase 4 + Phase 5 updated to honor target-file overrides from the constraints (e.g. edit `command.md` instead of `SKILL.md` when the context says so; package files as `before-/after-command.md` under the correct upstream-repo directory). Includes the first context file: `tools/auto-improve-contexts/vercel-web-interface-guidelines.md`, encoding the vercel research findings — `command.md` is the canonical source distributed to 7 tools + 10 downstream consumers, restructure risk is HIGH, additive-only PRs are the merged norm (PR #23 precedent), and the AGENTS.md / README.md mirrors happen at PR-draft time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Encodes upstream conventions discovered via gh-CLI research: - All 28 existing references in this skill are single-rule SQL anti-pattern fixes with **Incorrect/**Correct SQL blocks; meta-workflow guidance is shape-novel (MEDIUM-HIGH risk of "fit the convention" pushback from gregnr/Rodriguespn). - Prefixes locked to the 8 in `_sections.md` (`query-`, `conn-`, `security-`, `schema-`, `lock-`, `data-`, `monitor-`, `advanced-`); a `review-` prefix would require modifying `_sections.md` which is not additive-only. - Required reshape: pick a single concrete SQL anti-pattern that two-pass review catches and frame around it (Incorrect = single-pass miss, Correct = two-pass catch). If reshape feels contrived, surface needs-discussion signal instead of shipping borderline PR. - Frontmatter spec corrected: 4 fields (`title`, `impact`, `impactDescription`, `tags`); previous research missed `impactDescription`. `tags` is comma-separated string, not YAML list. - pnpm test:sanity does NOT validate frontmatter (corrected prior note); convention is enforced by maintainer review only. - Release Please owns metadata.version; do not bump manually (causes merge conflicts with bot's release PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-browser Carry over existing Tier-0 eval (navigate-and-report, screenshot-capture) as the starting point for deeper Tier-1 work.
- Add 4 cases (ref-based-search, ref-disambiguation, output-correctness, multi-step-state) that grade snapshot-driven @en ref discipline, ambiguous-element resolution, content correctness, and full state-machine traversal — none of which the v1 baseline covered. - Upgrade bin/agent-browser to a stateful playback CLI: URL match -> page, per-page transitions.txt drives state changes, snapshot emits the recorded accessibility-tree fixture for current (page, state). Falls back to the legacy generic snapshot for Tier-0 continuity. Adds AB_WORK override so the CLI can be smoke-tested outside Docker. - Add hand-fabricated recordings for 4 pages (wikipedia, signin-signup, blog-article, multistep-form) under references/agent-browser/recordings/. - Add checks/smoke-graders.mjs running 14 GOOD/BAD assertions against hand-crafted ab-calls.log + output-file fixtures; all pass without Docker or models.
…er-1 pilot Encodes constraints for the auto-pilot to run against the hand-built Tier-1 deeper eval (4 new cases: ref-based-search, ref-disambiguation, output-correctness, multi-step-state) without rebuilding the workbench. Key directives: - Workbench is already built — skip Phase 2 entirely - Optimization target = references/agent-browser/agent-browser-core.md (the workflow content), NOT references/agent-browser/SKILL.md (the discovery stub) - Upstream packaging target = skill-data/core/SKILL.md per AGENTS.md - Apache-2.0 + conventional commits + ctate same-day merges for clean docs-only PRs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrapper-skill PR target (`vercel-labs/agent-skills/.../web-design-guidelines/SKILL.md`) is dropped — it's a thin Claude-Code-specific adapter that WebFetches the rules doc, and editing it is low-leverage. All value lives in `vercel-labs/web-interface-guidelines/command.md` and its two stylistic siblings (`AGENTS.md`, `README.md`). The consolidated draft at #1 carries: - The auto-pilot's measured 22-line `command.md` insert (eval 0.92→1.00, 18 trials × 3 frontier models, 6 absence-type misses → 0) - A MUST/SHOULD/NEVER mirror for `AGENTS.md` (style-faithful, not independently measured) - A prose mirror for `README.md` (style-faithful, not independently measured) - A qualitative pitch as the headline + eval data as supporting evidence (matches PR #23 precedent in this repo, which has zero quantitative evidence in any merged PR) Old drafts moved to `superseded/` with a README explaining why each was retired. Repo PR-drafts README updated to reflect the new canonical numbering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the two structural lessons from the v1.2.1 pilot session: 1. Research-first context is mandatory (Phase 0): the auto-pilot is good at finding what to change, bad at fitting upstream conventions. Without a researched context file, output requires manual reformulation. 2. Two-loop iteration on eval AND skill (Phase 3.5): the current pipeline can't escape ceiling (>= 0.95) or floor (< 0.50) eval baselines because it only iterates the skill, treating the eval as fixed. Backwards compatible — v1.2.1's --context flag continues to work; v1.3 phases are opt-in via --research and --auto-eval flags until validated. Note: this commit lands on the supabase--v1-shallow branch because the agent-browser pilot is concurrently active on the main worktree; branch hygiene (move to docs/auto-pilot-runs) deferred until pilots finish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent-browser deeper-eval pilot timed out at the 90-min wrapper cap mid-baseline (50/54 trials complete; no Phase 5 commit). However, the supabase v2 pilot's Phase 4 instruction to append a run-record entry to lessons.md DID complete and wrote a useful observation about the 'calibrated graders cause baseline ceiling' pattern. Salvaging that entry here even though the parent agent-browser pilot didn't finalize. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#3 (agent-browser): updated to acknowledge that the v1.2.1 deeper-eval pilot was attempted but timed out at the wrapper's 90-min hard cap mid-baseline (50/54 trials complete, no Phase 5 commit). Ships the original v1.0 Pre-flight diff (baseline 0.97; 1/9 Gemini trial used curl). Partial baseline data preserved at .results/20260512-101220/ for future analysis. #4 (supabase): replaced the batch-1 draft with the v1.2.1 v2 result. The auto-pilot reshaped the proposal exactly per the upstream context file (filename monitor-two-pass-review.md, monitor- prefix, 4-field frontmatter, **Incorrect**/**Correct** SQL blocks, ~50 lines) so the file is convention-perfect. Honest framing: per-case breakdown shows update-without-where at 77.8% (the targeted failure pattern) but overall 0.97 baseline meant no iteration; auto-pilot's exit logic uses overall average rather than per-case minimum (v1.3 will fix). README index updated with evidence-strength column. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Approved design (brainstormed 2026-05-12) for converting the auto-improve-skill workflow from a wrapper-spawned claude -p pilot into a Claude Code skill (auto-improve-orchestrator) that an operator's CC session invokes via the Agent tool. Key architectural shift: - skill-optimizer stays lean (run-suite, run-case, graders, Docker) - Orchestration moves OUT into a new skill at skills/auto-improve-orchestrator/ that ships subagent prompt templates the operator's CC session dispatches via the Agent tool - Each orchestrator subagent owns one skill end-to-end, runs in its own worktree, parallelizable across N skills Addresses 4 motivators from v1.2.1 work: 1. Research-first context is mandatory (becomes Phase 0 sub-subagent) 2. Two-loop iteration on eval AND skill (Phase 3.5 sub-subagent) 3. Per-case-minimum threshold (computed by orchestrator from suite-result.json) 4. Resume-on-timeout (every phase resume-aware via on-disk artifacts) Predecessor design draft: docs/auto-improve-skill-v1.3-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 bite-sized tasks: file moves/deletes (1-4), new skill structure (5-10), smoke validation (11), end-to-end test on supabase (12). Each task is one focused commit. All prompt-template content is inlined in the plan (no placeholders). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrapper-spawned claude -p autonomous pilot is replaced by the new auto-improve-orchestrator Claude Code skill (built in subsequent commits). Operator now dispatches the orchestrator subagent via the Agent tool instead of running a Node wrapper.
The smoke check (36/36 structural validations) confirms the v1.3 implementation is complete and well-formed. End-to-end orchestrator dispatch is documented as a focused next-session task with the exact Agent dispatch payload, expected behavior per phase, and acceptance criteria. Workbench imported from eval/auto-pilot/supabase-postgres-best-practices-v2 in the prior commit so the e2e dispatch can resume cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename context files to match the orchestrator's ${OWNER}-${SKILL_ID}.md
formula so cached contexts are found on existing-skill runs (was
breaking acceptance criterion #3 — orchestrator was always dispatching
Phase 0 research even when cached context existed).
- Update smoke check to expect renamed supabase context.
- workflow.md: complete the sub-subagent input lists for Phase 3.5 and
Phase 4 dispatch lines (previously missing WORKBENCH_DIR / LESSONS_PATH /
CONTEXT_FILE).
- orchestrator.md hard rules: clarify that context files are written by
the research sub-subagent, not the orchestrator (was ambiguous).
- SKILL.md: document the optional REFRESH_CONTEXT template var.
- docs/pilot-runs/README.md: add deprecation note pointing at v1.3.
NOT FIXED (deliberate): Phase 4 cost tracking is a comment stub vs Phase 3's
full code block. The asymmetry is intentional — Phase 4 runs at most twice,
and a competent orchestrator can infer the pattern from Phase 3 without
duplicating the 10-line block.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in the batch-2 shadcn-ui eval workbench (baseline 0.82, batch-2 showed +0.07 uplift) so the v1.3 orchestrator can be dispatched on a fresh skill (no cached context, will trigger Phase 0 research). Pairs with the agent-browser dispatch (cached context, will skip Phase 0). The two together exercise both paths through Phase 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shadcn-ui workbench was cherry-picked from batch-2 which used the older gpt-4o-mini. Updating to gpt-5 to match the canonical v1.2.1+ frontier-model matrix (sonnet-4.6, gpt-5, gemini-2.5-pro) used by web-design-guidelines and agent-browser. The just-completed v1.3 orchestrator dispatch saw gpt-4o-mini dominate the per-case-min floor; re-firing with gpt-5 will produce a result that's apples-to-apples with the other PR candidates. Follow-up issue: v1.3 orchestrator should validate the model matrix against a canonical set before running, OR the operator's pre-flight should. Current behavior runs whatever is on disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…shadcn-ui (re-fire)
…k with first-line path comment instruction and StatusBadge BAD/GOOD example
…patch Brings firebase-hosting-basics and firecrawl-build-scrape workbenches from their batch-2 eval branches, with the model matrix updated to the v1.2.1+ canonical frontier set: claude-sonnet-4.6, gpt-5, gemini-2.5-pro Both eval the code-reviewer pattern (read a file, find seeded violations, write findings.txt). Lightweight — no Python venv or real API calls. Both will exercise the v1.3 Phase 3.5 eval-iteration loop if frontier models hit ceiling (likely; both batch-2 baselines were 0.84-0.89 with old gpt-4o-mini matrix). This is the v1.3 architectural feature we haven't validated yet — neither shadcn-ui nor agent-browser dispatched eval-iterate (both landed in (0.50, 0.95) directly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| import { spawnSync } from 'node:child_process'; | ||
| import { mkdtempSync, mkdirSync, writeFileSync, rmSync } from 'node:fs'; | ||
| import { tmpdir } from 'node:os'; | ||
| import { join, dirname, resolve } from 'node:path'; |
| import React from "react"; | ||
| import { Card, CardContent, CardHeader } from "@/components/ui/card"; | ||
| import { Button } from "@/components/ui/button"; | ||
| import { Badge } from "@/components/ui/badge"; |
b469dae to
96dbe31
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR implements the v1.3 “auto-improve” workflow as an operator-dispatched Claude Code skill (skills/auto-improve-orchestrator/), replacing the prior wrapper approach, and adds multiple new/updated workbenches + graders used to validate (and package) upstream skill improvements. It also hardens Docker workbench cleanup/permissions behavior for runs that write as a non-host UID.
Changes:
- Add the
auto-improve-orchestratorskill (SKILL.md + prompts + references/contexts) plus a node-based smoke-check script. - Add/extend several eval workbenches (Supabase Postgres, shadcn-ui, Firebase Hosting, Firecrawl scrape, agent-browser) including suite.yml, graders, fixtures/workspaces, analyses, and proposed-upstream-changes artifacts.
- Adjust Docker runner permissions/cleanup to reduce failures when containers write files the host user can’t delete; add
gray-matterfor frontmatter parsing.
Reviewed changes
Copilot reviewed 149 out of 152 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/workbench/docker-runner.ts | Adjust permissions and make cleanup resilient to host/container UID mismatches |
| skills/auto-improve-orchestrator/SKILL.md | New orchestrator skill entrypoint and operator invocation instructions |
| skills/auto-improve-orchestrator/.smoke-check.mjs | Smoke-check script validating skill structure and template vars |
| skills/auto-improve-orchestrator/references/contexts/vercel-labs-web-design-guidelines.md | Cached upstream research context for a target |
| skills/auto-improve-orchestrator/references/contexts/google-labs-code-shadcn-ui.md | Cached upstream research context for shadcn-ui |
| skills/auto-improve-orchestrator/references/contexts/firebase-firebase-hosting-basics.md | Cached upstream research context for firebase-hosting-basics |
| package.json | Add gray-matter (used by smoke-check) |
| examples/workbench/supabase-postgres-best-practices/workspace/schema.sql | New seeded SQL workspace for deterministic grading |
| examples/workbench/supabase-postgres-best-practices/workspace/rls_policies.sql | New seeded SQL workspace for RLS-focused grading |
| examples/workbench/supabase-postgres-best-practices/workspace/multi_table_schema.sql | New multi-table RLS enumeration workspace |
| examples/workbench/supabase-postgres-best-practices/workspace/migrations.sql | New FK/index enumeration workspace |
| examples/workbench/supabase-postgres-best-practices/workspace/data_migration.sql | New UPDATE-without-WHERE workspace |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/SKILL.md | Vendored skill snapshot for deterministic eval |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-rls-performance.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-rls-basics.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-privileges.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-primary-keys.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-partitioning.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-lowercase-identifiers.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-foreign-key-indexes.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-data-types.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-constraints.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-partial-indexes.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-missing-indexes.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-index-types.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-covering-indexes.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-composite-indexes.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-vacuum-analyze.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-two-pass-review.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-pg-stat-statements.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-explain-analyze.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-skip-locked.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-short-transactions.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-deadlock-prevention.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-advisory.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-upsert.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-pagination.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-n-plus-one.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-batch-inserts.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-prepared-statements.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-pooling.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-limits.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-idle-timeout.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/advanced-jsonb-indexing.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/advanced-full-text-search.md | New vendored reference rule file |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_template.md | New vendored reference template |
| examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_sections.md | New vendored section definitions |
| examples/workbench/supabase-postgres-best-practices/checks/_grader-utils.mjs | Shared grader utilities for this workbench |
| examples/workbench/supabase-postgres-best-practices/checks/grade-schema-findings.mjs | Grader for schema.sql case |
| examples/workbench/supabase-postgres-best-practices/checks/grade-rls-findings.mjs | Grader for rls_policies.sql case |
| examples/workbench/supabase-postgres-best-practices/checks/grade-multi-table-rls-findings.mjs | Grader for multi-table RLS case |
| examples/workbench/supabase-postgres-best-practices/checks/grade-fk-index-audit-findings.mjs | Grader for FK/index enumeration case |
| examples/workbench/supabase-postgres-best-practices/checks/grade-update-without-where-findings.mjs | Grader for UPDATE-without-WHERE case |
| examples/workbench/supabase-postgres-best-practices/README.md | Workbench documentation and run instructions |
| examples/workbench/supabase-postgres-best-practices/analysis.md | Run analysis summary |
| examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/README.md | Packaged upstream-change notes |
| examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/before-SKILL.md | Packaged before snapshot |
| examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/after-SKILL.md | Packaged after snapshot |
| examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/monitor-two-pass-review.md | Packaged new upstream reference file |
| examples/workbench/shadcn-ui/workspace/UserCard.tsx | Seeded code-review fixture |
| examples/workbench/shadcn-ui/workspace/StatusBadge.tsx | Seeded code-review fixture |
| examples/workbench/shadcn-ui/checks/_grader-utils.mjs | Shared grader utilities |
| examples/workbench/shadcn-ui/checks/grade-usercard-findings.mjs | UserCard grader |
| examples/workbench/shadcn-ui/checks/grade-statusbadge-findings.mjs | StatusBadge grader |
| examples/workbench/shadcn-ui/suite.yml | New shadcn-ui eval suite |
| examples/workbench/shadcn-ui/README.md | Workbench documentation and run instructions |
| examples/workbench/shadcn-ui/analysis.md | Run analysis summary |
| examples/workbench/shadcn-ui/proposed-upstream-changes/README.md | Packaged upstream-change notes |
| examples/workbench/firecrawl-build-scrape/workspace/ScrapeService.ts | Seeded code-review fixture |
| examples/workbench/firecrawl-build-scrape/checks/_grader-utils.mjs | Shared grader utilities |
| examples/workbench/firecrawl-build-scrape/checks/grade-scrape-service-findings.mjs | ScrapeService grader |
| examples/workbench/firecrawl-build-scrape/references/firecrawl-build-scrape/node-docs.md | Vendored Firecrawl Node docs snapshot |
| examples/workbench/firecrawl-build-scrape/suite.yml | New Firecrawl eval suite |
| examples/workbench/firecrawl-build-scrape/README.md | Workbench documentation and run instructions |
| examples/workbench/firecrawl-build-scrape/analysis.md | Run analysis summary |
| examples/workbench/firebase-hosting-basics/workspace/firebase-app/firebase.json | Seeded config fixture |
| examples/workbench/firebase-hosting-basics/checks/_grader-utils.mjs | Shared grader utilities |
| examples/workbench/firebase-hosting-basics/checks/grade-firebase-config-findings.mjs | firebase.json grader |
| examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/SKILL.md | Vendored skill snapshot |
| examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/configuration.md | Vendored reference doc |
| examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/deploying.md | Vendored reference doc |
| examples/workbench/firebase-hosting-basics/suite.yml | New Firebase Hosting eval suite |
| examples/workbench/firebase-hosting-basics/README.md | Workbench documentation and run instructions |
| examples/workbench/firebase-hosting-basics/analysis.md | Run analysis summary |
| examples/workbench/firebase-hosting-basics/proposed-upstream-changes/README.md | Packaged upstream-change notes |
| examples/workbench/firebase-hosting-basics/proposed-upstream-changes/firebase-agent-skills/before-SKILL.md | Packaged before snapshot |
| examples/workbench/firebase-hosting-basics/proposed-upstream-changes/firebase-agent-skills/after-SKILL.md | Packaged after snapshot |
| examples/workbench/agent-browser/suite.yml | Expanded agent-browser suite |
| examples/workbench/agent-browser/references/agent-browser/SKILL.md | Vendored skill stub used by workbench |
| examples/workbench/agent-browser/references/agent-browser/agent-browser-core.md | Vendored core workflow reference |
| examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/transitions.txt | Recorded snapshot transition data |
| examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/snapshot.out | Recorded snapshot |
| examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/snapshot-after-search.out | Recorded snapshot |
| examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/transitions.txt | Recorded snapshot transition data |
| examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot.out | Recorded snapshot |
| examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot-after-signup.out | Recorded snapshot |
| examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot-after-signin.out | Recorded snapshot |
| examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/transitions.txt | Recorded snapshot transition data |
| examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot.out | Recorded snapshot |
| examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-submitted.out | Recorded snapshot |
| examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-name-entered.out | Recorded snapshot |
| examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-email-entered.out | Recorded snapshot |
| examples/workbench/agent-browser/references/agent-browser/recordings/blog-article/transitions.txt | Recorded snapshot transition data |
| examples/workbench/agent-browser/references/agent-browser/recordings/blog-article/snapshot.out | Recorded snapshot |
| examples/workbench/agent-browser/checks/grade-navigate-report-findings.mjs | Behavioral grader updates/additions |
| examples/workbench/agent-browser/checks/grade-screenshot-capture-findings.mjs | Behavioral grader updates/additions |
| examples/workbench/agent-browser/checks/grade-ref-disambiguation-findings.mjs | Behavioral grader updates/additions |
| examples/workbench/agent-browser/checks/grade-output-correctness-findings.mjs | Behavioral grader updates/additions |
| examples/workbench/agent-browser/checks/_grader-utils.mjs | Shared grader utilities |
| examples/workbench/agent-browser/proposed-upstream-changes/README.md | Packaged upstream-change notes |
| examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/before-SKILL.md | Packaged before snapshot |
| examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/after-SKILL.md | Packaged after snapshot |
| examples/workbench/agent-browser/analysis.md | Run analysis summary |
| docs/pilot-runs/README.md | Note about v1.3 orchestrator vs removed wrapper |
| docs/pilot-runs/upstream-pr-drafts/README.md | Update upstream draft index/process |
| docs/pilot-runs/upstream-pr-drafts/superseded/README.md | Archive/superseded draft notes |
| docs/auto-improve-skill-v1.3-validation.md | v1.3 validation notes (needs update for current PR claims) |
| CLAUDE.md | Add pointer to orchestrator skill |
| .gitignore | Ignore new generated artifacts/logs |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+44
to
+50
| ## Models | ||
|
|
||
| The suite runs a 3-provider mid-tier matrix: | ||
|
|
||
| - `openrouter/anthropic/claude-sonnet-4-6` | ||
| - `openrouter/openai/gpt-4o-mini` | ||
| - `openrouter/google/gemini-2.5-pro` |
Comment on lines
+35
to
+41
| ## Models | ||
|
|
||
| The suite runs a 3-provider mid-tier matrix: | ||
|
|
||
| - `openrouter/anthropic/claude-sonnet-4-6` | ||
| - `openrouter/openai/gpt-4o-mini` | ||
| - `openrouter/google/gemini-2.5-pro` |
Comment on lines
+35
to
+41
| ## Models | ||
|
|
||
| The suite runs a 3-provider mid-tier matrix: | ||
|
|
||
| - `openrouter/anthropic/claude-sonnet-4-5` | ||
| - `openrouter/openai/gpt-4o-mini` | ||
| - `openrouter/google/gemini-2.5-flash` |
Comment on lines
+1
to
+6
| # v1.3 validation — deferred to operator | ||
|
|
||
| **Date:** 2026-05-12 | ||
| **Status:** implementation complete, end-to-end orchestrator dispatch | ||
| deferred to operator's next session. | ||
|
|
Comment on lines
+46
to
+51
| for (const [file, vars] of Object.entries(expectedVars)) { | ||
| const content = readFileSync(`${skillRoot}/prompts/${file}`, 'utf-8'); | ||
| for (const v of vars) { | ||
| check(content.includes(`\${${v}}`), `prompts/${file} contains \${${v}}`); | ||
| } | ||
| } |
Comment on lines
410
to
+414
| mkdirSync(referencesDir, { recursive: true }); | ||
| mkdirSync(workDir, { recursive: true }); | ||
| mkdirSync(resultsDir, { recursive: true }); | ||
| chmodSync(workDir, 0o777); | ||
| chmodSync(resultsDir, 0o777); |
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements v1.3 of the auto-improve-skill pipeline as documented in
docs/auto-improve-skill-v1.3-spec.md(the brainstormed + approved spec) anddocs/auto-improve-skill-v1.3-plan.md(the implementation plan).Architectural shift: the v1.2.1 wrapper-spawned
claude -pautonomous pilot is REPLACED by a Claude Code skill (skills/auto-improve-orchestrator/) that an operator's CC session invokes via the Agent tool. Each orchestrator subagent owns one skill end-to-end, runs in its own worktree, dispatches sub-subagents for research / eval-iteration / skill-iteration tasks. Multiple orchestrators can run in parallel for batch operation.Validated end-to-end on real skills
In this PR, three v1.3 orchestrator dispatches produced measured uplift:
vercel-labs/agent-browser/agent-browsergoogle-labs-code/stitch-skills/shadcn-ui(gpt-5 re-fire)firebase-hosting-basics,firecrawl-build-scrapeThe shadcn-ui result is drafted as PR #5 in
docs/pilot-runs/upstream-pr-drafts/for upstream submission.What's in this PR
New skill:
skills/auto-improve-orchestrator/— SKILL.md + 4 prompt templates (orchestrator + 3 sub-subagents) +references/(workflow.md, lessons.md moved from tools/, contexts/ moved from tools/).Removed:
tools/auto-improve-skill.mjsandtools/auto-improve-skill-prompt.md(the v1.2.1 wrapper).Docs:
docs/auto-improve-skill-v1.3-design.md— predecessor design draftdocs/auto-improve-skill-v1.3-spec.md— approved spec (after brainstorming)docs/auto-improve-skill-v1.3-plan.md— implementation plan (12 tasks, all completed)docs/auto-improve-skill-v1.3-validation.md— validation note (e2e was operator-deferred, then exercised via the dispatches above)docs/pilot-runs/upstream-pr-drafts/5-google-labs-code-stitch-skills-shadcn-ui.md— first v1.3-evidenced PR draftTest plan
skills/auto-improve-orchestrator/.smoke-check.mjspasses (36/36 structural validations)uplift-too-small(honest)777b245)🤖 Generated with Claude Code