feat(auto-improve-orchestrator): v1.3 — operator-dispatched orchestrator skill by Zhaiyuqing2003 · Pull Request #50 · fastxyz/skill-optimizer

Zhaiyuqing2003 · 2026-05-12T16:21:03Z

Summary

Implements v1.3 of the auto-improve-skill pipeline as documented in docs/auto-improve-skill-v1.3-spec.md (the brainstormed + approved spec) and docs/auto-improve-skill-v1.3-plan.md (the implementation plan).

Architectural shift: the v1.2.1 wrapper-spawned claude -p autonomous pilot is REPLACED by a Claude Code skill (skills/auto-improve-orchestrator/) that an operator's CC session invokes via the Agent tool. Each orchestrator subagent owns one skill end-to-end, runs in its own worktree, dispatches sub-subagents for research / eval-iteration / skill-iteration tasks. Multiple orchestrators can run in parallel for batch operation.

Validated end-to-end on real skills

In this PR, three v1.3 orchestrator dispatches produced measured uplift:

Skill	Status	Baseline → Final per-case-min	Notes
`vercel-labs/agent-browser/agent-browser`	uplift-too-small	0.667 → 0.667	Frontier matrix; gpt-5 floor on Tier-0; orchestrator caught an eval harness bug + fixed it additively
`google-labs-code/stitch-skills/shadcn-ui` (gpt-5 re-fire)	success	0.667 → 0.889 (+0.222)	Phase 0 research subagent worked, Recipe D applied, gemini V8 fixed
`firebase-hosting-basics`, `firecrawl-build-scrape`	(in flight at PR open time)	—	Exercising Phase 3.5 (eval-readiness loop)

The shadcn-ui result is drafted as PR #5 in docs/pilot-runs/upstream-pr-drafts/ for upstream submission.

What's in this PR

New skill: skills/auto-improve-orchestrator/ — SKILL.md + 4 prompt templates (orchestrator + 3 sub-subagents) + references/ (workflow.md, lessons.md moved from tools/, contexts/ moved from tools/).

Removed: tools/auto-improve-skill.mjs and tools/auto-improve-skill-prompt.md (the v1.2.1 wrapper).

Docs:

docs/auto-improve-skill-v1.3-design.md — predecessor design draft
docs/auto-improve-skill-v1.3-spec.md — approved spec (after brainstorming)
docs/auto-improve-skill-v1.3-plan.md — implementation plan (12 tasks, all completed)
docs/auto-improve-skill-v1.3-validation.md — validation note (e2e was operator-deferred, then exercised via the dispatches above)
docs/pilot-runs/upstream-pr-drafts/5-google-labs-code-stitch-skills-shadcn-ui.md — first v1.3-evidenced PR draft

Test plan

Smoke check at skills/auto-improve-orchestrator/.smoke-check.mjs passes (36/36 structural validations)
End-to-end on shadcn-ui (gpt-5 frontier matrix): 0.667→0.889
End-to-end on agent-browser: cleanly exits uplift-too-small (honest)
Spec self-review + final code review (issues addressed in commit 777b245)
(Operator) Decide on PR portfolio for upstream submission

🤖 Generated with Claude Code

The agent container runs non-root and writes results into a host-mounted results directory. On Linux the bind-mount inherits the host owner, so the container couldn't write `result.json`. Fix: - chmod 0777 on workDir + resultsDir before starting the container - chmod -R a+rw on resultsDir after `docker cp` so cleanup can read it Also gitignore .superpowers/ — runtime state from the categorization pipeline (per-skill JSON cache, progress logs).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reviewer flagged that the prompt referenced case-source files that don't exist on this branch (web-design-guidelines/checks/, find-skills/). Make the prompt self-sufficient: - Inline _grader-utils.mjs content under Phase 2 step 4 - Soften 'mirror <path>' references to advisory - Add minimal Cases-table README skeleton in Phase 2 step 6 - Explicit file list in commit step so .run.log can never sneak in Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Default 3.50 (unchanged). Pilot #1 (agent-browser) hit the original 3.50 cap mid-iteration before reaching the "Always: commit" step, losing the run record. With --budget 15 the same pilot completed cleanly: 0.56 → 1.00, +0.44 uplift, $3.15 actual spend. Operator usage: node tools/auto-improve-skill.mjs <slug> --budget 15

Three changes informed by the 3-skill pilot batch (PR #47): 1. **"Always: write analysis.md AND commit" merged into a single atomic step.** Pilots #1b and #2 wrote analysis.md but ran out of budget before reaching the separate commit step, leaving case files uncommitted. The merged section explicitly tells the agent to skip everything else if budget is low and finish this section first. 2. **Default --max-budget-usd bumped 3.50 → 10.00.** Pilot #1's first real-data attempt died at the cap mid-modification. Pilot #1c at --budget 15 settled at $3.15 with full success. The prompt's Phase-4 self-cap also moved from $3.00 to $7.00 to leave a $2-3 buffer for the analysis.md + commit cleanup below the wrapper hard cap. 3. **New tools/auto-improve-skill-lessons.md** — living doc the prompt reads as Phase-4 prior. Captures recipes A-E (two-pass workflow, verify-tool-installed, per-element checklists, BAD/GOOD examples, rationale + bug-story) and grader-reliability patterns G1-G6 (line tolerance, hyphen regex, per-finding-line matching, keyword variants, set-semantics, verbosity floor) with empirical evidence from the manual web-design-guidelines run + the 3 auto pilots. Phase 4 of the prompt now references the recipes by letter so the auto-pilot doesn't rediscover patterns from scratch each run. Also fixes a slug-parsing regression introduced by the --budget flag (when --budget was absent, the filter wrongly skipped argv[0]). Smoke tests pass: bare invocation prints usage, "nope" gives bad-slug, existing dir gets refused, --budget validates input.

Adds three grader-helper utilities to the inlined `_grader-utils.mjs` content the auto-pilot writes to each new case in Phase 2: - looseRange(N, tolerance=8) — centered range with default ±8 line tolerance. Replaces hand-rolling `range(N-3, N+3)`. Default absorbs the LLM line-counting drift seen across all 4 prior pilots. - fuzzyKeyword(phrase) — hyphen-and-space-tolerant regex builder. fuzzyKeyword('empty state') matches "empty state", "empty-state", "emptystate". Replaces hand-rolling `/empty[-\s]+state/`. - tolerantKeyword(stem) — word-stem prefix matcher. tolerantKeyword('cover') matches "covering", "covered", "does not cover" but NOT "discovery" (word boundary). Replaces alternation regexes for common phrasing variants. Also updates lessons.md G1 / G2 / G4 to reference the helpers in their recipes, so the auto-pilot's Phase-4 reading naturally guides it to use them rather than rediscovering by hand. Verified end-to-end: extracted the inlined block from the prompt, ran each helper, confirmed expected behavior on the canonical patterns from prior pilots.

Moves auto-improve-skill pilot summaries from gitignored docs/superpowers/pilot-runs/ to tracked docs/pilot-runs/ so the team can review them in-tree. Includes: - docs/pilot-runs/README.md — directory index + reproduction recipe - 2026-05-08-auto-improve-pilot-summary.md — batch 1 (3 skills, 3/3 success: agent-browser, supabase, pdf) - 2026-05-09-auto-improve-batch-2-summary.md — batch 2 (10 skills, 8/10 success, 0 failures: pptx, next-best-practices, firebase-auth-basics, firebase-hosting-basics, building-native-ui, shadcn-ui, native-data-fetching, firecrawl-build-scrape, next-upgrade, prd) Per-skill eval artifacts and proposed-upstream-changes live on eval/auto-pilot/<skill-id> branches and the consolidated batch branches (eval/auto-pilot/batch-2026-05-08, eval/auto-pilot/batch-2-2026-05-09).

Operational guide for submitting skill-improvement PRs to the four repos we're currently working with (vercel-labs/agent-skills, vercel-labs/web-interface-guidelines, vercel-labs/agent-browser, supabase/agent-skills). Per repo: title format, body convention, CI gates, CLA status, merge style, scope guidance, and any gotchas discovered by reading AGENTS.md/CONTRIBUTING.md/workflow files plus the last 5–10 merged PRs. Future batches: append new repos as their conventions become known.

Polished PR drafts ready for operator review + submission to upstream. Each draft contains: - Target repo + base branch - Title in the repo's preferred convention (see upstream-pr-conventions.md) - PR body matching the repo's style (formal/casual/terse) - File diff or path to the full proposed file in our repo - Caveats and gotchas specific to the repo - Operator copy-paste shell snippet for fork → branch → commit → push → gh pr create The 4 PRs cover 3 skills (web-design-guidelines spans 2 repos): 1. vercel-labs/agent-skills — web-design-guidelines SKILL.md two-pass workflow 2. vercel-labs/web-interface-guidelines — per-element checklist + 5 BAD/GOOD examples 3. vercel-labs/agent-browser — Pre-flight section (retargeted to skill-data/core/SKILL.md per AGENTS.md) 4. supabase/agent-skills — two-pass review reference (reformulated as a new references/ file per CONTRIBUTING.md, not a SKILL.md edit) Sources: - PR 1 + 2: manual web-design-guidelines run (eval/web-design-guidelines) - PR 3: agent-browser v1.2 re-run (the small additive Pre-flight) - PR 4: supabase batch-1 result (0.54 → 0.86, content reformulated to fit repo convention)

Adds a `--context <path>` flag to the auto-pilot wrapper that reads a markdown file and injects it into the prompt as a "Constraints" section Phase 4 must respect. Enables steering pilots toward upstream-specific targets (e.g. fetched rules docs instead of skill SKILL.md) and encoding architecture intent (additive-only, no restructure, etc.) as hard constraints. Phase 4 + Phase 5 updated to honor target-file overrides from the constraints (e.g. edit `command.md` instead of `SKILL.md` when the context says so; package files as `before-/after-command.md` under the correct upstream-repo directory). Includes the first context file: `tools/auto-improve-contexts/vercel-web-interface-guidelines.md`, encoding the vercel research findings — `command.md` is the canonical source distributed to 7 tools + 10 downstream consumers, restructure risk is HIGH, additive-only PRs are the merged norm (PR #23 precedent), and the AGENTS.md / README.md mirrors happen at PR-draft time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Encodes upstream conventions discovered via gh-CLI research: - All 28 existing references in this skill are single-rule SQL anti-pattern fixes with **Incorrect/**Correct SQL blocks; meta-workflow guidance is shape-novel (MEDIUM-HIGH risk of "fit the convention" pushback from gregnr/Rodriguespn). - Prefixes locked to the 8 in `_sections.md` (`query-`, `conn-`, `security-`, `schema-`, `lock-`, `data-`, `monitor-`, `advanced-`); a `review-` prefix would require modifying `_sections.md` which is not additive-only. - Required reshape: pick a single concrete SQL anti-pattern that two-pass review catches and frame around it (Incorrect = single-pass miss, Correct = two-pass catch). If reshape feels contrived, surface needs-discussion signal instead of shipping borderline PR. - Frontmatter spec corrected: 4 fields (`title`, `impact`, `impactDescription`, `tags`); previous research missed `impactDescription`. `tags` is comma-separated string, not YAML list. - pnpm test:sanity does NOT validate frontmatter (corrected prior note); convention is enforced by maintainer review only. - Release Please owns metadata.version; do not bump manually (causes merge conflicts with bot's release PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-browser Carry over existing Tier-0 eval (navigate-and-report, screenshot-capture) as the starting point for deeper Tier-1 work.

@en

- Add 4 cases (ref-based-search, ref-disambiguation, output-correctness, multi-step-state) that grade snapshot-driven @en ref discipline, ambiguous-element resolution, content correctness, and full state-machine traversal — none of which the v1 baseline covered. - Upgrade bin/agent-browser to a stateful playback CLI: URL match -> page, per-page transitions.txt drives state changes, snapshot emits the recorded accessibility-tree fixture for current (page, state). Falls back to the legacy generic snapshot for Tier-0 continuity. Adds AB_WORK override so the CLI can be smoke-tested outside Docker. - Add hand-fabricated recordings for 4 pages (wikipedia, signin-signup, blog-article, multistep-form) under references/agent-browser/recordings/. - Add checks/smoke-graders.mjs running 14 GOOD/BAD assertions against hand-crafted ab-calls.log + output-file fixtures; all pass without Docker or models.

…er-1 pilot Encodes constraints for the auto-pilot to run against the hand-built Tier-1 deeper eval (4 new cases: ref-based-search, ref-disambiguation, output-correctness, multi-step-state) without rebuilding the workbench. Key directives: - Workbench is already built — skip Phase 2 entirely - Optimization target = references/agent-browser/agent-browser-core.md (the workflow content), NOT references/agent-browser/SKILL.md (the discovery stub) - Upstream packaging target = skill-data/core/SKILL.md per AGENTS.md - Apache-2.0 + conventional commits + ctate same-day merges for clean docs-only PRs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The wrapper-skill PR target (`vercel-labs/agent-skills/.../web-design-guidelines/SKILL.md`) is dropped — it's a thin Claude-Code-specific adapter that WebFetches the rules doc, and editing it is low-leverage. All value lives in `vercel-labs/web-interface-guidelines/command.md` and its two stylistic siblings (`AGENTS.md`, `README.md`). The consolidated draft at #1 carries: - The auto-pilot's measured 22-line `command.md` insert (eval 0.92→1.00, 18 trials × 3 frontier models, 6 absence-type misses → 0) - A MUST/SHOULD/NEVER mirror for `AGENTS.md` (style-faithful, not independently measured) - A prose mirror for `README.md` (style-faithful, not independently measured) - A qualitative pitch as the headline + eval data as supporting evidence (matches PR #23 precedent in this repo, which has zero quantitative evidence in any merged PR) Old drafts moved to `superseded/` with a README explaining why each was retired. Repo PR-drafts README updated to reflect the new canonical numbering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures the two structural lessons from the v1.2.1 pilot session: 1. Research-first context is mandatory (Phase 0): the auto-pilot is good at finding what to change, bad at fitting upstream conventions. Without a researched context file, output requires manual reformulation. 2. Two-loop iteration on eval AND skill (Phase 3.5): the current pipeline can't escape ceiling (>= 0.95) or floor (< 0.50) eval baselines because it only iterates the skill, treating the eval as fixed. Backwards compatible — v1.2.1's --context flag continues to work; v1.3 phases are opt-in via --research and --auto-eval flags until validated. Note: this commit lands on the supabase--v1-shallow branch because the agent-browser pilot is concurrently active on the main worktree; branch hygiene (move to docs/auto-pilot-runs) deferred until pilots finish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The agent-browser deeper-eval pilot timed out at the 90-min wrapper cap mid-baseline (50/54 trials complete; no Phase 5 commit). However, the supabase v2 pilot's Phase 4 instruction to append a run-record entry to lessons.md DID complete and wrote a useful observation about the 'calibrated graders cause baseline ceiling' pattern. Salvaging that entry here even though the parent agent-browser pilot didn't finalize. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#3 (agent-browser): updated to acknowledge that the v1.2.1 deeper-eval pilot was attempted but timed out at the wrapper's 90-min hard cap mid-baseline (50/54 trials complete, no Phase 5 commit). Ships the original v1.0 Pre-flight diff (baseline 0.97; 1/9 Gemini trial used curl). Partial baseline data preserved at .results/20260512-101220/ for future analysis. #4 (supabase): replaced the batch-1 draft with the v1.2.1 v2 result. The auto-pilot reshaped the proposal exactly per the upstream context file (filename monitor-two-pass-review.md, monitor- prefix, 4-field frontmatter, **Incorrect**/**Correct** SQL blocks, ~50 lines) so the file is convention-perfect. Honest framing: per-case breakdown shows update-without-where at 77.8% (the targeted failure pattern) but overall 0.97 baseline meant no iteration; auto-pilot's exit logic uses overall average rather than per-case minimum (v1.3 will fix). README index updated with evidence-strength column. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Approved design (brainstormed 2026-05-12) for converting the auto-improve-skill workflow from a wrapper-spawned claude -p pilot into a Claude Code skill (auto-improve-orchestrator) that an operator's CC session invokes via the Agent tool. Key architectural shift: - skill-optimizer stays lean (run-suite, run-case, graders, Docker) - Orchestration moves OUT into a new skill at skills/auto-improve-orchestrator/ that ships subagent prompt templates the operator's CC session dispatches via the Agent tool - Each orchestrator subagent owns one skill end-to-end, runs in its own worktree, parallelizable across N skills Addresses 4 motivators from v1.2.1 work: 1. Research-first context is mandatory (becomes Phase 0 sub-subagent) 2. Two-loop iteration on eval AND skill (Phase 3.5 sub-subagent) 3. Per-case-minimum threshold (computed by orchestrator from suite-result.json) 4. Resume-on-timeout (every phase resume-aware via on-disk artifacts) Predecessor design draft: docs/auto-improve-skill-v1.3-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

12 bite-sized tasks: file moves/deletes (1-4), new skill structure (5-10), smoke validation (11), end-to-end test on supabase (12). Each task is one focused commit. All prompt-template content is inlined in the plan (no placeholders). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The wrapper-spawned claude -p autonomous pilot is replaced by the new auto-improve-orchestrator Claude Code skill (built in subsequent commits). Operator now dispatches the orchestrator subagent via the Agent tool instead of running a Node wrapper.

The smoke check (36/36 structural validations) confirms the v1.3 implementation is complete and well-formed. End-to-end orchestrator dispatch is documented as a focused next-session task with the exact Agent dispatch payload, expected behavior per phase, and acceptance criteria. Workbench imported from eval/auto-pilot/supabase-postgres-best-practices-v2 in the prior commit so the e2e dispatch can resume cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Rename context files to match the orchestrator's ${OWNER}-${SKILL_ID}.md formula so cached contexts are found on existing-skill runs (was breaking acceptance criterion #3 — orchestrator was always dispatching Phase 0 research even when cached context existed). - Update smoke check to expect renamed supabase context. - workflow.md: complete the sub-subagent input lists for Phase 3.5 and Phase 4 dispatch lines (previously missing WORKBENCH_DIR / LESSONS_PATH / CONTEXT_FILE). - orchestrator.md hard rules: clarify that context files are written by the research sub-subagent, not the orchestrator (was ambiguous). - SKILL.md: document the optional REFRESH_CONTEXT template var. - docs/pilot-runs/README.md: add deprecation note pointing at v1.3. NOT FIXED (deliberate): Phase 4 cost tracking is a comment stub vs Phase 3's full code block. The asymmetry is intentional — Phase 4 runs at most twice, and a competent orchestrator can infer the pattern from Phase 3 without duplicating the 10-line block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings in the batch-2 shadcn-ui eval workbench (baseline 0.82, batch-2 showed +0.07 uplift) so the v1.3 orchestrator can be dispatched on a fresh skill (no cached context, will trigger Phase 0 research). Pairs with the agent-browser dispatch (cached context, will skip Phase 0). The two together exercise both paths through Phase 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The shadcn-ui workbench was cherry-picked from batch-2 which used the older gpt-4o-mini. Updating to gpt-5 to match the canonical v1.2.1+ frontier-model matrix (sonnet-4.6, gpt-5, gemini-2.5-pro) used by web-design-guidelines and agent-browser. The just-completed v1.3 orchestrator dispatch saw gpt-4o-mini dominate the per-case-min floor; re-firing with gpt-5 will produce a result that's apples-to-apples with the other PR candidates. Follow-up issue: v1.3 orchestrator should validate the model matrix against a canonical set before running, OR the operator's pre-flight should. Current behavior runs whatever is on disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…shadcn-ui (re-fire)

…k with first-line path comment instruction and StatusBadge BAD/GOOD example

…patch Brings firebase-hosting-basics and firecrawl-build-scrape workbenches from their batch-2 eval branches, with the model matrix updated to the v1.2.1+ canonical frontier set: claude-sonnet-4.6, gpt-5, gemini-2.5-pro Both eval the code-reviewer pattern (read a file, find seeded violations, write findings.txt). Lightweight — no Python venv or real API calls. Both will exercise the v1.3 Phase 3.5 eval-iteration loop if frontier models hit ceiling (likely; both batch-2 baselines were 0.84-0.89 with old gpt-4o-mini matrix). This is the v1.3 architectural feature we haven't validated yet — neither shadcn-ui nor agent-browser dispatched eval-iterate (both landed in (0.50, 0.95) directly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hosting-basics

+import { spawnSync } from 'node:child_process';
+import { mkdtempSync, mkdirSync, writeFileSync, rmSync } from 'node:fs';
+import { tmpdir } from 'node:os';
+import { join, dirname, resolve } from 'node:path';


+import React from "react";
+import { Card, CardContent, CardHeader } from "@/components/ui/card";
+import { Button } from "@/components/ui/button";
+import { Badge } from "@/components/ui/badge";


Copilot

Pull request overview

This PR implements the v1.3 “auto-improve” workflow as an operator-dispatched Claude Code skill (skills/auto-improve-orchestrator/), replacing the prior wrapper approach, and adds multiple new/updated workbenches + graders used to validate (and package) upstream skill improvements. It also hardens Docker workbench cleanup/permissions behavior for runs that write as a non-host UID.

Changes:

Add the auto-improve-orchestrator skill (SKILL.md + prompts + references/contexts) plus a node-based smoke-check script.
Add/extend several eval workbenches (Supabase Postgres, shadcn-ui, Firebase Hosting, Firecrawl scrape, agent-browser) including suite.yml, graders, fixtures/workspaces, analyses, and proposed-upstream-changes artifacts.
Adjust Docker runner permissions/cleanup to reduce failures when containers write files the host user can’t delete; add gray-matter for frontmatter parsing.

Reviewed changes

Copilot reviewed 149 out of 152 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/workbench/docker-runner.ts	Adjust permissions and make cleanup resilient to host/container UID mismatches
skills/auto-improve-orchestrator/SKILL.md	New orchestrator skill entrypoint and operator invocation instructions
skills/auto-improve-orchestrator/.smoke-check.mjs	Smoke-check script validating skill structure and template vars
skills/auto-improve-orchestrator/references/contexts/vercel-labs-web-design-guidelines.md	Cached upstream research context for a target
skills/auto-improve-orchestrator/references/contexts/google-labs-code-shadcn-ui.md	Cached upstream research context for shadcn-ui
skills/auto-improve-orchestrator/references/contexts/firebase-firebase-hosting-basics.md	Cached upstream research context for firebase-hosting-basics
package.json	Add `gray-matter` (used by smoke-check)
examples/workbench/supabase-postgres-best-practices/workspace/schema.sql	New seeded SQL workspace for deterministic grading
examples/workbench/supabase-postgres-best-practices/workspace/rls_policies.sql	New seeded SQL workspace for RLS-focused grading
examples/workbench/supabase-postgres-best-practices/workspace/multi_table_schema.sql	New multi-table RLS enumeration workspace
examples/workbench/supabase-postgres-best-practices/workspace/migrations.sql	New FK/index enumeration workspace
examples/workbench/supabase-postgres-best-practices/workspace/data_migration.sql	New UPDATE-without-WHERE workspace
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/SKILL.md	Vendored skill snapshot for deterministic eval
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-rls-performance.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-rls-basics.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-privileges.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-primary-keys.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-partitioning.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-lowercase-identifiers.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-foreign-key-indexes.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-data-types.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-constraints.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-partial-indexes.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-missing-indexes.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-index-types.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-covering-indexes.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-composite-indexes.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-vacuum-analyze.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-two-pass-review.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-pg-stat-statements.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-explain-analyze.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-skip-locked.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-short-transactions.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-deadlock-prevention.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-advisory.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-upsert.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-pagination.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-n-plus-one.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-batch-inserts.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-prepared-statements.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-pooling.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-limits.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-idle-timeout.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/advanced-jsonb-indexing.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/advanced-full-text-search.md	New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_template.md	New vendored reference template
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_sections.md	New vendored section definitions
examples/workbench/supabase-postgres-best-practices/checks/_grader-utils.mjs	Shared grader utilities for this workbench
examples/workbench/supabase-postgres-best-practices/checks/grade-schema-findings.mjs	Grader for schema.sql case
examples/workbench/supabase-postgres-best-practices/checks/grade-rls-findings.mjs	Grader for rls_policies.sql case
examples/workbench/supabase-postgres-best-practices/checks/grade-multi-table-rls-findings.mjs	Grader for multi-table RLS case
examples/workbench/supabase-postgres-best-practices/checks/grade-fk-index-audit-findings.mjs	Grader for FK/index enumeration case
examples/workbench/supabase-postgres-best-practices/checks/grade-update-without-where-findings.mjs	Grader for UPDATE-without-WHERE case
examples/workbench/supabase-postgres-best-practices/README.md	Workbench documentation and run instructions
examples/workbench/supabase-postgres-best-practices/analysis.md	Run analysis summary
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/README.md	Packaged upstream-change notes
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/before-SKILL.md	Packaged before snapshot
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/after-SKILL.md	Packaged after snapshot
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/monitor-two-pass-review.md	Packaged new upstream reference file
examples/workbench/shadcn-ui/workspace/UserCard.tsx	Seeded code-review fixture
examples/workbench/shadcn-ui/workspace/StatusBadge.tsx	Seeded code-review fixture
examples/workbench/shadcn-ui/checks/_grader-utils.mjs	Shared grader utilities
examples/workbench/shadcn-ui/checks/grade-usercard-findings.mjs	UserCard grader
examples/workbench/shadcn-ui/checks/grade-statusbadge-findings.mjs	StatusBadge grader
examples/workbench/shadcn-ui/suite.yml	New shadcn-ui eval suite
examples/workbench/shadcn-ui/README.md	Workbench documentation and run instructions
examples/workbench/shadcn-ui/analysis.md	Run analysis summary
examples/workbench/shadcn-ui/proposed-upstream-changes/README.md	Packaged upstream-change notes
examples/workbench/firecrawl-build-scrape/workspace/ScrapeService.ts	Seeded code-review fixture
examples/workbench/firecrawl-build-scrape/checks/_grader-utils.mjs	Shared grader utilities
examples/workbench/firecrawl-build-scrape/checks/grade-scrape-service-findings.mjs	ScrapeService grader
examples/workbench/firecrawl-build-scrape/references/firecrawl-build-scrape/node-docs.md	Vendored Firecrawl Node docs snapshot
examples/workbench/firecrawl-build-scrape/suite.yml	New Firecrawl eval suite
examples/workbench/firecrawl-build-scrape/README.md	Workbench documentation and run instructions
examples/workbench/firecrawl-build-scrape/analysis.md	Run analysis summary
examples/workbench/firebase-hosting-basics/workspace/firebase-app/firebase.json	Seeded config fixture
examples/workbench/firebase-hosting-basics/checks/_grader-utils.mjs	Shared grader utilities
examples/workbench/firebase-hosting-basics/checks/grade-firebase-config-findings.mjs	firebase.json grader
examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/SKILL.md	Vendored skill snapshot
examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/configuration.md	Vendored reference doc
examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/deploying.md	Vendored reference doc
examples/workbench/firebase-hosting-basics/suite.yml	New Firebase Hosting eval suite
examples/workbench/firebase-hosting-basics/README.md	Workbench documentation and run instructions
examples/workbench/firebase-hosting-basics/analysis.md	Run analysis summary
examples/workbench/firebase-hosting-basics/proposed-upstream-changes/README.md	Packaged upstream-change notes
examples/workbench/firebase-hosting-basics/proposed-upstream-changes/firebase-agent-skills/before-SKILL.md	Packaged before snapshot
examples/workbench/firebase-hosting-basics/proposed-upstream-changes/firebase-agent-skills/after-SKILL.md	Packaged after snapshot
examples/workbench/agent-browser/suite.yml	Expanded agent-browser suite
examples/workbench/agent-browser/references/agent-browser/SKILL.md	Vendored skill stub used by workbench
examples/workbench/agent-browser/references/agent-browser/agent-browser-core.md	Vendored core workflow reference
examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/transitions.txt	Recorded snapshot transition data
examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/snapshot.out	Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/snapshot-after-search.out	Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/transitions.txt	Recorded snapshot transition data
examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot.out	Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot-after-signup.out	Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot-after-signin.out	Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/transitions.txt	Recorded snapshot transition data
examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot.out	Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-submitted.out	Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-name-entered.out	Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-email-entered.out	Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/blog-article/transitions.txt	Recorded snapshot transition data
examples/workbench/agent-browser/references/agent-browser/recordings/blog-article/snapshot.out	Recorded snapshot
examples/workbench/agent-browser/checks/grade-navigate-report-findings.mjs	Behavioral grader updates/additions
examples/workbench/agent-browser/checks/grade-screenshot-capture-findings.mjs	Behavioral grader updates/additions
examples/workbench/agent-browser/checks/grade-ref-disambiguation-findings.mjs	Behavioral grader updates/additions
examples/workbench/agent-browser/checks/grade-output-correctness-findings.mjs	Behavioral grader updates/additions
examples/workbench/agent-browser/checks/_grader-utils.mjs	Shared grader utilities
examples/workbench/agent-browser/proposed-upstream-changes/README.md	Packaged upstream-change notes
examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/before-SKILL.md	Packaged before snapshot
examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/after-SKILL.md	Packaged after snapshot
examples/workbench/agent-browser/analysis.md	Run analysis summary
docs/pilot-runs/README.md	Note about v1.3 orchestrator vs removed wrapper
docs/pilot-runs/upstream-pr-drafts/README.md	Update upstream draft index/process
docs/pilot-runs/upstream-pr-drafts/superseded/README.md	Archive/superseded draft notes
docs/auto-improve-skill-v1.3-validation.md	v1.3 validation notes (needs update for current PR claims)
CLAUDE.md	Add pointer to orchestrator skill
.gitignore	Ignore new generated artifacts/logs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+## Models
+
+The suite runs a 3-provider mid-tier matrix:
+
+- `openrouter/anthropic/claude-sonnet-4-6`
+- `openrouter/openai/gpt-4o-mini`
+- `openrouter/google/gemini-2.5-pro`


+## Models
+
+The suite runs a 3-provider mid-tier matrix:
+
+- `openrouter/anthropic/claude-sonnet-4-6`
+- `openrouter/openai/gpt-4o-mini`
+- `openrouter/google/gemini-2.5-pro`


+## Models
+
+The suite runs a 3-provider mid-tier matrix:
+
+- `openrouter/anthropic/claude-sonnet-4-5`
+- `openrouter/openai/gpt-4o-mini`
+- `openrouter/google/gemini-2.5-flash`


+# v1.3 validation — deferred to operator
+
+**Date:** 2026-05-12
+**Status:** implementation complete, end-to-end orchestrator dispatch
+deferred to operator's next session.
+


+for (const [file, vars] of Object.entries(expectedVars)) {
+  const content = readFileSync(`${skillRoot}/prompts/${file}`, 'utf-8');
+  for (const v of vars) {
+    check(content.includes(`\${${v}}`), `prompts/${file} contains \${${v}}`);
+  }
+}


  mkdirSync(referencesDir, { recursive: true });
  mkdirSync(workDir, { recursive: true });
  mkdirSync(resultsDir, { recursive: true });
+  chmodSync(workDir, 0o777);
+  chmodSync(resultsDir, 0o777);


Yuqing Zhai and others added 30 commits May 7, 2026 10:58

feat(auto-pilot): tools/auto-improve-skill.mjs + prompt template

6054a09

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(agent-browser-eval): import baseline from eval/auto-pilot/agent…

bdb4ed0

…-browser Carry over existing Tier-0 eval (navigate-and-report, screenshot-capture) as the starting point for deeper Tier-1 work.

refactor(orchestrator): move lessons.md into new skill

42f45fb

docs(agent-browser): update lessons.md reference path for v1.3 move

38d3c6f

refactor(orchestrator): move contexts/ into new skill

4f10a51

docs: update stale context paths to new orchestrator skill location

049096f

docs(CLAUDE.md): point at new auto-improve-orchestrator skill

c6b7ec0

feat(orchestrator): create SKILL.md with invocation guide

7fcfa56

feat(orchestrator): add workflow.md (human-readable algorithm)

3f346e8

feat(orchestrator): add research-upstream sub-subagent prompt

68d1332

feat(orchestrator): add eval-iterate sub-subagent prompt

1fc195a

Yuqing Zhai and others added 13 commits May 12, 2026 08:00

feat(orchestrator): add skill-iterate sub-subagent prompt

da0c9c9

feat(orchestrator): add orchestrator main prompt template

106e2c0

test(orchestrator): smoke validation script for skill structure

5df98c1

chore: add gray-matter dev dependency for smoke-check script

ca5e347

test(e2e): import supabase workbench for v1.3 validation

2039bc9

docs(contexts): research upstream for google-labs-code/stitch-skills/…

b135d8e

…shadcn-ui (re-fire)

feat(shadcn-ui): iterate 1 — Recipe D: strengthen wrong-location chec…

540028a

…k with first-line path comment instruction and StatusBadge BAD/GOOD example

docs(contexts): research upstream for firebase/agent-skills/firebase-…

96dbe31

…hosting-basics

Copilot AI review requested due to automatic review settings May 12, 2026 16:21

Copilot started reviewing on behalf of Zhaiyuqing2003 May 12, 2026 16:21 View session

github-code-quality Bot found potential problems May 12, 2026

View reviewed changes

Zhaiyuqing2003 force-pushed the feat/auto-improve-skill-v1.3 branch from b469dae to 96dbe31 Compare May 12, 2026 16:23

Copilot AI reviewed May 12, 2026

View reviewed changes

Zhaiyuqing2003 mentioned this pull request May 12, 2026

docs(pilot-runs): 3 strong upstream PR drafts ready for team greenlight #51

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(auto-improve-orchestrator): v1.3 — operator-dispatched orchestrator skill#50

feat(auto-improve-orchestrator): v1.3 — operator-dispatched orchestrator skill#50
Zhaiyuqing2003 wants to merge 43 commits into
developmentfrom
feat/auto-improve-skill-v1.3

Zhaiyuqing2003 commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Zhaiyuqing2003 commented May 12, 2026

Summary

Validated end-to-end on real skills

What's in this PR

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants