Skip to content

feat(auto-improve-orchestrator): v1.3 — operator-dispatched orchestrator skill#50

Open
Zhaiyuqing2003 wants to merge 43 commits into
developmentfrom
feat/auto-improve-skill-v1.3
Open

feat(auto-improve-orchestrator): v1.3 — operator-dispatched orchestrator skill#50
Zhaiyuqing2003 wants to merge 43 commits into
developmentfrom
feat/auto-improve-skill-v1.3

Conversation

@Zhaiyuqing2003

Copy link
Copy Markdown

Summary

Implements v1.3 of the auto-improve-skill pipeline as documented in docs/auto-improve-skill-v1.3-spec.md (the brainstormed + approved spec) and docs/auto-improve-skill-v1.3-plan.md (the implementation plan).

Architectural shift: the v1.2.1 wrapper-spawned claude -p autonomous pilot is REPLACED by a Claude Code skill (skills/auto-improve-orchestrator/) that an operator's CC session invokes via the Agent tool. Each orchestrator subagent owns one skill end-to-end, runs in its own worktree, dispatches sub-subagents for research / eval-iteration / skill-iteration tasks. Multiple orchestrators can run in parallel for batch operation.

Validated end-to-end on real skills

In this PR, three v1.3 orchestrator dispatches produced measured uplift:

Skill Status Baseline → Final per-case-min Notes
vercel-labs/agent-browser/agent-browser uplift-too-small 0.667 → 0.667 Frontier matrix; gpt-5 floor on Tier-0; orchestrator caught an eval harness bug + fixed it additively
google-labs-code/stitch-skills/shadcn-ui (gpt-5 re-fire) success 0.667 → 0.889 (+0.222) Phase 0 research subagent worked, Recipe D applied, gemini V8 fixed
firebase-hosting-basics, firecrawl-build-scrape (in flight at PR open time) Exercising Phase 3.5 (eval-readiness loop)

The shadcn-ui result is drafted as PR #5 in docs/pilot-runs/upstream-pr-drafts/ for upstream submission.

What's in this PR

New skill: skills/auto-improve-orchestrator/ — SKILL.md + 4 prompt templates (orchestrator + 3 sub-subagents) + references/ (workflow.md, lessons.md moved from tools/, contexts/ moved from tools/).

Removed: tools/auto-improve-skill.mjs and tools/auto-improve-skill-prompt.md (the v1.2.1 wrapper).

Docs:

  • docs/auto-improve-skill-v1.3-design.md — predecessor design draft
  • docs/auto-improve-skill-v1.3-spec.md — approved spec (after brainstorming)
  • docs/auto-improve-skill-v1.3-plan.md — implementation plan (12 tasks, all completed)
  • docs/auto-improve-skill-v1.3-validation.md — validation note (e2e was operator-deferred, then exercised via the dispatches above)
  • docs/pilot-runs/upstream-pr-drafts/5-google-labs-code-stitch-skills-shadcn-ui.md — first v1.3-evidenced PR draft

Test plan

  • Smoke check at skills/auto-improve-orchestrator/.smoke-check.mjs passes (36/36 structural validations)
  • End-to-end on shadcn-ui (gpt-5 frontier matrix): 0.667→0.889
  • End-to-end on agent-browser: cleanly exits uplift-too-small (honest)
  • Spec self-review + final code review (issues addressed in commit 777b245)
  • (Operator) Decide on PR portfolio for upstream submission

🤖 Generated with Claude Code

Yuqing Zhai and others added 30 commits May 7, 2026 10:58
The agent container runs non-root and writes results into a host-mounted
results directory. On Linux the bind-mount inherits the host owner, so
the container couldn't write `result.json`. Fix:

- chmod 0777 on workDir + resultsDir before starting the container
- chmod -R a+rw on resultsDir after `docker cp` so cleanup can read it

Also gitignore .superpowers/ — runtime state from the categorization
pipeline (per-skill JSON cache, progress logs).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewer flagged that the prompt referenced case-source files that
don't exist on this branch (web-design-guidelines/checks/, find-skills/).
Make the prompt self-sufficient:

- Inline _grader-utils.mjs content under Phase 2 step 4
- Soften 'mirror <path>' references to advisory
- Add minimal Cases-table README skeleton in Phase 2 step 6
- Explicit file list in commit step so .run.log can never sneak in

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default 3.50 (unchanged). Pilot #1 (agent-browser) hit the original
3.50 cap mid-iteration before reaching the "Always: commit" step,
losing the run record. With --budget 15 the same pilot completed
cleanly: 0.56 → 1.00, +0.44 uplift, $3.15 actual spend.

Operator usage:
  node tools/auto-improve-skill.mjs <slug> --budget 15
Three changes informed by the 3-skill pilot batch (PR #47):

1. **"Always: write analysis.md AND commit" merged into a single atomic
   step.** Pilots #1b and #2 wrote analysis.md but ran out of budget
   before reaching the separate commit step, leaving case files
   uncommitted. The merged section explicitly tells the agent to skip
   everything else if budget is low and finish this section first.

2. **Default --max-budget-usd bumped 3.50 → 10.00.** Pilot #1's first
   real-data attempt died at the cap mid-modification. Pilot #1c at
   --budget 15 settled at $3.15 with full success. The prompt's Phase-4
   self-cap also moved from $3.00 to $7.00 to leave a $2-3 buffer for
   the analysis.md + commit cleanup below the wrapper hard cap.

3. **New tools/auto-improve-skill-lessons.md** — living doc the prompt
   reads as Phase-4 prior. Captures recipes A-E (two-pass workflow,
   verify-tool-installed, per-element checklists, BAD/GOOD examples,
   rationale + bug-story) and grader-reliability patterns G1-G6 (line
   tolerance, hyphen regex, per-finding-line matching, keyword variants,
   set-semantics, verbosity floor) with empirical evidence from the
   manual web-design-guidelines run + the 3 auto pilots. Phase 4 of the
   prompt now references the recipes by letter so the auto-pilot doesn't
   rediscover patterns from scratch each run.

Also fixes a slug-parsing regression introduced by the --budget flag
(when --budget was absent, the filter wrongly skipped argv[0]).

Smoke tests pass: bare invocation prints usage, "nope" gives bad-slug,
existing dir gets refused, --budget validates input.
Adds three grader-helper utilities to the inlined `_grader-utils.mjs`
content the auto-pilot writes to each new case in Phase 2:

- looseRange(N, tolerance=8) — centered range with default ±8 line
  tolerance. Replaces hand-rolling `range(N-3, N+3)`. Default absorbs
  the LLM line-counting drift seen across all 4 prior pilots.

- fuzzyKeyword(phrase) — hyphen-and-space-tolerant regex builder.
  fuzzyKeyword('empty state') matches "empty state", "empty-state",
  "emptystate". Replaces hand-rolling `/empty[-\s]+state/`.

- tolerantKeyword(stem) — word-stem prefix matcher. tolerantKeyword('cover')
  matches "covering", "covered", "does not cover" but NOT "discovery"
  (word boundary). Replaces alternation regexes for common phrasing
  variants.

Also updates lessons.md G1 / G2 / G4 to reference the helpers in their
recipes, so the auto-pilot's Phase-4 reading naturally guides it to use
them rather than rediscovering by hand.

Verified end-to-end: extracted the inlined block from the prompt, ran
each helper, confirmed expected behavior on the canonical patterns from
prior pilots.
Moves auto-improve-skill pilot summaries from gitignored
docs/superpowers/pilot-runs/ to tracked docs/pilot-runs/ so the team
can review them in-tree.

Includes:

- docs/pilot-runs/README.md — directory index + reproduction recipe
- 2026-05-08-auto-improve-pilot-summary.md — batch 1 (3 skills, 3/3
  success: agent-browser, supabase, pdf)
- 2026-05-09-auto-improve-batch-2-summary.md — batch 2 (10 skills,
  8/10 success, 0 failures: pptx, next-best-practices, firebase-auth-basics,
  firebase-hosting-basics, building-native-ui, shadcn-ui, native-data-fetching,
  firecrawl-build-scrape, next-upgrade, prd)

Per-skill eval artifacts and proposed-upstream-changes live on
eval/auto-pilot/<skill-id> branches and the consolidated batch branches
(eval/auto-pilot/batch-2026-05-08, eval/auto-pilot/batch-2-2026-05-09).
Operational guide for submitting skill-improvement PRs to the four
repos we're currently working with (vercel-labs/agent-skills,
vercel-labs/web-interface-guidelines, vercel-labs/agent-browser,
supabase/agent-skills).

Per repo: title format, body convention, CI gates, CLA status, merge
style, scope guidance, and any gotchas discovered by reading
AGENTS.md/CONTRIBUTING.md/workflow files plus the last 5–10 merged
PRs.

Future batches: append new repos as their conventions become known.
Polished PR drafts ready for operator review + submission to upstream.
Each draft contains:

- Target repo + base branch
- Title in the repo's preferred convention (see upstream-pr-conventions.md)
- PR body matching the repo's style (formal/casual/terse)
- File diff or path to the full proposed file in our repo
- Caveats and gotchas specific to the repo
- Operator copy-paste shell snippet for fork → branch → commit → push → gh pr create

The 4 PRs cover 3 skills (web-design-guidelines spans 2 repos):

1. vercel-labs/agent-skills — web-design-guidelines SKILL.md two-pass workflow
2. vercel-labs/web-interface-guidelines — per-element checklist + 5 BAD/GOOD examples
3. vercel-labs/agent-browser — Pre-flight section (retargeted to
   skill-data/core/SKILL.md per AGENTS.md)
4. supabase/agent-skills — two-pass review reference (reformulated as a
   new references/ file per CONTRIBUTING.md, not a SKILL.md edit)

Sources:
- PR 1 + 2: manual web-design-guidelines run (eval/web-design-guidelines)
- PR 3: agent-browser v1.2 re-run (the small additive Pre-flight)
- PR 4: supabase batch-1 result (0.54 → 0.86, content reformulated to
  fit repo convention)
Adds a `--context <path>` flag to the auto-pilot wrapper that reads a
markdown file and injects it into the prompt as a "Constraints" section
Phase 4 must respect. Enables steering pilots toward upstream-specific
targets (e.g. fetched rules docs instead of skill SKILL.md) and
encoding architecture intent (additive-only, no restructure, etc.) as
hard constraints.

Phase 4 + Phase 5 updated to honor target-file overrides from the
constraints (e.g. edit `command.md` instead of `SKILL.md` when the
context says so; package files as `before-/after-command.md` under the
correct upstream-repo directory).

Includes the first context file:
`tools/auto-improve-contexts/vercel-web-interface-guidelines.md`,
encoding the vercel research findings — `command.md` is the canonical
source distributed to 7 tools + 10 downstream consumers, restructure
risk is HIGH, additive-only PRs are the merged norm (PR #23 precedent),
and the AGENTS.md / README.md mirrors happen at PR-draft time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Encodes upstream conventions discovered via gh-CLI research:
- All 28 existing references in this skill are single-rule SQL
  anti-pattern fixes with **Incorrect/**Correct SQL blocks; meta-workflow
  guidance is shape-novel (MEDIUM-HIGH risk of "fit the convention"
  pushback from gregnr/Rodriguespn).
- Prefixes locked to the 8 in `_sections.md` (`query-`, `conn-`,
  `security-`, `schema-`, `lock-`, `data-`, `monitor-`, `advanced-`); a
  `review-` prefix would require modifying `_sections.md` which is not
  additive-only.
- Required reshape: pick a single concrete SQL anti-pattern that
  two-pass review catches and frame around it (Incorrect = single-pass
  miss, Correct = two-pass catch). If reshape feels contrived, surface
  needs-discussion signal instead of shipping borderline PR.
- Frontmatter spec corrected: 4 fields (`title`, `impact`,
  `impactDescription`, `tags`); previous research missed
  `impactDescription`. `tags` is comma-separated string, not YAML list.
- pnpm test:sanity does NOT validate frontmatter (corrected prior note);
  convention is enforced by maintainer review only.
- Release Please owns metadata.version; do not bump manually (causes
  merge conflicts with bot's release PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-browser

Carry over existing Tier-0 eval (navigate-and-report, screenshot-capture)
as the starting point for deeper Tier-1 work.
- Add 4 cases (ref-based-search, ref-disambiguation, output-correctness,
  multi-step-state) that grade snapshot-driven @en ref discipline,
  ambiguous-element resolution, content correctness, and full
  state-machine traversal — none of which the v1 baseline covered.
- Upgrade bin/agent-browser to a stateful playback CLI: URL match -> page,
  per-page transitions.txt drives state changes, snapshot emits the
  recorded accessibility-tree fixture for current (page, state). Falls
  back to the legacy generic snapshot for Tier-0 continuity. Adds AB_WORK
  override so the CLI can be smoke-tested outside Docker.
- Add hand-fabricated recordings for 4 pages (wikipedia, signin-signup,
  blog-article, multistep-form) under references/agent-browser/recordings/.
- Add checks/smoke-graders.mjs running 14 GOOD/BAD assertions against
  hand-crafted ab-calls.log + output-file fixtures; all pass without
  Docker or models.
…er-1 pilot

Encodes constraints for the auto-pilot to run against the hand-built
Tier-1 deeper eval (4 new cases: ref-based-search, ref-disambiguation,
output-correctness, multi-step-state) without rebuilding the workbench.

Key directives:
- Workbench is already built — skip Phase 2 entirely
- Optimization target = references/agent-browser/agent-browser-core.md
  (the workflow content), NOT references/agent-browser/SKILL.md (the
  discovery stub)
- Upstream packaging target = skill-data/core/SKILL.md per AGENTS.md
- Apache-2.0 + conventional commits + ctate same-day merges for clean
  docs-only PRs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrapper-skill PR target (`vercel-labs/agent-skills/.../web-design-guidelines/SKILL.md`)
is dropped — it's a thin Claude-Code-specific adapter that
WebFetches the rules doc, and editing it is low-leverage. All value
lives in `vercel-labs/web-interface-guidelines/command.md` and its
two stylistic siblings (`AGENTS.md`, `README.md`).

The consolidated draft at #1 carries:
- The auto-pilot's measured 22-line `command.md` insert (eval 0.92→1.00,
  18 trials × 3 frontier models, 6 absence-type misses → 0)
- A MUST/SHOULD/NEVER mirror for `AGENTS.md` (style-faithful, not
  independently measured)
- A prose mirror for `README.md` (style-faithful, not independently
  measured)
- A qualitative pitch as the headline + eval data as supporting
  evidence (matches PR #23 precedent in this repo, which has zero
  quantitative evidence in any merged PR)

Old drafts moved to `superseded/` with a README explaining why each
was retired. Repo PR-drafts README updated to reflect the new
canonical numbering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the two structural lessons from the v1.2.1 pilot session:
1. Research-first context is mandatory (Phase 0): the auto-pilot is
   good at finding what to change, bad at fitting upstream conventions.
   Without a researched context file, output requires manual reformulation.
2. Two-loop iteration on eval AND skill (Phase 3.5): the current
   pipeline can't escape ceiling (>= 0.95) or floor (< 0.50) eval
   baselines because it only iterates the skill, treating the eval as
   fixed.

Backwards compatible — v1.2.1's --context flag continues to work; v1.3
phases are opt-in via --research and --auto-eval flags until validated.

Note: this commit lands on the supabase--v1-shallow branch because the
agent-browser pilot is concurrently active on the main worktree;
branch hygiene (move to docs/auto-pilot-runs) deferred until pilots
finish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent-browser deeper-eval pilot timed out at the 90-min wrapper cap
mid-baseline (50/54 trials complete; no Phase 5 commit). However, the
supabase v2 pilot's Phase 4 instruction to append a run-record entry to
lessons.md DID complete and wrote a useful observation about the
'calibrated graders cause baseline ceiling' pattern. Salvaging that
entry here even though the parent agent-browser pilot didn't finalize.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#3 (agent-browser): updated to acknowledge that the v1.2.1 deeper-eval
pilot was attempted but timed out at the wrapper's 90-min hard cap
mid-baseline (50/54 trials complete, no Phase 5 commit). Ships the
original v1.0 Pre-flight diff (baseline 0.97; 1/9 Gemini trial used
curl). Partial baseline data preserved at .results/20260512-101220/
for future analysis.

#4 (supabase): replaced the batch-1 draft with the v1.2.1 v2 result.
The auto-pilot reshaped the proposal exactly per the upstream context
file (filename monitor-two-pass-review.md, monitor- prefix, 4-field
frontmatter, **Incorrect**/**Correct** SQL blocks, ~50 lines) so the
file is convention-perfect. Honest framing: per-case breakdown shows
update-without-where at 77.8% (the targeted failure pattern) but
overall 0.97 baseline meant no iteration; auto-pilot's exit logic
uses overall average rather than per-case minimum (v1.3 will fix).

README index updated with evidence-strength column.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Approved design (brainstormed 2026-05-12) for converting the
auto-improve-skill workflow from a wrapper-spawned claude -p pilot
into a Claude Code skill (auto-improve-orchestrator) that an
operator's CC session invokes via the Agent tool.

Key architectural shift:
- skill-optimizer stays lean (run-suite, run-case, graders, Docker)
- Orchestration moves OUT into a new skill at
  skills/auto-improve-orchestrator/ that ships subagent prompt
  templates the operator's CC session dispatches via the Agent tool
- Each orchestrator subagent owns one skill end-to-end, runs in its
  own worktree, parallelizable across N skills

Addresses 4 motivators from v1.2.1 work:
1. Research-first context is mandatory (becomes Phase 0 sub-subagent)
2. Two-loop iteration on eval AND skill (Phase 3.5 sub-subagent)
3. Per-case-minimum threshold (computed by orchestrator from suite-result.json)
4. Resume-on-timeout (every phase resume-aware via on-disk artifacts)

Predecessor design draft: docs/auto-improve-skill-v1.3-design.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 bite-sized tasks: file moves/deletes (1-4), new skill structure
(5-10), smoke validation (11), end-to-end test on supabase (12).
Each task is one focused commit. All prompt-template content is
inlined in the plan (no placeholders).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrapper-spawned claude -p autonomous pilot is replaced by the new
auto-improve-orchestrator Claude Code skill (built in subsequent commits).
Operator now dispatches the orchestrator subagent via the Agent tool
instead of running a Node wrapper.
Yuqing Zhai and others added 13 commits May 12, 2026 08:00
The smoke check (36/36 structural validations) confirms the v1.3
implementation is complete and well-formed. End-to-end orchestrator
dispatch is documented as a focused next-session task with the exact
Agent dispatch payload, expected behavior per phase, and acceptance
criteria.

Workbench imported from eval/auto-pilot/supabase-postgres-best-practices-v2
in the prior commit so the e2e dispatch can resume cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename context files to match the orchestrator's ${OWNER}-${SKILL_ID}.md
  formula so cached contexts are found on existing-skill runs (was
  breaking acceptance criterion #3 — orchestrator was always dispatching
  Phase 0 research even when cached context existed).
- Update smoke check to expect renamed supabase context.
- workflow.md: complete the sub-subagent input lists for Phase 3.5 and
  Phase 4 dispatch lines (previously missing WORKBENCH_DIR / LESSONS_PATH /
  CONTEXT_FILE).
- orchestrator.md hard rules: clarify that context files are written by
  the research sub-subagent, not the orchestrator (was ambiguous).
- SKILL.md: document the optional REFRESH_CONTEXT template var.
- docs/pilot-runs/README.md: add deprecation note pointing at v1.3.

NOT FIXED (deliberate): Phase 4 cost tracking is a comment stub vs Phase 3's
full code block. The asymmetry is intentional — Phase 4 runs at most twice,
and a competent orchestrator can infer the pattern from Phase 3 without
duplicating the 10-line block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in the batch-2 shadcn-ui eval workbench (baseline 0.82, batch-2
showed +0.07 uplift) so the v1.3 orchestrator can be dispatched on a
fresh skill (no cached context, will trigger Phase 0 research). Pairs
with the agent-browser dispatch (cached context, will skip Phase 0).
The two together exercise both paths through Phase 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shadcn-ui workbench was cherry-picked from batch-2 which used the
older gpt-4o-mini. Updating to gpt-5 to match the canonical v1.2.1+
frontier-model matrix (sonnet-4.6, gpt-5, gemini-2.5-pro) used by
web-design-guidelines and agent-browser. The just-completed v1.3
orchestrator dispatch saw gpt-4o-mini dominate the per-case-min floor;
re-firing with gpt-5 will produce a result that's apples-to-apples
with the other PR candidates.

Follow-up issue: v1.3 orchestrator should validate the model matrix
against a canonical set before running, OR the operator's pre-flight
should. Current behavior runs whatever is on disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…k with first-line path comment instruction and StatusBadge BAD/GOOD example
…patch

Brings firebase-hosting-basics and firecrawl-build-scrape workbenches
from their batch-2 eval branches, with the model matrix updated to
the v1.2.1+ canonical frontier set:
  claude-sonnet-4.6, gpt-5, gemini-2.5-pro

Both eval the code-reviewer pattern (read a file, find seeded
violations, write findings.txt). Lightweight — no Python venv or
real API calls.

Both will exercise the v1.3 Phase 3.5 eval-iteration loop if frontier
models hit ceiling (likely; both batch-2 baselines were 0.84-0.89
with old gpt-4o-mini matrix). This is the v1.3 architectural feature
we haven't validated yet — neither shadcn-ui nor agent-browser
dispatched eval-iterate (both landed in (0.50, 0.95) directly).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 12, 2026 16:21
import { spawnSync } from 'node:child_process';
import { mkdtempSync, mkdirSync, writeFileSync, rmSync } from 'node:fs';
import { tmpdir } from 'node:os';
import { join, dirname, resolve } from 'node:path';
import React from "react";
import { Card, CardContent, CardHeader } from "@/components/ui/card";
import { Button } from "@/components/ui/button";
import { Badge } from "@/components/ui/badge";
@Zhaiyuqing2003 Zhaiyuqing2003 force-pushed the feat/auto-improve-skill-v1.3 branch from b469dae to 96dbe31 Compare May 12, 2026 16:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the v1.3 “auto-improve” workflow as an operator-dispatched Claude Code skill (skills/auto-improve-orchestrator/), replacing the prior wrapper approach, and adds multiple new/updated workbenches + graders used to validate (and package) upstream skill improvements. It also hardens Docker workbench cleanup/permissions behavior for runs that write as a non-host UID.

Changes:

  • Add the auto-improve-orchestrator skill (SKILL.md + prompts + references/contexts) plus a node-based smoke-check script.
  • Add/extend several eval workbenches (Supabase Postgres, shadcn-ui, Firebase Hosting, Firecrawl scrape, agent-browser) including suite.yml, graders, fixtures/workspaces, analyses, and proposed-upstream-changes artifacts.
  • Adjust Docker runner permissions/cleanup to reduce failures when containers write files the host user can’t delete; add gray-matter for frontmatter parsing.

Reviewed changes

Copilot reviewed 149 out of 152 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/workbench/docker-runner.ts Adjust permissions and make cleanup resilient to host/container UID mismatches
skills/auto-improve-orchestrator/SKILL.md New orchestrator skill entrypoint and operator invocation instructions
skills/auto-improve-orchestrator/.smoke-check.mjs Smoke-check script validating skill structure and template vars
skills/auto-improve-orchestrator/references/contexts/vercel-labs-web-design-guidelines.md Cached upstream research context for a target
skills/auto-improve-orchestrator/references/contexts/google-labs-code-shadcn-ui.md Cached upstream research context for shadcn-ui
skills/auto-improve-orchestrator/references/contexts/firebase-firebase-hosting-basics.md Cached upstream research context for firebase-hosting-basics
package.json Add gray-matter (used by smoke-check)
examples/workbench/supabase-postgres-best-practices/workspace/schema.sql New seeded SQL workspace for deterministic grading
examples/workbench/supabase-postgres-best-practices/workspace/rls_policies.sql New seeded SQL workspace for RLS-focused grading
examples/workbench/supabase-postgres-best-practices/workspace/multi_table_schema.sql New multi-table RLS enumeration workspace
examples/workbench/supabase-postgres-best-practices/workspace/migrations.sql New FK/index enumeration workspace
examples/workbench/supabase-postgres-best-practices/workspace/data_migration.sql New UPDATE-without-WHERE workspace
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/SKILL.md Vendored skill snapshot for deterministic eval
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-rls-performance.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-rls-basics.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/security-privileges.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-primary-keys.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-partitioning.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-lowercase-identifiers.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-foreign-key-indexes.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-data-types.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/schema-constraints.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-partial-indexes.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-missing-indexes.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-index-types.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-covering-indexes.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/query-composite-indexes.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-vacuum-analyze.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-two-pass-review.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-pg-stat-statements.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/monitor-explain-analyze.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-skip-locked.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-short-transactions.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-deadlock-prevention.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/lock-advisory.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-upsert.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-pagination.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-n-plus-one.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/data-batch-inserts.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-prepared-statements.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-pooling.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-limits.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/conn-idle-timeout.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/advanced-jsonb-indexing.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/advanced-full-text-search.md New vendored reference rule file
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_template.md New vendored reference template
examples/workbench/supabase-postgres-best-practices/references/supabase-postgres-best-practices/references/_sections.md New vendored section definitions
examples/workbench/supabase-postgres-best-practices/checks/_grader-utils.mjs Shared grader utilities for this workbench
examples/workbench/supabase-postgres-best-practices/checks/grade-schema-findings.mjs Grader for schema.sql case
examples/workbench/supabase-postgres-best-practices/checks/grade-rls-findings.mjs Grader for rls_policies.sql case
examples/workbench/supabase-postgres-best-practices/checks/grade-multi-table-rls-findings.mjs Grader for multi-table RLS case
examples/workbench/supabase-postgres-best-practices/checks/grade-fk-index-audit-findings.mjs Grader for FK/index enumeration case
examples/workbench/supabase-postgres-best-practices/checks/grade-update-without-where-findings.mjs Grader for UPDATE-without-WHERE case
examples/workbench/supabase-postgres-best-practices/README.md Workbench documentation and run instructions
examples/workbench/supabase-postgres-best-practices/analysis.md Run analysis summary
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/README.md Packaged upstream-change notes
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/before-SKILL.md Packaged before snapshot
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/after-SKILL.md Packaged after snapshot
examples/workbench/supabase-postgres-best-practices/proposed-upstream-changes/supabase-agent-skills/monitor-two-pass-review.md Packaged new upstream reference file
examples/workbench/shadcn-ui/workspace/UserCard.tsx Seeded code-review fixture
examples/workbench/shadcn-ui/workspace/StatusBadge.tsx Seeded code-review fixture
examples/workbench/shadcn-ui/checks/_grader-utils.mjs Shared grader utilities
examples/workbench/shadcn-ui/checks/grade-usercard-findings.mjs UserCard grader
examples/workbench/shadcn-ui/checks/grade-statusbadge-findings.mjs StatusBadge grader
examples/workbench/shadcn-ui/suite.yml New shadcn-ui eval suite
examples/workbench/shadcn-ui/README.md Workbench documentation and run instructions
examples/workbench/shadcn-ui/analysis.md Run analysis summary
examples/workbench/shadcn-ui/proposed-upstream-changes/README.md Packaged upstream-change notes
examples/workbench/firecrawl-build-scrape/workspace/ScrapeService.ts Seeded code-review fixture
examples/workbench/firecrawl-build-scrape/checks/_grader-utils.mjs Shared grader utilities
examples/workbench/firecrawl-build-scrape/checks/grade-scrape-service-findings.mjs ScrapeService grader
examples/workbench/firecrawl-build-scrape/references/firecrawl-build-scrape/node-docs.md Vendored Firecrawl Node docs snapshot
examples/workbench/firecrawl-build-scrape/suite.yml New Firecrawl eval suite
examples/workbench/firecrawl-build-scrape/README.md Workbench documentation and run instructions
examples/workbench/firecrawl-build-scrape/analysis.md Run analysis summary
examples/workbench/firebase-hosting-basics/workspace/firebase-app/firebase.json Seeded config fixture
examples/workbench/firebase-hosting-basics/checks/_grader-utils.mjs Shared grader utilities
examples/workbench/firebase-hosting-basics/checks/grade-firebase-config-findings.mjs firebase.json grader
examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/SKILL.md Vendored skill snapshot
examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/configuration.md Vendored reference doc
examples/workbench/firebase-hosting-basics/references/firebase-hosting-basics/deploying.md Vendored reference doc
examples/workbench/firebase-hosting-basics/suite.yml New Firebase Hosting eval suite
examples/workbench/firebase-hosting-basics/README.md Workbench documentation and run instructions
examples/workbench/firebase-hosting-basics/analysis.md Run analysis summary
examples/workbench/firebase-hosting-basics/proposed-upstream-changes/README.md Packaged upstream-change notes
examples/workbench/firebase-hosting-basics/proposed-upstream-changes/firebase-agent-skills/before-SKILL.md Packaged before snapshot
examples/workbench/firebase-hosting-basics/proposed-upstream-changes/firebase-agent-skills/after-SKILL.md Packaged after snapshot
examples/workbench/agent-browser/suite.yml Expanded agent-browser suite
examples/workbench/agent-browser/references/agent-browser/SKILL.md Vendored skill stub used by workbench
examples/workbench/agent-browser/references/agent-browser/agent-browser-core.md Vendored core workflow reference
examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/transitions.txt Recorded snapshot transition data
examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/snapshot.out Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/wikipedia/snapshot-after-search.out Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/transitions.txt Recorded snapshot transition data
examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot.out Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot-after-signup.out Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/signin-signup/snapshot-after-signin.out Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/transitions.txt Recorded snapshot transition data
examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot.out Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-submitted.out Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-name-entered.out Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/multistep-form/snapshot-email-entered.out Recorded snapshot
examples/workbench/agent-browser/references/agent-browser/recordings/blog-article/transitions.txt Recorded snapshot transition data
examples/workbench/agent-browser/references/agent-browser/recordings/blog-article/snapshot.out Recorded snapshot
examples/workbench/agent-browser/checks/grade-navigate-report-findings.mjs Behavioral grader updates/additions
examples/workbench/agent-browser/checks/grade-screenshot-capture-findings.mjs Behavioral grader updates/additions
examples/workbench/agent-browser/checks/grade-ref-disambiguation-findings.mjs Behavioral grader updates/additions
examples/workbench/agent-browser/checks/grade-output-correctness-findings.mjs Behavioral grader updates/additions
examples/workbench/agent-browser/checks/_grader-utils.mjs Shared grader utilities
examples/workbench/agent-browser/proposed-upstream-changes/README.md Packaged upstream-change notes
examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/before-SKILL.md Packaged before snapshot
examples/workbench/agent-browser/proposed-upstream-changes/vercel-labs-agent-browser/after-SKILL.md Packaged after snapshot
examples/workbench/agent-browser/analysis.md Run analysis summary
docs/pilot-runs/README.md Note about v1.3 orchestrator vs removed wrapper
docs/pilot-runs/upstream-pr-drafts/README.md Update upstream draft index/process
docs/pilot-runs/upstream-pr-drafts/superseded/README.md Archive/superseded draft notes
docs/auto-improve-skill-v1.3-validation.md v1.3 validation notes (needs update for current PR claims)
CLAUDE.md Add pointer to orchestrator skill
.gitignore Ignore new generated artifacts/logs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +44 to +50
## Models

The suite runs a 3-provider mid-tier matrix:

- `openrouter/anthropic/claude-sonnet-4-6`
- `openrouter/openai/gpt-4o-mini`
- `openrouter/google/gemini-2.5-pro`
Comment on lines +35 to +41
## Models

The suite runs a 3-provider mid-tier matrix:

- `openrouter/anthropic/claude-sonnet-4-6`
- `openrouter/openai/gpt-4o-mini`
- `openrouter/google/gemini-2.5-pro`
Comment on lines +35 to +41
## Models

The suite runs a 3-provider mid-tier matrix:

- `openrouter/anthropic/claude-sonnet-4-5`
- `openrouter/openai/gpt-4o-mini`
- `openrouter/google/gemini-2.5-flash`
Comment on lines +1 to +6
# v1.3 validation — deferred to operator

**Date:** 2026-05-12
**Status:** implementation complete, end-to-end orchestrator dispatch
deferred to operator's next session.

Comment on lines +46 to +51
for (const [file, vars] of Object.entries(expectedVars)) {
const content = readFileSync(`${skillRoot}/prompts/${file}`, 'utf-8');
for (const v of vars) {
check(content.includes(`\${${v}}`), `prompts/${file} contains \${${v}}`);
}
}
Comment on lines 410 to +414
mkdirSync(referencesDir, { recursive: true });
mkdirSync(workDir, { recursive: true });
mkdirSync(resultsDir, { recursive: true });
chmodSync(workDir, 0o777);
chmodSync(resultsDir, 0o777);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants