feat(skill-optimizer): v1.4 chain implementation (draft, per RFC #52) by Zhaiyuqing2003 · Pull Request #53 · fastxyz/skill-optimizer

Zhaiyuqing2003 · 2026-05-21T14:43:54Z

Summary

Implementation of the v1.4 skill-optimizer chain architecture proposed in RFC #52. Decomposes the v1.3 monolithic orchestrator subagent into 9 independent chain skills + 1 auto-pilot driver, each dispatching narrow-context subagents that produce human-reviewable artifacts at convention paths.

Draft because B10 (autopilot SKILL.md), plugin metadata updates, and end-to-end validation runs are still pending. Opening now for early review of the architecture as implemented.

What's in this PR

9 chain skill SKILL.md + 1 auto-pilot pending (B10)

#	Skill	Kind	What it does
1	`investigate-functionality`	fresh-derivation	Research what the skill does
2	`investigate-test-case`	maintenance	Design functionalities to test
3	`investigate-submissions` (optional)	fresh-derivation	Research upstream PR conventions
4	`write-tests`	maintenance	Build per-probe workspace + grader + smoke fixtures
5	`validate-tests`	fresh-derivation	Independent semantic check on probes (anti-ducktape gate)
6	`run-bench`	fresh-derivation (summary)	CLI wrapper for `run-suite`
7	`analyze`	fresh-derivation	Cluster failures into named structural weaknesses
8	`improve`	fresh-derivation	Optimizer drafts principled fix
9	`validate`	fresh-derivation	Independent verdict on the improvement
10 (pending)	`autopilot`	chain driver	Walks 1→9 end-to-end

8 subagent prompt templates (Phase C, just completed)

skills/skill-optimizer-subagents/: each chain skill that dispatches a subagent loads the corresponding prompt template at its dispatch step. Each prompt has consistent structure: role, inputs, what-it-sees / what-it-doesn't-see (the anti-ducktape constraints), output spec, reasoning protocol, edge cases, return summary.

4 shared docs

skills/skill-optimizer-shared/:

iteration-protocol.md — iteration mechanics (step kinds, staleness, destructive-edit checkpoints)
subagent-dispatch.md — dispatch architecture (constraints, directives, no-auto-invocation rule)
frontmatter-discipline.md — runtime facts vs history (git replaces version+archive bookkeeping)
workflow.md — operator-facing chain reference (chain table, re-run triggers, backward triggers)
workbench.md — workbench schema (moved from legacy skills/skill-optimizer/references/; needs rewrite, flagged)

Architecture

Filesystem-as-state for the test tree: tests/<functionality>/<probe>/{spec.yaml, workspace/, grader.mjs, smoke/, checks/}. No picked: [] or built_cases: [] arrays — picked: true|false lives per-functionality spec.yaml; built state is implicit (probe folder + grader.mjs present).
Git-native iteration: no version: fields, no archive/ directories. Git already content-addresses every prior state; git log mtime gives staleness signal for auto-pilot.
Two anti-ducktape gates:
- Step 5 (validate-tests) → step 6: can't measure against bad probes.
- Step 7 (analyze) → step 8 / step 9 (validate) → materialize improved-skill/: can't ship a ducktape patch (analyzer must name a weakness with both general principle AND explicit anti-pattern list; validator checks the optimizer's self-check against that list).
Original skill is never modified: vendored-skill/ stays frozen; improved-skill/ accumulates approved improvements via step 9.

Spec + plan docs

Moved to the superpowers convention:

docs/superpowers/specs/2026-05-19-skill-optimizer-v1.4-design.md
docs/superpowers/plans/2026-05-19-skill-optimizer-v1.4.md

What's NOT in this PR (deferred)

B10 skill-optimizer-autopilot/SKILL.md — chain driver still pending
Plugin metadata + CLAUDE.md + README updates pointing at v1.4 (separate cleanup PR per spec)
workbench.md rewrite — currently in legacy shape from the v1.3 era; flagged as future work
End-to-end validation runs on real skills (firecrawl regression case, etc.)
Legacy skills/auto-improve-orchestrator/ removal — separate cleanup

Stats

27 files changed, +6129 / -210 lines
9 chain SKILL.md (avg 150 lines each)
8 subagent prompts (avg 166 lines each, 1324 total)
4 shared docs (avg ~100 lines each)
40 commits, all conventional-format

Test plan

This PR is structurally complete but not yet validated end-to-end. Validation pending in a follow-up:

Manual walk-through of the chain on a small local skill — verify each chain skill dispatches correctly, each subagent prompt produces the expected output shape
Re-run firecrawl (v1.3 regression case) under v1.4 — expect optimizer either produces a principled fix OR honestly refuses (no regression shipped)
Auto-pilot smoke test on a local skill (after B10 lands)
Plugin metadata + CLAUDE.md/README pointer updates so the v1.4 chain is discoverable
Cross-platform plugin packaging verification (npx tsx tests/smoke-skill-distribution.ts, npm pack --dry-run)

Reference

Builds on RFC #52: #52 — that PR carries the architectural rationale, evidence base, and concerns-addressed sections. This PR is the implementation.

🤖 Generated with Claude Code

Approved design (brainstormed 2026-05-12) converting the v1.3 monolithic auto-improve-orchestrator into 7 independent Claude Code skills that chain via the superpowers plugin pattern. Motivation: team review of v1.3 PR drafts surfaced 4 critiques — v1.3 optimizes for incremental numerical uplift without validating test case quality, grader correctness, or improvement principledness. The firecrawl iteration regression (1.0 → 0.44 from piling on Recipe A+D simultaneously to chase a small uplift) is the canonical "ducktape-by-monolithic-orchestrator" case study. Key architectural shifts: - Skills (not slash commands) per superpowers convention - Convention-pathed reports at docs/skill-optimizer/<slug>/... (visible + committable, like docs/superpowers/specs/) - Strict limited-context subagents for all generative work (writer, analyzer, optimizer, validator) — prevents tunnel-vision into ducktape patches - Validator subagent after every improvement (internal + optional external consistency) - Auto-pilot = natural chained invocation, not a separate orchestrator Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

User clarification: the agent asks 'optimize for PR submission?' at step 1 (when the user first provides an upstream skill), not at the end of step 7. This decision determines whether step 3 (investigate-submissions) runs at all. Changes: - 'Two contexts, one workflow' rewritten: PR decision at step 1, recorded in 01-functionality.md frontmatter as pr_submission_intent - Step 1 behavior: explicit PR question for upstream sources - Step 2 handoff: reads pr_submission_intent to decide whether to invoke step 3 - Step 3 'skipped when': now keyed on pr_submission_intent: false - Step 7 handoff: three branches (local / upstream-no-PR / upstream-yes-PR), no late prompts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

15 tasks across 5 phases: - Phase A (4 tasks, mechanical): skill dir shells, subagents/refs dirs, v1.3 deprecation, recipes.md seed - Phase B (7 tasks, INTERACTIVE via writing-skills): one per SKILL.md, NOT subagent-driven per user request - Phase C (6 tasks, mechanical): subagent prompt templates with limited-context constraints - Phase D (1 task, mechanical): references/workflow.md chain diagram - Phase E (2 tasks, E2E validation): local skill + firecrawl re-run (the v1.3 regression case) Plan explicitly marks Phase B as not-for-subagent-driven-development; the user explicitly stated 'writing good skills is HARD' and wants interactive creation per skill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…+ writing-skills Phases B–D produce v1.4's load-bearing artifacts (description-routed SKILL.md files, subagent prompt templates that hold the anti-ducktape constraints, and the workflow reference doc). Per operator direction, none of these should be subagent-driven — "writing good skills is HARD" and "everything to be precise". Tool assignment: - Phase B (7 SKILL.md): skill-creator as outer loop (description-routing + eval iteration), superpowers:writing-skills as inner loop when behavioral compliance issues surface during eval - Phase C (6 subagent prompts): superpowers:writing-skills only — these are compliance documents and the limited-context constraint must hold under adversarial pressure, pure TDD-pressure-scenario territory - Phase D (workflow.md): no skill tool; direct interactive authoring with section-by-section operator review (matches handoffs from B + constraints from C, which may have shifted during interactive iteration) Only Phase A (file scaffolding) and Phase E (real eval runs) sit outside the interactive flow.

Body content for each SKILL.md will be written interactively in Phase B via superpowers:writing-skills, with the interface contract from docs/skill-optimizer-v1.4-spec.md as the brief. This commit just lays down the structural skeleton and discoverable frontmatter.

…se C+D)

Pulled from feat/auto-improve-skill-v1.3:skills/auto-improve-orchestrator/ references/lessons.md (the v1.3 orchestrator never landed on development, so we source from the experimental branch). Adds v1.4-specific header explaining how the analyzer (step 6) and optimizer (step 7) subagents will use this file. Replaces v1.3 "auto-improve-skill" / "Phase 4" framing with v1.4 step-numbering and subagent-naming. Content body is the v1.3 Recipe A-E + grader patterns G1-G6 + run-record protocol verbatim — that's the cumulative knowledge v1.4 inherits.

…ts dir, move philosophy to docs/ Changes to the v1.4 plan-as-executed, all in one place: - Move skill-writing-philosophy.md from skills/references/ to docs/. The philosophy doc is contributor-facing — we'll distill the load-bearing rules into the optimizer subagent's prompt directly rather than have it load this doc at runtime. - Delete skills/references/recipes.md (and the now-empty references/ dir). The raw seed copied from v1.3 lessons.md is case-study-shaped (accumulated per-pilot observations), which contradicts the "generalize-from-feedback" principle in the philosophy doc. The curated abstract-pattern version needs real end-to-end observations to ground it in — deferred to a follow-up after the chain ships. - Rename skills/subagents/ → skills/skill-optimizer-subagents/ so the dir name is scoped to this plugin and can't collide with subagents that other plugins might ship under a generic name. - Add a "Bootstrapping limit" section to the philosophy doc. The skill-optimizer chain runs an empirical loop on target skills, but that loop can't validate itself — same shape as Thompson's "Reflections on Trusting Trust". The seven meta-skills get authored from philosophy + best judgment; their test is Phase-E real-world runs, not eval data for the meta-skills themselves. - Spec + plan updated: file tree, Task A2 (revised), Task A3 (skipped), Task A4 (deferred), Phase C task file paths (subagents path rename), Phase D location (workflow.md goes to docs/ rather than skills/references/), acceptance criteria #4 (recipes.md deferred), coexistence section (v1.3 orchestrator never landed on development), open questions (recipes.md location TBD on real production). - Add note about post-v1.4 cleanup of the original skills/skill-optimizer/ skill (now redundant with the chain's skill-optimizer-run-bench step). Deferred to its own PR because it touches the public plugin API.

…sification taxonomy + local-with-PR-intent path Two changes to the B1 draft, plus a matching spec sync: 1. Classification: replaced the ad-hoc closed list (code-reviewer | document-producer | tool-use | code-patterns | other) with the v3 taxonomy actually used in the prioritization work — tool-use, code-patterns, document, prose-guidance, meta, interactive — and explicitly grants the subagent sovereignty to write a short descriptive label of its own when none fits, rather than collapsing to "other". A specific label gives downstream steps a real handle to work with. 2. Local-skill PR intent: the prior rule was "local = always pr_submission_intent: false". Revised to default false but treat PR-intent as live when the user explicitly says they want to send the local skill back upstream — in which case we capture the upstream guidelines location (URL / CONTRIBUTING.md path / Slack channel) into the report body as a "PR submission notes" subsection that step 3 (investigate-submissions) reads as its starting point. Also: while reviewing, dropped the v1.3/v1.4 framing from the "Why limited-context dispatch matters" section (version is historical metadata, not skill-functional content) and updated subagent path links to the renamed skill-optimizer-subagents/ directory. Spec §"The 7 skills" #1 step 2 + step 6 (classification field description) updated to match.

…r B2 Spec changes: - New ## Iteration patterns section between ## Subagent constraints and ## The skills. Covers: backtrack-trigger table, per-report versioning mechanism (version + inputs frontmatter, with direct- upstream-only staleness checks), re-entry contract with the ${OPERATOR_DIRECTIVES} slot, latest-plus-archive convention, and the transitive-staleness policy (user judgment trusted; auto-pilot re-runs on direct-upstream version mismatch). - ## The 7 skills → ## The skills. Added one-line "Iteration behavior" notes to each of the existing seven subsections. - Step 2 (investigate-test-case) now dispatches a test-case-designer subagent rather than running in the operator session. Reasoning: cross-iteration contamination — on iter 2+ the operator has seen prior failures and biases test selection toward "what just failed" instead of comprehensive coverage. Isolated subagent fixes this. - New ### 8. skill-optimizer-autopilot subsection. Walks 1→7, consults version mechanism per step (skip-or-dispatch), applies defaults for the three human-gate points (B1 PR-intent flag, B2 pick-top-N, B7 log-and-exit on validator-rejected), bounds at max-iterations-per-step. Replaces the old "## Auto-pilot mode" section (which said "no separate skill, just chained invocation"). - Subagent constraints table updated: test-case-designer added; every reasoning subagent's "Does NOT see" column now includes prior drafts of its own output; every Sees column includes ${OPERATOR_DIRECTIVES}. - Acceptance criteria expanded to 9 items (was 7): new #1 covers 8 SKILL.md files, new #5 covers iteration-mechanism E2E, new #8 covers auto-pilot smoke test. Plan changes: - Phase B count 7 → 8 tasks. New Task B8 (skill-optimizer-autopilot SKILL.md) added with its own A1-equivalent dir-creation step inline (the dir wasn't part of Phase A scope). - Phase C count 6 → 7 tasks. New Task C1b (test-case-designer subagent prompt) inserted between C1 and C2 with a full draft template; suffixed "1b" rather than renumbering existing C2-C6 to keep cross-references stable. - Phase C section header adds a preamble explaining the ${OPERATOR_DIRECTIVES} requirement that applies to every reasoning-subagent prompt template.

…h iteration patterns Five updates to match the spec's new iteration mechanism (§"Iteration patterns" added in commit 0009066): 1. Report frontmatter now includes `version` field. Step 1 has no upstream reports, so `inputs:` is omitted; downstream steps will add it as they're authored. 2. New workflow step 5 — "Handle iteration: archive prior version + collect directives." Checks for existing report at the canonical path; if present, moves it to archive/01-functionality-v<N>.md and bumps version. Also where the operator pre-digests cross-iteration learnings into the OPERATOR_DIRECTIVES bulleted list (atomic new requirements, not a context dump). 3. Step 6 (renumbered from old step 5) dispatches with two new templated inputs: ${VERSION} and ${OPERATOR_DIRECTIVES}. The subagent's "does NOT see" list explicitly includes prior 01-functionality.md drafts under archive/ to prevent the re-derivation from being contaminated by its own past output. 4. New "## Iteration behavior" section after "## Edge cases". Covers: when to re-run, what happens to vendored-skill/ on re-run, the archive convention, and how downstream cascade self-corrects via the version-mismatch check on next invocation. 5. Edge-case for "vendored-skill/ already exists" tightened: default to reuse (same source), re-fetch only on source URL change or explicit user ask. (Previously asked the user every time.) Sets the template for B2-B8.

Authoring the same ~40 lines of archive + version-bump + directives- collection logic into seven chain skills (B1-B7) would mean ~280 duplicated lines. Factor the mechanics into a single shared operational reference that every chain skill loads explicitly. New file: skills/skill-optimizer-shared/iteration-protocol.md (187 lines). Covers: versioning convention, the per-invocation decision tree (no existing report → v1; existing + inputs match → current; existing + mismatch → archive + bump), collecting OPERATOR_DIRECTIVES (atomic new requirements, not context dumps), subagent constraints under iteration (ignore own prior output, read upstream at latest), cascading staleness policy (direct-upstream-only checks), archive table, bootstrapping case, and what the protocol explicitly does NOT cover (bench-results timestamping, auto-pilot summaries, vendored-skill cache). B1 changes: - Step 5 stripped from ~25 lines of inline mechanics to ~10 lines with an explicit "Read this file now" instruction pointing at the protocol doc. The "Read this now" framing is load-bearing — the whole point is that the agent must consult the protocol, not improvise. - "Iteration behavior" section trimmed from ~25 lines of general mechanics to ~10 lines listing only step-1-specific re-run triggers, with a pointer back to the protocol for the general mechanics. - Net: 225 → 208 lines. This sets the pattern for B2-B8: each chain skill's "Handle iteration" step will be a brief pointer-and-step-specific-notes combo, with the heavy mechanics centralized. Spec changes: - Architecture overview file tree: skill-optimizer-shared/ added. - Acceptance criterion 2b added: shared protocol doc exists and is referenced explicitly from each chain skill's iteration step. Plan changes: - File tree updated (skill-optimizer-shared/ + autopilot/ stub). - New Task A5 added: author iteration-protocol.md. Documented as "as-executed (added mid-execution)" because it emerged from B1's revision rather than the initial plan.

…sophy Three issues that the protocol doc was a load-bearing reference for — so they would have propagated into every chain skill's invocation — all caught on read-through: 1. "B1-B7" was project-internal jargon (those are plan task labels, not part of the skill vocabulary). Replaced with "every skill in the chain". The shipped doc should be timeless; task labels live in the plan, not the protocol. 2. The "On every chain-skill invocation" section used an ASCII box- drawing decision tree. Per Anthropic and writing-skills guidance ("use flowcharts ONLY for non-obvious decision points"), this logic is straightforward conditional flow — bullets carry it more cleanly and match standard markdown rendering. Rewritten as nested bullets. 3. The "Bad examples" of operator directives were marked with ❌. Per the global instruction "Only use emojis if the user explicitly requests it. Avoid adding emojis to files unless asked", these don't belong. Replaced with explicit "Examples that count" / "Examples that do NOT count" section labels. Net: 187 → 180 lines.

…mbering + front-load load-bearing context Two related fixes: 1. Internal workflow step numbering (1-7) collided with chain step numbering (1-7 referring to other skills in the chain). Same word, two meanings — an agent reading "step 5" could plausibly think either "this skill's Handle-iteration step" or "the run-bench skill in the chain". Relabeled internal workflow steps to (a) through (g) so they're visually distinct from chain step references. Cross-references inside the doc updated accordingly. A note in the new "Before you start" section declares the convention explicitly so future readers don't have to infer it. 2. The "Read iteration-protocol.md" instruction was buried at internal step (e) — middle of a 7-step workflow. An agent reading sequentially might glance past it once they're in execution momentum. Added a "## Before you start" section right after the intro paragraph, listing the two load-bearing things to know up front: (1) read the iteration protocol now, (2) you will dispatch a subagent — you don't do the research yourself. Step (e) still requires reading the protocol — the "Before you start" section primes the agent so step (e) becomes reinforcement rather than first contact. 229 lines total (up from 208; "Before you start" earns its place as the discipline frame). Sets the pattern for B2-B8 — each will similarly relabel its workflow steps with (a)-(g) and front-load the two load-bearing context items.

…visibility moved from step 4-blocked to step 4-allowed Two coupled changes — B2's SKILL.md draft and a spec update that makes the source-visibility split consistent across the chain. Spec changes (Subagent constraints table): - Step 2 (test-case designer) — explicitly added "skill source content" to "does NOT see". Coverage design happens at the responsibility level here; concrete fixture details enter at step 4. Without this constraint, designers could gerrymander test cases around the source's literal phrasing instead of reasoning from stated responsibilities. - Step 4 (test writer) — removed "the skill's content" from "does NOT see"; added "skill source content" to "sees". Updated the rationale: fixture writing needs concrete patterns and violation examples, which come from source. The per-case test spec from step 2 constrains what the fixture should test, so the gerrymandering risk is bounded. Still blocks other test cases (prevents copying across the suite) and grader internals (prevents grader-leak hacking). Architecture: source enters the chain at step 1 (research), exits at step 2 (responsibility-level design needs no source), enters at step 4 (fixture writing needs source detail), stays present for step 6 (analyze failures) and step 7 (optimize/validate). The spec's step-4 body description updated to match. B2 SKILL.md: - New file, follows B1's template: front-loaded "Before you start" section, lettered workflow steps (a)-(f), bold discipline markers at action sites, why-this-matters rationale, edge cases, iteration behavior section. - 6 internal steps: (a) confirm prerequisites, (b) handle iteration, (c) dispatch designer subagent, (d) confirm subagent output, (e) user gate (present + collect picks), (f) conditional handoff based on pr_submission_intent. - "picked" frontmatter field is the operator's responsibility — subagent writes proposal with picked: []; operator fills picked after user gate. - Three-response handling in user gate: pick subset/all, ask for revision, pick zero. - 215 lines, description 479 chars.

Step 3 of the chain — OPTIONAL, runs only when 01-functionality.md has pr_submission_intent: true. Researches the upstream repo's PR conventions and writes 03-submissions.md as the verbatim-pastable context block the validator (step 7) uses for external consistency. Follows the B1/B2 template: - Front-loaded "Before you start" with iteration-protocol pointer + dispatch discipline + step-numbering convention - Frontmatter with version, inputs.step_1_functionality, plus step-specific fields: upstream_repo, upstream_branch_target, license, requires_cla - 5 lettered workflow steps: (a) confirm prerequisites including pr_submission_intent gate, (b) handle iteration, (c) dispatch subagent, (d) confirm subagent output with blocker flagging, (e) hand off - Why limited-context dispatch matters: rationale specific to step 3 — the validator's external consistency check depends on the submissions report being neutral upstream facts, not advocacy for the proposed change - Edge cases: private repos, non-GitHub hosts, empty PR history, copyleft license - Iteration behavior: explicitly note this rarely re-runs; upstream conventions change slowly 235 lines, description 578 chars.

…otocol + apply in B2 A chain skill never invokes another chain skill on its own. When a step finds its upstream input unsatisfactory (thin functionality report, too-easy bench, missing test case, wrong-target PR conventions), it surfaces the finding and stops — the user (or auto-pilot driver) decides whether to re-run upstream, accept the situation, or abandon. This was implicit in the architecture but not stated as a rule. Adding it explicitly to iteration-protocol.md as a new section between "Cascading staleness" and "Archive convention". Same section covers the forward-handoff exception (those are normal chain flow, not backward triggers). Updated B2's edge case for "Subagent's proposal has fewer cases than expected" — was "Either re-run step 1 with directives or accept the small proposal", now "Surface this to the user with two options ... Do not re-invoke step 1 yourself — re-runs require an active signal from the user", with reference back to the protocol doc's new section. Audit: only B2 had a wording suggesting backward auto-trigger; B1 and B3 don't suggest re-running upstream steps. Same rule applies to all future chain skills (B4-B8) — the protocol doc is the chain-wide source of truth.

…eview 1. upstream_branch_target placeholder syntax. Was "<main | next | other>" which reads as a closed enum. The actual value is the literal branch name (could be `develop`, `release-2024`, etc.), so the placeholder should describe the field meaning, not enumerate three literal options. Now: "<branch name — usually `main`, sometimes `next` for new-skill repos, or a specific release branch>". 2. Removed the "frontmatter spec includes fields not currently in the vendored skill's frontmatter" blocker. It described a real fact in the report (upstream uses fields the vendored skill lacks) but isn't a blocker — the optimizer (step 7) reads 03-submissions.md and would add missing fields naturally. Doesn't need a separate operator alert. 3. Removed the "License is GPL or other copyleft" blocker. Copyleft licenses don't mechanically block PR submission — the upstream is whatever-licensed; a PR becomes part of that codebase under that license. Organization-policy concerns about contributing to copyleft projects exist but are contributor-side decisions, not chain-level blockers. Step (d) reframed from "flag these blockers" to "verify file + only the CLA fact needs explicit mention since it requires operator-side work outside the chain". Other frontmatter fields (license, repo, branch target) are facts downstream steps consume directly. Edge case for "license is copyleft" replaced with one for unusual frontmatter conventions — that's the actual blocker shape (subagent can't extract a consistent spec, so the optimizer has to make a judgment call).

Per operator review: B2's re-run was modeled as fresh derivation, which loses continuity. The user's `picked` choices and existing case names should survive a re-run; otherwise the user has to re-pick everything every time they add coverage. The "archive as inert audit trail" rule means continuity has to come from the current canonical file, not from peeking at archived prior versions. New concept: chain steps come in two kinds. - Fresh-derivation steps (1, 3, 6, 7): subagent never sees its own step's prior or current file. The "anti-ducktape" rule applies in full — optimizer must not see prior attempts, analyzer must not see prior analyses. - Maintenance steps (2, 4): subagent reads the current canonical file as load-bearing input and produces an extended version of it. Existing entries the user has invested in (picks, manually-added cases) are preserved unless directives explicitly say to revise. Archive still happens for audit; archive is still inert. iteration-protocol changes: - New "Step kinds: fresh-derivation vs maintenance" section classifying each step. - "On every chain-skill invocation" decision tree updated to branch on step kind: fresh-derivation copies-to-archive then derives from scratch; maintenance copies-to-archive then dispatches with current file as input. - "Subagent constraints under iteration" restructured to make the third rule kind-dependent (do/don't read your own canonical file). B2 changes: - Step (b) Handle iteration: explicitly invokes the maintenance pattern. Notes the special case where step 1's version bumped (responsibility set changed) — user decides whether to extend or start fresh by deleting the canonical file before dispatch. - Step (c) Dispatch: new templated input ${EXISTING_CASES_PATH} (empty on iteration 1, current 02-test-case.md on re-runs). Subagent's "sees" list adds the current file with explicit preserve-existing semantics. "Does NOT see" list now correctly excludes archive only (was incorrectly excluding all prior drafts). - Step (e) User gate: response #2 "user wants additions or revisions" now reflects maintenance — user does not need to re-pick everything since existing picks are preserved. Spec changes: - Subagent constraints table: test-case-designer row updated to reflect maintenance pattern. Sees-list adds "current 02-test-case.md (when extending in maintenance mode)"; does-NOT-see-list correctly scoped to "archived prior drafts" only; why-rationale updated. Note: B4 (write-tests) when drafted will also follow the maintenance pattern (workbench/ accumulates per-case files). B6 and B7 stay as fresh-derivation — that's the load-bearing anti-ducktape constraint.

Step 4 of the chain — takes the picked cases from 02-test-case.md and dispatches test-writer subagents in parallel (one per case) to build concrete workspace files + graders. Runs a smoke check against hand-crafted GOOD/BAD/EMPTY fixtures before declaring done. Follows the established template (B1/B2/B3 shape) with several B4-specific additions: - **Maintenance step** (per the recent iteration-protocol update). workbench/ accumulates as new picks get built. The (b) iteration step diffs picked vs prior built_cases and only dispatches test-writers for NEW picks or for cases that directives flag for revision. Existing builds are preserved verbatim. - **Parallel per-case dispatch** in step (d). All test-writers emitted in a single message so they run concurrently. Each sees only its case spec + 01-functionality.md + skill source; does NOT see other cases, other graders, prior failures, or anything under archive/. The per-case isolation prevents both cross-fixture homogenization and grader-leak hacking. - **Source content access** (per the recent spec update). Test writers are the one step where the skill source is in-scope for a generative subagent. Bounded by the per-case spec from step 2. - **Smoke check** (step (e)). New responsibility not in other steps. Verifies each grader against GOOD/BAD/EMPTY fixtures the test-writer also produces. last_smoke_check_passed in frontmatter is only set when all graders pass. - **Two outputs**: 04-tests-plan.md (the meta-report) AND the workbench/ directory (the artifacts the run-bench step executes against). Maintenance protocol covers both via archive/04-tests-plan-vN.md and archive/workbench-vN/. - **User-gate at step (c)** for the planned workbench structure before dispatching. Three response cases (approve / add revisions via directives / reject and abandon-or-replan). - **6 internal steps**: (a) confirm prerequisites, (b) handle iteration with diff logic, (c) plan + user gate, (d) parallel dispatch, (e) smoke check, (f) assemble + commit + hand off. 350 lines, description 492 chars. Larger than B1/B2/B3 because the extra responsibilities (parallel dispatch, smoke check, maintenance diff) each earn their lines.

…o centralized script exists The prior wording referenced a path that didn't exist (`skills/skill-optimizer/references/scripts/smoke-check.mjs`). The actual codebase doesn't have a centralized smoke-check runner — each existing workbench has its own `checks/smoke-graders.mjs` specific to its graders (see e.g. examples/workbench/agent-browser/checks/smoke-graders.mjs). Rewrote step (e) to be honest about this: - Smoke check is per-workbench, not centralized. - The test-writer subagent produces the smoke-check artifact alongside its grader. Shape follows the workbench schema docs. - Two reasonable shapes described (per-case runner vs workbench-level aggregator); the test-writer prompt template (Phase C) will pick one and apply consistently. - Reference example pointed at the actual existing agent-browser/checks/smoke-graders.mjs. Conceptual contract unchanged (GOOD passes, BAD/EMPTY fail); only the execution mechanics corrected.

Step 5 of the skill-optimizer chain. Thin operator-driven CLI step — no subagent dispatch. Two outputs: timestamped raw bench data under 05-bench-results/<ts>/ (preserved naturally, exempt from the standard archive flow) and a versioned 05-bench-summary.md (follows the iteration protocol; archive-and-version on re-run). The summary is the entry point step 6 reads: per-model and per-case pass rates plus a failed-case pointer list into the raw output. The analyzer subagent at step 6 walks that list for trace and findings detail; the summary itself stays short. Handoff branches on overall pass rate: failures route to analyze-result; all-pass surfaces the choice to the user (accept that the picked set didn't expose a weakness, or re-run step 2 with a "make it harder" directive). Chain skills don't auto-invoke.

Two related coordination changes across the chain, plus a deferred limitation note. Rename convention for revised cases at step 2: when a directive asks to revise an existing case (rather than add a new one), the test-case-designer subagent appends a version suffix — `case-x` → `case-x-v2` → `case-x-v3`. Bare name is implicit v1. This gives step 4's diff logic a deterministic signal that the revised case needs a fresh test-writer dispatch (without the rename, the name-match would say "already built" and skip rebuilding the case whose spec actually changed). Encoded in B2's user-gate step and referenced from B4's diff logic so the v-suffixed names are not surprising downstream. Partial re-bench at step 5: deferred. CLI's run-suite does not accept a case filter, so step 5 always re-measures the full workbench. Documented as a known limitation with a roadmap pointer. Operator escape hatch (manual run-case + splice into 05-bench-results/<ts>/) noted as outside the chain.

Replace the version-field + archive-folder iteration model with git as the history mechanism and the filesystem itself as the current state. The prior protocol was reimplementing git in frontmatter — version: ints, inputs.step_N: lineage tracking, archive/<NN>-name-v<N>.md copies — and adding accidental complexity across the chain. Iteration-protocol rewrite: - Drop version field, archive/ subdir, inputs.step_N int tracking - Staleness detection: git log -1 --format=%ct mtime comparison - Two step kinds preserved (fresh-derivation vs maintenance), with the constraint reframed in terms of "does the subagent read its own canonical file" — fresh-derivation says no (anti-ducktape), maintenance says yes (filesystem IS the state) - Subagent constraint added: don't walk git history of any tree file — for fresh-derivation this is the anti-ducktape guarantee, for maintenance this prevents reasoning from prior states - Safe destructive edits: operator session commits a checkpoint before maintenance-step rebuilds so git history has a clean before/after breakpoint Spec doc updates: - State layout: tests/<functionality>/<test>/{spec.yaml, workspace, grader, smoke} replaces 02-test-case.md + 04-tests-plan.md + workbench/. Filesystem-as-state — no picked: [] or built_cases: [] arrays anywhere - B2 output: 00-test-proposals.md (audit) + tests/<func>/spec.yaml per functionality with picked: true|false in each - B4 output: tests/<func>/<test>/ probe folders + generated tests/suite.yml. Per-probe parallel test-writer dispatch - B5 input: tests/suite.yml. Two outputs: timestamped raw + single-canonical 05-bench-summary.md - Subagent constraints table updated for every reasoning subagent to reflect git-history-off-limits rule - Per-step iteration-behavior sections rewritten for the new model Undoes the -v2 rename convention added 30 min ago (no longer needed — filesystem-as-state means revising a spec.yaml in place is naturally detected, and B4 can just rebuild on directive without any special name mangling).

Apply the filesystem-as-state + git-native iteration redesign across all five existing chain SKILL.md files. Drops version: frontmatter fields and archive/ directory references everywhere; reframes the subagent constraints in terms of "don't read own canonical / git history" (fresh-derivation) vs "do read current tree as state" (maintenance). B1 (investigate-functionality, fresh-derivation): drop version field; subagent constraint becomes "don't read own canonical or git history of it" instead of "don't read archive/". B2 (investigate-test-case, maintenance — major rewrite): output shape changed entirely. Was a single 02-test-case.md with picked: [] frontmatter array; now produces 00-test-proposals.md (one-time audit report) plus tests/<functionality>/spec.yaml per proposed functionality, each with picked: true|false in its own frontmatter. User gate is "edit picked: in each spec.yaml" rather than "tell me which names to pick". Subagent reads existing tests/ tree as load-bearing state per the maintenance rule. B3 (investigate-submissions, fresh-derivation): drop version and inputs.step_1_functionality fields; staleness now via git mtime against 01-functionality.md. B4 (write-tests, maintenance — major rewrite): replaced 04-tests-plan.md + workbench/ with tests/<functionality>/<test>/ probe folders. State is implicit: probe folder + grader.mjs present = built; no built_cases: [] array. Per-probe parallel test-writer dispatch; one probe = one test-writer subagent. Step generates tests/suite.yml from picked-functionality probes. Destructive-edit checkpoint pattern: operator commits before rebuilding existing probes. B5 (run-bench, fresh-derivation summary + timestamped raw): drop version and inputs.step_4_tests fields; reads tests/suite.yml as input. Summary 05-bench-summary.md is single-canonical with prior state in git; raw 05-bench-results/<ts>/ stays naturally accumulated and outside the protocol. Net: less bookkeeping, simpler mental model, fewer ways to get state out of sync. Filesystem IS the state across the chain.

Two wording fixes to the fresh-derivation subagent constraint: 1. The prior wording "no reference to what was produced before" read as forbidding even the directive mechanism — which is wrong. The operator session DOES read prior outputs (that's part of its job between iterations) and distills lessons into atomic new requirements. The subagent then satisfies those distilled requirements as fresh constraints without seeing the raw prior content. This separation is what keeps the new derivation from rationalizing the prior one while still letting the chain converge across iterations. 2. Step 7 has a specific carve-out worth stating explicitly: the SKILL itself (the improvement target) is upstream input, not the subagent's own canonical. The optimizer reads the current skill — which may include modifications from prior step-7 runs — and proposes new improvements on top. The "own canonical" that's off-limits is 07-improvement-proposal.md (the reasoning report), not the skill file. The skill accumulates improvements across iterations; the proposal reports do not. Triggered by review of the prior wording on iteration-protocol.md.

Replace the descriptive "there is no version: field" wording with a prescriptive "do not add version-tracking metadata" rule plus a practical decision aid for future SKILL.md authors: When in doubt about whether a field belongs: ask whether a chain skill needs to READ it to do its job right now (yes -> keep), or whether you're recording it for future-debugging / future-audit purposes (no -> that's git's job). The prior wording could be read as describing the current state without prohibiting reintroduction. The new wording makes the prohibition explicit so B6/B7/B8 (yet to be drafted) don't accidentally bring back version-tracking metadata.

B6 (analyze-result): the chain's anti-ducktape gate. Fresh-derivation step. Dispatches an analyzer subagent that reads bench summary + raw trial data + probe specs + skill content — but NOT the test inputs themselves (forces principle-thinking over solution-thinking). Output: 06-analysis.md with has_structural_weakness: true|false in frontmatter and per-weakness sections containing Pattern, Hypothesized cause, Connects to skill section, What WOULD address this (general principle), and What WOULD NOT address this (the explicit anti-ducktape list step 7's optimizer must reckon with). Honest refusal is built in: if no weakness can be articulated, the report says so and has_structural_weakness: false, which step 7 will refuse to fire on. Forced weakness-naming when the analyzer found nothing is the ducktape failure mode this step exists to prevent. B7 (improve-skill): the terminal generative step. Two subagents (optimizer + validator), both fresh-derivation. Refuses to fire if 06-analysis has has_structural_weakness: false (the anti-ducktape gate's downstream half). Optimizer: reads 06-analysis, 01-functionality, current skill, 03-submissions (if PR-bound). Does NOT see raw trials, grader internals, test inputs, or prior proposals. Must apply the analyzer's general principle and self-check against the anti-pattern list explicitly. Validator: reads skill BEFORE + AFTER + 01-functionality + the proposal artifact + 03-submissions (if PR-bound). Internal consistency + external consistency checks. Does NOT see the optimizer's reasoning trace, prior verdicts, or raw trial data. Bounded loop: max 2 revision rounds. If validator still says needs-revision after round 2, surface honestly with three realistic paths; do not loop indefinitely. Three handoff branches: local skill (modify in place), upstream + PR=false (modify vendored copy), upstream + PR=true (write 07-pr-draft.md with operator-steps-to-submit; the chain does NOT submit the PR). Outputs three or four artifacts: 07-improvement-proposal.md, 07-validator-verdict.md, the modified skill file (on approve), and 07-pr-draft.md (PR-bound + approve). Frontmatter on both reports carries runtime-relevant facts only (verdict, addresses_weaknesses, diff_target) per the iteration protocol's discipline rule. Both files follow the established structural template (front-loaded "Before you start", lettered workflow steps, edge cases, iteration behavior section). Pending user review of B1-B7 before B8 and the subagent prompt templates.

…rompt B6 SKILL.md was carrying a full markdown body template (per-weakness section template with placeholders, non-structural noise example, honest-refusal wording verbatim). That's the subagent's concern — the subagent writes the body per its prompt template; the operator session reads only the frontmatter for handoff branching. Keep in the SKILL.md: - Frontmatter contract (operator reads has_structural_weakness for step 7's gate) - Enumeration of the five required parts per weakness entry (the operator session verifies these in step (d)) - The architecture-level rationale on why the anti-pattern list is load-bearing (this is design-decision content, not subagent-side prose — explains WHY the constraint exists for future authors) - Pointer to the subagent prompt template for the full template + reasoning protocol Same audit pass on B1/B2/B4/B7: their What-you-produce sections show structural contracts (frontmatter schemas, file-tree shape) that the operator session actually reads, not narrative body templates the subagent fills in — so they stay as-is.

Two architectural fixes per review: 1. Drop PR draft from B7. PR composition is a separate downstream concern that consumes B7's proposal + 03-submissions.md; the auto-pilot (step 8) or a dedicated composer can handle it if pr_submission_intent: true. B7's job is just to improve the skill — packaging it as a PR is not what improve-skill does. Removed: 07-pr-draft.md as an artifact, the three-branch handoff (local / upstream + PR=false / upstream + PR=true), the operator-steps-to-submit checklist. Collapses to a single handoff message regardless of PR intent. 2. Never modify the original skill. The improved version lives at docs/skill-optimizer/<slug>/improved-skill/ — a separate location that accumulates improvements across iterations. The vendored upstream copy stays frozen; the local source file stays untouched. Git tracks improved-skill/ history. On re-run, the optimizer reads improved-skill/ if it exists (the accumulated state) and proposes the next improvement on top; iteration 1 reads the original source instead. Original is always recoverable; improved evolves under git. Removed: "Written in place for local skills" / "Written to vendored-skill/ for upstream skills" — both wrong now. Updated B7's "Before you start" carve-out summary, step (c) and (d) input descriptions (SKILL_CURRENT_PATH replaces SKILL_SOURCE_PATH; validator BEFORE = optimizer's input), step (f) write outputs (materialize improved-skill/; don't touch source), step (g) handoff (single message). Iteration-protocol's step-7 carve-out updated to match (improved-skill/ is the accumulated state; source stays frozen). Spec doc layout adds improved-skill/ alongside vendored-skill/; B7 section rewritten; subagent constraints table rows for optimizer + validator updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ildTrialSummary

…estration

…st-side now)

Extend TrialAggregate / WorkbenchModelAggregateResult with totalTokens and totalDurationMs, sourced from per-trial result.metrics.tokens.total and result.metrics.durationMs. Missing values fall back to 0 so legacy result.json files remain backward-compatible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… results Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ts pass through Native ACP agents (claude-agent-acp, codex-acp, gemini, opencode) accept their own model refs (claude-haiku-4-5-20251001, gpt-5-mini, etc.). The OpenRouter-only guard now applies only when agent is pi-acp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Subscription auth files (~/.claude/.credentials.json, ~/.codex/auth.json, etc.) are typically mode 600 on the host. After cpSync into the staging dir that becomes /home/agent in the container, the container's agent user (uid 10001) can't read them. Loosen mode to 644 on the staged copies so the in-container agent can read. Host files unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Baseline is currently a placeholder (captureStatus: placeholder); the test skips when OPENROUTER_API_KEY is unset or when the baseline hasn't been replaced with real captured rates. Once a real run is performed, replace the JSON with the actual perCasePassRate and remove the captureStatus field to enable the check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rmat + agent column - test-writer: do-not-pre-trigger-skill guidance for task prompts - analyzer: read-first pointer to acp-trace-format.md - run-bench: per-(agent + model) summary; surface tokens/duration; drop cost; new image+Dockerfile names Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Architecture summary: host-side ACP client driving per-trial Docker containers - Important files: src/workbench/acp/, src/workbench/agents/, parse-trace.ts - Invariants: agent: required on case.yml, runs: matrix on suite.yml, pi-acp-only openrouter/ prefix - Auth matrix per agent - Docker image+Dockerfile name updated to skill-optimizer-agent:local Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e dirs Three intertwined fixes needed for claude-agent-acp (and codex-acp) subscription auth to work end-to-end: 1. Extract the OAuth bearer from the credentials JSON and pass it via ANTHROPIC_AUTH_TOKEN (claude) / OPENAI_API_KEY (codex). The respective underlying SDKs (@anthropic-ai/claude-agent-sdk and codex-acp's analogue) read env vars, not the credentials file directly. New SubscriptionEnvExtract type lets the registry declare these mappings. 2. Pre-create directories the agent's runtime expects under the staged $HOME. @anthropic-ai/claude-agent-sdk writes debug logs to ~/.claude/debug/<uuid>.txt; session/new fails with "Query closed before response received" if that dir doesn't exist. Use the registry's existing homeDirs field to declare these. 3. Make staged $HOME and its declared subdirs world-writable (mode 0o777). The container's agent user (uid 10001) does not match the host user that owns the staging dir, so without loosened perms the SDK can't write its own session/debug files. Verified: claude-agent-acp smoke probe now completes a real initialize→newSession→prompt cycle and writes /work/out.txt in ~15s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sion Without this, the model: field in case.yml was silently ignored — the agent used its default. Now passes the case's model string to the agent's session when supportsAcpSetModel is true on the agent's registry entry. Errors are re-thrown with the agent name + bad model string so case authors can tell they used an unrecognized ID. Verified with model: opus[1m] on claude-agent-acp; tokens recorded, stopReason end_turn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… aliases Closes two spec-satisfaction gaps from the post-implementation audit: - pi-acp now declares OPENROUTER_API_KEY in requiresEnv so resolveAuth fails loudly preflight instead of letting the agent crash mid-trial. opencode keeps requiresEnv: [] intentionally — it's multi-provider and the loginHint guides the user. - AGENT_ALIASES drops 'gemini' (self-pointing) and 'openclaw' (points at non-existent agent). Spec's alias table is canonical: claude, codex, pi.

Fulfills the v1.4 acceptance criterion #8 that was never built. Defines a single user-invocable SKILL.md at skills/autopilot/ that walks the 9-step chain end-to-end on one skill. Defaults to fully automated with optional per-gate breakpoints chosen at startup. Replaces v1.4 §10's single-cycle retry model with a strategy-menu per gate (cheap → expensive); the pilot picks the cheapest unattempted strategy on each gate firing. Caps are per-gate (default 2) + global (default 8); cap exhaustion is an honest exit, not a failure. State lives in a per-run journal (gitignored), promoted to a tracked autopilot-summary-<ts>.md at end-of-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `skills/autopilot/SKILL.md` as the 10th skill — the chain driver that walks steps 1→9 end-to-end on a single skill. Fully automated by default; operator picks optional per-gate breakpoints at startup. Honest-exits when caps exhaust (per-gate=2, global=8) or step 9 approves. Long static content lives in agents/ so the SKILL.md stays operational: interview-prompt.md (startup UX), recovery-menus.md (strategy menus consulted at each gate firing), surfacing-prompt.md (operator-facing gate prompt format), summary-template.md (end-of-run audit report shape). Plugin metadata + smoke tests updated. README/CONTRIBUTING mention the new driver. Fixed a pre-existing step 2↔3 classification swap in shared/subagent-dispatch.md that this work surfaced. Fulfills v1.4 acceptance criterion #8 (auto-pilot smoke test) — the last criterion from the v1.4 design that was never built. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

skills/shared/workbench.md is significantly refreshed to match current truth: drops stale --model/--models flags, replaces models:[...] with runs:[{agent,model}], adds agent: as required on cases, fixes the default image name. Adds a new Per-Agent Model Strings section listing the model-naming convention each agent CLI accepts plus the env vars they require — closes the trace-friction where Claude wrote 'claude-opus-4-7-1m' and the CLI rejected it (valid: opus[1m]). src/workbench/docker-runner.ts: two changes. - copyCaseSupportDirs() now enumerates the case dir dynamically with an exclusion set (references/, node_modules/, hidden dirs), instead of a hardcoded ['checks','fixtures','bin','workspace','mcp'] list. Case authors can add seeds/, data/, helpers/ etc. without editing the runner. - Pre-removal docker exec --user 0 chmod -R 0777 /home/agent /work before docker rm -f, so the host's rmSync(tempDir) doesn't trip EACCES on uid-10001-owned subdirs (.cache, .venv). Best-effort; ignored if the container is already gone. skills/investigate-functionality/SKILL.md: two UX changes. - Wrapper-redirect now keeps the wrapper intact at vendored-skill/ and writes the underlying content to a sibling vendored-underlying/. Tests target vendored-skill/ (the ship target) regardless of re-vendor outcome. Surfaces clearer 3-option prompt. - Branch check is now ALWAYS asked, with current branch surfaced and eval/<slug> recommended as option 1. Replaces the prior 'only ask if on main/master/development' heuristic. skills/shared/workflow.md: state layout adds vendored-underlying/ and notes that test target is vendored-skill/. Addresses 4 of 5 friction points from a manual trial run trace: workbench model-string confusion, wrapper-handoff prose, branching timing ambiguity, tmp /tmp permission noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… worktrees Trial run on web-interface-guidelines surfaced two issues that this commit addresses, plus the long-standing worktree-gitlink clutter. Branch handling. The step 1 branching prompt was "always ask", which interrupts an autopilot full-auto run after the operator already said "all auto". Autopilot now handles branching itself in workflow (a): reads current branch, creates eval/<slug> if not already on a purpose-named branch, logs to the journal, and passes 'branching: already-resolved' in OPERATOR_DIRECTIVES so step 1 skips its prompt. Step 1 still asks when invoked manually (directive absent). CWD discipline. The orchestrator ran 'cd .skill-optimizer/.../vendored-skill && curl ...' to vendor source files. The Bash tool keeps CWD between calls, so the cd persisted; later 'mkdir -p docs/skill-optimizer/<slug>/' resolved under vendored-skill/ silently, while the dispatched subagent (with absolute output path) wrote the canonical 01-functionality.md to the correct main-checkout location. The orchestrator then "couldn't find" the file because its relative-path verification looked in the wrong subtree. Added a Working-directory discipline section to autopilot SKILL.md and a CWD-safe vendor-step note to investigate-functionality: prefer absolute paths or (subshell) scoping; never bare cd that persists. Worktree cleanup. .claude/worktrees/ held 10 stale gitlinks (mode 160000) from old auto-pilot pilot runs in May. All worktrees removed via 'git worktree remove --force'; gitlinks dropped from the tree. Branches themselves stay (eval/auto-pilot/shadcn-ui, eval/auto-pilot/firecrawl-build-scrape, etc.) — if you want them gone, 'git branch -D' separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…st inheritance A validator-approved improvement is now backed by an actual bench comparison before autopilot claims `improved`. Without this, the exit_status `improved` only meant "validator believed the proposal" — no measurement of whether the proposal moved the needle. Autopilot workflow gains phase (e.5) — fires after step 9 approve: 1. Snapshot baseline pass-rate from 06-bench-summary.md. 2. Patch suite.yml with a top-level skillUnderTest pointing at improved-skill/; back the original up to the journal dir. 3. Re-dispatch step 6 against the improved skill. 4. Restore suite.yml; the re-bench raw output lives at bench-results/<ts>-after-improvement/. 5. Per-case comparison: count pass→fail (regressions), fail→pass (new passes), and the net delta. New gate G10-NO-EMPIRICAL-GAIN fires when the re-bench shows pass-rate ≤ baseline or any regression. Recovery menu (3 strategies, cheap→expensive): different fix, reframe weakness, increase test coverage. Cap exhaustion → exit_status: unchanged-empirical. New exit_status values: improved-with-regression (net gain but ≥1 case regressed — surface clearly for review), unchanged-empirical (re-bench didn't demonstrate the validator's claim). Suite-loader change (the only code touched): suite-level skillUnderTest now inherits to inline cases (with per-case override), mirroring how references/env/setup/etc. already work. Lets autopilot patch one top-level field instead of N cases. Test covers inheritance + override + malformed-field rejection. workbench.md Suite Schema doc gains a row for skillUnderTest with its purpose called out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tion_target picks the file Replaces the wrapper-detection branch (vendored-skill/ + vendored-underlying/ + 3-option re-vendor prompt) with a single-directory model that's both simpler and more honest about test reproducibility. What changes structurally - vendored-skill/ is now a single flat dir holding the source SKILL.md PLUS any files the source content fetches at runtime (the researcher vendors all referenced URLs as part of step 1). This makes the test bench reproducible without live network — whether the skill is a wrapper or not, the workbench mount is self-contained. - New frontmatter on 01-functionality.md: optimization_target (relative path within vendored-skill/, the file the chain modifies + submits upstream) + optimization_target_source (upstream URL of that file, used by step 2 to decide which repo to research for PR conventions — may differ from skill_source when a wrapper redirects). - optimization_target_candidates lists the viable files the researcher identified; downstream skills ignore it, but step 1's picker uses it. Picker UX - Replaces the 3-option re-vendor prompt with a "which file to optimize" picker. Surfaces only when multiple candidates exist AND no operator directive pre-resolves the choice. - Autopilot full-auto passes 'optimization-target: auto-pick-recommended' in OPERATOR_DIRECTIVES so step 1 honors the researcher's recommendation without surfacing. Same pattern as 'branching: already-resolved'. Per-skill updates - investigate-functionality: vendor-all-references rule + picker; new frontmatter fields documented. likely_wrapper kept as diagnostic only. - research-functionality (subagent): new section "Vendoring referenced content" + "Identifying the optimization target". Researcher fetches external URLs into vendored-skill/, lists candidates, recommends one. - investigate-submissions: reads optimization_target_source (not skill_source) to derive UPSTREAM_REPO for the gh CLI subagent. - improve / optimizer: reads ${OPTIMIZATION_TARGET}; diff scope is that single file. - validate / validator: BEFORE/AFTER comparison at ${OPTIMIZATION_TARGET}; other files in the trees should be identical (if they differ, the optimizer exceeded scope — validator flags). - autopilot: passes 'optimization-target: auto-pick-recommended' alongside the existing 'branching: already-resolved' in step 1's directives. - shared/workflow.md: dropped vendored-underlying/ from state layout; documented the new optimization_target / optimization_target_source contract. The wrapper concept doesn't disappear — likely_wrapper + wrapper_points_to are still recorded as diagnostic info, just no longer load-bearing for behavior. optimization_target is the field that drives the chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Yuqing Zhai and others added 30 commits May 19, 2026 06:01

chore(v1.4): create subagents/ and references/ dirs (populated in Pha…

7c0541a

…se C+D)

Yuqing Zhai and others added 30 commits May 25, 2026 13:10

feat(schema): suite.yml uses runs: matrix; legacy models: rejected

f763840

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(test): wire suite-loader runs test into npm test

a100661

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor(metrics): thin wrapper over parse-trace; drop cost field

61e8fa3

refactor(metrics): delegate to iterToolCalls; single trace read in bu…

139e85a

…ildTrialSummary

refactor(trace): drop WorkbenchTraceEntry; trace.jsonl is raw ACP

65cc4b4

feat(docker-runner): add runOneAcpTrial helper for host-side ACP orch…

0053d2d

…estration

fix(acp): use SDK PROTOCOL_VERSION constant (integer) for initialize

4c6d8c7

feat(docker-runner): host-side ACP orchestration per trial

64f99a7

fix(docker-runner): restore cache env flags on per-trial container

7b59fb9

refactor(container-runner): drop --agent mode and pi-agent.ts (ACP ho…

c55ef48

…st-side now)

chore(test): wire trials-aggregate test into npm test

20baebe

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(run-*): replace models matrix with runs (agent + model) matrix

e567817

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(examples): migrate pdf+mcp suites to runs: schema; remove stale…

62c786e

… results Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(smoke): per-agent end-to-end smoke probes

9faba1e

docs(shared): ACP trace format reference for chain analyzer

0dca012

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(skill-optimizer): v1.4 chain implementation (draft, per RFC #52)#53

feat(skill-optimizer): v1.4 chain implementation (draft, per RFC #52)#53
Zhaiyuqing2003 wants to merge 121 commits into
developmentfrom
feat/skill-optimizer-v1.4

Zhaiyuqing2003 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Zhaiyuqing2003 commented May 21, 2026

Summary

What's in this PR

9 chain skill SKILL.md + 1 auto-pilot pending (B10)

8 subagent prompt templates (Phase C, just completed)

4 shared docs

Architecture

Spec + plan docs

What's NOT in this PR (deferred)

Stats

Test plan

Reference

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant