Skip to content

feat(skill-optimizer): v1.4 chain implementation (draft, per RFC #52)#53

Draft
Zhaiyuqing2003 wants to merge 121 commits into
developmentfrom
feat/skill-optimizer-v1.4
Draft

feat(skill-optimizer): v1.4 chain implementation (draft, per RFC #52)#53
Zhaiyuqing2003 wants to merge 121 commits into
developmentfrom
feat/skill-optimizer-v1.4

Conversation

@Zhaiyuqing2003

Copy link
Copy Markdown

Summary

Implementation of the v1.4 skill-optimizer chain architecture proposed in RFC #52. Decomposes the v1.3 monolithic orchestrator subagent into 9 independent chain skills + 1 auto-pilot driver, each dispatching narrow-context subagents that produce human-reviewable artifacts at convention paths.

Draft because B10 (autopilot SKILL.md), plugin metadata updates, and end-to-end validation runs are still pending. Opening now for early review of the architecture as implemented.

What's in this PR

9 chain skill SKILL.md + 1 auto-pilot pending (B10)

# Skill Kind What it does
1 investigate-functionality fresh-derivation Research what the skill does
2 investigate-test-case maintenance Design functionalities to test
3 investigate-submissions (optional) fresh-derivation Research upstream PR conventions
4 write-tests maintenance Build per-probe workspace + grader + smoke fixtures
5 validate-tests fresh-derivation Independent semantic check on probes (anti-ducktape gate)
6 run-bench fresh-derivation (summary) CLI wrapper for run-suite
7 analyze fresh-derivation Cluster failures into named structural weaknesses
8 improve fresh-derivation Optimizer drafts principled fix
9 validate fresh-derivation Independent verdict on the improvement
10 (pending) autopilot chain driver Walks 1→9 end-to-end

8 subagent prompt templates (Phase C, just completed)

skills/skill-optimizer-subagents/: each chain skill that dispatches a subagent loads the corresponding prompt template at its dispatch step. Each prompt has consistent structure: role, inputs, what-it-sees / what-it-doesn't-see (the anti-ducktape constraints), output spec, reasoning protocol, edge cases, return summary.

4 shared docs

skills/skill-optimizer-shared/:

  • iteration-protocol.md — iteration mechanics (step kinds, staleness, destructive-edit checkpoints)
  • subagent-dispatch.md — dispatch architecture (constraints, directives, no-auto-invocation rule)
  • frontmatter-discipline.md — runtime facts vs history (git replaces version+archive bookkeeping)
  • workflow.md — operator-facing chain reference (chain table, re-run triggers, backward triggers)
  • workbench.md — workbench schema (moved from legacy skills/skill-optimizer/references/; needs rewrite, flagged)

Architecture

  • Filesystem-as-state for the test tree: tests/<functionality>/<probe>/{spec.yaml, workspace/, grader.mjs, smoke/, checks/}. No picked: [] or built_cases: [] arrays — picked: true|false lives per-functionality spec.yaml; built state is implicit (probe folder + grader.mjs present).
  • Git-native iteration: no version: fields, no archive/ directories. Git already content-addresses every prior state; git log mtime gives staleness signal for auto-pilot.
  • Two anti-ducktape gates:
    • Step 5 (validate-tests) → step 6: can't measure against bad probes.
    • Step 7 (analyze) → step 8 / step 9 (validate) → materialize improved-skill/: can't ship a ducktape patch (analyzer must name a weakness with both general principle AND explicit anti-pattern list; validator checks the optimizer's self-check against that list).
  • Original skill is never modified: vendored-skill/ stays frozen; improved-skill/ accumulates approved improvements via step 9.

Spec + plan docs

Moved to the superpowers convention:

  • docs/superpowers/specs/2026-05-19-skill-optimizer-v1.4-design.md
  • docs/superpowers/plans/2026-05-19-skill-optimizer-v1.4.md

What's NOT in this PR (deferred)

  • B10 skill-optimizer-autopilot/SKILL.md — chain driver still pending
  • Plugin metadata + CLAUDE.md + README updates pointing at v1.4 (separate cleanup PR per spec)
  • workbench.md rewrite — currently in legacy shape from the v1.3 era; flagged as future work
  • End-to-end validation runs on real skills (firecrawl regression case, etc.)
  • Legacy skills/auto-improve-orchestrator/ removal — separate cleanup

Stats

  • 27 files changed, +6129 / -210 lines
  • 9 chain SKILL.md (avg 150 lines each)
  • 8 subagent prompts (avg 166 lines each, 1324 total)
  • 4 shared docs (avg ~100 lines each)
  • 40 commits, all conventional-format

Test plan

This PR is structurally complete but not yet validated end-to-end. Validation pending in a follow-up:

  • Manual walk-through of the chain on a small local skill — verify each chain skill dispatches correctly, each subagent prompt produces the expected output shape
  • Re-run firecrawl (v1.3 regression case) under v1.4 — expect optimizer either produces a principled fix OR honestly refuses (no regression shipped)
  • Auto-pilot smoke test on a local skill (after B10 lands)
  • Plugin metadata + CLAUDE.md/README pointer updates so the v1.4 chain is discoverable
  • Cross-platform plugin packaging verification (npx tsx tests/smoke-skill-distribution.ts, npm pack --dry-run)

Reference

Builds on RFC #52: #52 — that PR carries the architectural rationale, evidence base, and concerns-addressed sections. This PR is the implementation.

🤖 Generated with Claude Code

Yuqing Zhai and others added 30 commits May 19, 2026 06:01
Approved design (brainstormed 2026-05-12) converting the v1.3
monolithic auto-improve-orchestrator into 7 independent Claude Code
skills that chain via the superpowers plugin pattern.

Motivation: team review of v1.3 PR drafts surfaced 4 critiques —
v1.3 optimizes for incremental numerical uplift without validating
test case quality, grader correctness, or improvement principledness.
The firecrawl iteration regression (1.0 → 0.44 from piling on
Recipe A+D simultaneously to chase a small uplift) is the canonical
"ducktape-by-monolithic-orchestrator" case study.

Key architectural shifts:
- Skills (not slash commands) per superpowers convention
- Convention-pathed reports at docs/skill-optimizer/<slug>/...
  (visible + committable, like docs/superpowers/specs/)
- Strict limited-context subagents for all generative work (writer,
  analyzer, optimizer, validator) — prevents tunnel-vision into
  ducktape patches
- Validator subagent after every improvement (internal + optional
  external consistency)
- Auto-pilot = natural chained invocation, not a separate orchestrator

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User clarification: the agent asks 'optimize for PR submission?' at
step 1 (when the user first provides an upstream skill), not at the
end of step 7. This decision determines whether step 3
(investigate-submissions) runs at all.

Changes:
- 'Two contexts, one workflow' rewritten: PR decision at step 1,
  recorded in 01-functionality.md frontmatter as pr_submission_intent
- Step 1 behavior: explicit PR question for upstream sources
- Step 2 handoff: reads pr_submission_intent to decide whether to
  invoke step 3
- Step 3 'skipped when': now keyed on pr_submission_intent: false
- Step 7 handoff: three branches (local / upstream-no-PR /
  upstream-yes-PR), no late prompts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 tasks across 5 phases:
- Phase A (4 tasks, mechanical): skill dir shells, subagents/refs dirs, v1.3 deprecation, recipes.md seed
- Phase B (7 tasks, INTERACTIVE via writing-skills): one per SKILL.md, NOT subagent-driven per user request
- Phase C (6 tasks, mechanical): subagent prompt templates with limited-context constraints
- Phase D (1 task, mechanical): references/workflow.md chain diagram
- Phase E (2 tasks, E2E validation): local skill + firecrawl re-run (the v1.3 regression case)

Plan explicitly marks Phase B as not-for-subagent-driven-development;
the user explicitly stated 'writing good skills is HARD' and wants
interactive creation per skill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ writing-skills

Phases B–D produce v1.4's load-bearing artifacts (description-routed SKILL.md
files, subagent prompt templates that hold the anti-ducktape constraints, and
the workflow reference doc). Per operator direction, none of these should be
subagent-driven — "writing good skills is HARD" and "everything to be precise".

Tool assignment:
- Phase B (7 SKILL.md): skill-creator as outer loop (description-routing +
  eval iteration), superpowers:writing-skills as inner loop when behavioral
  compliance issues surface during eval
- Phase C (6 subagent prompts): superpowers:writing-skills only — these are
  compliance documents and the limited-context constraint must hold under
  adversarial pressure, pure TDD-pressure-scenario territory
- Phase D (workflow.md): no skill tool; direct interactive authoring with
  section-by-section operator review (matches handoffs from B + constraints
  from C, which may have shifted during interactive iteration)

Only Phase A (file scaffolding) and Phase E (real eval runs) sit outside the
interactive flow.
Body content for each SKILL.md will be written interactively in Phase B
via superpowers:writing-skills, with the interface contract from
docs/skill-optimizer-v1.4-spec.md as the brief. This commit just lays
down the structural skeleton and discoverable frontmatter.
Pulled from feat/auto-improve-skill-v1.3:skills/auto-improve-orchestrator/
references/lessons.md (the v1.3 orchestrator never landed on development,
so we source from the experimental branch). Adds v1.4-specific header
explaining how the analyzer (step 6) and optimizer (step 7) subagents will
use this file. Replaces v1.3 "auto-improve-skill" / "Phase 4" framing
with v1.4 step-numbering and subagent-naming.

Content body is the v1.3 Recipe A-E + grader patterns G1-G6 + run-record
protocol verbatim — that's the cumulative knowledge v1.4 inherits.
…ts dir, move philosophy to docs/

Changes to the v1.4 plan-as-executed, all in one place:

- Move skill-writing-philosophy.md from skills/references/ to docs/. The
  philosophy doc is contributor-facing — we'll distill the load-bearing
  rules into the optimizer subagent's prompt directly rather than have
  it load this doc at runtime.

- Delete skills/references/recipes.md (and the now-empty references/
  dir). The raw seed copied from v1.3 lessons.md is case-study-shaped
  (accumulated per-pilot observations), which contradicts the
  "generalize-from-feedback" principle in the philosophy doc. The
  curated abstract-pattern version needs real end-to-end observations
  to ground it in — deferred to a follow-up after the chain ships.

- Rename skills/subagents/ → skills/skill-optimizer-subagents/ so the
  dir name is scoped to this plugin and can't collide with subagents
  that other plugins might ship under a generic name.

- Add a "Bootstrapping limit" section to the philosophy doc. The
  skill-optimizer chain runs an empirical loop on target skills, but
  that loop can't validate itself — same shape as Thompson's
  "Reflections on Trusting Trust". The seven meta-skills get authored
  from philosophy + best judgment; their test is Phase-E real-world
  runs, not eval data for the meta-skills themselves.

- Spec + plan updated: file tree, Task A2 (revised), Task A3 (skipped),
  Task A4 (deferred), Phase C task file paths (subagents path
  rename), Phase D location (workflow.md goes to docs/ rather than
  skills/references/), acceptance criteria #4 (recipes.md deferred),
  coexistence section (v1.3 orchestrator never landed on development),
  open questions (recipes.md location TBD on real production).

- Add note about post-v1.4 cleanup of the original
  skills/skill-optimizer/ skill (now redundant with the chain's
  skill-optimizer-run-bench step). Deferred to its own PR because it
  touches the public plugin API.
…sification taxonomy + local-with-PR-intent path

Two changes to the B1 draft, plus a matching spec sync:

1. Classification: replaced the ad-hoc closed list
   (code-reviewer | document-producer | tool-use | code-patterns |
   other) with the v3 taxonomy actually used in the prioritization
   work — tool-use, code-patterns, document, prose-guidance, meta,
   interactive — and explicitly grants the subagent sovereignty to
   write a short descriptive label of its own when none fits, rather
   than collapsing to "other". A specific label gives downstream
   steps a real handle to work with.

2. Local-skill PR intent: the prior rule was "local = always
   pr_submission_intent: false". Revised to default false but treat
   PR-intent as live when the user explicitly says they want to send
   the local skill back upstream — in which case we capture the
   upstream guidelines location (URL / CONTRIBUTING.md path / Slack
   channel) into the report body as a "PR submission notes"
   subsection that step 3 (investigate-submissions) reads as its
   starting point.

Also: while reviewing, dropped the v1.3/v1.4 framing from the "Why
limited-context dispatch matters" section (version is historical
metadata, not skill-functional content) and updated subagent path
links to the renamed skill-optimizer-subagents/ directory.

Spec §"The 7 skills" #1 step 2 + step 6 (classification field
description) updated to match.
…r B2

Spec changes:

- New ## Iteration patterns section between ## Subagent constraints
  and ## The skills. Covers: backtrack-trigger table, per-report
  versioning mechanism (version + inputs frontmatter, with direct-
  upstream-only staleness checks), re-entry contract with the
  ${OPERATOR_DIRECTIVES} slot, latest-plus-archive convention, and
  the transitive-staleness policy (user judgment trusted; auto-pilot
  re-runs on direct-upstream version mismatch).

- ## The 7 skills → ## The skills. Added one-line "Iteration
  behavior" notes to each of the existing seven subsections.

- Step 2 (investigate-test-case) now dispatches a test-case-designer
  subagent rather than running in the operator session. Reasoning:
  cross-iteration contamination — on iter 2+ the operator has seen
  prior failures and biases test selection toward "what just failed"
  instead of comprehensive coverage. Isolated subagent fixes this.

- New ### 8. skill-optimizer-autopilot subsection. Walks 1→7,
  consults version mechanism per step (skip-or-dispatch), applies
  defaults for the three human-gate points (B1 PR-intent flag, B2
  pick-top-N, B7 log-and-exit on validator-rejected), bounds at
  max-iterations-per-step. Replaces the old "## Auto-pilot mode"
  section (which said "no separate skill, just chained invocation").

- Subagent constraints table updated: test-case-designer added;
  every reasoning subagent's "Does NOT see" column now includes
  prior drafts of its own output; every Sees column includes
  ${OPERATOR_DIRECTIVES}.

- Acceptance criteria expanded to 9 items (was 7): new #1 covers
  8 SKILL.md files, new #5 covers iteration-mechanism E2E, new #8
  covers auto-pilot smoke test.

Plan changes:

- Phase B count 7 → 8 tasks. New Task B8 (skill-optimizer-autopilot
  SKILL.md) added with its own A1-equivalent dir-creation step
  inline (the dir wasn't part of Phase A scope).

- Phase C count 6 → 7 tasks. New Task C1b (test-case-designer
  subagent prompt) inserted between C1 and C2 with a full draft
  template; suffixed "1b" rather than renumbering existing C2-C6
  to keep cross-references stable.

- Phase C section header adds a preamble explaining the
  ${OPERATOR_DIRECTIVES} requirement that applies to every
  reasoning-subagent prompt template.
…h iteration patterns

Five updates to match the spec's new iteration mechanism (§"Iteration
patterns" added in commit 0009066):

1. Report frontmatter now includes `version` field. Step 1 has no
   upstream reports, so `inputs:` is omitted; downstream steps will
   add it as they're authored.

2. New workflow step 5 — "Handle iteration: archive prior version +
   collect directives." Checks for existing report at the canonical
   path; if present, moves it to archive/01-functionality-v<N>.md
   and bumps version. Also where the operator pre-digests
   cross-iteration learnings into the OPERATOR_DIRECTIVES bulleted
   list (atomic new requirements, not a context dump).

3. Step 6 (renumbered from old step 5) dispatches with two new
   templated inputs: ${VERSION} and ${OPERATOR_DIRECTIVES}. The
   subagent's "does NOT see" list explicitly includes prior
   01-functionality.md drafts under archive/ to prevent the
   re-derivation from being contaminated by its own past output.

4. New "## Iteration behavior" section after "## Edge cases". Covers:
   when to re-run, what happens to vendored-skill/ on re-run, the
   archive convention, and how downstream cascade self-corrects via
   the version-mismatch check on next invocation.

5. Edge-case for "vendored-skill/ already exists" tightened: default
   to reuse (same source), re-fetch only on source URL change or
   explicit user ask. (Previously asked the user every time.)

Sets the template for B2-B8.
Authoring the same ~40 lines of archive + version-bump + directives-
collection logic into seven chain skills (B1-B7) would mean ~280
duplicated lines. Factor the mechanics into a single shared
operational reference that every chain skill loads explicitly.

New file: skills/skill-optimizer-shared/iteration-protocol.md (187
lines). Covers: versioning convention, the per-invocation decision
tree (no existing report → v1; existing + inputs match → current;
existing + mismatch → archive + bump), collecting OPERATOR_DIRECTIVES
(atomic new requirements, not context dumps), subagent constraints
under iteration (ignore own prior output, read upstream at latest),
cascading staleness policy (direct-upstream-only checks), archive
table, bootstrapping case, and what the protocol explicitly does
NOT cover (bench-results timestamping, auto-pilot summaries,
vendored-skill cache).

B1 changes:
- Step 5 stripped from ~25 lines of inline mechanics to ~10 lines
  with an explicit "Read this file now" instruction pointing at the
  protocol doc. The "Read this now" framing is load-bearing — the
  whole point is that the agent must consult the protocol, not
  improvise.
- "Iteration behavior" section trimmed from ~25 lines of general
  mechanics to ~10 lines listing only step-1-specific re-run
  triggers, with a pointer back to the protocol for the general
  mechanics.
- Net: 225 → 208 lines.

This sets the pattern for B2-B8: each chain skill's "Handle
iteration" step will be a brief pointer-and-step-specific-notes
combo, with the heavy mechanics centralized.

Spec changes:
- Architecture overview file tree: skill-optimizer-shared/ added.
- Acceptance criterion 2b added: shared protocol doc exists and is
  referenced explicitly from each chain skill's iteration step.

Plan changes:
- File tree updated (skill-optimizer-shared/ + autopilot/ stub).
- New Task A5 added: author iteration-protocol.md. Documented as
  "as-executed (added mid-execution)" because it emerged from B1's
  revision rather than the initial plan.
…sophy

Three issues that the protocol doc was a load-bearing reference for —
so they would have propagated into every chain skill's invocation —
all caught on read-through:

1. "B1-B7" was project-internal jargon (those are plan task labels,
   not part of the skill vocabulary). Replaced with "every skill in
   the chain". The shipped doc should be timeless; task labels live
   in the plan, not the protocol.

2. The "On every chain-skill invocation" section used an ASCII box-
   drawing decision tree. Per Anthropic and writing-skills guidance
   ("use flowcharts ONLY for non-obvious decision points"), this
   logic is straightforward conditional flow — bullets carry it
   more cleanly and match standard markdown rendering. Rewritten as
   nested bullets.

3. The "Bad examples" of operator directives were marked with ❌.
   Per the global instruction "Only use emojis if the user
   explicitly requests it. Avoid adding emojis to files unless
   asked", these don't belong. Replaced with explicit "Examples
   that count" / "Examples that do NOT count" section labels.

Net: 187 → 180 lines.
…mbering + front-load load-bearing context

Two related fixes:

1. Internal workflow step numbering (1-7) collided with chain step
   numbering (1-7 referring to other skills in the chain). Same word,
   two meanings — an agent reading "step 5" could plausibly think
   either "this skill's Handle-iteration step" or "the run-bench
   skill in the chain". Relabeled internal workflow steps to (a)
   through (g) so they're visually distinct from chain step
   references. Cross-references inside the doc updated accordingly.

   A note in the new "Before you start" section declares the
   convention explicitly so future readers don't have to infer it.

2. The "Read iteration-protocol.md" instruction was buried at
   internal step (e) — middle of a 7-step workflow. An agent
   reading sequentially might glance past it once they're in
   execution momentum. Added a "## Before you start" section right
   after the intro paragraph, listing the two load-bearing things
   to know up front: (1) read the iteration protocol now, (2) you
   will dispatch a subagent — you don't do the research yourself.

   Step (e) still requires reading the protocol — the "Before you
   start" section primes the agent so step (e) becomes reinforcement
   rather than first contact.

229 lines total (up from 208; "Before you start" earns its place as
the discipline frame). Sets the pattern for B2-B8 — each will
similarly relabel its workflow steps with (a)-(g) and front-load
the two load-bearing context items.
…visibility moved from step 4-blocked to step 4-allowed

Two coupled changes — B2's SKILL.md draft and a spec update that
makes the source-visibility split consistent across the chain.

Spec changes (Subagent constraints table):

- Step 2 (test-case designer) — explicitly added "skill source
  content" to "does NOT see". Coverage design happens at the
  responsibility level here; concrete fixture details enter at
  step 4. Without this constraint, designers could gerrymander
  test cases around the source's literal phrasing instead of
  reasoning from stated responsibilities.

- Step 4 (test writer) — removed "the skill's content" from "does
  NOT see"; added "skill source content" to "sees". Updated the
  rationale: fixture writing needs concrete patterns and violation
  examples, which come from source. The per-case test spec from
  step 2 constrains what the fixture should test, so the
  gerrymandering risk is bounded. Still blocks other test cases
  (prevents copying across the suite) and grader internals
  (prevents grader-leak hacking).

Architecture: source enters the chain at step 1 (research), exits
at step 2 (responsibility-level design needs no source), enters at
step 4 (fixture writing needs source detail), stays present for
step 6 (analyze failures) and step 7 (optimize/validate). The
spec's step-4 body description updated to match.

B2 SKILL.md:

- New file, follows B1's template: front-loaded "Before you start"
  section, lettered workflow steps (a)-(f), bold discipline markers
  at action sites, why-this-matters rationale, edge cases,
  iteration behavior section.
- 6 internal steps: (a) confirm prerequisites, (b) handle iteration,
  (c) dispatch designer subagent, (d) confirm subagent output,
  (e) user gate (present + collect picks), (f) conditional handoff
  based on pr_submission_intent.
- "picked" frontmatter field is the operator's responsibility —
  subagent writes proposal with picked: []; operator fills picked
  after user gate.
- Three-response handling in user gate: pick subset/all, ask for
  revision, pick zero.
- 215 lines, description 479 chars.
Step 3 of the chain — OPTIONAL, runs only when 01-functionality.md
has pr_submission_intent: true. Researches the upstream repo's PR
conventions and writes 03-submissions.md as the verbatim-pastable
context block the validator (step 7) uses for external consistency.

Follows the B1/B2 template:
- Front-loaded "Before you start" with iteration-protocol pointer +
  dispatch discipline + step-numbering convention
- Frontmatter with version, inputs.step_1_functionality, plus
  step-specific fields: upstream_repo, upstream_branch_target,
  license, requires_cla
- 5 lettered workflow steps: (a) confirm prerequisites including
  pr_submission_intent gate, (b) handle iteration, (c) dispatch
  subagent, (d) confirm subagent output with blocker flagging,
  (e) hand off
- Why limited-context dispatch matters: rationale specific to step
  3 — the validator's external consistency check depends on the
  submissions report being neutral upstream facts, not advocacy for
  the proposed change
- Edge cases: private repos, non-GitHub hosts, empty PR history,
  copyleft license
- Iteration behavior: explicitly note this rarely re-runs;
  upstream conventions change slowly

235 lines, description 578 chars.
…otocol + apply in B2

A chain skill never invokes another chain skill on its own. When a
step finds its upstream input unsatisfactory (thin functionality
report, too-easy bench, missing test case, wrong-target PR
conventions), it surfaces the finding and stops — the user (or
auto-pilot driver) decides whether to re-run upstream, accept the
situation, or abandon.

This was implicit in the architecture but not stated as a rule.
Adding it explicitly to iteration-protocol.md as a new section
between "Cascading staleness" and "Archive convention". Same
section covers the forward-handoff exception (those are normal
chain flow, not backward triggers).

Updated B2's edge case for "Subagent's proposal has fewer cases
than expected" — was "Either re-run step 1 with directives or
accept the small proposal", now "Surface this to the user with two
options ... Do not re-invoke step 1 yourself — re-runs require an
active signal from the user", with reference back to the protocol
doc's new section.

Audit: only B2 had a wording suggesting backward auto-trigger; B1
and B3 don't suggest re-running upstream steps. Same rule applies
to all future chain skills (B4-B8) — the protocol doc is the
chain-wide source of truth.
…eview

1. upstream_branch_target placeholder syntax. Was "<main | next |
   other>" which reads as a closed enum. The actual value is the
   literal branch name (could be `develop`, `release-2024`, etc.),
   so the placeholder should describe the field meaning, not enumerate
   three literal options. Now: "<branch name — usually `main`,
   sometimes `next` for new-skill repos, or a specific release branch>".

2. Removed the "frontmatter spec includes fields not currently in the
   vendored skill's frontmatter" blocker. It described a real fact in
   the report (upstream uses fields the vendored skill lacks) but
   isn't a blocker — the optimizer (step 7) reads 03-submissions.md
   and would add missing fields naturally. Doesn't need a separate
   operator alert.

3. Removed the "License is GPL or other copyleft" blocker. Copyleft
   licenses don't mechanically block PR submission — the upstream is
   whatever-licensed; a PR becomes part of that codebase under that
   license. Organization-policy concerns about contributing to
   copyleft projects exist but are contributor-side decisions, not
   chain-level blockers.

Step (d) reframed from "flag these blockers" to "verify file + only
the CLA fact needs explicit mention since it requires operator-side
work outside the chain". Other frontmatter fields (license, repo,
branch target) are facts downstream steps consume directly.

Edge case for "license is copyleft" replaced with one for unusual
frontmatter conventions — that's the actual blocker shape (subagent
can't extract a consistent spec, so the optimizer has to make a
judgment call).
Per operator review: B2's re-run was modeled as fresh derivation,
which loses continuity. The user's `picked` choices and existing
case names should survive a re-run; otherwise the user has to re-pick
everything every time they add coverage. The "archive as inert audit
trail" rule means continuity has to come from the current canonical
file, not from peeking at archived prior versions.

New concept: chain steps come in two kinds.

- Fresh-derivation steps (1, 3, 6, 7): subagent never sees its own
  step's prior or current file. The "anti-ducktape" rule applies in
  full — optimizer must not see prior attempts, analyzer must not see
  prior analyses.

- Maintenance steps (2, 4): subagent reads the current canonical file
  as load-bearing input and produces an extended version of it.
  Existing entries the user has invested in (picks, manually-added
  cases) are preserved unless directives explicitly say to revise.
  Archive still happens for audit; archive is still inert.

iteration-protocol changes:

- New "Step kinds: fresh-derivation vs maintenance" section
  classifying each step.
- "On every chain-skill invocation" decision tree updated to branch
  on step kind: fresh-derivation copies-to-archive then derives from
  scratch; maintenance copies-to-archive then dispatches with current
  file as input.
- "Subagent constraints under iteration" restructured to make the
  third rule kind-dependent (do/don't read your own canonical file).

B2 changes:

- Step (b) Handle iteration: explicitly invokes the maintenance
  pattern. Notes the special case where step 1's version bumped
  (responsibility set changed) — user decides whether to extend or
  start fresh by deleting the canonical file before dispatch.
- Step (c) Dispatch: new templated input ${EXISTING_CASES_PATH}
  (empty on iteration 1, current 02-test-case.md on re-runs).
  Subagent's "sees" list adds the current file with explicit
  preserve-existing semantics. "Does NOT see" list now correctly
  excludes archive only (was incorrectly excluding all prior drafts).
- Step (e) User gate: response #2 "user wants additions or revisions"
  now reflects maintenance — user does not need to re-pick everything
  since existing picks are preserved.

Spec changes:

- Subagent constraints table: test-case-designer row updated to
  reflect maintenance pattern. Sees-list adds "current 02-test-case.md
  (when extending in maintenance mode)"; does-NOT-see-list correctly
  scoped to "archived prior drafts" only; why-rationale updated.

Note: B4 (write-tests) when drafted will also follow the maintenance
pattern (workbench/ accumulates per-case files). B6 and B7 stay as
fresh-derivation — that's the load-bearing anti-ducktape constraint.
Step 4 of the chain — takes the picked cases from 02-test-case.md
and dispatches test-writer subagents in parallel (one per case) to
build concrete workspace files + graders. Runs a smoke check
against hand-crafted GOOD/BAD/EMPTY fixtures before declaring done.

Follows the established template (B1/B2/B3 shape) with several
B4-specific additions:

- **Maintenance step** (per the recent iteration-protocol update).
  workbench/ accumulates as new picks get built. The (b) iteration
  step diffs picked vs prior built_cases and only dispatches
  test-writers for NEW picks or for cases that directives flag for
  revision. Existing builds are preserved verbatim.

- **Parallel per-case dispatch** in step (d). All test-writers
  emitted in a single message so they run concurrently. Each sees
  only its case spec + 01-functionality.md + skill source; does
  NOT see other cases, other graders, prior failures, or anything
  under archive/. The per-case isolation prevents both
  cross-fixture homogenization and grader-leak hacking.

- **Source content access** (per the recent spec update). Test
  writers are the one step where the skill source is in-scope for
  a generative subagent. Bounded by the per-case spec from step 2.

- **Smoke check** (step (e)). New responsibility not in other
  steps. Verifies each grader against GOOD/BAD/EMPTY fixtures the
  test-writer also produces. last_smoke_check_passed in frontmatter
  is only set when all graders pass.

- **Two outputs**: 04-tests-plan.md (the meta-report) AND the
  workbench/ directory (the artifacts the run-bench step executes
  against). Maintenance protocol covers both via archive/04-tests-plan-vN.md
  and archive/workbench-vN/.

- **User-gate at step (c)** for the planned workbench structure
  before dispatching. Three response cases (approve / add revisions
  via directives / reject and abandon-or-replan).

- **6 internal steps**: (a) confirm prerequisites, (b) handle
  iteration with diff logic, (c) plan + user gate, (d) parallel
  dispatch, (e) smoke check, (f) assemble + commit + hand off.

350 lines, description 492 chars. Larger than B1/B2/B3 because the
extra responsibilities (parallel dispatch, smoke check, maintenance
diff) each earn their lines.
…o centralized script exists

The prior wording referenced a path that didn't exist
(`skills/skill-optimizer/references/scripts/smoke-check.mjs`). The
actual codebase doesn't have a centralized smoke-check runner —
each existing workbench has its own `checks/smoke-graders.mjs`
specific to its graders (see e.g.
examples/workbench/agent-browser/checks/smoke-graders.mjs).

Rewrote step (e) to be honest about this:

- Smoke check is per-workbench, not centralized.
- The test-writer subagent produces the smoke-check artifact
  alongside its grader. Shape follows the workbench schema docs.
- Two reasonable shapes described (per-case runner vs workbench-level
  aggregator); the test-writer prompt template (Phase C) will pick
  one and apply consistently.
- Reference example pointed at the actual existing
  agent-browser/checks/smoke-graders.mjs.

Conceptual contract unchanged (GOOD passes, BAD/EMPTY fail); only
the execution mechanics corrected.
Step 5 of the skill-optimizer chain. Thin operator-driven CLI step —
no subagent dispatch. Two outputs: timestamped raw bench data under
05-bench-results/<ts>/ (preserved naturally, exempt from the standard
archive flow) and a versioned 05-bench-summary.md (follows the
iteration protocol; archive-and-version on re-run).

The summary is the entry point step 6 reads: per-model and per-case
pass rates plus a failed-case pointer list into the raw output. The
analyzer subagent at step 6 walks that list for trace and findings
detail; the summary itself stays short.

Handoff branches on overall pass rate: failures route to
analyze-result; all-pass surfaces the choice to the user (accept
that the picked set didn't expose a weakness, or re-run step 2 with
a "make it harder" directive). Chain skills don't auto-invoke.
Two related coordination changes across the chain, plus a deferred
limitation note.

Rename convention for revised cases at step 2: when a directive asks
to revise an existing case (rather than add a new one), the
test-case-designer subagent appends a version suffix —
`case-x` → `case-x-v2` → `case-x-v3`. Bare name is implicit v1.
This gives step 4's diff logic a deterministic signal that the
revised case needs a fresh test-writer dispatch (without the rename,
the name-match would say "already built" and skip rebuilding the
case whose spec actually changed). Encoded in B2's user-gate step
and referenced from B4's diff logic so the v-suffixed names are not
surprising downstream.

Partial re-bench at step 5: deferred. CLI's run-suite does not
accept a case filter, so step 5 always re-measures the full
workbench. Documented as a known limitation with a roadmap pointer.
Operator escape hatch (manual run-case + splice into
05-bench-results/<ts>/) noted as outside the chain.
Replace the version-field + archive-folder iteration model with
git as the history mechanism and the filesystem itself as the
current state. The prior protocol was reimplementing git in
frontmatter — version: ints, inputs.step_N: lineage tracking,
archive/<NN>-name-v<N>.md copies — and adding accidental complexity
across the chain.

Iteration-protocol rewrite:
- Drop version field, archive/ subdir, inputs.step_N int tracking
- Staleness detection: git log -1 --format=%ct mtime comparison
- Two step kinds preserved (fresh-derivation vs maintenance), with
  the constraint reframed in terms of "does the subagent read its
  own canonical file" — fresh-derivation says no (anti-ducktape),
  maintenance says yes (filesystem IS the state)
- Subagent constraint added: don't walk git history of any tree
  file — for fresh-derivation this is the anti-ducktape guarantee,
  for maintenance this prevents reasoning from prior states
- Safe destructive edits: operator session commits a checkpoint
  before maintenance-step rebuilds so git history has a clean
  before/after breakpoint

Spec doc updates:
- State layout: tests/<functionality>/<test>/{spec.yaml, workspace,
  grader, smoke} replaces 02-test-case.md + 04-tests-plan.md +
  workbench/. Filesystem-as-state — no picked: [] or built_cases: []
  arrays anywhere
- B2 output: 00-test-proposals.md (audit) + tests/<func>/spec.yaml
  per functionality with picked: true|false in each
- B4 output: tests/<func>/<test>/ probe folders + generated
  tests/suite.yml. Per-probe parallel test-writer dispatch
- B5 input: tests/suite.yml. Two outputs: timestamped raw +
  single-canonical 05-bench-summary.md
- Subagent constraints table updated for every reasoning subagent
  to reflect git-history-off-limits rule
- Per-step iteration-behavior sections rewritten for the new model

Undoes the -v2 rename convention added 30 min ago (no longer
needed — filesystem-as-state means revising a spec.yaml in place
is naturally detected, and B4 can just rebuild on directive
without any special name mangling).
Apply the filesystem-as-state + git-native iteration redesign across
all five existing chain SKILL.md files. Drops version: frontmatter
fields and archive/ directory references everywhere; reframes the
subagent constraints in terms of "don't read own canonical / git
history" (fresh-derivation) vs "do read current tree as state"
(maintenance).

B1 (investigate-functionality, fresh-derivation): drop version
field; subagent constraint becomes "don't read own canonical or
git history of it" instead of "don't read archive/".

B2 (investigate-test-case, maintenance — major rewrite): output
shape changed entirely. Was a single 02-test-case.md with picked:
[] frontmatter array; now produces 00-test-proposals.md (one-time
audit report) plus tests/<functionality>/spec.yaml per proposed
functionality, each with picked: true|false in its own frontmatter.
User gate is "edit picked: in each spec.yaml" rather than "tell me
which names to pick". Subagent reads existing tests/ tree as
load-bearing state per the maintenance rule.

B3 (investigate-submissions, fresh-derivation): drop version and
inputs.step_1_functionality fields; staleness now via git mtime
against 01-functionality.md.

B4 (write-tests, maintenance — major rewrite): replaced
04-tests-plan.md + workbench/ with tests/<functionality>/<test>/
probe folders. State is implicit: probe folder + grader.mjs
present = built; no built_cases: [] array. Per-probe parallel
test-writer dispatch; one probe = one test-writer subagent. Step
generates tests/suite.yml from picked-functionality probes.
Destructive-edit checkpoint pattern: operator commits before
rebuilding existing probes.

B5 (run-bench, fresh-derivation summary + timestamped raw): drop
version and inputs.step_4_tests fields; reads tests/suite.yml as
input. Summary 05-bench-summary.md is single-canonical with prior
state in git; raw 05-bench-results/<ts>/ stays naturally
accumulated and outside the protocol.

Net: less bookkeeping, simpler mental model, fewer ways to get
state out of sync. Filesystem IS the state across the chain.
Two wording fixes to the fresh-derivation subagent constraint:

1. The prior wording "no reference to what was produced before"
   read as forbidding even the directive mechanism — which is
   wrong. The operator session DOES read prior outputs (that's
   part of its job between iterations) and distills lessons into
   atomic new requirements. The subagent then satisfies those
   distilled requirements as fresh constraints without seeing the
   raw prior content. This separation is what keeps the new
   derivation from rationalizing the prior one while still letting
   the chain converge across iterations.

2. Step 7 has a specific carve-out worth stating explicitly: the
   SKILL itself (the improvement target) is upstream input, not
   the subagent's own canonical. The optimizer reads the current
   skill — which may include modifications from prior step-7 runs
   — and proposes new improvements on top. The "own canonical"
   that's off-limits is 07-improvement-proposal.md (the reasoning
   report), not the skill file. The skill accumulates improvements
   across iterations; the proposal reports do not.

Triggered by review of the prior wording on iteration-protocol.md.
Replace the descriptive "there is no version: field" wording with
a prescriptive "do not add version-tracking metadata" rule plus a
practical decision aid for future SKILL.md authors:

  When in doubt about whether a field belongs: ask whether a chain
  skill needs to READ it to do its job right now (yes -> keep), or
  whether you're recording it for future-debugging / future-audit
  purposes (no -> that's git's job).

The prior wording could be read as describing the current state
without prohibiting reintroduction. The new wording makes the
prohibition explicit so B6/B7/B8 (yet to be drafted) don't
accidentally bring back version-tracking metadata.
B6 (analyze-result): the chain's anti-ducktape gate. Fresh-derivation
step. Dispatches an analyzer subagent that reads bench summary +
raw trial data + probe specs + skill content — but NOT the test
inputs themselves (forces principle-thinking over solution-thinking).
Output: 06-analysis.md with has_structural_weakness: true|false in
frontmatter and per-weakness sections containing Pattern,
Hypothesized cause, Connects to skill section, What WOULD address
this (general principle), and What WOULD NOT address this (the
explicit anti-ducktape list step 7's optimizer must reckon with).

Honest refusal is built in: if no weakness can be articulated, the
report says so and has_structural_weakness: false, which step 7
will refuse to fire on. Forced weakness-naming when the analyzer
found nothing is the ducktape failure mode this step exists to
prevent.

B7 (improve-skill): the terminal generative step. Two subagents
(optimizer + validator), both fresh-derivation. Refuses to fire if
06-analysis has has_structural_weakness: false (the anti-ducktape
gate's downstream half).

Optimizer: reads 06-analysis, 01-functionality, current skill,
03-submissions (if PR-bound). Does NOT see raw trials, grader
internals, test inputs, or prior proposals. Must apply the
analyzer's general principle and self-check against the
anti-pattern list explicitly.

Validator: reads skill BEFORE + AFTER + 01-functionality + the
proposal artifact + 03-submissions (if PR-bound). Internal
consistency + external consistency checks. Does NOT see the
optimizer's reasoning trace, prior verdicts, or raw trial data.

Bounded loop: max 2 revision rounds. If validator still says
needs-revision after round 2, surface honestly with three realistic
paths; do not loop indefinitely.

Three handoff branches: local skill (modify in place), upstream +
PR=false (modify vendored copy), upstream + PR=true (write
07-pr-draft.md with operator-steps-to-submit; the chain does NOT
submit the PR).

Outputs three or four artifacts: 07-improvement-proposal.md,
07-validator-verdict.md, the modified skill file (on approve), and
07-pr-draft.md (PR-bound + approve). Frontmatter on both reports
carries runtime-relevant facts only (verdict, addresses_weaknesses,
diff_target) per the iteration protocol's discipline rule.

Both files follow the established structural template (front-loaded
"Before you start", lettered workflow steps, edge cases, iteration
behavior section). Pending user review of B1-B7 before B8 and the
subagent prompt templates.
…rompt

B6 SKILL.md was carrying a full markdown body template (per-weakness
section template with placeholders, non-structural noise example,
honest-refusal wording verbatim). That's the subagent's concern —
the subagent writes the body per its prompt template; the operator
session reads only the frontmatter for handoff branching.

Keep in the SKILL.md:
- Frontmatter contract (operator reads has_structural_weakness for
  step 7's gate)
- Enumeration of the five required parts per weakness entry (the
  operator session verifies these in step (d))
- The architecture-level rationale on why the anti-pattern list is
  load-bearing (this is design-decision content, not subagent-side
  prose — explains WHY the constraint exists for future authors)
- Pointer to the subagent prompt template for the full template +
  reasoning protocol

Same audit pass on B1/B2/B4/B7: their What-you-produce sections
show structural contracts (frontmatter schemas, file-tree shape)
that the operator session actually reads, not narrative body
templates the subagent fills in — so they stay as-is.
Two architectural fixes per review:

1. Drop PR draft from B7. PR composition is a separate downstream
   concern that consumes B7's proposal + 03-submissions.md; the
   auto-pilot (step 8) or a dedicated composer can handle it if
   pr_submission_intent: true. B7's job is just to improve the
   skill — packaging it as a PR is not what improve-skill does.

   Removed: 07-pr-draft.md as an artifact, the three-branch
   handoff (local / upstream + PR=false / upstream + PR=true),
   the operator-steps-to-submit checklist. Collapses to a single
   handoff message regardless of PR intent.

2. Never modify the original skill. The improved version lives at
   docs/skill-optimizer/<slug>/improved-skill/ — a separate
   location that accumulates improvements across iterations. The
   vendored upstream copy stays frozen; the local source file
   stays untouched. Git tracks improved-skill/ history.

   On re-run, the optimizer reads improved-skill/ if it exists
   (the accumulated state) and proposes the next improvement on
   top; iteration 1 reads the original source instead. Original
   is always recoverable; improved evolves under git.

   Removed: "Written in place for local skills" / "Written to
   vendored-skill/ for upstream skills" — both wrong now.

Updated B7's "Before you start" carve-out summary, step (c) and
(d) input descriptions (SKILL_CURRENT_PATH replaces
SKILL_SOURCE_PATH; validator BEFORE = optimizer's input), step (f)
write outputs (materialize improved-skill/; don't touch source),
step (g) handoff (single message). Iteration-protocol's step-7
carve-out updated to match (improved-skill/ is the accumulated
state; source stays frozen). Spec doc layout adds improved-skill/
alongside vendored-skill/; B7 section rewritten; subagent
constraints table rows for optimizer + validator updated.
Yuqing Zhai and others added 30 commits May 25, 2026 13:10
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extend TrialAggregate / WorkbenchModelAggregateResult with totalTokens
and totalDurationMs, sourced from per-trial result.metrics.tokens.total
and result.metrics.durationMs. Missing values fall back to 0 so legacy
result.json files remain backward-compatible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… results

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts pass through

Native ACP agents (claude-agent-acp, codex-acp, gemini, opencode) accept their own model
refs (claude-haiku-4-5-20251001, gpt-5-mini, etc.). The OpenRouter-only guard now applies
only when agent is pi-acp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Subscription auth files (~/.claude/.credentials.json, ~/.codex/auth.json, etc.)
are typically mode 600 on the host. After cpSync into the staging dir that becomes
/home/agent in the container, the container's agent user (uid 10001) can't read
them. Loosen mode to 644 on the staged copies so the in-container agent can read.
Host files unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Baseline is currently a placeholder (captureStatus: placeholder); the test skips
when OPENROUTER_API_KEY is unset or when the baseline hasn't been replaced with
real captured rates. Once a real run is performed, replace the JSON with the
actual perCasePassRate and remove the captureStatus field to enable the check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rmat + agent column

- test-writer: do-not-pre-trigger-skill guidance for task prompts
- analyzer: read-first pointer to acp-trace-format.md
- run-bench: per-(agent + model) summary; surface tokens/duration; drop cost; new image+Dockerfile names

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Architecture summary: host-side ACP client driving per-trial Docker containers
- Important files: src/workbench/acp/, src/workbench/agents/, parse-trace.ts
- Invariants: agent: required on case.yml, runs: matrix on suite.yml, pi-acp-only openrouter/ prefix
- Auth matrix per agent
- Docker image+Dockerfile name updated to skill-optimizer-agent:local

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e dirs

Three intertwined fixes needed for claude-agent-acp (and codex-acp) subscription
auth to work end-to-end:

1. Extract the OAuth bearer from the credentials JSON and pass it via
   ANTHROPIC_AUTH_TOKEN (claude) / OPENAI_API_KEY (codex). The respective
   underlying SDKs (@anthropic-ai/claude-agent-sdk and codex-acp's analogue)
   read env vars, not the credentials file directly. New
   SubscriptionEnvExtract type lets the registry declare these mappings.

2. Pre-create directories the agent's runtime expects under the staged $HOME.
   @anthropic-ai/claude-agent-sdk writes debug logs to ~/.claude/debug/<uuid>.txt;
   session/new fails with "Query closed before response received" if that dir
   doesn't exist. Use the registry's existing homeDirs field to declare these.

3. Make staged $HOME and its declared subdirs world-writable (mode 0o777).
   The container's agent user (uid 10001) does not match the host user that
   owns the staging dir, so without loosened perms the SDK can't write its
   own session/debug files.

Verified: claude-agent-acp smoke probe now completes a real
initialize→newSession→prompt cycle and writes /work/out.txt in ~15s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sion

Without this, the model: field in case.yml was silently ignored — the agent
used its default. Now passes the case's model string to the agent's session
when supportsAcpSetModel is true on the agent's registry entry. Errors are
re-thrown with the agent name + bad model string so case authors can tell
they used an unrecognized ID.

Verified with model: opus[1m] on claude-agent-acp; tokens recorded, stopReason
end_turn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… aliases

Closes two spec-satisfaction gaps from the post-implementation audit:

- pi-acp now declares OPENROUTER_API_KEY in requiresEnv so resolveAuth
  fails loudly preflight instead of letting the agent crash mid-trial.
  opencode keeps requiresEnv: [] intentionally — it's multi-provider
  and the loginHint guides the user.
- AGENT_ALIASES drops 'gemini' (self-pointing) and 'openclaw' (points
  at non-existent agent). Spec's alias table is canonical: claude,
  codex, pi.
Fulfills the v1.4 acceptance criterion #8 that was never built.
Defines a single user-invocable SKILL.md at skills/autopilot/ that
walks the 9-step chain end-to-end on one skill. Defaults to fully
automated with optional per-gate breakpoints chosen at startup.

Replaces v1.4 §10's single-cycle retry model with a strategy-menu
per gate (cheap → expensive); the pilot picks the cheapest
unattempted strategy on each gate firing. Caps are per-gate
(default 2) + global (default 8); cap exhaustion is an honest exit,
not a failure. State lives in a per-run journal (gitignored),
promoted to a tracked autopilot-summary-<ts>.md at end-of-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `skills/autopilot/SKILL.md` as the 10th skill — the chain
driver that walks steps 1→9 end-to-end on a single skill. Fully
automated by default; operator picks optional per-gate breakpoints
at startup. Honest-exits when caps exhaust (per-gate=2, global=8)
or step 9 approves.

Long static content lives in agents/ so the SKILL.md stays
operational: interview-prompt.md (startup UX), recovery-menus.md
(strategy menus consulted at each gate firing), surfacing-prompt.md
(operator-facing gate prompt format), summary-template.md (end-of-run
audit report shape).

Plugin metadata + smoke tests updated. README/CONTRIBUTING mention
the new driver. Fixed a pre-existing step 2↔3 classification swap
in shared/subagent-dispatch.md that this work surfaced.

Fulfills v1.4 acceptance criterion #8 (auto-pilot smoke test) — the
last criterion from the v1.4 design that was never built.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
skills/shared/workbench.md is significantly refreshed to match current
truth: drops stale --model/--models flags, replaces models:[...] with
runs:[{agent,model}], adds agent: as required on cases, fixes the
default image name. Adds a new Per-Agent Model Strings section listing
the model-naming convention each agent CLI accepts plus the env vars
they require — closes the trace-friction where Claude wrote
'claude-opus-4-7-1m' and the CLI rejected it (valid: opus[1m]).

src/workbench/docker-runner.ts: two changes.
  - copyCaseSupportDirs() now enumerates the case dir dynamically with
    an exclusion set (references/, node_modules/, hidden dirs), instead
    of a hardcoded ['checks','fixtures','bin','workspace','mcp'] list.
    Case authors can add seeds/, data/, helpers/ etc. without editing
    the runner.
  - Pre-removal docker exec --user 0 chmod -R 0777 /home/agent /work
    before docker rm -f, so the host's rmSync(tempDir) doesn't trip
    EACCES on uid-10001-owned subdirs (.cache, .venv). Best-effort;
    ignored if the container is already gone.

skills/investigate-functionality/SKILL.md: two UX changes.
  - Wrapper-redirect now keeps the wrapper intact at vendored-skill/
    and writes the underlying content to a sibling vendored-underlying/.
    Tests target vendored-skill/ (the ship target) regardless of
    re-vendor outcome. Surfaces clearer 3-option prompt.
  - Branch check is now ALWAYS asked, with current branch surfaced and
    eval/<slug> recommended as option 1. Replaces the prior
    'only ask if on main/master/development' heuristic.

skills/shared/workflow.md: state layout adds vendored-underlying/ and
notes that test target is vendored-skill/.

Addresses 4 of 5 friction points from a manual trial run trace:
workbench model-string confusion, wrapper-handoff prose, branching
timing ambiguity, tmp /tmp permission noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… worktrees

Trial run on web-interface-guidelines surfaced two issues that this
commit addresses, plus the long-standing worktree-gitlink clutter.

Branch handling. The step 1 branching prompt was "always ask",
which interrupts an autopilot full-auto run after the operator
already said "all auto". Autopilot now handles branching itself
in workflow (a): reads current branch, creates eval/<slug> if not
already on a purpose-named branch, logs to the journal, and
passes 'branching: already-resolved' in OPERATOR_DIRECTIVES so
step 1 skips its prompt. Step 1 still asks when invoked manually
(directive absent).

CWD discipline. The orchestrator ran 'cd .skill-optimizer/.../vendored-skill
&& curl ...' to vendor source files. The Bash tool keeps CWD
between calls, so the cd persisted; later 'mkdir -p
docs/skill-optimizer/<slug>/' resolved under vendored-skill/
silently, while the dispatched subagent (with absolute output
path) wrote the canonical 01-functionality.md to the correct
main-checkout location. The orchestrator then "couldn't find" the
file because its relative-path verification looked in the wrong
subtree.

Added a Working-directory discipline section to autopilot SKILL.md
and a CWD-safe vendor-step note to investigate-functionality:
prefer absolute paths or (subshell) scoping; never bare cd that
persists.

Worktree cleanup. .claude/worktrees/ held 10 stale gitlinks
(mode 160000) from old auto-pilot pilot runs in May. All worktrees
removed via 'git worktree remove --force'; gitlinks dropped from
the tree. Branches themselves stay (eval/auto-pilot/shadcn-ui,
eval/auto-pilot/firecrawl-build-scrape, etc.) — if you want them
gone, 'git branch -D' separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…st inheritance

A validator-approved improvement is now backed by an actual bench
comparison before autopilot claims `improved`. Without this, the
exit_status `improved` only meant "validator believed the proposal"
— no measurement of whether the proposal moved the needle.

Autopilot workflow gains phase (e.5) — fires after step 9 approve:

  1. Snapshot baseline pass-rate from 06-bench-summary.md.
  2. Patch suite.yml with a top-level skillUnderTest pointing at
     improved-skill/; back the original up to the journal dir.
  3. Re-dispatch step 6 against the improved skill.
  4. Restore suite.yml; the re-bench raw output lives at
     bench-results/<ts>-after-improvement/.
  5. Per-case comparison: count pass→fail (regressions),
     fail→pass (new passes), and the net delta.

New gate G10-NO-EMPIRICAL-GAIN fires when the re-bench shows
pass-rate ≤ baseline or any regression. Recovery menu (3 strategies,
cheap→expensive): different fix, reframe weakness, increase test
coverage. Cap exhaustion → exit_status: unchanged-empirical.

New exit_status values: improved-with-regression (net gain but ≥1
case regressed — surface clearly for review), unchanged-empirical
(re-bench didn't demonstrate the validator's claim).

Suite-loader change (the only code touched): suite-level
skillUnderTest now inherits to inline cases (with per-case
override), mirroring how references/env/setup/etc. already work.
Lets autopilot patch one top-level field instead of N cases. Test
covers inheritance + override + malformed-field rejection.

workbench.md Suite Schema doc gains a row for skillUnderTest with
its purpose called out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion_target picks the file

Replaces the wrapper-detection branch (vendored-skill/ + vendored-underlying/
+ 3-option re-vendor prompt) with a single-directory model that's both
simpler and more honest about test reproducibility.

What changes structurally
- vendored-skill/ is now a single flat dir holding the source SKILL.md
  PLUS any files the source content fetches at runtime (the researcher
  vendors all referenced URLs as part of step 1). This makes the test
  bench reproducible without live network — whether the skill is a
  wrapper or not, the workbench mount is self-contained.
- New frontmatter on 01-functionality.md: optimization_target (relative
  path within vendored-skill/, the file the chain modifies + submits
  upstream) + optimization_target_source (upstream URL of that file,
  used by step 2 to decide which repo to research for PR conventions —
  may differ from skill_source when a wrapper redirects).
- optimization_target_candidates lists the viable files the researcher
  identified; downstream skills ignore it, but step 1's picker uses it.

Picker UX
- Replaces the 3-option re-vendor prompt with a "which file to optimize"
  picker. Surfaces only when multiple candidates exist AND no operator
  directive pre-resolves the choice.
- Autopilot full-auto passes 'optimization-target: auto-pick-recommended'
  in OPERATOR_DIRECTIVES so step 1 honors the researcher's recommendation
  without surfacing. Same pattern as 'branching: already-resolved'.

Per-skill updates
- investigate-functionality: vendor-all-references rule + picker; new
  frontmatter fields documented. likely_wrapper kept as diagnostic only.
- research-functionality (subagent): new section "Vendoring referenced
  content" + "Identifying the optimization target". Researcher fetches
  external URLs into vendored-skill/, lists candidates, recommends one.
- investigate-submissions: reads optimization_target_source (not
  skill_source) to derive UPSTREAM_REPO for the gh CLI subagent.
- improve / optimizer: reads ${OPTIMIZATION_TARGET}; diff scope is that
  single file.
- validate / validator: BEFORE/AFTER comparison at ${OPTIMIZATION_TARGET};
  other files in the trees should be identical (if they differ, the
  optimizer exceeded scope — validator flags).
- autopilot: passes 'optimization-target: auto-pick-recommended' alongside
  the existing 'branching: already-resolved' in step 1's directives.
- shared/workflow.md: dropped vendored-underlying/ from state layout;
  documented the new optimization_target / optimization_target_source
  contract.

The wrapper concept doesn't disappear — likely_wrapper + wrapper_points_to
are still recorded as diagnostic info, just no longer load-bearing for
behavior. optimization_target is the field that drives the chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant