Autopilot run: web-design-guidelines (bench 0.40 → 0.60, +1 case)#55
Autopilot run: web-design-guidelines (bench 0.40 → 0.60, +1 case)#55Zhaiyuqing2003 wants to merge 3 commits into
Conversation
End-to-end skill-optimizer autopilot run, exit_status: improved. Review bundle: improved-skill/ (deliverable), vendored-skill/ (original for before/after), and the 01-09 step reports + autopilot summary. Test machinery (skill-evals probes) and raw bench output omitted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Added to Group findings by file. Each finding cites
|
The full skill-evals tree the autopilot run measured against: 5 built probes (accessibility-rules, clean-file-pass, multi-file-grouping, output-format-contract, triggering) with workspace fixtures, graders, and GOOD/BAD/EMPTY smoke fixtures, plus the 8 functionality spec.yaml files (5 picked + 3 proposed-but-unpicked) and suite.yml. suite.yml is the canonical committed form (pi-acp / openrouter). The operator's as-run substitution to claude-agent-acp / sonnet (after a pi-acp transport crash) is documented in the autopilot summary's "Infra retries" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| import { useEffect, useState } from 'react'; | ||
|
|
||
| export function UserMenu() { | ||
| const [open, setOpen] = useState(false); |
The empirical evidence behind the 0.40 → 0.60 claim, for the 3 meaningful runs (baseline, after-improvement, after-retry). Per run: suite-result.json (aggregate passRate) + per-trial result.json (grading evidence) + workspace findings.txt (the agent's actual output) + the .tsx fixtures. Curated to a secret-free, noise-free subset: excludes agent-internal/ (contained .credentials.json — agent OAuth tokens), workspace .cache/ (npm + claude-ai MCP logs), and the verbose per-trial trace.jsonl. The 3 infra-failed early bench attempts are omitted (documented in the autopilot summary's Infra retries section). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Re: "two major changes" — it's actually one change, and here's the trigger for the The change is one section rewriteThe diff replaces a single line in What triggered the
|
Autopilot run — web-design-guidelines
Full skill-optimizer autopilot run, end-to-end. exit_status: improved.
Target:
web-design-guidelines— a wrapper atvercel-labs/agent-skillswhose rules live incommand.mdatvercel-labs/web-interface-guidelines(that underlying file is the optimization + PR target).Finding: output-format conventions lived only in an example, not in prose; clean-file behavior was shown by a trailing example rather than specified as a rule. The optimizer rewrote
command.md's Output Format section into explicit rules — level-2 file headers, a literal-separator (with anti-em-dash guidance),✓ passfor clean files, no closing summary.Empirical verification: bench 0.40 → 0.60 (+1 case, zero regressions) after one G10-NO-EMPIRICAL-GAIN recovery cycle. Validator approved both rounds.
Branch = review bundle:
improved-skill/(deliverable),vendored-skill/(original, for before/after), and the01-…09step reports + autopilot summary. The improved diff is ready for an upstream PR tovercel-labs/web-interface-guidelines; this run didn't compose or submit one.