Autopilot run: web-design-guidelines (bench 0.40 → 0.60, +1 case) by Zhaiyuqing2003 · Pull Request #55 · fastxyz/skill-optimizer

Zhaiyuqing2003 · 2026-05-28T16:24:05Z

Autopilot run — web-design-guidelines

Full skill-optimizer autopilot run, end-to-end. exit_status: improved.

Target: web-design-guidelines — a wrapper at vercel-labs/agent-skills whose rules live in command.md at vercel-labs/web-interface-guidelines (that underlying file is the optimization + PR target).

Finding: output-format conventions lived only in an example, not in prose; clean-file behavior was shown by a trailing example rather than specified as a rule. The optimizer rewrote command.md's Output Format section into explicit rules — level-2 file headers, a literal - separator (with anti-em-dash guidance), ✓ pass for clean files, no closing summary.

Empirical verification: bench 0.40 → 0.60 (+1 case, zero regressions) after one G10-NO-EMPIRICAL-GAIN recovery cycle. Validator approved both rounds.

Branch = review bundle: improved-skill/ (deliverable), vendored-skill/ (original, for before/after), and the 01-…09 step reports + autopilot summary. The improved diff is ready for an upstream PR to vercel-labs/web-interface-guidelines; this run didn't compose or submit one.

End-to-end skill-optimizer autopilot run, exit_status: improved. Review bundle: improved-skill/ (deliverable), vendored-skill/ (original for before/after), and the 01-09 step reports + autopilot summary. Test machinery (skill-evals probes) and raw bench output omitted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Zhaiyuqing2003 · 2026-05-28T16:24:20Z

Added to command.md → Output Format (replaces the prior one-liner "Group by file. Use file:line format (VS Code clickable). Terse findings."). Posted as rendered markdown so you can read the new guidance without opening the diff:

Group findings by file. Each finding cites file:line so the path is a VS Code-clickable link. Output follows this shape, every time:

File header is a level-2 markdown heading whose text is the file path (e.g. ## src/Button.tsx). Nothing precedes the first header.
Each finding is one line: <path>:<line> - <issue>. The separator is a literal - (space, hyphen, space); no other whitespace stands in for it. The separator hyphen is an ASCII hyphen-minus (-, U+002D), not an en-dash (–) or em-dash (—); downstream tools split on the literal - sequence, so src/Button.tsx:42 — icon button missing aria-label (em-dash) and src/Button.tsx:42 icon button missing aria-label (two spaces, no hyphen) both break the contract. Issue text is terse — state issue + location, skip explanation unless the fix is non-obvious.
A file with zero rule violations gets its header followed by ✓ pass on the next line. Emit ✓ pass so the reader can tell "considered, nothing to flag" apart from "forgot to look"; omission is not approval. When in doubt whether a rule fires on a given line, ✓ pass is the correct output for that file — do not stretch an ambiguous rule, invert an OR-clause ("X or Y" does not mean "prefer X over Y"), apply a rule outside its stated scope (e.g. a Title-Case rule that names "headings/buttons" does not extend to aria-label text), or flag an adjacent-but-different concern to manufacture a finding. Reviewer credibility comes from precision, not volume.
Nothing follows the last finding: no closing summary, no totals, no recap section.

The full skill-evals tree the autopilot run measured against: 5 built probes (accessibility-rules, clean-file-pass, multi-file-grouping, output-format-contract, triggering) with workspace fixtures, graders, and GOOD/BAD/EMPTY smoke fixtures, plus the 8 functionality spec.yaml files (5 picked + 3 proposed-but-unpicked) and suite.yml. suite.yml is the canonical committed form (pi-acp / openrouter). The operator's as-run substitution to claude-agent-acp / sonnet (after a pi-acp transport crash) is documented in the autopilot summary's "Infra retries" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

+import { useEffect, useState } from 'react';
+
+export function UserMenu() {
+  const [open, setOpen] = useState(false);


The empirical evidence behind the 0.40 → 0.60 claim, for the 3 meaningful runs (baseline, after-improvement, after-retry). Per run: suite-result.json (aggregate passRate) + per-trial result.json (grading evidence) + workspace findings.txt (the agent's actual output) + the .tsx fixtures. Curated to a secret-free, noise-free subset: excludes agent-internal/ (contained .credentials.json — agent OAuth tokens), workspace .cache/ (npm + claude-ai MCP logs), and the verbose per-trial trace.jsonl. The 3 infra-failed early bench attempts are omitted (documented in the autopilot summary's Infra retries section). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Zhaiyuqing2003 · 2026-05-28T23:26:39Z

Re: "two major changes" — it's actually one change, and here's the trigger for the - separator text.

The change is one section rewrite

The diff replaces a single line in command.md → ## Output Format (L158) — the old one-liner "Group by file. Use file:line format (VS Code clickable). Terse findings." — with an explicit rule block (improved-skill/command.md L160–185). That one block covers two of the analyzer's named weaknesses (see 07-analysis.md): the format/separator rule and the ✓ pass clean-file rule. So if it reads as "two changes," that's two rules inside one contiguous edit, not two separate edits.

What triggered the `-` separator text

Test: output-format-contract/format-compliance-single-file (fixture Hero.tsx), reinforced by multi-file-grouping/two-file-mixed-violations.

Grader rule (skill-evals/.../output-format-contract/format-compliance-single-file/grader.mjs L5):

const FINDING_RE = /^(?:- )?\S+\.\w+:\d+\s+-\s+\S/;  // requires "<file>:<line> - <issue>", ASCII hyphen

Bad outputs that drove it (in the bench-results now on this branch):

baseline — bench-results/baseline/.../output-format-contract.../findings.txt:
```
Hero.tsx:6  <img> missing alt attribute (accessibility)
```
Multi-space, no separator → fails FINDING_RE. This is what motivated specifying a literal - separator at all (command.md L165–166).
round 1 (after the first separator prose) — bench-results/after-improvement/.../output-format-contract.../findings.txt:
```
Hero.tsx:6   ACCESSIBILITY — <img> missing `alt` attribute; ...
```
The agent switched to an em-dash (—), still not the ASCII - → still fails. This is why the G10 retry round sharpened the rule with the explicit anti-pattern at command.md L167–171: "ASCII hyphen-minus (-, U+002D), not an en-dash (–) or em-dash (—)".

So the ## src/Button.tsx … :42 - … separator guidance (L165–171) traces directly to output-format-contract failing first on multi-space, then on em-dash. Note it still didn't flip to pass after round 2 — the skill guidance is clearer but doesn't byte-force the separator; the summary flags a grader relaxation (accept — / double-space) as a future probe-tuning question.

Copilot AI review requested due to automatic review settings May 28, 2026 16:24

Copilot started reviewing on behalf of Zhaiyuqing2003 May 28, 2026 16:24 View session

This comment was marked as spam.

Sign in to view

github-code-quality Bot found potential problems May 28, 2026

View reviewed changes

Comment thread ...s/web-design-guidelines/accessibility-rules/a11y-violations-component/workspace/UserMenu.tsx

import { useEffect, useState } from 'react';

export function UserMenu() {

const [open, setOpen] = useState(false);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Autopilot run: web-design-guidelines (bench 0.40 → 0.60, +1 case)#55

Autopilot run: web-design-guidelines (bench 0.40 → 0.60, +1 case)#55
Zhaiyuqing2003 wants to merge 3 commits into
developmentfrom
autopilot-run/web-design-guidelines

Zhaiyuqing2003 commented May 28, 2026

Uh oh!

Zhaiyuqing2003 commented May 28, 2026

Uh oh!

This comment was marked as spam.

Uh oh!

Zhaiyuqing2003 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Zhaiyuqing2003 commented May 28, 2026