Skip to content

Autopilot run: web-design-guidelines (bench 0.40 → 0.60, +1 case)#55

Open
Zhaiyuqing2003 wants to merge 3 commits into
developmentfrom
autopilot-run/web-design-guidelines
Open

Autopilot run: web-design-guidelines (bench 0.40 → 0.60, +1 case)#55
Zhaiyuqing2003 wants to merge 3 commits into
developmentfrom
autopilot-run/web-design-guidelines

Conversation

@Zhaiyuqing2003

Copy link
Copy Markdown

Autopilot run — web-design-guidelines

Full skill-optimizer autopilot run, end-to-end. exit_status: improved.

Target: web-design-guidelines — a wrapper at vercel-labs/agent-skills whose rules live in command.md at vercel-labs/web-interface-guidelines (that underlying file is the optimization + PR target).

Finding: output-format conventions lived only in an example, not in prose; clean-file behavior was shown by a trailing example rather than specified as a rule. The optimizer rewrote command.md's Output Format section into explicit rules — level-2 file headers, a literal - separator (with anti-em-dash guidance), ✓ pass for clean files, no closing summary.

Empirical verification: bench 0.40 → 0.60 (+1 case, zero regressions) after one G10-NO-EMPIRICAL-GAIN recovery cycle. Validator approved both rounds.

Branch = review bundle: improved-skill/ (deliverable), vendored-skill/ (original, for before/after), and the 01-…09 step reports + autopilot summary. The improved diff is ready for an upstream PR to vercel-labs/web-interface-guidelines; this run didn't compose or submit one.

End-to-end skill-optimizer autopilot run, exit_status: improved.
Review bundle: improved-skill/ (deliverable), vendored-skill/ (original
for before/after), and the 01-09 step reports + autopilot summary.
Test machinery (skill-evals probes) and raw bench output omitted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 16:24
@Zhaiyuqing2003

Copy link
Copy Markdown
Author

Added to command.md → Output Format (replaces the prior one-liner "Group by file. Use file:line format (VS Code clickable). Terse findings."). Posted as rendered markdown so you can read the new guidance without opening the diff:


Group findings by file. Each finding cites file:line so the path is a VS Code-clickable link. Output follows this shape, every time:

  • File header is a level-2 markdown heading whose text is the file path (e.g. ## src/Button.tsx). Nothing precedes the first header.
  • Each finding is one line: <path>:<line> - <issue>. The separator is a literal - (space, hyphen, space); no other whitespace stands in for it. The separator hyphen is an ASCII hyphen-minus (-, U+002D), not an en-dash () or em-dash (); downstream tools split on the literal - sequence, so src/Button.tsx:42 — icon button missing aria-label (em-dash) and src/Button.tsx:42 icon button missing aria-label (two spaces, no hyphen) both break the contract. Issue text is terse — state issue + location, skip explanation unless the fix is non-obvious.
  • A file with zero rule violations gets its header followed by ✓ pass on the next line. Emit ✓ pass so the reader can tell "considered, nothing to flag" apart from "forgot to look"; omission is not approval. When in doubt whether a rule fires on a given line, ✓ pass is the correct output for that file — do not stretch an ambiguous rule, invert an OR-clause ("X or Y" does not mean "prefer X over Y"), apply a rule outside its stated scope (e.g. a Title-Case rule that names "headings/buttons" does not extend to aria-label text), or flag an adjacent-but-different concern to manufacture a finding. Reviewer credibility comes from precision, not volume.
  • Nothing follows the last finding: no closing summary, no totals, no recap section.

This comment was marked as spam.

The full skill-evals tree the autopilot run measured against: 5 built
probes (accessibility-rules, clean-file-pass, multi-file-grouping,
output-format-contract, triggering) with workspace fixtures, graders,
and GOOD/BAD/EMPTY smoke fixtures, plus the 8 functionality spec.yaml
files (5 picked + 3 proposed-but-unpicked) and suite.yml.

suite.yml is the canonical committed form (pi-acp / openrouter). The
operator's as-run substitution to claude-agent-acp / sonnet (after a
pi-acp transport crash) is documented in the autopilot summary's
"Infra retries" section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
import { useEffect, useState } from 'react';

export function UserMenu() {
const [open, setOpen] = useState(false);
The empirical evidence behind the 0.40 → 0.60 claim, for the 3 meaningful
runs (baseline, after-improvement, after-retry). Per run: suite-result.json
(aggregate passRate) + per-trial result.json (grading evidence) +
workspace findings.txt (the agent's actual output) + the .tsx fixtures.

Curated to a secret-free, noise-free subset: excludes agent-internal/
(contained .credentials.json — agent OAuth tokens), workspace .cache/
(npm + claude-ai MCP logs), and the verbose per-trial trace.jsonl. The
3 infra-failed early bench attempts are omitted (documented in the
autopilot summary's Infra retries section).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Zhaiyuqing2003

Copy link
Copy Markdown
Author

Re: "two major changes" — it's actually one change, and here's the trigger for the - separator text.

The change is one section rewrite

The diff replaces a single line in command.md## Output Format (L158) — the old one-liner "Group by file. Use file:line format (VS Code clickable). Terse findings." — with an explicit rule block (improved-skill/command.md L160–185). That one block covers two of the analyzer's named weaknesses (see 07-analysis.md): the format/separator rule and the ✓ pass clean-file rule. So if it reads as "two changes," that's two rules inside one contiguous edit, not two separate edits.

What triggered the - separator text

Test: output-format-contract/format-compliance-single-file (fixture Hero.tsx), reinforced by multi-file-grouping/two-file-mixed-violations.

Grader rule (skill-evals/.../output-format-contract/format-compliance-single-file/grader.mjs L5):

const FINDING_RE = /^(?:- )?\S+\.\w+:\d+\s+-\s+\S/;  // requires "<file>:<line> - <issue>", ASCII hyphen

Bad outputs that drove it (in the bench-results now on this branch):

  • baselinebench-results/baseline/.../output-format-contract.../findings.txt:

    Hero.tsx:6  <img> missing alt attribute (accessibility)
    

    Multi-space, no separator → fails FINDING_RE. This is what motivated specifying a literal - separator at all (command.md L165–166).

  • round 1 (after the first separator prose) — bench-results/after-improvement/.../output-format-contract.../findings.txt:

    Hero.tsx:6   ACCESSIBILITY — <img> missing `alt` attribute; ...
    

    The agent switched to an em-dash (), still not the ASCII - → still fails. This is why the G10 retry round sharpened the rule with the explicit anti-pattern at command.md L167–171: "ASCII hyphen-minus (-, U+002D), not an en-dash () or em-dash ()".

So the ## src/Button.tsx … :42 - … separator guidance (L165–171) traces directly to output-format-contract failing first on multi-space, then on em-dash. Note it still didn't flip to pass after round 2 — the skill guidance is clearer but doesn't byte-force the separator; the summary flags a grader relaxation (accept / double-space) as a future probe-tuning question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants