Skip to content

Zane456/skill-doctor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English | 简体中文

skill-doctor

Reviews your Claude skills so they actually trigger, stay lean, and route cleanly.

Claude Skill Deterministic checks References Runtime deps Python

A skill that audits other skills. It settles the things a machine can settle — character counts, broken routes, listing budget — with scripts and exit codes, and saves your attention for the calls that need judgment: dead instructions, bloat, exit conditions too vague to act on. Every defect it reports has a name and a line number.

See it run · Quick start · What it catches · How it works · Reference


See It Run

A real diagnosis, lightly trimmed:

$ "review my image-tool skill"

[skill-doctor] Auditing: ~/.claude/skills/image-tool  body=143 lines  description=52 chars
[skill-doctor] Budget: 32 skills installed, 11,515 chars vs ≈40,000 → fits
[skill-doctor] Dry-run prompt: "use image-tool to convert this"  broken links=1

[skill-doctor] Diagnosis

❌ Must fix (P0→P3)
  [P0 effect break]  Step 3 reads "the output file" but Step 2 never names one
                     → have Step 2 write a named artifact Step 3 can pick up
  [P1 structure]     description is first person ("I help you resize…")
                     → rewrite in third person ("Resizes and converts images…")
  [P2 specificity]   "process the image" has no command
                     → "convert to PNG with `sips -s format png`"

⚠️ Suggested
  - [no-op] "be careful and thorough" — the model already is → delete the line
  - [sprawl] body 143 lines; a 40-line example block → move to references/

✅ Passed
  - name regex, routing (0 orphans / 0 dangling), YAML, visible output per step

Here's the thing about a broken skill: nothing throws. It just doesn't fire, or Claude reads half of it and quietly moves on. You find out weeks later when the skill you wrote never once triggered. This is the pass that catches that before you ship.


Quick Start

Point it at any skill — in a Claude session, just ask:

review this skill   ·   审查 / 体检 / 修复 this skill   ·   "check my-tool"

Or run one deterministic check straight from the shell, no Claude needed:

python3 scripts/check_routes.py ~/.claude/skills/my-tool
# exit 0 = clean · exit 1 = orphans / dangling links / over the size cap

It reads the target, runs the mechanical checks, walks one real prompt through the workflow, and prints a P0→P3 report. It does not touch your files until you say "fix it".


What It Catches

The failures that don't announce themselves:

Symptom you'd never notice What it flags
Skill never auto-fires description in first person, too weak, or over the listing budget (silently dropped to name-only)
Claude skips half your instructions body past ~500 lines — [sprawl]
A reference file never gets read orphan or dangling route, caught by check_routes.py
A workflow step gets skipped the step prints nothing, so the model walks past it
The agent quits a step early exit condition too vague — [premature-completion]
A line that does nothing the model already obeyed it by default — [no-op]

Every check, threshold, and rule is listed in full under Reference.


How It Works

  your skill  (SKILL.md + references/ + scripts/)
        |
        v
  mechanical checks ........ char / line counts · listing budget
  (scripts, exit-coded)      routing reachability · retrieval eval
        |
        v
  judgment .................. bloat · no-op · sediment · vague exits
  (dimensions, on demand)    loaded only when their condition fires
        |
        v
  dry-run ................... walk one real prompt through the workflow
        |
        v
  P0 -> P3 report .......... must-fix / suggested / passed, each with a line number
        |
        v
  apply fixes .............. only after you confirm

Two design choices make the verdict worth trusting:

  • Deterministic where it can be. Counts, routing, and budget are settled by scripts that return exit codes, not by an opinion that drifts between runs.
  • Honest where it can't. If skill-doctor already edited your skill earlier in the same session, it stamps every line of its report [self-review] and tells you to re-check with a fresh agent. It discounts its own grade on purpose.

Reference

Everything below is lookup material. You don't need to read it to use the skill.

Diagnosis flow

Step Does Output
1 Read the target SKILL.md; run check_listing_budget.py and check_routes.py size + budget + routes verdicts
2 index.py rosters every dimension (FORCED / DEFAULT / SKIP, plus a MODE= dispatch line); read the fired ones one file at a time the roster + one receipt per dimension
2.5 Dry-run: walk one prompt per task shape (write / revise / check) through the body broken links + gates fired per prompt (any break → P0)
2.6 Live-injection: is the target's description actually injected this session, or dropped name-only injected / dropped / skipped
3 Print the P0→P3 diagnosis (❌ must fix · ⚠️ suggested · ✅ passed) the report
4 Apply fixes with Edit — only after you confirm applied-fixes line

Bundled scripts

No LLM and no third-party dependencies (except eval_retrieval.py, which calls a model on request).

Script Checks Command Exit
index.py measures the target and prints the dimension roster (FORCED / DEFAULT / SKIP with reasons) plus the MODE= dispatch line python3 scripts/index.py <skill_dir> 0 roster printed / 2 unreadable
check_routes.py reachability, orphans, dangling links, the compass cap (6000 chars, overridable via a compass-cap: frontmatter field); --before <snapshot> adds a content-conservation check python3 scripts/check_routes.py <skill_dir> 0 clean / 1 issues
check_listing_budget.py total description chars vs the ~1%-of-context listing pool; lists every description over the 300-char band python3 scripts/check_listing_budget.py <root> 0 fits / 1 overflow / 2 unavailable
check_desc_slim.py flags descriptions that can be trimmed without losing trigger strength python3 scripts/check_desc_slim.py <skill_dir> 0 clean / 1 issues
eval_retrieval.py asks a model (GLM, 3× majority vote) whether the router can find each reference by its when-to-read scenario python3 scripts/eval_retrieval.py <skill_dir> --glm 0 clean / 1 misses

Hard rules

A violation here is an error, not a suggestion.

Rule Source
description ≤ 1024 chars (Claude Code ≥ 2.1.105 accepts 1536) Anthropic spec / CC release notes
listing-budget overflow is always reported — it is a population fact, fixed by a global pass, never by editing one skill CC issues #56710 / #47627
description in third person Anthropic best practices
name uses only lowercase letters, digits, hyphens Anthropic spec
body < 500 lines Anthropic / community consensus
routing depth ≤ 2 hops (SKILL.md → index.md → file); nested indexes forbidden structure-surgery (hardened from practice)
zero dangling routes, zero orphan subdocuments structure-surgery iron rule
description on a single logical YAML line (no > or ` `)
every workflow step produces visible output Seleznov experiment
a skill's claim about its own performance must point to in-repo evidence, or be cut evidence-gating

Failure modes it names

A defect gets a name so you can act on it precisely; the P-tier says how bad, the name says what kind.

  • [no-op] — a line the model already obeys by default
  • [sediment] — a stale line that drifted out of relevance
  • [sprawl] — simply too long, even if every line is live
  • [duplication] — the same meaning in more than one place
  • [premature-completion] — an exit condition so vague the agent quits early
  • [weak-leading-word] — an anchor word too weak to beat the model's default
  • [no-floor] — no mandatory read-set, so audit depth varies run to run
  • [gate-gap] — quality gates keyed only to big tasks; the everyday small edit passes zero gates
  • [missing-license] — an editing skill that never grants rewrite authority, so the model's minimal-diff default wins

Priority tiers

Tier Meaning Enters ❌
P0 effect break (the flow doesn't connect when run) always
P1 violates a structural hard rule always
P2 insufficient specificity (no command, no I/O spec) always
P3 readability only when it changes what the next run does; else ⚠️

Dimensions (rostered on demand by scripts/index.py)

scripts/index.py decides which dimension files fire (FORCED / DEFAULT / SKIP); workflow docs load at the step that names them. references/index.md stays the human-readable mirror.

Reference Covers
description-templates.md trigger-strength template + listing-budget math
body-quality-checklist.md body size and what to demote into references
visible-output-rule.md every workflow step must print something
yaml-pitfalls.md frontmatter formatting traps
hard-code-vs-llm-judgment.md script it, or leave it to the model
assets-vs-references.md sorting templates vs reference docs
structure-surgery.md splitting, routing tiers, the 2-hop cap
effect-dry-run.md walking one prompt per task shape through the workflow
runtime-contract.md load floor, gate granularity, edit license: whether the skill still takes effect when it actually runs
priority-tiers.md P0–P3 definitions and placement
predictability-glossary.md the failure-mode vocabulary
hard-rules.md the quantified must-pass rules
exception-fallback.md what to do when a path or script errors
language-policy.md English-by-default and its exceptions
output-format.md the layered shape of the Step 3 report
apply-safety.md size gate, pre-deletion check, closing the loop
live-injection-check.md is the description actually injected this session
rationale.md why the skill exists at all

Repository structure

skill-doctor/
├── SKILL.md                              # the compass: diagnosis flow + routing
├── references/
│   ├── index.md                          # routes to every dimension by when-to-read
│   ├── apply-safety.md
│   ├── assets-vs-references.md
│   ├── body-quality-checklist.md
│   ├── description-templates.md
│   ├── effect-dry-run.md
│   ├── exception-fallback.md
│   ├── hard-code-vs-llm-judgment.md
│   ├── hard-rules.md
│   ├── language-policy.md
│   ├── live-injection-check.md
│   ├── output-format.md
│   ├── predictability-glossary.md
│   ├── priority-tiers.md
│   ├── rationale.md
│   ├── runtime-contract.md
│   ├── structure-surgery.md
│   ├── visible-output-rule.md
│   └── yaml-pitfalls.md
├── scripts/
│   ├── index.py                          # dimension roster: FORCED / DEFAULT / SKIP + MODE=
│   ├── check_routes.py                   # reachability / orphans / dangling / compass cap
│   ├── check_listing_budget.py           # description listing-budget
│   ├── check_desc_slim.py                # description slimming
│   └── eval_retrieval.py                 # retrieval eval (model-backed)
├── README.md
├── README.zh-CN.md
└── LICENSE

Skills fail silently. This is the pass that makes them fail loudly first.

Author: Zane456 · License: MIT

About

Health check for AI agent skills. skill-doctor diagnoses why a SKILL.md won't reliably trigger, runs a routing-recall test on every reference, and restructures the package. 4 deterministic scripts + GLM routing eval. Claude Code & Codex.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages