diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index ee5f161..1bf7dd4 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -10,6 +10,14 @@ "license": "MIT" }, "plugins": [ + { + "name": "agentic-harness", + "source": "./agentic-harness", + "description": "Stand up, assess, and maintain an agentic harness in an existing repo — generate project-specific agent teams and the skills they use, then assess how effectively they are used", + "version": "0.1.0", + "category": "engineering", + "tags": ["harness", "agents", "skills", "scaffolding", "orchestration", "multi-agent", "meta-skill"] + }, { "name": "developer-tools", "source": "./developer-tools", diff --git a/CLAUDE.md b/CLAUDE.md index 7d8b7e2..1a98469 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -47,3 +47,9 @@ This is a static marketplace — no build steps. Plugins are collections of mark ## Shell script conventions Scripts must be macOS (BSD) compatible — no `grep -P`, no `head -n -1`. Use `grep -E`/`grep -oE` and `awk` instead. Avoid `((VAR++))` with `set -e` (fails when VAR=0); use `VAR=$((VAR + 1))`. + +## Git / PR Working Policy +- Worktree usage: disabled +- Worktree location: n/a (topic branch in main checkout) +- Base branch: main +- Recorded on: 2026-06-01 diff --git a/README.md b/README.md index da835bb..5c5c2c4 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,7 @@ A curated collection of [Claude Code](https://claude.com/claude-code) plugins fo | Plugin | Description | Category | Version | |--------|-------------|----------|---------| +| [agentic-harness](./agentic-harness) | Stand up, assess, and maintain an agentic harness — generate project-specific agent teams and the skills they use, then assess how effectively they are used | Engineering | 0.1.0 | | [developer-tools](./developer-tools) | Developer environment tooling — devcontainer generation, stack detection, infrastructure config | Engineering | 2.4.0 | | [human-resources](./human-resources) | HR interview workflow — job descriptions, pre-screening, interview prep, evaluation, compliance | Human Resources | 0.2.0 | | [kaizen](./kaizen) | Continuous improvement loops — recursive optimization engine with profiles for Claude Code usage, refactoring, and process improvement | Engineering | 1.0.0 | @@ -13,6 +14,8 @@ A curated collection of [Claude Code](https://claude.com/claude-code) plugins fo | [project-management](./project-management) | SOW writing, review, estimation, and PMI-compliant PERT analysis — integrated project management pipeline | Operations | 1.0.0 | | [tech-writing](./tech-writing) | Technical writing support — documentation structure, style guides, content review | Documentation | 0.1.0 | +> The [`agentic-harness`](./agentic-harness) plugin is inspired by the open-source `harness` plugin by revfactory (Apache-2.0). It is an independent reimplementation under MIT, with its own structure and prose. + ## Installation Add this marketplace to Claude Code: diff --git a/agentic-harness/.claude-plugin/plugin.json b/agentic-harness/.claude-plugin/plugin.json new file mode 100644 index 0000000..a62ff7f --- /dev/null +++ b/agentic-harness/.claude-plugin/plugin.json @@ -0,0 +1,22 @@ +{ + "name": "agentic-harness", + "version": "0.1.0", + "description": "Stand up, assess, and maintain an agentic harness in an existing repo. A meta-tool that generates project-specific agent teams and the skills they use, then assesses how effectively they are used. Two skills: harness-setup (build, extend, maintain) and harness-review (read-only assessment).", + "author": { + "name": "MrBogomips", + "url": "https://github.com/MrBogomips" + }, + "license": "MIT", + "keywords": [ + "harness", + "agentic-harness", + "agent-team", + "skill-architect", + "meta-skill", + "orchestration", + "scaffolding", + "agent-scaffolding", + "multi-agent", + "claude-code-plugin" + ] +} diff --git a/agentic-harness/README.md b/agentic-harness/README.md new file mode 100644 index 0000000..b28c12d --- /dev/null +++ b/agentic-harness/README.md @@ -0,0 +1,28 @@ +# agentic-harness + +Stand up, assess, and maintain an **agentic harness** inside an existing repository — the project-local agents, skills, and orchestrator that make a repo work well with Claude Code. + +This plugin does not do your domain work. It builds and maintains the agents and skills that do. + +## The two skills + +- **`harness-setup`** — the writer. Analyzes the project, designs an agent team and the skills they use, generates them into `.claude/`, builds an orchestrator, and registers a pointer in `CLAUDE.md`. Also extends an existing harness, applies a review context, and records every change. +- **`harness-review`** — read-only. Inventories the harness, detects drift, and assesses how effectively the skills and agents are actually used (from project memory, the `CLAUDE.md` pointer, and the `.claude/` inventory), then produces a prioritized *review context* that `harness-setup` can act on. + +The loop: **review → setup → review again.** + +## Skills, not commands + +Invoke a skill directly (`/agentic-harness:harness-setup`) or let Claude trigger it from context. This plugin ships no slash commands — Claude Code merged commands into skills. + +## Execution modes + +`harness-setup` defaults to an **agent team** and falls back to **subagents** when the experimental team tools are unavailable. See `shared/execution-modes.md`. + +## Relationship to `kaizen` + +`kaizen` optimizes assets against measured KPIs. `agentic-harness` builds, assesses, and maintains the harness those assets live in. They compose. + +## Inspired by + +The harness concept is inspired by prior work in the community; this is an independent implementation. Credit lives in the repository README. diff --git a/agentic-harness/shared/claude-md-pointer.md b/agentic-harness/shared/claude-md-pointer.md new file mode 100644 index 0000000..d3cdf53 --- /dev/null +++ b/agentic-harness/shared/claude-md-pointer.md @@ -0,0 +1,61 @@ +# The CLAUDE.md pointer + +A target project's `CLAUDE.md` is loaded into context at the start of every session. That +makes it the right place to record that a harness exists and when it should fire — and the +wrong place for anything the file system already holds. + +## Register a minimal pointer + +After a harness is built or changed, write (or update) one short section in the target +project's `CLAUDE.md`. It carries three things and nothing more: + +1. The harness's **goal**, in one line. +2. The **trigger rule** — which orchestrator skill to use, and for which kind of request. +3. A **change-history** table. + +This is enough for a fresh session: the trigger rule routes domain requests to the +orchestrator, and the orchestrator handles the rest from the files under `.claude/`. + +### Template + +````markdown +## Harness: {domain} + +**Goal:** {one line on what this harness produces} + +**Trigger:** For {domain} work, use the `{orchestrator-skill-name}` skill. Answer simple +questions directly. + +**Change history:** +| Date | Change | Target | Reason | +|------|--------|--------|--------| +| {YYYY-MM-DD} | Initial setup | All | — | +```` + +## What not to put here + +Leave these out of `CLAUDE.md`: + +- The agent list or the skill list — they live in `.claude/agents/` and `.claude/skills/`, + and the orchestrator already knows them. Copying them here creates a second source of + truth that drifts. +- The directory structure — readable straight from the file system. +- Detailed execution rules — they belong in the skills and the orchestrator. + +The pointer is a signpost, not a manifest. Keep it small enough that it stays correct. + +## The change-history table + +Every write to the harness appends a row. The columns are fixed: + +| Date | Change | Target | Reason | +|------|--------|--------|--------| +| 2026-04-05 | Initial setup | All | — | +| 2026-04-07 | Added QA agent | `agents/qa.md` | deliverable-quality gaps reported | +| 2026-04-10 | Added tone guidance | `skills/content-writer` | output read as too stiff | + +The table earns its place: it shows how the harness has evolved and why, which makes +regressions visible and gives the next reviewer a starting point. Recording history is a +required step of every harness write — not an optional courtesy. + +On periodic review this table can be pruned, preserving valuable information and keeping it readable. diff --git a/agentic-harness/shared/execution-modes.md b/agentic-harness/shared/execution-modes.md new file mode 100644 index 0000000..3a2baa4 --- /dev/null +++ b/agentic-harness/shared/execution-modes.md @@ -0,0 +1,85 @@ +# Execution modes + +A harness coordinates more than one agent. There are three ways to run that coordination. +Pick one per harness — or, in hybrid, one per phase — and state the choice explicitly in +the orchestrator. + +## Before you choose: the team-tools caveat + +The agent-team mode depends on experimental tools: `TeamCreate`, `SendMessage`, +`TaskCreate`/`TaskUpdate`, `TeamDelete`. These are **not guaranteed to be available** in a +given Claude Code build, session, or permission setup. Treat their presence as a runtime +fact to check, not an assumption. + +Because of that, every team-mode harness must define a **subagent fallback**. The subagent +mode uses only the `Agent` tool, which is broadly available. A harness that can only run as +a team is brittle; a harness that prefers a team and falls back to subagents is not. Build +the fallback at the same time as the team path, not later. + +When the team tools are missing, do not fail — re-route to the subagent equivalent +(`Agent` with `run_in_background`) and note in the run that team coordination was +unavailable. The mapping below makes that re-route mechanical. + +## The three modes + +| Mode | Use when | Mechanism | +|------|----------|-----------| +| **Agent team** (default) | Two or more agents collaborate and benefit from real-time exchange — sharing findings, challenging each other, reconciling conflicts | Members run as peers; coordinate via `TeamCreate` + `SendMessage` + a shared task list (`TaskCreate`/`TaskUpdate`) | +| **Subagent** (fallback and lightweight default) | A single agent's work, or several independent jobs where only the result matters and inter-agent talk would be overhead | The orchestrator calls the `Agent` tool directly; parallelize with `run_in_background` and collect return values | +| **Hybrid** | Phases differ in character — e.g. independent collection, then consensus integration | Choose the mode per phase; state each phase's mode in the orchestrator | + +The team mode is the *preferred* default when agents genuinely need to talk: cross-checking +and shared discovery raise quality in a way isolated subagents cannot. But "preferred" is +conditional on the tools being present and on the work actually needing coordination. When +either is false, subagents are the right call, not a downgrade. + +## Decision order + +1. Is this a single agent? → **subagent**. One agent needs no team. +2. Two or more agents: do they need to exchange information mid-task (challenge findings, + resolve conflicts, hand off partial state)? → if yes, **agent team**; if no — only the + final results combine — **subagent** is enough and cheaper. +3. Are the team tools unavailable? → **subagent fallback**, using the mapping below. +4. Do phases differ markedly in whether coordination helps? → **hybrid**, mode stated per + phase. + +## Team-to-subagent fallback mapping + +When team mode is unavailable, translate it mechanically: + +| Team-mode mechanism | Subagent equivalent | +|---------------------|---------------------| +| `TeamCreate` + member prompts | One `Agent` call per member, each with its role prompt | +| Parallel members | `Agent` calls with `run_in_background: true`, collected after | +| `SendMessage` between members | No direct channel — members can't talk; route shared context through files the orchestrator writes and each agent reads | +| `TaskCreate`/`TaskUpdate` shared list | The orchestrator holds the task list itself and assigns work in the spawn prompts | +| Leader synthesis after team | Orchestrator reads each agent's output file/return value and integrates | + +The cost of the fallback is real: subagents cannot debate or revise each other mid-flight. +Where the team relied on that, compensate by adding an explicit integration or review step +that reconciles the independent outputs afterward. + +## Data-passing per mode + +State the data-passing method inside the orchestrator. Match it to the mode: + +| Method | How | Mode | Fits | +|--------|-----|------|------| +| Message-based | direct member-to-member via `SendMessage` | team | real-time coordination, lightweight state hand-off | +| Task-based | shared work state via `TaskCreate`/`TaskUpdate` | team | progress tracking, dependency management | +| File-based | write and read files at agreed paths | team + subagent | large or structured deliverables, audit trail | +| Return-value-based | the `Agent` tool's return message | subagent | the orchestrator collects results directly | + +- **Team mode:** task-based for coordination + file-based for deliverables + message-based + for live exchange. +- **Subagent mode:** return-value-based for results + file-based for anything large. +- **Hybrid:** apply the matching combination per phase, and check the hand-off at phase + boundaries — when a team phase feeds a subagent phase, the team's output files become the + subagent's inputs. + +### File-passing convention + +- Keep intermediate work under a `_agents_workspace/` directory in the working tree. +- Name files `{phase}_{agent}_{artifact}.{ext}` — e.g. `01_analyst_requirements.md`. +- Write only the final deliverable to the user's target path; preserve `_agents_workspace/` for + later inspection rather than deleting it. diff --git a/agentic-harness/shared/harness-model.md b/agentic-harness/shared/harness-model.md new file mode 100644 index 0000000..ab925e5 --- /dev/null +++ b/agentic-harness/shared/harness-model.md @@ -0,0 +1,55 @@ +# The harness model + +A *harness* is the project-local configuration that lets a repository work well with +agents. It has three parts, and keeping them separate is what makes a harness +maintainable. + +## The three parts + +| Part | Question it answers | Where it lives | +|------|---------------------|----------------| +| **Agent** | *Who* does the work | `.claude/agents/{name}.md` | +| **Skill** | *How* the work is done | `.claude/skills/{name}/SKILL.md` | +| **Orchestrator** | *When*, and *in what order*, the agents collaborate | a skill, usually `.claude/skills/{domain}-orchestrator/` | + +An **agent** is an expert role: a persona, its working principles, its input/output +contract, and — in team mode — how it communicates with other agents. An agent is small. +It says who is acting and what that actor cares about, not the step-by-step procedure. + +A **skill** is procedural knowledge: the workflow an agent follows, the format it +produces, the references it consults. A skill can be large. It says how a job is done, +independent of who does it. + +An **orchestrator** is a skill whose subject is the team itself. Where individual skills +describe one agent's job, the orchestrator describes the collaboration: which agents take +part, what each produces, how their outputs flow together, and how failures are handled. + +## Why separate who from how + +The separation is not bookkeeping. It buys three concrete things: + +- **Reuse.** A skill written for "how to review an API contract" is usable by any agent + that needs it, in this project or the next. Bind the procedure to a single named agent + and it stops being portable. +- **Independent change.** When a deliverable is weak, you revise the skill. When you need + a new kind of reviewer, you add an agent. When the workflow order is wrong, you edit the + orchestrator. Each fault has one obvious place to fix it. +- **Survival across sessions.** An agent defined only inline in a prompt vanishes when the + session ends. An agent defined as a file is there next time, with its protocol intact + and its memory. + +## The file rule + +Every agent is a file under `.claude/agents/`, even when it uses a built-in type +(`general-purpose`, `Explore`, `Plan`). Put the built-in type in the call that spawns the +agent; put the role, principles, and protocol in the file. A role written only into a +spawn prompt is not reusable and carries no collaboration contract — so it is not part of +the harness. + +## How the parts connect + +One agent uses one or more skills; a skill may be shared by several agents. The +orchestrator names the agents and points each at its skills. The pointer in the target +project's `CLAUDE.md` names only the orchestrator and its trigger rule — see +`claude-md-pointer.md`. Nothing in the harness duplicates what the file system already +states: the agent and skill lists live in `.claude/`, not in `CLAUDE.md`. diff --git a/agentic-harness/skills/harness-review/SKILL.md b/agentic-harness/skills/harness-review/SKILL.md new file mode 100644 index 0000000..b0304b7 --- /dev/null +++ b/agentic-harness/skills/harness-review/SKILL.md @@ -0,0 +1,134 @@ +--- +name: harness-review +description: "Assess an existing agentic harness — read-only. Use to review, audit, or assess a harness; to ask how well its skills and agents are actually used; to check for drift between the files and the CLAUDE.md record; or to validate that skills trigger and agents wire up correctly. It inventories .claude/agents and .claude/skills, reads the CLAUDE.md pointer and change history, judges effective usage from project memory and the inventory, and produces a prioritized review context — what works well and what to improve — that hands off to harness-setup. This skill writes nothing; it diagnoses. To build or change a harness, use harness-setup instead." +model: opus +--- + +# Harness review — assess an existing harness (read-only) + +This skill diagnoses an existing agentic harness and produces a *review context*: a +prioritized account of what works well and what to improve. It is the **reader** half of the +plugin — it writes nothing. The fixes it identifies are carried out by `harness-setup`. + +The harness model it assesses against is in `${CLAUDE_PLUGIN_ROOT}/shared/harness-model.md` +(agent = who, skill = how, orchestrator = when/order). + +## This skill vs harness-setup + +`harness-review` reads and judges; `harness-setup` writes. A request to "review / assess / +audit a harness" or "how well is the harness used" is this skill. A request that creates or +changes anything is `harness-setup`. The output here is a review context that hands off to +`harness-setup` — this skill never applies a change itself. + +## The read-only contract + +Make no edits — not to `.claude/agents/`, not to `.claude/skills/`, not to `CLAUDE.md`, not +to project memory. Run nothing that mutates the project. When a fix is obvious, write it into +the review context as a recommendation for `harness-setup`; do not apply it. The value of a +reader is that its findings are trustworthy precisely because it changed nothing. + +## Step 1: Inventory the harness + +1. List `.claude/agents/` and `.claude/skills/`; identify the orchestrator skill. +2. For each agent, read its frontmatter and role; for each skill, read its name and + description. Note the declared execution mode. +3. Map which skills each agent uses and how the orchestrator wires them together. + +## Step 2: Read the record + +Read the harness section of `CLAUDE.md` — the goal, the trigger rule, and the change-history +table. The convention is in `${CLAUDE_PLUGIN_ROOT}/shared/claude-md-pointer.md`. The history +tells you how the harness has evolved and where recent changes concentrated. + +## Step 3: Detect drift + +Compare three views of the harness: + +- the files in `.claude/agents/` and `.claude/skills/`, +- the orchestrator's stated composition, +- the `CLAUDE.md` pointer and change history. + +List every discrepancy: an agent or skill present in the files but absent from the record or +the orchestrator; something the record names that no longer exists; an orchestrator that +references a member it never forms. Do not reconcile drift — report it. Reconciliation is a +write, and belongs to `harness-setup`. + +## Step 4: Assess effective usage + +Drift tells you what *exists*; usage tells you what is *used*. Judge effective usage from a +fixed, deterministic signal set — read each, infer nothing you cannot ground in one: + +- **Project auto-memory** at `~/.claude/projects//memory/` — what the project + has recorded about how work actually happens. +- **The `CLAUDE.md` pointer and change history** — whether the trigger rule is current and + what has been changing. +- **The `.claude/` inventory** — what the harness offers. +- **The tools registry**, if present (a `tools.md` in the orchestrator's `references/` + directory) — which roles are filled by which tools, and their alternatives. + +From these, classify each skill and agent as **used**, **unused**, **bypassed** (the work +happens but around the harness), or **drifted** (present but out of sync). Apply the same lens +to registered tools: is each one still used, and is there now a better alternative for its +role? Flag tools that look unnecessary or superseded — as findings for `harness-setup`, not +changes you make. The method for reading each signal and making the call is in +`references/usage-assessment.md`. This step is strictly read-only. + +## Step 5: Validate + +- **Structural.** Agents are files in the right place (including built-in types); skill + frontmatter has `name` and `description`; cross-references between agents are consistent; + no `commands/` directory was generated. +- **Triggering.** For each skill description, write should-trigger and should-NOT-trigger + queries — the should-NOT set built from near-misses, including ones that belong to a + *different* skill. Confirm the descriptions separate cleanly and don't collide. The method + is in `references/skill-testing-guide.md`. +- **Effectiveness.** Where feasible, compare the deliverable with the skill against the same + task without it, to confirm the skill adds value. Also in `references/skill-testing-guide.md`. +- **Dry-run.** Walk the orchestrator: is the phase order logical, do data-passing paths have + no dead links, does each phase's input match the previous phase's output, is each error + fallback executable? Check the declared execution mode against + `${CLAUDE_PLUGIN_ROOT}/shared/execution-modes.md` — in particular, that a team-mode + harness has the subagent fallback it needs. +- **QA agent, if present.** Check it against the QA methodology — does it compare across + boundaries rather than only confirm existence, and run incrementally rather than once at + the end? See `references/qa-agent-guide.md`. + +## Step 6: Emit the review context + +Produce a prioritized review context — the deliverable of this skill — and hand it to +`harness-setup`: + +````markdown +## Harness review: {domain} — {date} + +### Works well (keep) +- {finding} — {brief evidence} + +### To improve (act on) +| Priority | Finding | Target | Why | +|----------|---------|--------|-----| +| high | {what's wrong} | {agent / skill / orchestrator / CLAUDE.md} | {evidence and impact} | +| medium | ... | ... | ... | + +### Drift +- {discrepancy between files, orchestrator, and record} + +### Suggested next step +Hand this context to `harness-setup` to act on the "to improve" items in priority order. +```` + +Order the "to improve" items by impact, name a concrete target for each (so `harness-setup` +knows where the fix lands), and ground each in evidence from the steps above — not in +speculation. Keep "works well" honest: it tells `harness-setup` what not to touch. + +## References + +- `references/usage-assessment.md` — the deterministic signal set and how to read each to + classify a skill or agent as used, unused, bypassed, or drifted. +- `references/skill-testing-guide.md` — trigger validation (should / should-not), with-skill + vs baseline effectiveness checks, and the iterative-improvement method. +- `references/qa-agent-guide.md` — what a good QA agent does (cross-boundary comparison, + boundary-bug patterns, incremental QA), for assessing a harness that includes one. +- `${CLAUDE_PLUGIN_ROOT}/shared/harness-model.md`, + `${CLAUDE_PLUGIN_ROOT}/shared/claude-md-pointer.md`, + `${CLAUDE_PLUGIN_ROOT}/shared/execution-modes.md` — shared concepts. diff --git a/agentic-harness/skills/harness-review/references/qa-agent-guide.md b/agentic-harness/skills/harness-review/references/qa-agent-guide.md new file mode 100644 index 0000000..af75fcd --- /dev/null +++ b/agentic-harness/skills/harness-review/references/qa-agent-guide.md @@ -0,0 +1,137 @@ +# QA agent guide + +What a good QA agent does, so you can assess one in an existing harness or specify one for +`harness-setup` to build. The method is drawn from common boundary-bug patterns — recurring +failure modes that surface where two correctly-built components meet. + +## Table of contents + +1. [The bugs QA agents miss](#1-the-bugs-qa-agents-miss) +2. [Why static checks don't catch them](#2-why-static-checks-dont-catch-them) +3. [Cross-boundary verification](#3-cross-boundary-verification) +4. [Design principles](#4-design-principles) +5. [Integration-consistency checklist](#5-integration-consistency-checklist) +6. [QA agent definition template](#6-qa-agent-definition-template) + +--- + +## 1. The bugs QA agents miss + +The most common defect is a **boundary mismatch**: two components are each correct on their +own, but the contract between them is broken at the seam. Each passes its own review; nobody +compares the two sides. Recurring forms: + +| Boundary | Mismatch | Why it's missed | +|----------|----------|-----------------| +| response → consumer type | the producer returns `{ items: [...] }`; the consumer expects a bare array | each is fine alone; no one compares the shapes | +| field name across layers | one layer uses camelCase, another snake_case for the same field | a generic cast hides it from the compiler | +| route path → link target | a page lives under one path prefix; a link points at another | file layout and link values aren't cross-checked | +| state map → update sites | the transition map allows a move the code never performs (or vice versa) | only the map's existence is confirmed, not every update site | +| endpoint → caller | an endpoint exists but nothing calls it (or a caller hits a missing one) | the endpoint list and caller list aren't mapped 1:1 | +| immediate vs async result | the immediate response lacks a field the consumer reads off it | only the type is checked, not sync vs async timing | + +## 2. Why static checks don't catch them + +- **Generic casts defeat the type checker.** A typed fetch annotated to return one shape will + compile even when the runtime value has another — the annotation is a claim, not a check. +- **A passing build is not correct behavior.** With casts, `any`, or generics in play, the + build succeeds and the code still fails at runtime. +- **Existence is not connection.** "Does the endpoint exist?" and "does its response match + what the caller expects?" are different questions; only the second catches boundary bugs. + +## 3. Cross-boundary verification + +The core technique: compare the two sides of a contract directly, never one in isolation. + +- **Response shape ↔ consumer type.** Extract the shape the producer actually returns and the + type the consumer expects, and compare — including any wrapper (does the producer return + `{ data: [...] }` while the consumer reads a bare array?), and field-name casing across + layers, and the difference between an immediate acknowledgement and the eventual result. +- **Route path ↔ link target.** Derive each page's real path from the file layout and compare + it against every link, redirect, and navigation target in the code; account for path + prefixes and dynamic segments. +- **State map ↔ update sites.** List the transitions the map allows, then find every place the + code changes state; flag transitions the code performs that the map forbids, and transitions + the map allows that the code never performs. +- **Endpoint ↔ caller.** List every endpoint and every call site and pair them; an endpoint no + one calls is either dead or a missing wire-up — decide which. + +## 4. Design principles + +- **Use the `general-purpose` type, not `Explore`.** Effective QA greps for patterns, runs + comparison scripts, and sometimes fixes — `Explore` is read-only and can't. Give the agent a + "verify → report → request fix" protocol instead of read-only tooling. +- **Prefer cross-boundary comparison over existence checks.** "Does the response shape match + the consumer's type?" beats "does the endpoint exist?"; "does every update obey the state + map?" beats "is the map defined?". +- **Read both sides at once.** To catch a seam bug, open the producer and the consumer + together — the route and its caller, the state map and the update code, the file layout and + the links. State this explicitly in the agent definition. +- **Run incrementally, not just at the end.** QA placed only after everything is built lets + early mismatches propagate and raises the cost of every fix. Verify each module's boundaries + as it lands. + +## 5. Integration-consistency checklist + +A template to fold into a QA agent for a typical web application: + +```markdown +### Integration consistency + +#### Response ↔ consumer +- [ ] Every response shape matches the consumer's expected type +- [ ] Wrapped responses ({ items: [...] }) are unwrapped on the consumer side +- [ ] camelCase/snake_case is consistent across layers +- [ ] Immediate acknowledgements and eventual results are distinguished on the consumer +- [ ] Every endpoint has a caller, and every caller hits a real endpoint + +#### Routing +- [ ] Every link / redirect / navigation target resolves to a real page path +- [ ] Path prefixes and route groups are accounted for +- [ ] Dynamic segments are filled with the right parameters + +#### State machine +- [ ] Every transition the code performs is allowed by the map (no unauthorized moves) +- [ ] Every transition the map allows is reachable in the code (no dead transitions) +- [ ] Intermediate-to-final transitions are present, not missing +- [ ] State-based branches on the consumer are actually reachable + +#### Data flow +- [ ] Field names match from storage through response to consumer type +- [ ] Optional-field null/undefined handling is consistent on both sides +``` + +## 6. QA agent definition template + +```markdown +--- +name: qa-inspector +description: "QA verification specialist. Checks spec compliance, integration consistency, and design quality across module boundaries." +model: opus +--- + +# QA inspector + +## Core role +Verify implementation quality against the spec, and **integration consistency between +modules** — boundary mismatch is the leading cause of runtime failure. + +## Verification priority +1. Integration consistency (highest) +2. Functional spec compliance +3. Design quality +4. Code quality (dead code, naming) + +## Method: read both sides at once +| Target | Producer side | Consumer side | +|--------|---------------|---------------| +| response shape | where the response is built | where it is consumed and typed | +| routing | page path from the file layout | link / redirect / navigation target | +| state | the transition map | the state-update sites | +| data | storage field names | response field → consumer type | + +## Team communication protocol +- On a finding, send a specific fix request (location + fix) to the responsible agent. +- For a boundary issue, notify **both** sides. +- To the leader: a report separating passed / failed / unverified. +``` diff --git a/agentic-harness/skills/harness-review/references/skill-testing-guide.md b/agentic-harness/skills/harness-review/references/skill-testing-guide.md new file mode 100644 index 0000000..50d5fa7 --- /dev/null +++ b/agentic-harness/skills/harness-review/references/skill-testing-guide.md @@ -0,0 +1,103 @@ +# Skill testing guide + +How to test whether a harness's skills trigger correctly and add value, and how to improve +them. Supplements the validation step. + +## Table of contents + +1. [Two kinds of evaluation](#1-two-kinds-of-evaluation) +2. [Writing test prompts](#2-writing-test-prompts) +3. [With-skill vs baseline](#3-with-skill-vs-baseline) +4. [Assertion-based scoring](#4-assertion-based-scoring) +5. [Trigger validation](#5-trigger-validation) +6. [The improvement loop](#6-the-improvement-loop) + +--- + +## 1. Two kinds of evaluation + +Skill quality is judged two ways, and most skills need both: + +| Kind | How | Fits | +|------|-----|------| +| Qualitative | the user reads the deliverable | subjective quality — prose, design, judgment calls | +| Quantitative | assertion-based scoring | objectively checkable results — a file created, data extracted, code generated | + +The loop is the same either way: **write a test → run it → evaluate → improve → re-run**. + +## 2. Writing test prompts + +A test prompt should read like a real request a real user would type — specific and natural. +Abstract prompts ("process the data", "generate a chart") have little test value. + +A good prompt carries concrete detail: + +```text +In the quarterly sales sheet in my downloads folder, add a margin column from the revenue and +cost columns, then sort by margin descending. +``` + +Vary the prompts so the set covers more than the happy path: + +- Mix formal and casual phrasing. +- Mix explicit intent (states the file type) with implicit (must be inferred from context). +- Mix simple and multi-step tasks. +- Start with two or three: one core case, one edge case, optionally one composite. + +## 3. With-skill vs baseline + +To confirm a skill actually adds value, run the same prompt twice in parallel — once with the +skill, once without — and compare: + +- **With-skill:** the agent reads the skill and does the work. +- **Baseline:** the same prompt, no skill. When improving an *existing* skill, the baseline is + the previous version (keep a snapshot), not an empty run. + +Keep each run's outputs in their own directory so iterations don't overwrite each other. If +you capture timing or token counts, save them from the completion notification immediately — +that data is available only at that moment and cannot be recovered later. + +## 4. Assertion-based scoring + +When the deliverable is objectively checkable, write assertions. A good assertion can be +judged true or false, has a descriptive name, and tests the skill's core value. A bad one is +subjective ("it reads well") or passes regardless of the skill ("output exists"). + +Watch for **non-discriminating** assertions — ones that pass in both the with-skill and the +baseline run. They measure nothing about the skill; drop them or replace them with a harder +one. Where an assertion can be checked by code, script it: faster, repeatable, and reusable +across iterations. Record results with the field names `text`, `passed`, `evidence` (the same +schema the setup-side skill-writing guide defines). + +## 5. Trigger validation + +A skill's description is its only trigger. Validate it with two query sets: + +- **Should-trigger (8–10):** the same need phrased many ways — formal and casual, explicit and + implicit, including cases where this skill should win over a competing one. +- **Should-NOT-trigger (8–10):** **near-misses are the point.** A query with overlapping + keywords but where a *different* tool is the right fit tests the boundary; an obviously + unrelated query ("write a Fibonacci function") tests nothing. Build the should-NOT set from + queries that belong to adjacent skills — especially, for this plugin's own two skills, a + "review my harness" that must not fire `harness-setup` and an "extend my harness" that must + not fire `harness-review`. + +Then check for collisions: confirm the should-trigger queries don't wrongly fire a *different* +existing skill. Where two descriptions overlap, sharpen the boundary wording in both. + +## 6. The improvement loop + +When a test surfaces a problem: + +1. **Generalize the fix.** Repair the principle, not the one example that failed — a narrow + patch that fits only the test case is overfitting. +2. **Cut what doesn't earn its place.** If the transcript shows the skill making the agent do + unproductive work, remove that part. +3. **Explain the why.** Even terse feedback has a reason behind it; fold the reason into the + skill so it generalizes. +4. **Bundle repetition.** If every run writes the same helper, pre-bundle it under `scripts/`. + +Re-run all cases in a fresh iteration directory, show the user the result against the previous +one, and repeat. Stop when the user is satisfied, when feedback comes back empty, or when +there is no more meaningful improvement to make. Draft, then re-read with fresh eyes — don't +expect the first pass to be the final one. diff --git a/agentic-harness/skills/harness-review/references/usage-assessment.md b/agentic-harness/skills/harness-review/references/usage-assessment.md new file mode 100644 index 0000000..fc3ab8e --- /dev/null +++ b/agentic-harness/skills/harness-review/references/usage-assessment.md @@ -0,0 +1,80 @@ +# Usage assessment + +How to judge whether a harness is actually *used*, not just whether it *exists*. This reads a +fixed, deterministic signal set and classifies each skill, agent, and tool. It changes +nothing — every output is a finding for `harness-setup`, never an edit. + +## Why a fixed signal set + +Usage is inferred, not measured directly — there is no usage counter. So ground every +judgment in a concrete signal rather than a guess. Four signals, all read-only. Read each, +state what it shows, and make the call only from evidence one of them supports. When the +signals are thin, say so and lower your confidence rather than inventing certainty. + +## Signal 1: project auto-memory + +Location: `~/.claude/projects//memory/`. The `` is the project's +absolute path with the separators turned into dashes — e.g. a project at +`/Users/me/work/app` becomes a directory like `-Users-me-work-app`. Find the one whose slug +matches the project you are reviewing. + +Read `MEMORY.md` (the index) and the memory files it points to. This is where the project has +recorded how work actually happens. What to look for: + +- Does the memory mention the orchestrator skill, the agents, or the harness at all? Silence + is itself a signal — a harness the project never records using is likely unused. +- Does it describe work being done **by hand** in the harness's domain? That points to the + orchestrator being bypassed. +- Does it record friction ("the agent kept doing X")? That points to a quality or trigger + problem to flag. + +## Signal 2: the CLAUDE.md pointer and change history + +Read the harness section of `CLAUDE.md`. The pointer's trigger rule tells you what *should* +route to the orchestrator; compare it against how the project actually describes its work +(Signal 1). The change-history table tells you the evolution: + +- A pointer that names an orchestrator the files no longer contain → drift. +- A long history that stops abruptly → the harness may have been abandoned. +- History concentrated on one agent or skill → that area has been unstable; worth a look. + +## Signal 3: the `.claude/` inventory + +List `.claude/agents/` and `.claude/skills/`, and read the orchestrator's composition. This +is the ground truth of what the harness offers. Cross-check it against the other signals: +something in the inventory that no signal ever mentions is a candidate for "unused"; something +the memory or pointer references that is missing from the inventory is drift. + +## Signal 4: the tools registry + +If the orchestrator has a `tools.md` registry (from the tool-discovery step), read it: the +roles, the preferred tool and alternative per role, and the `Last reviewed` dates. What to +look for: + +- A registered tool no agent or skill references by role → unused. +- A `Last reviewed` date that is far in the past → due for re-evaluation. +- A role whose preferred tool the project's memory shows failing, so the alternative is what + runs in practice → flag the preferred tool as a candidate to swap. + +## Classifying from the signals + +Combine the signals and label each skill, agent, and tool: + +| Class | What the signals show | +|-------|------------------------| +| **Used** | the inventory has it, and memory or history shows it being invoked or changed | +| **Unused** | it exists in the inventory, but no other signal ever references it | +| **Bypassed** | the work in its domain happens (memory shows it), but around the harness rather than through the orchestrator | +| **Drifted** | the pointer, history, or orchestrator names it, but the files disagree (missing, moved, or out of sync) | + +A label is only as good as its evidence. For each, name the signal that supports it. "Unused +— not referenced in memory, history, or the orchestrator" is a finding; "probably unused" with +nothing behind it is not. + +## Staying read-only + +This assessment writes nothing — not the files, not `CLAUDE.md`, not memory. When the right +fix is obvious (retire an unused agent, swap a superseded tool, widen a trigger that memory +shows missing), record it as a prioritized recommendation in the review context and hand it to +`harness-setup`. The separation is the point: a reader that never writes can be trusted, and +its findings stay clean for the writer to act on. diff --git a/agentic-harness/skills/harness-setup/SKILL.md b/agentic-harness/skills/harness-setup/SKILL.md new file mode 100644 index 0000000..6407e3d --- /dev/null +++ b/agentic-harness/skills/harness-setup/SKILL.md @@ -0,0 +1,221 @@ +--- +name: harness-setup +description: "Build, extend, and maintain a project's agentic harness — the agents, skills, and orchestrator under .claude/. This skill writes files. Use it to set up or build a harness, scaffold or design an agent team and the skills they use, add or change an agent or skill, update or rebuild the harness, sync it after drift, or apply a review context produced by harness-review. On explicit request it can also discover and configure external tools — MCP servers and plugins — that fit the project, and register the approved ones in the harness tools registry. Also triggers on follow-ups such as 'extend the harness', 'the harness needs a new agent', 're-run setup', 'act on the review', and 'find tools or MCPs for this project'. For read-only assessment of how well an existing harness is used, use harness-review instead — this skill is the writer, that one is the reader." +model: opus +--- + +# Harness setup — build and maintain the agent team + +This skill builds and maintains a project-local **agentic harness**: a team of agent +definitions, the skills those agents use, an orchestrator that coordinates them, and a +pointer in the project's `CLAUDE.md`. It writes files. It does not do the project's domain +work — it builds the agents and skills that do. + +Read the model first: `${CLAUDE_PLUGIN_ROOT}/shared/harness-model.md` defines the three +parts (agent = who, skill = how, orchestrator = when/order) and why they stay separate. + +## This skill vs harness-review + +`harness-setup` is the **writer** — every change to the harness goes through it. `harness-review` +is the **reader** — it assesses an existing harness and writes nothing. When a request is +"assess / review / audit / how well is it used," that is `harness-review`. When a request +creates or changes anything, it is this skill. After `harness-review` hands off a *review +context*, this skill is what acts on it. + +## Step 0: Orient before writing + +Always check the current state first, then pick the path. Do not start generating until the +plan is confirmed. + +1. Read `.claude/agents/`, `.claude/skills/`, and the harness section of `CLAUDE.md`. +2. Branch on what you find: + - **New build** — no harness, or empty agent/skill dirs → run Steps 1–7 in full. + - **Extend** — a harness exists and the request adds or changes an agent or skill → run + only the needed steps, per the extension matrix in `references/maintenance.md`. + - **Apply a review context** — `harness-review` produced a prioritized list → work it as + an interactive improvement pass (see `references/maintenance.md`). + - **Sync** — the files and the `CLAUDE.md` record disagree → reconcile and record the + correction. +3. Summarize what you found and the plan you propose; get the user's confirmation. + +## Step 1: Analyze the domain + +1. Identify the domain and the core task types (creation, validation, editing, analysis). +2. Explore the codebase — tech stack, data models, key modules — so agents and skills fit + the project rather than a generic template. +3. Check for conflicts or overlap with any existing agents and skills from Step 0. +4. Read the user's technical level from the conversation and match your wording to it. + Explain a term like "assertion" or "schema" when the cues suggest it is unfamiliar. + +## Step 1b: Discover tools — optional, on request + +Run this only when the user asks for it ("find tools / MCPs / plugins for this project"). +It is never automatic. It can run inside a build or standalone against an existing harness. + +This skill proposes nothing of its own — the candidates come from a live search, not a +built-in catalog: + +1. Hand a **search-optimized subagent** (`general-purpose`, with web search) a tight context + — the project's domain, stack, and task types from Step 1 — and have it find candidate + MCP servers and plugins that fit. Also inspect the local and session configuration for + tools already available, so you don't propose what is already there. +2. Present each candidate with the **role** it would fill, what it does, and its trade-off. + The user **accepts or rejects each one explicitly**. Adopt only what is accepted. +3. Register accepted tools **by role** in the tools registry under the orchestrator — a + `tools.md` file in `.claude/skills/{domain}-orchestrator/references/`. It is a lookup of + role → preferred tool → alternative (for when the preferred one is unavailable) → status. + Agents and skills reference a tool by its **role**, never by a hard tool name, so the + harness falls back to the alternative when a tool is missing. + +The subagent's context template, the acceptance flow, and the registry schema are in +`references/tool-discovery.md`. Registered tools are reviewed periodically — see +`references/maintenance.md`. + +## Step 2: Choose the execution mode and the architecture pattern + +**Execution mode.** Default to an **agent team** when two or more agents genuinely need to +exchange information mid-task; fall back to **subagents** when they do not, or when the +experimental team tools are unavailable. The team-tools caveat and the mechanical fallback +mapping are in `${CLAUDE_PLUGIN_ROOT}/shared/execution-modes.md` — read it and decide the +mode before designing the team, because the mode shapes the agent definitions and the +orchestrator. + +**Architecture pattern.** Decompose the work into areas of expertise and pick a structure. +The six patterns — Pipeline, Fan-out/Fan-in, Expert Pool, Producer-Reviewer, Supervisor, +Hierarchical Delegation — with their fit and their team-mode suitability are in +`references/agent-design-patterns.md`. Composite patterns are common; the same reference +covers them. + +**Split agents** along four axes — expertise, parallelism, context, reusability. The +criteria table is in `references/agent-design-patterns.md`. Prefer a few focused agents over +many thin ones; coordination cost grows with team size. + +## Step 3: Generate the agent definitions + +Write every agent as a file under `.claude/agents/{name}.md` — including agents that use a +built-in type (`general-purpose`, `Explore`, `Plan`). Put the built-in type in the spawn +call; put the role, principles, and protocol in the file. The reason is in the harness +model: a role defined only inline is not reusable next session and carries no collaboration +contract. + +Each agent file states: core role, working principles, input/output protocol, error +handling, and collaboration. In team mode, add a **team communication protocol** section — +who it messages, who messages it, and what it claims from the shared task list. The +definition template and worked agent files are in `references/agent-design-patterns.md` and +`references/team-examples.md`. + +**Model.** Set each agent's model explicitly, in both the agent file and the spawn call. A +harness's quality tracks its agents' reasoning, so use the strongest reasoning model for +roles that depend on judgment rather than throughput. + +**If the team includes a QA agent.** Use the `general-purpose` type (`Explore` is read-only +and cannot run validation). Make its core method *cross-boundary comparison* — read both +sides of a contract together (the producer and the consumer), not each in isolation — and +run it incrementally as each module lands, not once at the end. The full methodology, +boundary-bug patterns, and a QA agent template are in the `qa-agent-guide` reference under +the `harness-review` skill. + +## Step 4: Create the skills + +Create each skill the agents use at `.claude/skills/{name}/SKILL.md`. The authoring guide — +description writing, body principles, progressive disclosure, data-schema standards — is in +`references/skill-writing-guide.md`. The essentials: + +- **Description.** It is the only trigger mechanism. Write it to be specific about what the + skill does and when it should fire, slightly pushy to offset conservative triggering, and + worded to stay clear of skills that should *not* fire on the same request. +- **Body.** Explain the *why* rather than issuing bare "ALWAYS/NEVER" rules — an agent that + understands the reason handles edge cases correctly. Keep it lean (aim under 500 lines; + move detail to `references/`). Generalize to the principle instead of overfitting to one + example. Write imperatively. +- **Progressive disclosure.** Metadata is always in context; the body loads on trigger; + `references/` load only when needed. Split large or domain-specific detail into + `references/` so only the relevant file loads. +- **Linking.** One agent uses one or more skills; a skill may be shared across agents. The + skill holds *how*; the agent holds *who*. + +## Step 5: Build the orchestrator and register the pointer + +The orchestrator is a skill whose subject is the team: which agents take part, what each +produces, how outputs flow, and how failures are handled. Templates for team, subagent, and +hybrid modes — with data-passing, error handling, and test scenarios — are in +`references/orchestrator-template.md`. + +Build into the orchestrator: + +- **A context-check first phase** so it distinguishes an initial run from a follow-up or a + partial re-run (branch on whether `_agents_workspace/` already exists). +- **Follow-up trigger keywords** in its description ("re-run", "update", "modify", + "supplement", "improve the previous result", and everyday domain phrasings). Without + these, the harness goes unused after its first run. +- **Data-passing** stated explicitly, matched to the mode — see + `${CLAUDE_PLUGIN_ROOT}/shared/execution-modes.md`. +- **Error handling** that does not assume success: retry once, then proceed without the + missing result and note the omission; never delete conflicting data — record it with its + source. +- **The tools registry**, when tool discovery (Step 1b) has run: it lives in this + orchestrator's `references/` directory as `tools.md`, and agents and skills reference tools + by role from it. + +When extending rather than building new, modify the existing orchestrator — do not create a +second one. Reflect a new agent in the team composition, task assignment, data flow, and +trigger keywords. + +Then **register the pointer** in the project's `CLAUDE.md`: goal, trigger rule, and the +change-history table — and nothing the file system already holds. The convention and the +template are in `${CLAUDE_PLUGIN_ROOT}/shared/claude-md-pointer.md`. + +## Step 6: Record the change in history + +Every write to the harness appends a row to the change-history table in `CLAUDE.md` +(`Date | Change | Target | Reason`). This is a required step, not optional — it is how +evolution stays visible and regressions stay catchable. The table format is in +`${CLAUDE_PLUGIN_ROOT}/shared/claude-md-pointer.md`. + +## Step 7: Capture feedback + +A harness is a system that keeps changing, not a one-time artifact. After a run, offer the +user the chance to feed back ("anything to improve in the result, the team, or the +workflow?"). If there is none, move on — do not force it. When there is, route it to the +right place: output quality → the relevant skill; missing role → a new agent definition; +wrong order → the orchestrator; missing trigger → a description. The routing table and the +evolution triggers (recurring feedback, repeated failures, the user bypassing the +orchestrator) are in `references/maintenance.md`. + +## Deliverable checklist + +Before calling a setup or change complete: + +- [ ] Every agent is a file under `.claude/agents/` — including built-in types. +- [ ] Skills exist under `.claude/skills/` with valid `name` + `description` frontmatter. +- [ ] One orchestrator, with data flow, error handling, and test scenarios. +- [ ] Execution mode is stated (team / subagent / hybrid; per-phase if hybrid), with the + subagent fallback covered when team mode is the default. +- [ ] Each agent's model is set explicitly. +- [ ] No `commands/` directory was generated. +- [ ] No conflict with existing agents or skills. +- [ ] Skill and orchestrator descriptions are pushy and include follow-up keywords. +- [ ] Each SKILL.md body is within ~500 lines; overflow moved to `references/`. +- [ ] The orchestrator's first phase does a context check (initial / follow-up / partial). +- [ ] The `CLAUDE.md` pointer is registered (goal + trigger + change history). +- [ ] The change-history table records this change. +- [ ] If tool discovery ran: nothing was adopted without explicit approval, and accepted + tools are registered by role (with alternatives) in the orchestrator's `tools.md`. + +## References + +- `references/agent-design-patterns.md` — execution-mode comparison, the six architecture + patterns, agent-split criteria, the agent-definition template. +- `references/team-examples.md` — worked agent teams across generic domains, with full + sample agent files. +- `references/orchestrator-template.md` — orchestrator templates by mode, with data-passing, + error handling, and test scenarios. +- `references/skill-writing-guide.md` — skill authoring: descriptions, body style, + progressive disclosure, data-schema standards. +- `references/maintenance.md` — extending an existing harness (the extension matrix), + applying a review context, syncing drift, feedback routing, and periodic tool review. +- `references/tool-discovery.md` — the optional, on-request tool-discovery step: the + search subagent's context, the explicit-acceptance flow, and the `tools.md` registry schema. +- `${CLAUDE_PLUGIN_ROOT}/shared/harness-model.md`, + `${CLAUDE_PLUGIN_ROOT}/shared/execution-modes.md`, + `${CLAUDE_PLUGIN_ROOT}/shared/claude-md-pointer.md` — shared concepts. diff --git a/agentic-harness/skills/harness-setup/references/agent-design-patterns.md b/agentic-harness/skills/harness-setup/references/agent-design-patterns.md new file mode 100644 index 0000000..7a62e61 --- /dev/null +++ b/agentic-harness/skills/harness-setup/references/agent-design-patterns.md @@ -0,0 +1,223 @@ +# Agent design patterns + +How to structure a team, which architecture pattern fits the work, when to split a role +into its own agent, and the shape of an agent definition file. + +## Table of contents + +1. [Execution modes, in brief](#1-execution-modes-in-brief) +2. [The six architecture patterns](#2-the-six-architecture-patterns) +3. [Composite patterns](#3-composite-patterns) +4. [Agent type selection](#4-agent-type-selection) +5. [Splitting vs merging agents](#5-splitting-vs-merging-agents) +6. [Skill vs agent](#6-skill-vs-agent) +7. [Agent definition template](#7-agent-definition-template) + +--- + +## 1. Execution modes, in brief + +Two modes, with a third that mixes them. The full decision logic, the experimental +team-tools caveat, and the team-to-subagent fallback mapping are in the shared execution-modes +doc — read that before designing the team. The short version: + +- **Agent team** — members run as peers and coordinate directly (`SendMessage`) over a + shared task list (`TaskCreate`). Use it when agents need to exchange information while they + work: sharing findings, challenging each other, reconciling conflicts. Only one team is + active per session, but a team can be disbanded between phases and a new one formed. The + team tools are experimental — always design a subagent fallback. +- **Subagent** — the orchestrator spawns each agent with the `Agent` tool; the agent returns + its result and does not talk to siblings. Use it for a single agent, or for independent + jobs where only the combined result matters. Parallelize with `run_in_background`. + +Each pattern below notes whether a team earns its cost or a subagent is the better fit. + +## 2. The six architecture patterns + +### Pipeline + +Stages run in order; each consumes the previous stage's output. + +``` +[Analyze] → [Design] → [Build] → [Verify] +``` + +Fits work with strong sequential dependency. A slow stage stalls the whole line, so make +each stage as independent as the work allows. Team mode adds little to a pure sequence — +unless a stage has parallel sub-work, in which case a team helps there. + +### Fan-out / Fan-in + +Independent work in parallel, then integration. + +``` + ┌→ [Specialist A] ─┐ +[Distribute]┼→ [Specialist B] ─┼→ [Integrate] + └→ [Specialist C] ─┘ +``` + +Fits when one input needs several independent perspectives. The integration stage governs +final quality. This is the most natural fit for a team: members surface findings to each +other and one member's discovery can redirect another's work mid-flight, which a set of +isolated subagents cannot do. Build it as a team when the tools allow. + +### Expert Pool + +Route to the one specialist the input needs. + +``` +[Router] → { Specialist A | Specialist B | Specialist C } +``` + +Fits when different input types need different handling. The router's classification is the +weak point. Subagents usually fit better — only the chosen specialist runs, so a standing +team is wasted. + +### Producer–Reviewer + +A producer and a reviewer work as a pair, looping until the deliverable passes. + +``` +[Produce] → [Review] → (issues?) → back to [Produce] +``` + +Fits when quality matters and there are objective review criteria. Cap the retry count (two +or three) to avoid an endless loop. A team lets producer and reviewer exchange feedback +directly; subagents work too when a written review hand-off is enough. + +### Supervisor + +A central agent holds the work state and hands out work as it goes. + +``` + ┌→ [Worker A] +[Supervisor] ─┼→ [Worker B] + └→ [Worker C] +``` + +Fits variable workloads where assignment is decided at runtime, not fixed up front (the +difference from fan-out). Keep delegation units large enough that the supervisor is not the +bottleneck. The team's shared task list matches this naturally: register work, let members +claim it. + +### Hierarchical Delegation + +A higher agent decomposes a problem and delegates to lower agents, recursively. + +``` +[Lead] → [Sub-lead A] → [Worker A1], [Worker A2] + → [Sub-lead B] → [Worker B1] +``` + +Fits problems that decompose cleanly into a tree. Keep it to two levels — three or more adds +latency and loses context. Teams cannot nest (a member cannot form its own team), so +implement level one as a team and level two as subagents, or flatten to a single team. + +## 3. Composite patterns + +Real harnesses usually combine patterns: + +| Composite | Shape | Example domain | +|-----------|-------|----------------| +| Fan-out + Producer–Reviewer | parallel production, then per-item review | translate several variants in parallel, each reviewed separately | +| Pipeline + Fan-out | parallelize one stage of a sequence | sequential analysis → parallel build → sequential integration test | +| Supervisor + Expert Pool | supervisor routes to the right specialist on demand | inbound triage that assigns each case to a domain specialist | + +Default composites to a team when members benefit from talking; drop to subagents only for a +stage that is genuinely isolated and one-off. + +## 4. Agent type selection + +Pass the type via the `Agent` tool's `subagent_type`. Built-in types: + +| Type | Tool access | Fits | +|------|-------------|------| +| `general-purpose` | full (incl. web, scripts) | research, validation, any work needing tools | +| `Explore` | read-only | reading and analyzing a codebase | +| `Plan` | read-only | design and planning | + +A custom type is an agent you defined under `.claude/agents/{name}.md`, invoked with +`subagent_type: "{name}"`; it has full tool access. Choose by: + +| Situation | Choice | +|-----------|--------| +| Complex role reused across sessions | custom type (a file) | +| Simple collection where a prompt suffices | `general-purpose` + a detailed prompt | +| Read-only analysis or review | `Explore` | +| Design or planning only | `Plan` | +| Work that modifies files | custom type | + +Whatever the type, write the agent as a file (see the harness model's file rule). For a QA +agent specifically, use `general-purpose` — `Explore` cannot run validation. + +## 5. Splitting vs merging agents + +Judge along four axes: + +| Axis | Split when | Merge when | +|------|------------|------------| +| Expertise | the domains differ | the domains overlap | +| Parallelism | they can run independently | they are sequentially dependent | +| Context | the context load is large | both are light and fast | +| Reusability | other teams use it too | only this team uses it | + +Prefer a few focused agents over many thin ones — coordination cost rises with team size. + +## 6. Skill vs agent + +| Aspect | Skill | Agent | +|--------|-------|-------| +| What it is | procedural knowledge + tools | an expert persona + principles | +| Lives in | `.claude/skills/` | `.claude/agents/` | +| Triggered by | matching a request | explicit invocation via the `Agent` tool | +| Size | small to large (a workflow) | small (a role) | +| Answers | *how* | *who* | + +An agent leverages a skill in one of three ways: + +| Way | How | Fits | +|-----|-----|------| +| Skill invocation | the agent prompt says to invoke `/skill-name` via the Skill tool | reusable, independently invocable skills | +| Inline | the skill content sits inside the agent definition | short (≤50 lines), exclusive to that agent | +| Reference load | the agent reads a `references/` file when needed | large, only conditionally relevant content | + +## 7. Agent definition template + +```markdown +--- +name: agent-name +description: "One or two sentences on the role. List the trigger keywords." +model: opus +--- + +# Agent Name — one-line role + +You are an expert [role] in [domain]. + +## Core role +1. ... +2. ... + +## Working principles +- ... + +## Input / output protocol +- Input: where it reads from, and what +- Output: where it writes, and what +- Format: file format and structure + +## Team communication protocol (team mode only) +- Receives: from whom, and what +- Sends: to whom, and what +- Claims: what it takes from the shared task list + +## Error handling +- on failure: ... +- on timeout: ... + +## Collaboration +- relationships with the other agents +``` + +Set `model` explicitly. For roles whose quality depends on judgment rather than throughput, +use the strongest reasoning model available. diff --git a/agentic-harness/skills/harness-setup/references/maintenance.md b/agentic-harness/skills/harness-setup/references/maintenance.md new file mode 100644 index 0000000..08da66e --- /dev/null +++ b/agentic-harness/skills/harness-setup/references/maintenance.md @@ -0,0 +1,117 @@ +# Maintaining a harness + +A harness is a system that keeps changing, not a one-time build. This covers the four ways it +changes after the first build: extending it, applying a review context, syncing drift, and +folding in feedback. Every one of them ends by recording the change in history. + +## Table of contents + +1. [The extension matrix](#1-the-extension-matrix) +2. [Applying a review context](#2-applying-a-review-context) +3. [Syncing drift](#3-syncing-drift) +4. [Feedback routing](#4-feedback-routing) +5. [Evolution triggers](#5-evolution-triggers) +6. [The operations workflow](#6-the-operations-workflow) +7. [Periodic tool review](#7-periodic-tool-review) + +--- + +## 1. The extension matrix + +When extending an existing harness, do not re-run the full build. Run only the steps the +change needs (step numbers are from the `harness-setup` SKILL.md): + +| Change | Domain analysis (1) | Mode/pattern (2) | Agents (3) | Skills (4) | Orchestrator (5) | History (6) | +|--------|---------------------|------------------|------------|------------|------------------|-------------| +| Add an agent | reuse Step 0 | placement only | yes | only if it needs a new skill | update composition + triggers | yes | +| Add / change a skill | skip | skip | skip | yes | only if wiring changes | yes | +| Architecture change | skip | yes | only affected agents | only affected skills | yes | yes | + +When adding an agent, modify the existing orchestrator — do not spawn a second one. Reflect +the new agent in the team composition, task assignment, data flow, and trigger keywords. + +## 2. Applying a review context + +`harness-review` hands off a prioritized list: what works well (leave it), and what to +improve (act on it). Treat it as an interactive improvement pass, not a batch rewrite: + +1. Read the review context and confirm the priority order with the user. +2. Take items one at a time, highest priority first. +3. For each, identify the right target with the routing table below, make the smallest change + that addresses it, and record it in history. +4. After each change, re-check the specific concern the item raised before moving on. + +Working one item at a time keeps each change attributable and easy to roll back. + +## 3. Syncing drift + +Drift is when the files and the `CLAUDE.md` record disagree — an agent or skill exists that +the record doesn't mention, or the record names something no longer present. + +1. List `.claude/agents/` and `.claude/skills/`; compare against the orchestrator's + composition and the `CLAUDE.md` pointer. +2. For each discrepancy, decide with the user which side is correct — the files or the + record. +3. Reconcile toward the correct side, then record the correction in history. + +The files are usually the truth; the record is what falls behind. But confirm rather than +assume — a missing file can mean a deletion that was never recorded *or* an accidental loss. + +## 4. Feedback routing + +Different feedback lands in different places. Route by type: + +| Feedback | Fix where | Example | +|----------|-----------|---------| +| Output quality | the agent's skill | "the analysis is too shallow" → add depth criteria to the skill | +| Missing role | a new agent definition | "we also need a security pass" → add an agent | +| Wrong order | the orchestrator | "validation should come first" → reorder the phases | +| Team composition | orchestrator + agents | "merge these two" → combine the agents | +| Missing trigger | the skill description | "it doesn't fire when I phrase it this way" → widen the description | + +The reason this table exists is the separation in the harness model: who, how, and when each +have one home, so each kind of fault has one place to fix it. + +## 5. Evolution triggers + +Propose a change not only when the user asks, but when the signals say it is due: + +- The same kind of feedback recurs two or more times. +- An agent fails the same way repeatedly. +- The user is working around the orchestrator by hand — a sign it does not fit the real task. + +When you see these, raise it; don't wait to be told. + +## 6. The operations workflow + +For an audit-fix-sync request on an existing harness: + +1. **Audit.** Compare `.claude/agents/` and `.claude/skills/` against the orchestrator's + composition; produce a discrepancy list; report it to the user. (Read-only inventory and + usage assessment are `harness-review`'s job — call it for the deeper read.) +2. **Change incrementally.** Add, modify, or remove one agent or skill at a time. Sync after + each change rather than batching. +3. **Record history.** Append the date, change, target, and reason to the `CLAUDE.md` table. +4. **Validate.** Re-check the changed agents and skills structurally; if the change affects + triggering, re-check the descriptions; for large changes (an architecture change, or + adding/removing several agents), re-run the relevant tests. Finally, confirm the + `CLAUDE.md` record matches the actual files. + +## 7. Periodic tool review + +If the harness has a tools registry (`references/tools.md` under the orchestrator, produced +by the tool-discovery step), the tools in it are not permanent. Re-evaluate them +periodically — not on a fixed clock, but when a signal warrants it: + +- A registered tool is no longer being used by any agent or skill. +- A tool has stopped being maintained, or a better alternative has appeared. +- The preferred tool fails often enough that agents are running on its alternative in + practice. + +The split follows the rest of this plugin: assessing whether a tool is still earning its +place is a read activity, so it belongs to `harness-review` (it reads the registry as one of +its usage signals); swapping, adding, or retiring a tool is a write, so it comes back here. +When you change the registry, update the affected row's `Last reviewed` date, keep every +role's alternative current, and record the change in the `CLAUDE.md` history like any other. +Because agents and skills reference tools by role, swapping the tool behind a role needs no +edits to their files. diff --git a/agentic-harness/skills/harness-setup/references/orchestrator-template.md b/agentic-harness/skills/harness-setup/references/orchestrator-template.md new file mode 100644 index 0000000..b78f365 --- /dev/null +++ b/agentic-harness/skills/harness-setup/references/orchestrator-template.md @@ -0,0 +1,208 @@ +# Orchestrator template + +The orchestrator is the top-level skill that coordinates the whole team. It comes in three +shapes, one per execution mode. Pick the shape from the mode chosen in Step 2; for hybrid, +state the mode per phase. + +## Table of contents + +- [Orchestrator template](#orchestrator-template) + - [Table of contents](#table-of-contents) + - [Template A — agent team (default)](#template-a--agent-team-default) + - [Template B — subagent (fallback / lightweight)](#template-b--subagent-fallback--lightweight) + - [Template C — hybrid](#template-c--hybrid) + - [Authoring rules](#authoring-rules) + - [Follow-up keywords](#follow-up-keywords) + +--- + +## Template A — agent team (default) + +The first choice when two or more agents need to talk while they work. Build the team with +`TeamCreate`; coordinate over a shared task list and `SendMessage`. + +````markdown +--- +name: {domain}-orchestrator +description: "Coordinates the {domain} agent team to produce {deliverable}. {initial keywords}. Also use for follow-ups: re-run, update, modify, supplement, improve the previous result, and everyday {domain} requests." +model: opus +--- + +# {Domain} orchestrator + +Coordinates the {domain} agent team to produce {final deliverable}. + +## Execution mode: agent team + +## Team + +| Member | Type | Role | Skill | Output | +|--------|------|------|-------|--------| +| {member-1} | {custom or built-in} | {role} | {skill} | {file} | +| {member-2} | ... | ... | ... | ... | + +## Workflow + +### Phase 0: context check +Branch on whether prior work exists: +- `_agents_workspace/` absent → initial run; go to Phase 1. +- `_agents_workspace/` present + a partial-change request → partial re-run; re-invoke only + the affected member and overwrite only its output. +- `_agents_workspace/` present + new input → new run; move the old `_agents_workspace/` to + `_agents_workspace/archive/{YYYYMMDD_HHMMSS}/`, then go to Phase 1. +On a partial re-run, pass the prior output path into the member's prompt so it reads the +existing result and folds in the change. + +### Phase 1: prepare +1. Read the input and identify {what to identify}. +2. Create `_agents_workspace/` (or recreate it after archiving the old one on a new run). +3. Save the input under `_agents_workspace/00_input/`. + +### Phase 2: form the team +1. `TeamCreate(team_name, members: [...])` — each member with its name, type, model, and a + role prompt. +2. `TaskCreate(tasks: [...])` — roughly five to six tasks per member; declare dependencies + with `depends_on`. + +### Phase 3: {the main work} +Members claim tasks from the shared list and work independently. State the communication +rules: who passes what to whom via `SendMessage`; that each member saves its output to a +file and notifies the leader on completion; that a member requesting another's result asks +over `SendMessage`. The leader watches progress (it is notified when a member goes idle), +nudges or reassigns a stuck member, and checks state with `TaskGet`. + +| Member | Output path | +|--------|-------------| +| {member-1} | `_agents_workspace/{phase}_{member-1}_{artifact}.md` | + +### Phase 4: integrate +1. Wait for all tasks to complete (`TaskGet`). +2. Read each member's output. +3. Apply the integration logic; where members conflict, record both with their sources. +4. Write the final deliverable to the user's target path. + +### Phase 5: tear down +1. Ask members to stop (`SendMessage`); `TeamDelete`. +2. Keep `_agents_workspace/` (do not delete intermediate work). +3. Report a summary to the user. + +## Error handling + +| Situation | Strategy | +|-----------|----------| +| One member fails or stalls | leader checks via `SendMessage`, restarts or replaces it | +| Most members fail | tell the user, confirm whether to continue | +| Timeout | use the partial results, stop the unfinished members | +| Members disagree | record both with sources; do not delete either | +| Task state lags | leader confirms with `TaskGet`, corrects with `TaskUpdate` | + +## Test scenarios +- **Normal:** input → analysis → team of N + M tasks → self-coordinated work → integration → + teardown → the deliverable exists at its path. +- **Error:** a member stops mid-phase → leader gets the idle notice → restart attempt → + on repeated failure, reassign its task → integrate the rest → note the gap in the report. +```` + +## Template B — subagent (fallback / lightweight) + +When team communication is unnecessary, or the team tools are unavailable. Spawn each agent +with the `Agent` tool and collect return values. + +````markdown +--- +name: {domain}-orchestrator +description: "Coordinates {domain} agents to produce {deliverable}. {initial keywords} + follow-up keywords." +model: opus +--- + +## Execution mode: subagent + +## Agents + +| Agent | subagent_type | Role | Skill | Output | +|-------|---------------|------|-------|--------| +| {agent-1} | {built-in or custom} | {role} | {skill} | {file} | + +## Workflow + +### Phase 0: context check +Same branch as Template A — on whether `_agents_workspace/` exists. + +### Phase 1: prepare +Read the input; create `_agents_workspace/` (archiving any old one on a new run). + +### Phase 2: run in parallel +Invoke the agents in a single message: + +| Agent | Input | Output | model | run_in_background | +|-------|-------|--------|-------|-------------------| +| {agent-1} | {source} | `_agents_workspace/{phase}_{agent}_{artifact}.md` | opus | true | + +### Phase 3: integrate +Collect each return value; read file outputs; apply the integration logic; write the final +deliverable. + +### Phase 4: finish +Keep `_agents_workspace/`; report a summary. + +## Error handling +- One agent fails: retry once; on a second failure, note the omission and continue. +- Most fail: tell the user and confirm. +- Timeout: use the partial results. +```` + +## Template C — hybrid + +A different mode per phase. State `**Execution mode:** {team | subagent}` at the top of each +phase. + +````markdown +--- +name: {domain}-orchestrator +description: "{domain} orchestrator (hybrid). {keywords} + follow-up keywords." +model: opus +--- + +## Execution mode: hybrid + +| Phase | Mode | Why | +|-------|------|-----| +| Phase 2 (parallel collection) | subagent | independent collection, no coordination needed | +| Phase 3 (consensus integration) | agent team | reconcile conflicting inputs by discussion | +| Phase 4 (independent verification) | subagent | one QA agent verifies objectively | + +### Phase 2: collect — **Execution mode:** subagent +Invoke N agents in parallel (`run_in_background: true`); save each to +`_agents_workspace/02_{agent}_raw.md`. + +### Phase 3: integrate — **Execution mode:** agent team +`TeamCreate` an integration team; `TaskCreate` work that reads the Phase 2 files; members +reconcile conflicts via `SendMessage`; write `_agents_workspace/03_integrated.md`; +`TeamDelete`. + +### Phase 4: verify — **Execution mode:** subagent +One QA subagent reads `_agents_workspace/03_integrated.md` and writes a verification report. +```` + +**Transition rules.** Team → subagent: `TeamDelete` before any `Agent` call. Subagent → +team: pass the subagent's file outputs to members as read paths. Team → team: tear down the +old team before the next `TeamCreate` (only one team is active per session). + +## Authoring rules + +1. State the execution mode at the top. For hybrid, a per-phase mode table is required. +2. For team mode, be concrete about `TeamCreate` / `TaskCreate` / `SendMessage` usage. +3. For subagent mode, fully specify each `Agent` call (name, type, prompt, model, + `run_in_background`). +4. Use clear paths under `_agents_workspace/`; avoid ambiguous relative paths. +5. State inter-phase dependencies — which phase needs which phase's output. For hybrid, + call out the transition points. +6. Make error handling realistic — do not assume everything succeeds. +7. Include at least one normal and one error test scenario. + +## Follow-up keywords + +Initial-run keywords alone leave the harness unused after its first run. Put follow-up +phrasings in the description: re-run, run again, update, modify, supplement; "only the +{part} again"; "based on the previous result", "improve the result"; and everyday domain +requests (for a launch-planning harness: "launch", "promotion", and the like). diff --git a/agentic-harness/skills/harness-setup/references/skill-writing-guide.md b/agentic-harness/skills/harness-setup/references/skill-writing-guide.md new file mode 100644 index 0000000..b968501 --- /dev/null +++ b/agentic-harness/skills/harness-setup/references/skill-writing-guide.md @@ -0,0 +1,154 @@ +# Skill writing guide + +How to write the skills a generated harness uses, so they trigger reliably and earn their +context cost. Supplements Step 4. + +## Table of contents + +- [Skill writing guide](#skill-writing-guide) + - [Table of contents](#table-of-contents) + - [1. Writing the description](#1-writing-the-description) + - [2. Body style](#2-body-style) + - [3. Output formats and examples](#3-output-formats-and-examples) + - [4. Progressive disclosure](#4-progressive-disclosure) + - [5. When to bundle a script](#5-when-to-bundle-a-script) + - [6. Data-schema standards](#6-data-schema-standards) + - [7. What to leave out](#7-what-to-leave-out) + +--- + +## 1. Writing the description + +The description is the only thing that decides whether a skill triggers. The model judges +from the name and description alone, and it tends to skip a skill for a task it thinks it can +handle bare-handed. So: + +1. Say both **what the skill does** and **the specific situations that should trigger it**. +2. Mark the boundary — name the near-miss cases where a *different* tool is the right fit, so + this skill does not fire on them. +3. Lean slightly pushy, to offset the conservative default. + +**Weak:** `"A skill that processes data"` — which data, which task? Too vague to trigger. + +**Strong:** `"All spreadsheet operations — add columns, compute formulas, format, chart, +clean data — for .xlsx / .csv / .tsv. Use whenever the user mentions a spreadsheet file, +even in passing. Not for a Word document or a PDF: use the matching skill for those."` + +The strong version enumerates concrete actions, states the trigger, and draws the boundary. + +## 2. Body style + +**Explain the why.** A rule the agent understands generalizes; a bare command does not. + +Weak: +```text +ALWAYS use a table-aware extractor. NEVER use a plain text extractor for tables. +``` +Strong: +```text +Use a table-aware extractor for tables: a plain text extractor loses the row and column +structure, while a table-aware one recognizes cell boundaries and returns structured rows. +``` + +**Generalize, don't overfit.** Fix at the level of the principle, not the one example that +failed. + +Weak: `If a column is named "Q4 Revenue", convert it to numbers.` +Strong: `If a column name implies a numeric quantity (revenue, amount, count), convert it to +a number; if conversion fails, keep the original value.` + +**Write imperatively.** A skill is an instruction sheet: "do X", not "the skill does X". + +**Spend context deliberately.** The window is shared. For each sentence ask: does the agent +already know this (cut it), would it err without it (keep it), would one example replace the +explanation (swap it)? + +## 3. Output formats and examples + +When the deliverable's shape matters, show the template: + +```text +## Report structure +# {Title} +## Summary +## Findings +## Recommendations +``` + +Keep it short — a concrete example teaches faster than a long spec. For transformations, +pair input with output: + +```text +Input: Add token-based user authentication +Output: feat(auth): add token-based authentication +``` + +## 4. Progressive disclosure + +A skill loads in three stages: metadata (always in context), the SKILL.md body (on trigger), +and `references/` files (only when read). Use the stages: + +- As the body nears ~500 lines, move detail into `references/` and leave a one-line pointer + that says when to read it. +- Give any reference over ~300 lines a table of contents at the top. +- Split domain- or framework-specific variants into separate reference files, so only the + relevant one loads: + +```text +deploy-skill/ +├── SKILL.md (workflow + which file to load) +└── references/ + ├── provider-a.md + ├── provider-b.md + └── provider-c.md +``` + +## 5. When to bundle a script + +Watch the agents' transcripts during testing. Bundle when a pattern repeats: + +| Signal | Action | +|--------|--------| +| The same helper script is written in every test run | bundle it under `scripts/` | +| The same dependency is installed every time | state the install step in the skill | +| The same multi-step approach recurs | write it up as a standard procedure in the body | +| The same workaround follows the same error every time | document the issue and its fix | +| The same document or report is read or produced | write it up and define a json data schema for input/output payload to interact with it | + +A bundled script must pass an execution test of its own. + +## 6. Data-schema standards + +When a harness tests its own generated skills, a shared schema keeps results comparable. + +Test-case metadata: +```json +{ + "eval_id": 0, + "eval_name": "descriptive-name", + "prompt": "the user's task prompt", + "assertions": ["the deliverable contains X", "a file in format Y was produced"] +} +``` + +Assertion-based grading — use the field names `text`, `passed`, `evidence` exactly: +```json +{ + "expectations": [ + { "text": "a margin column was added", "passed": true, "evidence": "column E holds margin_pct" } + ], + "summary": { "passed": 2, "failed": 1, "total": 3, "pass_rate": 0.67 } +} +``` + +Timing — capture from the completion notification immediately; it cannot be recovered later: +```json +{ "total_tokens": 84852, "duration_ms": 23332 } +``` + +## 7. What to leave out + +- Supplementary docs (README, CHANGELOG, install guides). +- Meta-notes about how the skill was built (test logs, iteration history). +- End-user manuals — a skill is an instruction sheet for an agent, not a human. +- General knowledge the model already has. diff --git a/agentic-harness/skills/harness-setup/references/team-examples.md b/agentic-harness/skills/harness-setup/references/team-examples.md new file mode 100644 index 0000000..4e56718 --- /dev/null +++ b/agentic-harness/skills/harness-setup/references/team-examples.md @@ -0,0 +1,157 @@ +# Team examples + +Worked examples across generic domains. Each shows the architecture pattern, the execution +mode, the agent composition, and the coordination shape. Adapt them — do not copy them +literally. + +## Table of contents + +- [Team examples](#team-examples) + - [Table of contents](#table-of-contents) + - [1. Multi-source research (team, fan-out/fan-in)](#1-multi-source-research-team-fan-outfan-in) + - [2. Long-form authoring (team, pipeline + fan-out)](#2-long-form-authoring-team-pipeline--fan-out) + - [A full agent file — `consistency-reviewer.md`](#a-full-agent-file--consistency-reviewermd) + - [3. Generate-and-review (subagent, producer–reviewer)](#3-generate-and-review-subagent-producerreviewer) + - [4. Code review (team, fan-out/fan-in + debate)](#4-code-review-team-fan-outfan-in--debate) + - [5. Large migration (team, supervisor)](#5-large-migration-team-supervisor) + - [What every example shares](#what-every-example-shares) + +--- + +## 1. Multi-source research (team, fan-out/fan-in) + +Several researchers cover different source types in parallel; the leader integrates. + +| Member | Type | Scope | Output | +|--------|------|-------|--------| +| primary-source-researcher | general-purpose | official / first-party sources | `research_primary.md` | +| press-researcher | general-purpose | reporting and analysis | `research_press.md` | +| community-researcher | general-purpose | forums and social discussion | `research_community.md` | +| context-researcher | general-purpose | background, comparables, prior art | `research_context.md` | +| (leader = orchestrator) | — | integrate into one report | `synthesis-report.md` | + +The researchers use the `general-purpose` built-in type but are still defined as files — +each file states the scope and the team communication protocol so the roles are reusable and +the collaboration has a contract. + +Coordination: the four work independently, but pass relevant finds to each other over +`SendMessage` (the press researcher hands an acquisition rumor to the context researcher; +the community researcher flags a press item to the press researcher). Conflicting claims are +debated directly and, in the final report, recorded side by side with their sources. This +cross-talk is why it is a team rather than four isolated subagents. + +## 2. Long-form authoring (team, pipeline + fan-out) + +A document built in stages, with parallel groundwork. + +```text +Phase 1 (team, parallel): structure-planner + source-curator + outline-builder + — keep each other consistent via SendMessage +Phase 2 (subagent): section-writer drafts from the Phase 1 outputs +Phase 3 (team, parallel): fact-checker + consistency-reviewer review the draft + — share findings via SendMessage +Phase 4 (subagent): section-writer revises to reflect the review +``` + +Phase 1 forms a team, then tears it down. Phase 2 needs no team (one writer working alone), +so it runs as a subagent reading the Phase 1 files from `_agents_workspace/`. Phase 3 forms a +new review team (only one team is active at a time, but Phase 1's was already disbanded). +Phase 4 is again a lone subagent. + +### A full agent file — `consistency-reviewer.md` + +```markdown +--- +name: consistency-reviewer +description: "Checks a long-form draft for internal consistency — terminology, claims, and cross-references that contradict each other." +model: opus +--- + +# Consistency reviewer + +You check a draft for internal contradictions. You do not judge prose quality — that is the +writer's concern. You judge whether the document agrees with itself. + +## Core role +1. Flag terms used with two different meanings. +2. Flag claims in one section that a later section contradicts. +3. Flag cross-references that point to the wrong place or to nothing. + +## Working principles +- Report file and location for every issue, so the writer can act without searching. +- Distinguish a true contradiction from a stylistic variation; report only the former. + +## Input / output protocol +- Input: the draft at `_agents_workspace/02_section-writer_draft.md`. +- Output: `_agents_workspace/03_consistency_report.md`. +- Format: one entry per issue — location, the two conflicting statements, suggested fix. + +## Team communication protocol +- To fact-checker: SendMessage when a consistency issue depends on a factual question. +- To the leader: a report distinguishing confirmed issues from open questions. + +## Error handling +- If the draft is missing, report that and stop rather than inventing content. + +## Collaboration +- Works alongside fact-checker; the two divide internal vs external correctness. +``` + +## 3. Generate-and-review (subagent, producer–reviewer) + +Two agents — a producer and a reviewer — looping until the output passes. With only two +agents and a hand-off that matters more than conversation, subagent mode fits; cap the loop +at two or three rounds. + +| Agent | subagent_type | Role | +|-------|---------------|------| +| asset-producer | custom | generate the batch of outputs | +| asset-reviewer | custom | inspect each, mark PASS / FIX / REDO | + +The reviewer judges on objective criteria (completeness, consistency, legibility), not +taste, and writes a per-item verdict with a specific fix instruction. Items marked REDO go +back to the producer with that instruction; after the retry cap, a still-failing item is +passed with a warning. If a large fraction needs REDO, the reviewer proposes revising the +generation prompt rather than looping further. + +## 4. Code review (team, fan-out/fan-in + debate) + +Reviewers with different lenses examine the same change and talk to each other directly. + +```text +[leader] → TeamCreate(review-team) + ├── security-reviewer: vulnerabilities, input handling, authz + ├── performance-reviewer: hot paths, query patterns, allocations + └── test-reviewer: coverage and meaningful assertions + → reviewers cross-message; leader synthesizes +``` + +The value is in the cross-talk: the security reviewer flags an injectable query and asks the +performance reviewer to weigh in; the performance reviewer finds an N+1 query and asks the +test reviewer whether a test would have caught it. Reviewers reach each other without routing +through the leader, which catches cross-domain issues a set of siloed reviews would miss. + +## 5. Large migration (team, supervisor) + +A supervisor analyzes the file set and hands out batches dynamically. + +```text +[supervisor] → analyze file list → register batches as tasks + ├→ migrator-1 (claims a batch) + ├→ migrator-2 (claims a batch) + └→ migrator-3 (claims a batch) + ← on completion, claims the next; on failure, supervisor diagnoses and reassigns +``` + +Unlike fan-out, batches are not fixed in advance — workers claim the next item from the +shared task list as they free up, which keeps fast workers busy and lets the supervisor +rebalance around failures. When all batches are done, the supervisor runs the integration +test. + +## What every example shares + +- Every agent is a file under `.claude/agents/`, with the required sections and — in team + mode — a team communication protocol. +- Skills live under `.claude/skills/{name}/SKILL.md`. +- The orchestrator names the agents and the workflow and states the execution mode. See + `references/orchestrator-template.md`. diff --git a/agentic-harness/skills/harness-setup/references/tool-discovery.md b/agentic-harness/skills/harness-setup/references/tool-discovery.md new file mode 100644 index 0000000..8cd3a60 --- /dev/null +++ b/agentic-harness/skills/harness-setup/references/tool-discovery.md @@ -0,0 +1,96 @@ +# Tool discovery + +How the optional, on-request tool-discovery step works: how to brief the search subagent, +how candidates are accepted, and how accepted tools are registered so agents and skills can +use them without hard-coding a tool name. Supplements Step 1b. + +## When it runs + +Only when the user asks ("find tools / MCPs / plugins for this project"). It is never part of +an automatic build. It runs equally well during an initial build or standalone against an +existing harness. This skill ships **no catalog of recommendations** — every candidate comes +from a live search and the local configuration, and nothing is adopted without the user's +explicit approval. + +## Step 1: Brief the search subagent + +Dispatch a search-optimized subagent and give it a tight context so its search is grounded in +this project, not generic. Use the `general-purpose` type (it has web search and fetch); run +it in the background while you continue. The brief: + +```text +You are searching for external tools — MCP servers and Claude Code plugins — that would help +this project's harness. Project context: +- Domain: {from Step 1} +- Stack / languages / frameworks: {from Step 1} +- Core task types: {creation / validation / analysis / ...} +- Pain points or repeated manual work the user mentioned: {if any} + +For each candidate, return: the ROLE it fills (e.g. "knowledge base", "language server", +"issue tracker CLI"), the concrete tool, what it does, its maturity/maintenance signal, and +its main trade-off or cost. Do not rank or recommend — just surface grounded candidates with +evidence. Prefer well-maintained, widely-used tools; flag anything experimental. +``` + +The categories are illustrative, not prescriptive — a document-heavy project might benefit +from a markup-conversion MCP, a code project from a language-server tool, a project tracked +in an external issue tracker from that tracker's CLI or MCP. What actually fits is whatever the +search and the project context turn up. + +## Step 2: Check the local configuration + +Before proposing anything, inspect what is already available in the local and session +configuration — connected MCP servers, installed plugins, CLIs on the path. The point is +twofold: don't propose a tool the project already has, and surface useful tools already +present that the harness isn't using yet. Fold both into the candidate list. + +## Step 3: Get explicit acceptance + +Present the candidates to the user, each with its role, what it does, and its trade-off. The +user **accepts or rejects each one individually**. Adopt only the accepted ones. Do not +batch-accept, and do not adopt a tool on the user's behalf because it "seems useful" — an +external tool is a dependency and a trust decision, and that decision is the user's. + +For an accepted tool that needs configuration (an MCP server, a plugin), set it up only as +far as the user authorizes, and note any credentials or installation the user must complete +themselves. + +## Step 4: Register accepted tools + +Record accepted tools **by role** in the registry under the orchestrator: +`.claude/skills/{domain}-orchestrator/references/tools.md`. One registry per harness; the +orchestrator owns it. The schema: + +```markdown +# Tools registry + +Agents and skills reference a tool by its **role**, never by a hard tool name. When the +preferred tool is unavailable, fall back to the alternative. Reviewed on the dates below. + +| Role | Preferred tool | Alternative (if unavailable) | Status | Last reviewed | +|------|----------------|------------------------------|--------|---------------| +| knowledge-base | {accepted MCP} | built-in file search | active | {YYYY-MM-DD} | +| language-server | {accepted tool} | manual code reading | active | {YYYY-MM-DD} | +| issue-tracker | {accepted CLI} | none — note the gap | active | {YYYY-MM-DD} | +``` + +Every role needs an **alternative**, even if the alternative is "none, and here is what the +agent does without it." The alternative is what keeps the harness working when a tool is +absent or unreachable — the same graceful-degradation idea the marketplace applies to its own +optional connectors. + +## Step 5: Reference tools by role + +In an agent definition or a skill, name the **role**, not the tool. For example: "retrieve +context using the `knowledge-base` tool from the orchestrator's tools registry; if it is +unavailable, use the registered alternative." This keeps the concrete tool swappable: change +the registry row and every agent and skill follows, with no edits to their files. + +## Step 6: Periodic review + +Registered tools are not permanent. A tool can fall out of use, stop being maintained, or be +overtaken by a better option. Re-evaluate the registry periodically — whether each tool is +still earning its place, and whether a better alternative now exists — and update the `Last +reviewed` date. Assessing tool usage is a read activity that belongs to `harness-review`; +acting on it (swapping or retiring a tool) is a write that comes back here. The cadence and +the triggers are in `references/maintenance.md`.