Multi-page mode: sitemap vs crawl-graph diff

## Context

crawl-sim v1 is explicitly **single-page audit**. The design spec lists multi-page crawling as a post-v1 item. This issue scopes what multi-page should look like when we get there.

A proper multi-page audit should surface **two kinds of discovery gaps**:

1. **Orphaned URLs** — listed in `sitemap.xml` but **not reachable** from any internal link starting at the homepage. These are URLs the sitemap claims exist but crawlers can't actually find by following links. Indexing risk.

2. **Missing-from-sitemap URLs** — discovered by BFS crawling but **not listed** in `sitemap.xml`. These are real pages that won't be submitted for indexing via the sitemap. Discovery risk.

A Codex-run audit of [almostimpossible.agency](https://www.almostimpossible.agency/) used exactly this diff to find internal link hygiene issues (apex-host links, shortened work slugs, Cloudflare email-protection URLs getting scooped up by BFS but not in sitemap). The output was:

```
CRAWLED 30
SITEMAP_URLS 23
EXTRA_COUNT 7   (crawled but not in sitemap)
MISSING_COUNT 0 (sitemap but not crawled)
```

That format works well — two small lists plus counts.

## Proposal

New subcommand mode: `/crawl-sim <url> --crawl` triggers multi-page.

### Architecture

- New `scripts/crawl-site.sh <base-url> [--max-depth N] [--max-pages N]` — BFS crawler, outputs `crawl-graph.json` with:
  - `visited: { <url>: { depth, status, final_url, title, canonical, wordCount, internal_links: [...] } }`
  - `failed: [<url>]`
  - `stats: { crawled, max_depth_reached, unique_hosts }`
- Extend `check-sitemap.sh` to accept `--diff-against <crawl-graph.json>` and output the two-set diff
- `compute-score.sh` reads both and adds new findings for orphaned/missing URLs
- SKILL.md `--crawl` mode orchestrates this after the per-bot single-page checks

### Scope guardrails

- Default max-depth: 3
- Default max-pages: 50
- Rate limit: 1 request per second unless `--fast`
- Same-host only (with www/apex handled as same host)
- Only apply the bot simulation to the single primary URL; crawling uses a generic UA to build the graph, then the per-bot comparison runs on selected pages from the crawl (e.g., top 10 by inbound link count)

## Acceptance criteria

- [ ] `scripts/crawl-site.sh` exists with BFS, depth/page limits, same-host enforcement
- [ ] Extends `check-sitemap.sh` with `--diff-against` flag returning orphaned/missing arrays
- [ ] New SKILL.md section documenting `--crawl` flag
- [ ] New scoring finding: "Orphaned sitemap URLs" (high severity) and "Missing from sitemap" (medium)
- [ ] Integration test verifies the diff on a known site with at least one orphan and one missing URL
- [ ] Still bash + jq only — no new dependencies

## Out of scope

- robots.txt compliance enforcement during BFS (nice to have but not required for v2)
- Full-site per-bot re-fetching (stays on selected representative pages)
- Depth-weighted scoring

## Prior art

See the Codex-run audit notes — the script `/tmp/crawler_audit.py` was the reference implementation. We can port its BFS + diff logic to bash without the HTMLParser dependency by reusing the existing `extract-links.sh` for per-page link extraction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-page mode: sitemap vs crawl-graph diff #6

Context

Proposal

Architecture

Scope guardrails

Acceptance criteria

Out of scope

Prior art

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multi-page mode: sitemap vs crawl-graph diff #6

Description

Context

Proposal

Architecture

Scope guardrails

Acceptance criteria

Out of scope

Prior art

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions