Context
crawl-sim v1 is explicitly single-page audit. The design spec lists multi-page crawling as a post-v1 item. This issue scopes what multi-page should look like when we get there.
A proper multi-page audit should surface two kinds of discovery gaps:
-
Orphaned URLs — listed in sitemap.xml but not reachable from any internal link starting at the homepage. These are URLs the sitemap claims exist but crawlers can't actually find by following links. Indexing risk.
-
Missing-from-sitemap URLs — discovered by BFS crawling but not listed in sitemap.xml. These are real pages that won't be submitted for indexing via the sitemap. Discovery risk.
A Codex-run audit of almostimpossible.agency used exactly this diff to find internal link hygiene issues (apex-host links, shortened work slugs, Cloudflare email-protection URLs getting scooped up by BFS but not in sitemap). The output was:
CRAWLED 30
SITEMAP_URLS 23
EXTRA_COUNT 7 (crawled but not in sitemap)
MISSING_COUNT 0 (sitemap but not crawled)
That format works well — two small lists plus counts.
Proposal
New subcommand mode: /crawl-sim <url> --crawl triggers multi-page.
Architecture
- New
scripts/crawl-site.sh <base-url> [--max-depth N] [--max-pages N] — BFS crawler, outputs crawl-graph.json with:
visited: { <url>: { depth, status, final_url, title, canonical, wordCount, internal_links: [...] } }
failed: [<url>]
stats: { crawled, max_depth_reached, unique_hosts }
- Extend
check-sitemap.sh to accept --diff-against <crawl-graph.json> and output the two-set diff
compute-score.sh reads both and adds new findings for orphaned/missing URLs
- SKILL.md
--crawl mode orchestrates this after the per-bot single-page checks
Scope guardrails
- Default max-depth: 3
- Default max-pages: 50
- Rate limit: 1 request per second unless
--fast
- Same-host only (with www/apex handled as same host)
- Only apply the bot simulation to the single primary URL; crawling uses a generic UA to build the graph, then the per-bot comparison runs on selected pages from the crawl (e.g., top 10 by inbound link count)
Acceptance criteria
Out of scope
- robots.txt compliance enforcement during BFS (nice to have but not required for v2)
- Full-site per-bot re-fetching (stays on selected representative pages)
- Depth-weighted scoring
Prior art
See the Codex-run audit notes — the script /tmp/crawler_audit.py was the reference implementation. We can port its BFS + diff logic to bash without the HTMLParser dependency by reusing the existing extract-links.sh for per-page link extraction.
Context
crawl-sim v1 is explicitly single-page audit. The design spec lists multi-page crawling as a post-v1 item. This issue scopes what multi-page should look like when we get there.
A proper multi-page audit should surface two kinds of discovery gaps:
Orphaned URLs — listed in
sitemap.xmlbut not reachable from any internal link starting at the homepage. These are URLs the sitemap claims exist but crawlers can't actually find by following links. Indexing risk.Missing-from-sitemap URLs — discovered by BFS crawling but not listed in
sitemap.xml. These are real pages that won't be submitted for indexing via the sitemap. Discovery risk.A Codex-run audit of almostimpossible.agency used exactly this diff to find internal link hygiene issues (apex-host links, shortened work slugs, Cloudflare email-protection URLs getting scooped up by BFS but not in sitemap). The output was:
That format works well — two small lists plus counts.
Proposal
New subcommand mode:
/crawl-sim <url> --crawltriggers multi-page.Architecture
scripts/crawl-site.sh <base-url> [--max-depth N] [--max-pages N]— BFS crawler, outputscrawl-graph.jsonwith:visited: { <url>: { depth, status, final_url, title, canonical, wordCount, internal_links: [...] } }failed: [<url>]stats: { crawled, max_depth_reached, unique_hosts }check-sitemap.shto accept--diff-against <crawl-graph.json>and output the two-set diffcompute-score.shreads both and adds new findings for orphaned/missing URLs--crawlmode orchestrates this after the per-bot single-page checksScope guardrails
--fastAcceptance criteria
scripts/crawl-site.shexists with BFS, depth/page limits, same-host enforcementcheck-sitemap.shwith--diff-againstflag returning orphaned/missing arrays--crawlflagOut of scope
Prior art
See the Codex-run audit notes — the script
/tmp/crawler_audit.pywas the reference implementation. We can port its BFS + diff logic to bash without the HTMLParser dependency by reusing the existingextract-links.shfor per-page link extraction.