Skip to content

Multi-page mode: sitemap vs crawl-graph diff #6

@BraedenBDev

Description

@BraedenBDev

Context

crawl-sim v1 is explicitly single-page audit. The design spec lists multi-page crawling as a post-v1 item. This issue scopes what multi-page should look like when we get there.

A proper multi-page audit should surface two kinds of discovery gaps:

  1. Orphaned URLs — listed in sitemap.xml but not reachable from any internal link starting at the homepage. These are URLs the sitemap claims exist but crawlers can't actually find by following links. Indexing risk.

  2. Missing-from-sitemap URLs — discovered by BFS crawling but not listed in sitemap.xml. These are real pages that won't be submitted for indexing via the sitemap. Discovery risk.

A Codex-run audit of almostimpossible.agency used exactly this diff to find internal link hygiene issues (apex-host links, shortened work slugs, Cloudflare email-protection URLs getting scooped up by BFS but not in sitemap). The output was:

CRAWLED 30
SITEMAP_URLS 23
EXTRA_COUNT 7   (crawled but not in sitemap)
MISSING_COUNT 0 (sitemap but not crawled)

That format works well — two small lists plus counts.

Proposal

New subcommand mode: /crawl-sim <url> --crawl triggers multi-page.

Architecture

  • New scripts/crawl-site.sh <base-url> [--max-depth N] [--max-pages N] — BFS crawler, outputs crawl-graph.json with:
    • visited: { <url>: { depth, status, final_url, title, canonical, wordCount, internal_links: [...] } }
    • failed: [<url>]
    • stats: { crawled, max_depth_reached, unique_hosts }
  • Extend check-sitemap.sh to accept --diff-against <crawl-graph.json> and output the two-set diff
  • compute-score.sh reads both and adds new findings for orphaned/missing URLs
  • SKILL.md --crawl mode orchestrates this after the per-bot single-page checks

Scope guardrails

  • Default max-depth: 3
  • Default max-pages: 50
  • Rate limit: 1 request per second unless --fast
  • Same-host only (with www/apex handled as same host)
  • Only apply the bot simulation to the single primary URL; crawling uses a generic UA to build the graph, then the per-bot comparison runs on selected pages from the crawl (e.g., top 10 by inbound link count)

Acceptance criteria

  • scripts/crawl-site.sh exists with BFS, depth/page limits, same-host enforcement
  • Extends check-sitemap.sh with --diff-against flag returning orphaned/missing arrays
  • New SKILL.md section documenting --crawl flag
  • New scoring finding: "Orphaned sitemap URLs" (high severity) and "Missing from sitemap" (medium)
  • Integration test verifies the diff on a known site with at least one orphan and one missing URL
  • Still bash + jq only — no new dependencies

Out of scope

  • robots.txt compliance enforcement during BFS (nice to have but not required for v2)
  • Full-site per-bot re-fetching (stays on selected representative pages)
  • Depth-weighted scoring

Prior art

See the Codex-run audit notes — the script /tmp/crawler_audit.py was the reference implementation. We can port its BFS + diff logic to bash without the HTMLParser dependency by reusing the existing extract-links.sh for per-page link extraction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestv2-futureScheduled for post-v1 release

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions