Skip to content

Add soft-404 detection check #5

@BraedenBDev

Description

@BraedenBDev

Context

Real-world SEO bug pattern: a page returns HTTP 200 but the body is actually a "not found" fallback (e.g., <h1>Project Not Found</h1>) and the canonical points to the homepage instead of the requested URL. This is classic soft-404 behavior and it's terrible for crawl budget, internal link equity, and indexing.

Generic crawlers and uptime checks miss this because they only look at the status code. Search engines usually detect it, but inconsistently, and it causes ranking issues either way.

A recent independent audit of almostimpossible.agency (by a Codex-run crawler used as a comparison benchmark for crawl-sim) caught exactly this bug on several old case-study URLs — they returned 200 OK with "Project Not Found" in the body and canonicalized to /. After the owner fixed it, the same crawler confirmed the redirects were now 308 → correct slug.

Proposal

Add soft-404 detection to the scoring pipeline. Heuristics:

  1. Canonical vs requested URL mismatch — if the requested path is /work/foo but the canonical is /, flag it.
  2. Known fallback stringsProject Not Found, Page Not Found, 404, Not Found, case-insensitive. Configurable list in a new scripts/_soft-404-patterns.txt.
  3. H1 vs title mismatch — if H1 is generic ("Not found") while title matches the URL slug, flag it.
  4. Thin body + generic contentwordCount < 200 AND body contains one of the fallback strings = high-confidence soft-404.

Output field on the fetch result:

{
  "status": 200,
  "softFourOhFour": {
    "detected": true,
    "confidence": "high",
    "reasons": ["canonical_mismatch", "generic_h1"]
  }
}

Subtract from the Accessibility category score when detected.

Where to implement

Option A: extend scripts/fetch-as-bot.sh to run the detection inline (adds latency per bot).
Option B: new scripts/check-soft-404.sh that takes a body file + requested URL and outputs a small JSON object (cleaner separation, called once per bot by the orchestrator).

Recommend B for testability and reusability.

Acceptance criteria

  • scripts/check-soft-404.sh <url> <body-file> exists, outputs JSON to stdout
  • Detects canonical-mismatch case (test with a known bad URL)
  • Detects thin-body + fallback-string case
  • Reports confidence: high | medium | low based on how many heuristics match
  • compute-score.sh reads the result and subtracts 20+ points from Accessibility on high-confidence detection
  • SKILL.md orchestration calls the new script after extract-meta.sh
  • Fallback-string list is configurable via a plain text file, not hardcoded

Prior art

The Codex audit session that inspired this check is documented in the crawl-sim notes — the script ran curl -I -L redirect chains plus rg -n 'Project Not Found|rel="canonical"|<h1' against suspect URLs and correlated the signals. Same approach fits a bash implementation cleanly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions