Context
Real-world SEO bug pattern: a page returns HTTP 200 but the body is actually a "not found" fallback (e.g., <h1>Project Not Found</h1>) and the canonical points to the homepage instead of the requested URL. This is classic soft-404 behavior and it's terrible for crawl budget, internal link equity, and indexing.
Generic crawlers and uptime checks miss this because they only look at the status code. Search engines usually detect it, but inconsistently, and it causes ranking issues either way.
A recent independent audit of almostimpossible.agency (by a Codex-run crawler used as a comparison benchmark for crawl-sim) caught exactly this bug on several old case-study URLs — they returned 200 OK with "Project Not Found" in the body and canonicalized to /. After the owner fixed it, the same crawler confirmed the redirects were now 308 → correct slug.
Proposal
Add soft-404 detection to the scoring pipeline. Heuristics:
- Canonical vs requested URL mismatch — if the requested path is
/work/foo but the canonical is /, flag it.
- Known fallback strings —
Project Not Found, Page Not Found, 404, Not Found, case-insensitive. Configurable list in a new scripts/_soft-404-patterns.txt.
- H1 vs title mismatch — if H1 is generic ("Not found") while title matches the URL slug, flag it.
- Thin body + generic content —
wordCount < 200 AND body contains one of the fallback strings = high-confidence soft-404.
Output field on the fetch result:
{
"status": 200,
"softFourOhFour": {
"detected": true,
"confidence": "high",
"reasons": ["canonical_mismatch", "generic_h1"]
}
}
Subtract from the Accessibility category score when detected.
Where to implement
Option A: extend scripts/fetch-as-bot.sh to run the detection inline (adds latency per bot).
Option B: new scripts/check-soft-404.sh that takes a body file + requested URL and outputs a small JSON object (cleaner separation, called once per bot by the orchestrator).
Recommend B for testability and reusability.
Acceptance criteria
Prior art
The Codex audit session that inspired this check is documented in the crawl-sim notes — the script ran curl -I -L redirect chains plus rg -n 'Project Not Found|rel="canonical"|<h1' against suspect URLs and correlated the signals. Same approach fits a bash implementation cleanly.
Context
Real-world SEO bug pattern: a page returns HTTP 200 but the body is actually a "not found" fallback (e.g.,
<h1>Project Not Found</h1>) and the canonical points to the homepage instead of the requested URL. This is classic soft-404 behavior and it's terrible for crawl budget, internal link equity, and indexing.Generic crawlers and uptime checks miss this because they only look at the status code. Search engines usually detect it, but inconsistently, and it causes ranking issues either way.
A recent independent audit of almostimpossible.agency (by a Codex-run crawler used as a comparison benchmark for crawl-sim) caught exactly this bug on several old case-study URLs — they returned 200 OK with "Project Not Found" in the body and canonicalized to
/. After the owner fixed it, the same crawler confirmed the redirects were now 308 → correct slug.Proposal
Add soft-404 detection to the scoring pipeline. Heuristics:
/work/foobut the canonical is/, flag it.Project Not Found,Page Not Found,404,Not Found, case-insensitive. Configurable list in a newscripts/_soft-404-patterns.txt.wordCount < 200AND body contains one of the fallback strings = high-confidence soft-404.Output field on the fetch result:
{ "status": 200, "softFourOhFour": { "detected": true, "confidence": "high", "reasons": ["canonical_mismatch", "generic_h1"] } }Subtract from the Accessibility category score when detected.
Where to implement
Option A: extend
scripts/fetch-as-bot.shto run the detection inline (adds latency per bot).Option B: new
scripts/check-soft-404.shthat takes a body file + requested URL and outputs a small JSON object (cleaner separation, called once per bot by the orchestrator).Recommend B for testability and reusability.
Acceptance criteria
scripts/check-soft-404.sh <url> <body-file>exists, outputs JSON to stdoutconfidence: high | medium | lowbased on how many heuristics matchcompute-score.shreads the result and subtracts 20+ points from Accessibility on high-confidence detectionPrior art
The Codex audit session that inspired this check is documented in the crawl-sim notes — the script ran
curl -I -Lredirect chains plusrg -n 'Project Not Found|rel="canonical"|<h1'against suspect URLs and correlated the signals. Same approach fits a bash implementation cleanly.