Skip to content

Add content-type to robots.txt custom metric#193

Merged
tunetheweb merged 1 commit intomainfrom
add-content-type-to-robots.txt
Feb 24, 2026
Merged

Add content-type to robots.txt custom metric#193
tunetheweb merged 1 commit intomainfrom
add-content-type-to-robots.txt

Conversation

@tunetheweb
Copy link
Member

Some recent analysis of robots.txt following #191 shows a lot of rubbish in them.

Some can be filtered out by checking for a 200 status code (AND INT64(custom_metrics.robots_txt.status) = 200) but some return an HTML doc but with a 200 status code :-(

Let's add the content-type header so we can remove HTML pages if we want as likely they are not robots.txt files.

FYI @garyillyes


Test websites:

@github-actions
Copy link

https://almanac.httparchive.org/en/2022/

WPT result details

Changed custom metrics values:

{
  "_robots_txt": {
    "redirected": false,
    "status": 200,
    "content_type": "text/plain; charset=utf-8",
    "size": 76,
    "size_kib": 0.07421875,
    "over_google_limit": false,
    "comment_count": 0,
    "record_counts": {
      "by_type": {
        "user_agent": 1,
        "allow": 1,
        "sitemap": 1
      },
      "by_useragent": {
        "*": {
          "allow": 1
        }
      }
    }
  }
}
https://example.com

WPT result details

Changed custom metrics values:

{
  "_robots_txt": {
    "redirected": false,
    "status": 404,
    "content_type": "text/html",
    "size": 528,
    "size_kib": 0.515625,
    "over_google_limit": false,
    "comment_count": 0,
    "record_counts": {
      "by_type": {
        "other": 1
      },
      "by_useragent": {}
    }
  }
}
https://recruitment.uniuyo.edu.ng

WPT result details

Changed custom metrics values:

{
  "_robots_txt": {
    "redirected": false,
    "status": 200,
    "content_type": "text/html; charset=UTF-8",
    "size": 11539,
    "size_kib": 11.2685546875,
    "over_google_limit": false,
    "comment_count": 14,
    "record_counts": {
      "by_type": {
        "other": 302,
        "background": 13,
        "background_size": 2,
        "background_position": 1,
        "color": 4,
        "padding": 6,
        "position": 5,
        "overflow": 1,
        "content": 2,
        "top": 4,
        "left": 3,
        "right": 1,
        "bottom": 2,
        "border": 5,
        "border_radius": 8,
        "display": 3,
        "margin_bottom": 2,
        "backdrop_filter": 1,
        "transition": 2,
        "box_shadow": 3,
        "height": 4,
        "border_color": 2,
        "transform": 1,
        "border_left": 2,
        "padding_left": 1,
        "width": 3,
        "align_items": 2,
        "justify_content": 2,
        "font_size": 3,
        "text_align": 3,
        "font_weight": 2,
        "margin": 2,
        "animation": 1,
        "_webkit_background_clip": 1,
        "_webkit_text_fill_color": 1,
        "behavior": 1,
        "threshold": 1,
        "rootmargin": 1
      },
      "by_useragent": {}
    }
  }
}
https://www.google.com

WPT result details

Changed custom metrics values:

{
  "_robots_txt": {
    "redirected": false,
    "status": 200,
    "content_type": "text/plain",
    "size": 6502,
    "size_kib": 6.349609375,
    "over_google_limit": false,
    "comment_count": 7,
    "record_counts": {
      "by_type": {
        "user_agent": 6,
        "disallow": 176,
        "allow": 64,
        "sitemap": 1
      },
      "by_useragent": {
        "*": {
          "disallow": 167,
          "allow": 61
        },
        "yandex": {
          "disallow": 169,
          "allow": 61
        },
        "adsbot-google": {
          "disallow": 4,
          "allow": 1
        },
        "facebookexternalhit": {
          "allow": 2,
          "disallow": 3
        },
        "twitterbot": {
          "allow": 2,
          "disallow": 3
        }
      }
    }
  }
}

@tunetheweb tunetheweb requested a review from pmeenan February 23, 2026 12:19
@pmeenan
Copy link
Member

pmeenan commented Feb 23, 2026

Instead of relying on the content-type, would it be cleaner to just check the first non-whitespace character and make sure it's not <?

AFAIK, anything html-like should require that for either or and it shouldn't be at the start of a robots.txt file.

@tunetheweb
Copy link
Member Author

Could do. I was trying to avoid getting too heuristicy and just giving non-opinionated information and letting the person writing the query choose whether to use this info or not. See also an internal email I've just sent you.

And another option would be to looks for at least one User-Agent line. Or Exclude when there's too many unknown values. But I think when we get down the heuristics path, it gets very messy, very quickly.

Like what to make of this one: https://plumaagroavicola.com.br/robots.txt, which is neither HTML content type, nor starting with a < but is a hybrids of a robots.txt and an HTML file?!!?

So I think, all-in-all, just exposing what we do know (Status-Code which is already exposed, and now adding Content-Type) is probably open to less interpretation from the crawler and leaves it up to the query-er.

@pmeenan
Copy link
Member

pmeenan commented Feb 24, 2026

SGTM. FWIW, I'd report that hybrid one as a robots.txt under the assumption that a crawler would ignore the invalid chunks but maybe some would throw the whole thing away.

In either case, sending the whole thing back makes sense for research purposes since they're likely sending those busted robots files to the crawlers too.

@tunetheweb tunetheweb merged commit 1e9b5f6 into main Feb 24, 2026
4 checks passed
@tunetheweb tunetheweb deleted the add-content-type-to-robots.txt branch February 24, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants