Add content-type to robots.txt custom metric by tunetheweb · Pull Request #193 · HTTPArchive/custom-metrics

tunetheweb · 2026-02-23T11:59:45Z

Some recent analysis of robots.txt following #191 shows a lot of rubbish in them.

Some can be filtered out by checking for a 200 status code (AND INT64(custom_metrics.robots_txt.status) = 200) but some return an HTML doc but with a 200 status code :-(

Let's add the content-type header so we can remove HTML pages if we want as likely they are not robots.txt files.

FYI @garyillyes

Test websites:

github-actions · 2026-02-23T12:02:42Z

https://almanac.httparchive.org/en/2022/

WPT result details

Changed custom metrics values:

{
  "_robots_txt": {
    "redirected": false,
    "status": 200,
    "content_type": "text/plain; charset=utf-8",
    "size": 76,
    "size_kib": 0.07421875,
    "over_google_limit": false,
    "comment_count": 0,
    "record_counts": {
      "by_type": {
        "user_agent": 1,
        "allow": 1,
        "sitemap": 1
      },
      "by_useragent": {
        "*": {
          "allow": 1
        }
      }
    }
  }
}

https://example.com

WPT result details

Changed custom metrics values:

{
  "_robots_txt": {
    "redirected": false,
    "status": 404,
    "content_type": "text/html",
    "size": 528,
    "size_kib": 0.515625,
    "over_google_limit": false,
    "comment_count": 0,
    "record_counts": {
      "by_type": {
        "other": 1
      },
      "by_useragent": {}
    }
  }
}

https://recruitment.uniuyo.edu.ng

WPT result details

Changed custom metrics values:

{
  "_robots_txt": {
    "redirected": false,
    "status": 200,
    "content_type": "text/html; charset=UTF-8",
    "size": 11539,
    "size_kib": 11.2685546875,
    "over_google_limit": false,
    "comment_count": 14,
    "record_counts": {
      "by_type": {
        "other": 302,
        "background": 13,
        "background_size": 2,
        "background_position": 1,
        "color": 4,
        "padding": 6,
        "position": 5,
        "overflow": 1,
        "content": 2,
        "top": 4,
        "left": 3,
        "right": 1,
        "bottom": 2,
        "border": 5,
        "border_radius": 8,
        "display": 3,
        "margin_bottom": 2,
        "backdrop_filter": 1,
        "transition": 2,
        "box_shadow": 3,
        "height": 4,
        "border_color": 2,
        "transform": 1,
        "border_left": 2,
        "padding_left": 1,
        "width": 3,
        "align_items": 2,
        "justify_content": 2,
        "font_size": 3,
        "text_align": 3,
        "font_weight": 2,
        "margin": 2,
        "animation": 1,
        "_webkit_background_clip": 1,
        "_webkit_text_fill_color": 1,
        "behavior": 1,
        "threshold": 1,
        "rootmargin": 1
      },
      "by_useragent": {}
    }
  }
}

https://www.google.com

WPT result details

Changed custom metrics values:

{
  "_robots_txt": {
    "redirected": false,
    "status": 200,
    "content_type": "text/plain",
    "size": 6502,
    "size_kib": 6.349609375,
    "over_google_limit": false,
    "comment_count": 7,
    "record_counts": {
      "by_type": {
        "user_agent": 6,
        "disallow": 176,
        "allow": 64,
        "sitemap": 1
      },
      "by_useragent": {
        "*": {
          "disallow": 167,
          "allow": 61
        },
        "yandex": {
          "disallow": 169,
          "allow": 61
        },
        "adsbot-google": {
          "disallow": 4,
          "allow": 1
        },
        "facebookexternalhit": {
          "allow": 2,
          "disallow": 3
        },
        "twitterbot": {
          "allow": 2,
          "disallow": 3
        }
      }
    }
  }
}

pmeenan · 2026-02-23T15:14:33Z

Instead of relying on the content-type, would it be cleaner to just check the first non-whitespace character and make sure it's not <?

AFAIK, anything html-like should require that for either or and it shouldn't be at the start of a robots.txt file.

tunetheweb · 2026-02-24T15:16:00Z

Could do. I was trying to avoid getting too heuristicy and just giving non-opinionated information and letting the person writing the query choose whether to use this info or not. See also an internal email I've just sent you.

And another option would be to looks for at least one User-Agent line. Or Exclude when there's too many unknown values. But I think when we get down the heuristics path, it gets very messy, very quickly.

Like what to make of this one: https://plumaagroavicola.com.br/robots.txt, which is neither HTML content type, nor starting with a < but is a hybrids of a robots.txt and an HTML file?!!?

So I think, all-in-all, just exposing what we do know (Status-Code which is already exposed, and now adding Content-Type) is probably open to less interpretation from the crawler and leaves it up to the query-er.

pmeenan · 2026-02-24T15:19:16Z

SGTM. FWIW, I'd report that hybrid one as a robots.txt under the assumption that a crawler would ignore the invalid chunks but maybe some would throw the whole thing away.

In either case, sending the whole thing back makes sense for research purposes since they're likely sending those busted robots files to the crawlers too.

Add content-type to robots.txt custom metric

25bd8ad

tunetheweb requested a review from pmeenan February 23, 2026 12:19

pmeenan approved these changes Feb 24, 2026

View reviewed changes

tunetheweb merged commit 1e9b5f6 into main Feb 24, 2026
4 checks passed

tunetheweb deleted the add-content-type-to-robots.txt branch February 24, 2026 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add content-type to robots.txt custom metric#193

Add content-type to robots.txt custom metric#193
tunetheweb merged 1 commit intomainfrom
add-content-type-to-robots.txt

tunetheweb commented Feb 23, 2026

Uh oh!

github-actions bot commented Feb 23, 2026

Uh oh!

pmeenan commented Feb 23, 2026

Uh oh!

tunetheweb commented Feb 24, 2026

Uh oh!

pmeenan commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tunetheweb commented Feb 23, 2026

Uh oh!

github-actions bot commented Feb 23, 2026

Uh oh!

pmeenan commented Feb 23, 2026

Uh oh!

tunetheweb commented Feb 24, 2026

Uh oh!

pmeenan commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants