Add content-type to robots.txt custom metric#193
Conversation
https://almanac.httparchive.org/en/2022/Changed custom metrics values: {
"_robots_txt": {
"redirected": false,
"status": 200,
"content_type": "text/plain; charset=utf-8",
"size": 76,
"size_kib": 0.07421875,
"over_google_limit": false,
"comment_count": 0,
"record_counts": {
"by_type": {
"user_agent": 1,
"allow": 1,
"sitemap": 1
},
"by_useragent": {
"*": {
"allow": 1
}
}
}
}
}https://example.comChanged custom metrics values: {
"_robots_txt": {
"redirected": false,
"status": 404,
"content_type": "text/html",
"size": 528,
"size_kib": 0.515625,
"over_google_limit": false,
"comment_count": 0,
"record_counts": {
"by_type": {
"other": 1
},
"by_useragent": {}
}
}
}https://recruitment.uniuyo.edu.ngChanged custom metrics values: {
"_robots_txt": {
"redirected": false,
"status": 200,
"content_type": "text/html; charset=UTF-8",
"size": 11539,
"size_kib": 11.2685546875,
"over_google_limit": false,
"comment_count": 14,
"record_counts": {
"by_type": {
"other": 302,
"background": 13,
"background_size": 2,
"background_position": 1,
"color": 4,
"padding": 6,
"position": 5,
"overflow": 1,
"content": 2,
"top": 4,
"left": 3,
"right": 1,
"bottom": 2,
"border": 5,
"border_radius": 8,
"display": 3,
"margin_bottom": 2,
"backdrop_filter": 1,
"transition": 2,
"box_shadow": 3,
"height": 4,
"border_color": 2,
"transform": 1,
"border_left": 2,
"padding_left": 1,
"width": 3,
"align_items": 2,
"justify_content": 2,
"font_size": 3,
"text_align": 3,
"font_weight": 2,
"margin": 2,
"animation": 1,
"_webkit_background_clip": 1,
"_webkit_text_fill_color": 1,
"behavior": 1,
"threshold": 1,
"rootmargin": 1
},
"by_useragent": {}
}
}
}https://www.google.comChanged custom metrics values: {
"_robots_txt": {
"redirected": false,
"status": 200,
"content_type": "text/plain",
"size": 6502,
"size_kib": 6.349609375,
"over_google_limit": false,
"comment_count": 7,
"record_counts": {
"by_type": {
"user_agent": 6,
"disallow": 176,
"allow": 64,
"sitemap": 1
},
"by_useragent": {
"*": {
"disallow": 167,
"allow": 61
},
"yandex": {
"disallow": 169,
"allow": 61
},
"adsbot-google": {
"disallow": 4,
"allow": 1
},
"facebookexternalhit": {
"allow": 2,
"disallow": 3
},
"twitterbot": {
"allow": 2,
"disallow": 3
}
}
}
}
} |
|
Instead of relying on the content-type, would it be cleaner to just check the first non-whitespace character and make sure it's not AFAIK, anything html-like should require that for either or and it shouldn't be at the start of a robots.txt file. |
|
Could do. I was trying to avoid getting too heuristicy and just giving non-opinionated information and letting the person writing the query choose whether to use this info or not. See also an internal email I've just sent you. And another option would be to looks for at least one Like what to make of this one: https://plumaagroavicola.com.br/robots.txt, which is neither HTML content type, nor starting with a So I think, all-in-all, just exposing what we do know (Status-Code which is already exposed, and now adding Content-Type) is probably open to less interpretation from the crawler and leaves it up to the query-er. |
|
SGTM. FWIW, I'd report that hybrid one as a robots.txt under the assumption that a crawler would ignore the invalid chunks but maybe some would throw the whole thing away. In either case, sending the whole thing back makes sense for research purposes since they're likely sending those busted robots files to the crawlers too. |
Some recent analysis of robots.txt following #191 shows a lot of rubbish in them.
Some can be filtered out by checking for a 200 status code (
AND INT64(custom_metrics.robots_txt.status) = 200) but some return an HTML doc but with a 200 status code :-(Let's add the
content-typeheader so we can remove HTML pages if we want as likely they are not robots.txt files.FYI @garyillyes
Test websites: