Skip to content

Robots.txt parsing fails when one rule line is invalid #111788

@gyula-lakatos

Description

@gyula-lakatos

Bug report

Bug description:

The robots.txt parsing fails if one line is not parsable from a robots.txt file. I don't think this is valid behavior. Ideally, non-parsable/invalid lines should be skipped. The norobots-rfc says the same too: Implementors should pay particular attention to the robustness in parsing of the /robots.txt file..

  File "/usr/local/lib/python3.11/urllib/robotparser.py", line 123, in parse
    entry.rulelines.append(RuleLine(line[1], False))
                           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/robotparser.py", line 222, in __init__
    path = urllib.parse.urlunparse(urllib.parse.urlparse(path))
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/parse.py", line 395, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/parse.py", line 500, in urlsplit
    _check_bracketed_host(bracketed_host)
  File "/usr/local/lib/python3.11/urllib/parse.py", line 446, in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/ipaddress.py", line 54, in ip_address
    raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: '[routes.productDetail(product.sku, product.slug)' does not appear to be an IPv4 or IPv6 address

I know [routes.productDetail(product.sku, product.slug) is clearly not a valid URL, but I don't think the whole parsing should error out because of this one line.

CPython versions tested on:

3.11

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Labels

3.13bugs and security fixes3.14bugs and security fixes3.15new features, bugs and security fixesstdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions