Skip to content

web.search_urls() is too lenient #2475

@dgw

Description

@dgw

The regex pattern here:

re_url = r'((?:%s)(?::\/\/\S+))' % schemes_patterns

doesn't ignore all IRC formatting characters. Bold and monospace formatting, for example, will be included in the match if it's right up against the end of the URL.

Additionally, text before the protocol (with no whitespace or other word boundary) doesn't appear to be ignored either. For example, "bold of youhttps://github.com/sopel-irc/sopel" (note lack of space between you and https) resulted in url trying to fetch a page, when it should have been ignored because youhttps is not a valid protocol. Hypothetically there could be custom protocols that end with one of the known strings, and Sopel should make sure the whole protocol string matches what it's looking for.

This leniency affects core message dispatch because it's used to build PreTrigger.urls:

sopel/sopel/trigger.py

Lines 269 to 271 in bc5843b

# Search URLs after CTCP parsing
self.urls = tuple(
web.search_urls(self.args[-1], schemes=url_schemes))

Of note: web.search_urls() has a clean parameter that causes it to run web.trim_urls() on the found matches, which the PreTrigger code doesn't make use of.

Metadata

Metadata

Assignees

Labels

BugThings to squish; generally used for issues

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions