`web.search_urls()` is too lenient

The regex pattern here:

https://github.com/sopel-irc/sopel/blob/bc5843bd26bfc17d3082bd704f09b4aa58ce1a50/sopel/tools/web.py#L253

doesn't ignore all IRC formatting characters. Bold and monospace formatting, for example, will be included in the match if it's right up against the end of the URL.

Additionally, text before the protocol (with no whitespace or other word boundary) doesn't appear to be ignored either. For example, "**bold of youhttps://github.com/sopel-irc/sopel**" (note lack of space between `you` and `https`) resulted in `url` trying to fetch a page, when it should have been ignored because `youhttps` is not a valid protocol. Hypothetically there could be custom protocols that end with one of the known strings, and Sopel should make sure the _whole_ protocol string matches what it's looking for.

This leniency affects core message dispatch because it's used to build `PreTrigger.urls`:

https://github.com/sopel-irc/sopel/blob/bc5843bd26bfc17d3082bd704f09b4aa58ce1a50/sopel/trigger.py#L269-L271

Of note: `web.search_urls()` has a `clean` parameter that causes it to run `web.trim_urls()` on the found matches, which the `PreTrigger` code doesn't make use of.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`web.search_urls()` is too lenient #2475

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	# Search URLs after CTCP parsing
	self.urls = tuple(
	web.search_urls(self.args[-1], schemes=url_schemes))

Uh oh!

web.search_urls() is too lenient #2475

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`web.search_urls()` is too lenient #2475