Fix ReDoS in HTML tokenizer regex (#633) #634

Crozzers · 2025-07-12T16:42:29Z

This PR fixes #633, which showcases a ReDoS vulnerability in _sorta_html_tokenize_re. This PR also adds a regression test suite for ReDoS attacks and integrates it as part of the CICD checks.

The fix

The problem was with the section of regex that matches tags with attributes, specifically line 1276:

python-markdown2/lib/markdown2.py

Lines 1272 to 1278 in adf4e81

    
                           (?:             # attributes 
        
                               \s+                           # whitespace after tag 
        
                               (?:[^\t<>"'=/]+:)? 
        
                               [^<>"'=/]+=                   # attr name 
        
                               (?:".*?"|'.*?'|[^<>"'=/\s]+)  # value, quoted or unquoted. If unquoted, no spaces allowed 
        
                           )* 
        
                           \s*/?>

The problem was with the section that matches quoted attribute values using ".*?". Take the following input:

<p m="1"<p m="1"<p m="1"<p m="1"<p m="1"<p m="1"<p m="1" ...[x 5000]...         </div

The first m="1" matches the attribute regex just fine, but then we hit another <p right after. The regex expects a closing bracket for this tag, so it assumes this is part of the attribute as well. It ends up consuming the whole string and matching m="1"<p m="1"<p m="1" ...[x5000]...<p m="1" as the attribute until it reaches the end of the input, at which point it fails to find a match and catastrophically backtracks.

By changing the attribute matching criteria to "[^"]*?", we negate this. The regex reads m="1" as the attribute, <p then breaks the match immediately and the regex can exit.

This is what I believe happened, based on what I saw stepping through it with debuggex.

The new test suite

Since alot of these redos attacks rely on creating a massive string to force extensive backtracking, I thought it would be inefficient to include them with the normal testcase files.

What I've done instead is added a separate test suite that will generate these inputs on the fly and pass it to the library. It uses a time limit to decide pass/fail, with no test case taking longer than 3s.

I searched the repo for any ReDoS related issues/PRs and added a test case for each of them.

I've added this to the makefile as make testredos and also added it to the CICD workflow.

Crozzers · 2025-07-12T17:14:36Z

@nicholasserra on a related note, I was looking at the pending changes and thinking it might be time for a release. I can see 1 other security fix in there, as well as some other stuff. According to PyPI the last release was Jan so probably due for one

nicholasserra · 2025-07-27T16:13:10Z

Thanks for the writeup on this, and the new test suite! Also first time seeing that debuggex tool, that's pretty awesome. Definitely gonna use that at some point.

I'll get a release out after this merge

Crozzers added 4 commits July 12, 2025 17:06

Fix reDOS in HTML tokenizer regex

034d126

Add test suite for ReDoS attacks

74fc8cd

Adjust size of output in redos test case

68885ac

Update changelog

101f1eb

nicholasserra merged commit 4840300 into trentm:master Jul 27, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ReDoS in HTML tokenizer regex (#633) #634

Fix ReDoS in HTML tokenizer regex (#633) #634

Uh oh!

Crozzers commented Jul 12, 2025

Uh oh!

Crozzers commented Jul 12, 2025

Uh oh!

nicholasserra commented Jul 27, 2025

Uh oh!

Uh oh!

Uh oh!

	(?: # attributes
	\s+ # whitespace after tag
	(?:[^\t<>"'=/]+:)?
	[^<>"'=/]+= # attr name
	(?:".?"\|'.?'\|[^<>"'=/\s]+) # value, quoted or unquoted. If unquoted, no spaces allowed
	)*
	\s*/?>

Fix ReDoS in HTML tokenizer regex (#633) #634

Fix ReDoS in HTML tokenizer regex (#633) #634

Uh oh!

Conversation

Crozzers commented Jul 12, 2025

The fix

The new test suite

Uh oh!

Crozzers commented Jul 12, 2025

Uh oh!

nicholasserra commented Jul 27, 2025

Uh oh!

Uh oh!

Uh oh!