Fix ReDoS in HTML tokenizer regex (#633) #634
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes #633, which showcases a ReDoS vulnerability in
_sorta_html_tokenize_re
. This PR also adds a regression test suite for ReDoS attacks and integrates it as part of the CICD checks.The fix
The problem was with the section of regex that matches tags with attributes, specifically line 1276:
python-markdown2/lib/markdown2.py
Lines 1272 to 1278 in adf4e81
The problem was with the section that matches quoted attribute values using
".*?"
. Take the following input:The first
m="1"
matches the attribute regex just fine, but then we hit another<p
right after. The regex expects a closing bracket for this tag, so it assumes this is part of the attribute as well. It ends up consuming the whole string and matchingm="1"<p m="1"<p m="1" ...[x5000]...<p m="1"
as the attribute until it reaches the end of the input, at which point it fails to find a match and catastrophically backtracks.By changing the attribute matching criteria to
"[^"]*?"
, we negate this. The regex readsm="1"
as the attribute,<p
then breaks the match immediately and the regex can exit.This is what I believe happened, based on what I saw stepping through it with debuggex.
The new test suite
Since alot of these redos attacks rely on creating a massive string to force extensive backtracking, I thought it would be inefficient to include them with the normal testcase files.
What I've done instead is added a separate test suite that will generate these inputs on the fly and pass it to the library. It uses a time limit to decide pass/fail, with no test case taking longer than 3s.
I searched the repo for any ReDoS related issues/PRs and added a test case for each of them.
I've added this to the makefile as
make testredos
and also added it to the CICD workflow.