Skip to content

Fix UTF-16 surrogate pair validation#52

Open
atgreen wants to merge 1 commit intoedicl:masterfrom
atgreen:fix-utf16-surrogate-validation
Open

Fix UTF-16 surrogate pair validation#52
atgreen wants to merge 1 commit intoedicl:masterfrom
atgreen:fix-utf16-surrogate-validation

Conversation

@atgreen
Copy link

@atgreen atgreen commented Nov 21, 2025

Summary

This PR fixes a bug in the UTF-16 decoder where low surrogates (#xdc00-#xdfff) were incorrectly accepted as the first word of a surrogate pair. According to the UTF-16 specification, only high surrogates (#xd800-#xdbff) are valid as the first word.

Problem

The UTF-16 decoder in decode.lisp was using the range check (<= #xd800 word #xdfff) to detect surrogate pairs. This range includes both:

  • High surrogates: #xd800-#xdbff (valid as first word)
  • Low surrogates: #xdc00-#xdfff (invalid as first word)

This allowed malformed UTF-16 data with unpaired low surrogates to be incorrectly decoded as valid surrogate pairs, potentially producing garbage Unicode code points.

Solution

Changed the validation in both UTF-16 LE and BE decoders (lines 366 and 397 in decode.lisp) from:

(cond ((<= #xd800 word #xdfff)

to:

(cond ((<= #xd800 word #xdbff)

This ensures only high surrogates are accepted as the first word of a surrogate pair.

Testing

Added 8 comprehensive test cases to test/test.lisp that verify:

  • Low surrogate pairs are correctly rejected with encoding errors
  • Both UTF-16 LE and BE are properly validated
  • Various low surrogate values throughout the range are tested
  • Edge cases like low surrogate followed by valid non-surrogate characters

All existing tests pass, and the new tests correctly fail before the fix and pass after.

Test Results

Before fix: 7 new tests failed (incorrectly decoded invalid data)
After fix: All tests pass (properly reject invalid sequences)

Files Changed

  • decode.lisp: Fixed surrogate validation in UTF-16 LE and BE decoders (2 lines)
  • test/test.lisp: Added 8 test cases for low surrogate validation (14 lines)

The UTF-16 decoder was incorrectly accepting low surrogates
(#xdc00-#xdfff) as the first word of a surrogate pair. According to
the UTF-16 specification, only high surrogates (#xd800-#xdbff) are
valid as the first word.

This fix changes the validation in both UTF-16 LE and BE decoders to
only accept the correct range of high surrogates, properly rejecting
invalid sequences where low surrogates appear as the first word.

Also added comprehensive test cases to verify that low surrogate pairs
are correctly rejected with encoding errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant