Fix UTF-16 surrogate pair validation by atgreen · Pull Request #52 · edicl/flexi-streams

atgreen · 2025-11-21T23:37:43Z

Summary

This PR fixes a bug in the UTF-16 decoder where low surrogates (#xdc00-#xdfff) were incorrectly accepted as the first word of a surrogate pair. According to the UTF-16 specification, only high surrogates (#xd800-#xdbff) are valid as the first word.

Problem

The UTF-16 decoder in decode.lisp was using the range check (<= #xd800 word #xdfff) to detect surrogate pairs. This range includes both:

High surrogates: #xd800-#xdbff (valid as first word)
Low surrogates: #xdc00-#xdfff (invalid as first word)

This allowed malformed UTF-16 data with unpaired low surrogates to be incorrectly decoded as valid surrogate pairs, potentially producing garbage Unicode code points.

Solution

Changed the validation in both UTF-16 LE and BE decoders (lines 366 and 397 in decode.lisp) from:

(cond ((<= #xd800 word #xdfff)

to:

(cond ((<= #xd800 word #xdbff)

This ensures only high surrogates are accepted as the first word of a surrogate pair.

Testing

Added 8 comprehensive test cases to test/test.lisp that verify:

Low surrogate pairs are correctly rejected with encoding errors
Both UTF-16 LE and BE are properly validated
Various low surrogate values throughout the range are tested
Edge cases like low surrogate followed by valid non-surrogate characters

All existing tests pass, and the new tests correctly fail before the fix and pass after.

Test Results

Before fix: 7 new tests failed (incorrectly decoded invalid data)
After fix: All tests pass (properly reject invalid sequences)

Files Changed

decode.lisp: Fixed surrogate validation in UTF-16 LE and BE decoders (2 lines)
test/test.lisp: Added 8 test cases for low surrogate validation (14 lines)

The UTF-16 decoder was incorrectly accepting low surrogates (#xdc00-#xdfff) as the first word of a surrogate pair. According to the UTF-16 specification, only high surrogates (#xd800-#xdbff) are valid as the first word. This fix changes the validation in both UTF-16 LE and BE decoders to only accept the correct range of high surrogates, properly rejecting invalid sequences where low surrogates appear as the first word. Also added comprehensive test cases to verify that low surrogate pairs are correctly rejected with encoding errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UTF-16 surrogate pair validation#52

Fix UTF-16 surrogate pair validation#52
atgreen wants to merge 1 commit intoedicl:masterfrom
atgreen:fix-utf16-surrogate-validation

atgreen commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

atgreen commented Nov 21, 2025

Summary

Problem

Solution

Testing

Test Results

Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant