Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing `Token::Illegal` handling in switch, giving misleading error to users by msooseth · Pull Request #16520 · argotorg/solidity

msooseth · 2026-03-12T13:58:01Z

Problem

The compiler crashes with an unhandled nlohmann::json::type_error when a source file
contains invalid UTF-8 bytes in a NatSpec comment (e.g. /// \xF7).

The scanner copies raw source bytes into the NatSpec comment literal without UTF-8
validation. These bytes end up in the metadata JSON via natspecUser/natspecDev,
and jsonPrint's dump(ensure_ascii=true) then throws type_error::316 on the
invalid byte.

A secondary issue: even after adding the scanner error, the compiler reported a
misleading "Expected pragma, import directive..." message instead of the real error,
because Token::Illegal fell into the default: branch of the top-level parser switch
which ignores currentError().

Fix

Scanner (liblangutil/Scanner.cpp): After scanning each NatSpec comment literal,
validate it using the existing solidity::util::validateUTF8 from libsolutil/UTF8.h
(which liblangutil already links against). If invalid UTF-8 is found, report the new
ScannerError::InvalidUTF8InComment error.

scanSingleLineDocComment: validate after literal.complete(); set
m_skippedComments[NextNext].error if invalid. scanSlash checks and propagates
this as Token::Illegal.
scanMultiLineDocComment: validate after literal.complete(); return
setError(ScannerError::InvalidUTF8InComment) if invalid.

Parser (libsolidity/parsing/Parser.cpp): Add an explicit case Token::Illegal:
in the top-level source-unit switch that calls fatalParserError with
to_string(m_scanner->currentError()), matching the pattern already used elsewhere
in the parser.

Output

Output now:

./solc/solc --optimize --ir minimized-from-ab4ee1498fff0c71961d81fc308f1dee0be16f33

Warning: This is a pre-release compiler version, please do not use it in production.

Error: Invalid UTF-8 sequence in NatSpec comment.
 --> minimized-from-ab4ee1498fff0c71961d81fc308f1dee0be16f33:1:1:
  |
1 | ///
  | ^ (Relevant source part starts here and spans across multiple lines).

Disclaimers

Wirrten by Claude Code, reviewed by myself.

clonker

some minor adjustments. you can use the single-argument version of validateUTF8 if you're not using the invalidPos anyways. And I think you can get away without using m_skippedComments[...].error as intermediary if you set the error directly when it is encountered and then return Token::Illegal in scanSlash.

liblangutil/Scanner.cpp

msooseth

Thanks, very good points! I think the last one is better left as-is though?

I have also fixed the python script for the error checking, please double-check that one... don't want to accidentally disable all tests or something :S

liblangutil/Scanner.cpp

libsolidity/parsing/Parser.cpp

cameel · 2026-03-12T19:59:52Z

scripts/isolate_tests.py

-                if basename(f) == "invalid_utf8_sequence.sol":
-                    continue  # ignore the test with broken utf-8 encoding
+                if "invalid_utf8" in basename(f):
+                    continue  # ignore tests with invalid utf-8 encoding


Please don't generalize this hack :)

The issue is in the splitting script - it should just handle invalid UTF-8 properly and not generate UnicodeDecode error ever (see #9710 (comment)). It's very low priority, but if it's getting in the way, we should just do a proper fix instead of making the hack more elaborate :)

I fixed this I hope. Please check :)

test/libsolidity/syntaxTests/comments/natspec_invalid_utf8_multiline.sol

cameel · 2026-03-12T20:15:35Z

liblangutil/Scanner.cpp

 	}
 	literal.complete();
+	if (!util::validateUTF8(m_skippedComments[NextNext].literal))
+		setError(ScannerError::InvalidUTF8InComment);


What about normal, non-Natspec comments? I think this change only makes these chars illegal in Natspec, but that's not enough. If we cannot have them in metadata then they must be banned everywhere because it is possible to get whole sources into metadata by compiling with --metadata-literal option.

The only other case I can think of are strings. I'm actually surprised that we do not validate those in the parser. We instead do that during analysis, in SyntaxChecker, but also with redundant checks in other places:

SyntaxChecker::visit(Literal()

StringLiteralType::isImplicitlyConvertibleTo()

ASTJsonExporter::visit(Literal)

AsmJsonConverter::operator()(Literal)

These checks at a glance look like they should have been asserts, it's possible that they've been added to plug corner cases that were discovered after the fact.

They can also be bypassed though if you use --stop-after parsing. The only thing that saves us here is that you cannot request the --metadata output is you use that option. I would not be surprised if there are other ways to bypass it I can't think of now. Overall, trying to check it in analysis feels like whack-a-mole. I think we should move all such checks to the parser and also have some sanity check that every file is either free of invalid UTF-8 or parsing ended with an error.

Actually, it's more complicated. We do check strings in the parser, but not unicode"" ones.

Still, I don't see a good reason not to do that for unicode"" strings as well if there are no situations where those would be valid anyway.

I think I made it now work for illegal UTF8 in normal comments, too. Please check. I also added tests for it.

cameel · 2026-03-12T20:55:36Z

test/libsolidity/syntaxTests/comments/natspec_invalid_utf8_multiline.sol

We should also test this in Yul (and maybe also inline assembly). If invalid UTF-8 crashes JSON generation then an easy way to trigger that is to request Yul output in Standard JSON mode.

I believe I have added tests for this too. Please check :)

Add

msooseth · 2026-03-16T17:53:19Z

scripts/isolate_tests.py

+        except (UnicodeDecodeError, UnicodeEncodeError):
+            print(f"Warning: Test case in {f} contains invalid UTF-8 characters, skipping.")
+            continue
+


@cameel I think I need help with how to do this better. Sorry :S

Just to avoid stalling this PR too much, maybe let's restore your original hack and move this bit to a separate PR? I'd rather avoid that hack long-term, but it's fine to have it temporarily and it's independent of the bugfix.

Thanks. I went back to the old hack :) I had to change a file name that was invalid utf8 so it's better reflected in the filename and then it's properly filtered out with the hack.

clonker · 2026-03-18T13:02:51Z

liblangutil/Scanner.cpp

+	size_t invalidPos;
+	if (!util::validateUTF8(m_source.source().substr(startPosition, m_source.position() - startPosition), invalidPos))
+	{
+		m_tokens[NextNext].location.start = static_cast<int>(startPosition + invalidPos);


This messes up rescan which is called by setScannerMode after an assembly block. Try running this through solc observe the output:

contract C { function f() public { assembly { let x := 1 } /*�*/ } }

yields Error: Invalid token. instead of the expected Error: Invalid UTF-8 sequence in comment.. Same for single line comments with //.

msooseth requested a review from clonker March 12, 2026 13:59

msooseth changed the title ~~Fix crash on invalid UTF-8 bytes in NatSpec comments~~ Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling for correct error message to users Mar 12, 2026

clonker reviewed Mar 12, 2026

View reviewed changes

liblangutil/Scanner.cpp Outdated Show resolved Hide resolved

liblangutil/Scanner.cpp Outdated Show resolved Hide resolved

liblangutil/Scanner.cpp Show resolved Hide resolved

liblangutil/Scanner.cpp Show resolved Hide resolved

msooseth commented Mar 12, 2026

View reviewed changes

liblangutil/Scanner.cpp Show resolved Hide resolved

cameel reviewed Mar 12, 2026

View reviewed changes

Fixing utf8 parsing

478278a

msooseth force-pushed the fix-utf8 branch from 4fd651d to 478278a Compare March 16, 2026 13:34

msooseth added 13 commits March 16, 2026 15:37

Adding tests

b188dfb

Fixing utf8 handling here, so we can test wrong utf8

d4b91d8

Update

62d0fa3

Fix

fa7bea4

Update

b025cc3

Update

9120ad7

Let's add seatbelt

d778185

Update

78edb43

Update

36f5dc1

No utf8 issues

3c8c97b

Fix this

7ab0392

Add

995f4e8

Add

Add tests

d708191

msooseth commented Mar 16, 2026

View reviewed changes

msooseth and others added 4 commits March 17, 2026 10:40

Update

3e584fd

Merge branch 'develop' into fix-utf8

4f71f4d

Update changelog

a3ecf7b

Update test

5d28d33

clonker requested changes Mar 18, 2026

View reviewed changes

Conversation

msooseth commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Output

Disclaimers

Uh oh!

clonker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

msooseth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cameel Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msooseth Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

msooseth commented Mar 12, 2026 •

edited

Loading

cameel Mar 16, 2026 •

edited

Loading

msooseth Mar 17, 2026 •

edited

Loading