Skip to content

Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling in switch, giving misleading error to users#16520

Open
msooseth wants to merge 18 commits intodevelopfrom
fix-utf8
Open

Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling in switch, giving misleading error to users#16520
msooseth wants to merge 18 commits intodevelopfrom
fix-utf8

Conversation

@msooseth
Copy link
Contributor

@msooseth msooseth commented Mar 12, 2026

Fixes #16519

Problem

The compiler crashes with an unhandled nlohmann::json::type_error when a source file
contains invalid UTF-8 bytes in a NatSpec comment (e.g. /// \xF7).

The scanner copies raw source bytes into the NatSpec comment literal without UTF-8
validation. These bytes end up in the metadata JSON via natspecUser/natspecDev,
and jsonPrint's dump(ensure_ascii=true) then throws type_error::316 on the
invalid byte.

A secondary issue: even after adding the scanner error, the compiler reported a
misleading "Expected pragma, import directive..." message instead of the real error,
because Token::Illegal fell into the default: branch of the top-level parser switch
which ignores currentError().

Fix

Scanner (liblangutil/Scanner.cpp): After scanning each NatSpec comment literal,
validate it using the existing solidity::util::validateUTF8 from libsolutil/UTF8.h
(which liblangutil already links against). If invalid UTF-8 is found, report the new
ScannerError::InvalidUTF8InComment error.

  • scanSingleLineDocComment: validate after literal.complete(); set
    m_skippedComments[NextNext].error if invalid. scanSlash checks and propagates
    this as Token::Illegal.
  • scanMultiLineDocComment: validate after literal.complete(); return
    setError(ScannerError::InvalidUTF8InComment) if invalid.

Parser (libsolidity/parsing/Parser.cpp): Add an explicit case Token::Illegal:
in the top-level source-unit switch that calls fatalParserError with
to_string(m_scanner->currentError()), matching the pattern already used elsewhere
in the parser.

Output

Output now:

./solc/solc --optimize --ir minimized-from-ab4ee1498fff0c71961d81fc308f1dee0be16f33

Warning: This is a pre-release compiler version, please do not use it in production.

Error: Invalid UTF-8 sequence in NatSpec comment.
 --> minimized-from-ab4ee1498fff0c71961d81fc308f1dee0be16f33:1:1:
  |
1 | ///
  | ^ (Relevant source part starts here and spans across multiple lines).

Disclaimers

Wirrten by Claude Code, reviewed by myself.

@msooseth msooseth requested a review from clonker March 12, 2026 13:59
@msooseth msooseth changed the title Fix crash on invalid UTF-8 bytes in NatSpec comments Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling for correct error message to users Mar 12, 2026
@msooseth msooseth changed the title Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling for correct error message to users Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling in switch, giving misleading error to users Mar 12, 2026
Copy link
Member

@clonker clonker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor adjustments. you can use the single-argument version of validateUTF8 if you're not using the invalidPos anyways. And I think you can get away without using m_skippedComments[...].error as intermediary if you set the error directly when it is encountered and then return Token::Illegal in scanSlash.

Copy link
Contributor Author

@msooseth msooseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, very good points! I think the last one is better left as-is though?

I have also fixed the python script for the error checking, please double-check that one... don't want to accidentally disable all tests or something :S

Comment on lines +162 to +163
if basename(f) == "invalid_utf8_sequence.sol":
continue # ignore the test with broken utf-8 encoding
if "invalid_utf8" in basename(f):
continue # ignore tests with invalid utf-8 encoding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't generalize this hack :)

The issue is in the splitting script - it should just handle invalid UTF-8 properly and not generate UnicodeDecode error ever (see #9710 (comment)). It's very low priority, but if it's getting in the way, we should just do a proper fix instead of making the hack more elaborate :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this I hope. Please check :)

}
literal.complete();
if (!util::validateUTF8(m_skippedComments[NextNext].literal))
setError(ScannerError::InvalidUTF8InComment);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about normal, non-Natspec comments? I think this change only makes these chars illegal in Natspec, but that's not enough. If we cannot have them in metadata then they must be banned everywhere because it is possible to get whole sources into metadata by compiling with --metadata-literal option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other case I can think of are strings. I'm actually surprised that we do not validate those in the parser. We instead do that during analysis, in SyntaxChecker, but also with redundant checks in other places:

These checks at a glance look like they should have been asserts, it's possible that they've been added to plug corner cases that were discovered after the fact.

They can also be bypassed though if you use --stop-after parsing. The only thing that saves us here is that you cannot request the --metadata output is you use that option. I would not be surprised if there are other ways to bypass it I can't think of now. Overall, trying to check it in analysis feels like whack-a-mole. I think we should move all such checks to the parser and also have some sanity check that every file is either free of invalid UTF-8 or parsing ended with an error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's more complicated. We do check strings in the parser, but not unicode"" ones.

Still, I don't see a good reason not to do that for unicode"" strings as well if there are no situations where those would be valid anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I made it now work for illegal UTF8 in normal comments, too. Please check. I also added tests for it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also test this in Yul (and maybe also inline assembly). If invalid UTF-8 crashes JSON generation then an easy way to trigger that is to request Yul output in Standard JSON mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I have added tests for this too. Please check :)

except (UnicodeDecodeError, UnicodeEncodeError):
print(f"Warning: Test case in {f} contains invalid UTF-8 characters, skipping.")
continue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cameel I think I need help with how to do this better. Sorry :S

Copy link
Collaborator

@cameel cameel Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to avoid stalling this PR too much, maybe let's restore your original hack and move this bit to a separate PR? I'd rather avoid that hack long-term, but it's fine to have it temporarily and it's independent of the bugfix.

Copy link
Contributor Author

@msooseth msooseth Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I went back to the old hack :) I had to change a file name that was invalid utf8 so it's better reflected in the filename and then it's properly filtered out with the hack.

size_t invalidPos;
if (!util::validateUTF8(m_source.source().substr(startPosition, m_source.position() - startPosition), invalidPos))
{
m_tokens[NextNext].location.start = static_cast<int>(startPosition + invalidPos);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This messes up rescan which is called by setScannerMode after an assembly block. Try running this through solc observe the output:

contract C { function f() public { assembly { let x := 1 }
/*�*/
} }

yields Error: Invalid token. instead of the expected Error: Invalid UTF-8 sequence in comment.. Same for single line comments with //.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash on invalid UTF-8 bytes in Natspec

3 participants