Fix crash on invalid UTF-8 bytes in NatSpec comments. ALSO fix missing Token::Illegal handling in switch, giving misleading error to users#16520
Conversation
Token::Illegal handling for correct error message to users
Token::Illegal handling for correct error message to usersToken::Illegal handling in switch, giving misleading error to users
clonker
left a comment
There was a problem hiding this comment.
some minor adjustments. you can use the single-argument version of validateUTF8 if you're not using the invalidPos anyways. And I think you can get away without using m_skippedComments[...].error as intermediary if you set the error directly when it is encountered and then return Token::Illegal in scanSlash.
msooseth
left a comment
There was a problem hiding this comment.
Thanks, very good points! I think the last one is better left as-is though?
I have also fixed the python script for the error checking, please double-check that one... don't want to accidentally disable all tests or something :S
scripts/isolate_tests.py
Outdated
| if basename(f) == "invalid_utf8_sequence.sol": | ||
| continue # ignore the test with broken utf-8 encoding | ||
| if "invalid_utf8" in basename(f): | ||
| continue # ignore tests with invalid utf-8 encoding |
There was a problem hiding this comment.
Please don't generalize this hack :)
The issue is in the splitting script - it should just handle invalid UTF-8 properly and not generate UnicodeDecode error ever (see #9710 (comment)). It's very low priority, but if it's getting in the way, we should just do a proper fix instead of making the hack more elaborate :)
There was a problem hiding this comment.
I fixed this I hope. Please check :)
test/libsolidity/syntaxTests/comments/natspec_invalid_utf8_multiline.sol
Outdated
Show resolved
Hide resolved
test/libsolidity/syntaxTests/comments/natspec_invalid_utf8_multiline.sol
Outdated
Show resolved
Hide resolved
liblangutil/Scanner.cpp
Outdated
| } | ||
| literal.complete(); | ||
| if (!util::validateUTF8(m_skippedComments[NextNext].literal)) | ||
| setError(ScannerError::InvalidUTF8InComment); |
There was a problem hiding this comment.
What about normal, non-Natspec comments? I think this change only makes these chars illegal in Natspec, but that's not enough. If we cannot have them in metadata then they must be banned everywhere because it is possible to get whole sources into metadata by compiling with --metadata-literal option.
There was a problem hiding this comment.
The only other case I can think of are strings. I'm actually surprised that we do not validate those in the parser. We instead do that during analysis, in SyntaxChecker, but also with redundant checks in other places:
SyntaxChecker::visit(Literal()StringLiteralType::isImplicitlyConvertibleTo()ASTJsonExporter::visit(Literal)AsmJsonConverter::operator()(Literal)
These checks at a glance look like they should have been asserts, it's possible that they've been added to plug corner cases that were discovered after the fact.
They can also be bypassed though if you use --stop-after parsing. The only thing that saves us here is that you cannot request the --metadata output is you use that option. I would not be surprised if there are other ways to bypass it I can't think of now. Overall, trying to check it in analysis feels like whack-a-mole. I think we should move all such checks to the parser and also have some sanity check that every file is either free of invalid UTF-8 or parsing ended with an error.
There was a problem hiding this comment.
Actually, it's more complicated. We do check strings in the parser, but not unicode"" ones.
Still, I don't see a good reason not to do that for unicode"" strings as well if there are no situations where those would be valid anyway.
There was a problem hiding this comment.
I think I made it now work for illegal UTF8 in normal comments, too. Please check. I also added tests for it.
There was a problem hiding this comment.
We should also test this in Yul (and maybe also inline assembly). If invalid UTF-8 crashes JSON generation then an easy way to trigger that is to request Yul output in Standard JSON mode.
There was a problem hiding this comment.
I believe I have added tests for this too. Please check :)
| except (UnicodeDecodeError, UnicodeEncodeError): | ||
| print(f"Warning: Test case in {f} contains invalid UTF-8 characters, skipping.") | ||
| continue | ||
|
|
There was a problem hiding this comment.
@cameel I think I need help with how to do this better. Sorry :S
There was a problem hiding this comment.
Just to avoid stalling this PR too much, maybe let's restore your original hack and move this bit to a separate PR? I'd rather avoid that hack long-term, but it's fine to have it temporarily and it's independent of the bugfix.
There was a problem hiding this comment.
Thanks. I went back to the old hack :) I had to change a file name that was invalid utf8 so it's better reflected in the filename and then it's properly filtered out with the hack.
| size_t invalidPos; | ||
| if (!util::validateUTF8(m_source.source().substr(startPosition, m_source.position() - startPosition), invalidPos)) | ||
| { | ||
| m_tokens[NextNext].location.start = static_cast<int>(startPosition + invalidPos); |
There was a problem hiding this comment.
This messes up rescan which is called by setScannerMode after an assembly block. Try running this through solc observe the output:
contract C { function f() public { assembly { let x := 1 }
/*�*/
} }yields Error: Invalid token. instead of the expected Error: Invalid UTF-8 sequence in comment.. Same for single line comments with //.
Fixes #16519
Problem
The compiler crashes with an unhandled
nlohmann::json::type_errorwhen a source filecontains invalid UTF-8 bytes in a NatSpec comment (e.g.
/// \xF7).The scanner copies raw source bytes into the NatSpec comment literal without UTF-8
validation. These bytes end up in the metadata JSON via
natspecUser/natspecDev,and
jsonPrint'sdump(ensure_ascii=true)then throwstype_error::316on theinvalid byte.
A secondary issue: even after adding the scanner error, the compiler reported a
misleading "Expected pragma, import directive..." message instead of the real error,
because
Token::Illegalfell into thedefault:branch of the top-level parser switchwhich ignores
currentError().Fix
Scanner (
liblangutil/Scanner.cpp): After scanning each NatSpec comment literal,validate it using the existing
solidity::util::validateUTF8fromlibsolutil/UTF8.h(which
liblangutilalready links against). If invalid UTF-8 is found, report the newScannerError::InvalidUTF8InCommenterror.scanSingleLineDocComment: validate afterliteral.complete(); setm_skippedComments[NextNext].errorif invalid.scanSlashchecks and propagatesthis as
Token::Illegal.scanMultiLineDocComment: validate afterliteral.complete(); returnsetError(ScannerError::InvalidUTF8InComment)if invalid.Parser (
libsolidity/parsing/Parser.cpp): Add an explicitcase Token::Illegal:in the top-level source-unit switch that calls
fatalParserErrorwithto_string(m_scanner->currentError()), matching the pattern already used elsewherein the parser.
Output
Output now:
Disclaimers
Wirrten by Claude Code, reviewed by myself.