-
-
Notifications
You must be signed in to change notification settings - Fork 185
Fix UTF8 encoding of invalid character #3191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughReimplements UTF-8/UTF-16 core conversion and validation in CLR_RT_UnicodeHelper, adding explicit replacement (U+FFFD) on invalid input and robust forward/backward movement; adds surrogate-aware MatchString and refactors System.String.IndexOf to use it; plus defensive initializations and minor formatting/cleanup. Changes
Sequence Diagram(s)sequenceDiagram
participant App
participant Encoding as System.Text.Encoding.UTF8
participant Helper as CLR_RT_UnicodeHelper
participant StringLib as System.String
App->>Encoding: GetString(bytes, index, count)
Encoding->>Helper: ConvertFromUTF8(iMaxChars, fJustMove=false, iMaxBytes=count)
alt valid UTF-8
Helper-->>Encoding: UTF-16 chars
else invalid sequences
Helper-->>Encoding: emit U+FFFD for invalid bytes
end
Encoding-->>App: Managed String
App->>StringLib: IndexOf(searchUtf8, start, count)
StringLib->>Helper: create input iterator at candidate position
StringLib->>StringLib: MatchString(helperIter, searchUtf8, len)
alt match
StringLib-->>App: return position
else no match
StringLib-->>App: continue / return -1
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (3)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (3)
89-104
: Remove commented-out code.The commented code block is unnecessary and reduces readability.
else { // Advance over valid continuation bytes in invalid sequence - // int advance = 0; - // if (UTF8_VALID_CONTINUATION(pSrc[0])) { - // advance++; - // if (UTF8_VALID_CONTINUATION(pSrc[1])) { - // advance++; - // } - // } - // pSrc += advance; - // remaining -= advance; - pSrc += validCount; remaining -= validCount; }
277-286
: Remove commented-out code.Clean up the commented code for better readability.
int validCount = 0; if (UTF8_VALID_CONTINUATION(ch2)) validCount++; if (UTF8_VALID_CONTINUATION(ch3)) validCount++; if (validCount <2) { - //inputUTF8 += validCount; - //iMaxBytes -= validCount; goto invalid_sequence; }
329-334
: Remove commented-out code.Clean up the commented code.
if (validCount < 3) { // Not all continuation bytes are valid - //inputUTF8 += validCount; - //iMaxBytes -= validCount; goto invalid_sequence; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/CLR/CorLib/corlib_native_System_String.cpp
(1 hunks)src/CLR/Core/CLR_RT_UnicodeHelper.cpp
(7 hunks)src/nanoFramework.Graphics/Graphics/Core/Support/Fonts/Font.cpp
(2 hunks)
🧰 Additional context used
🧠 Learnings (3)
src/nanoFramework.Graphics/Graphics/Core/Support/Fonts/Font.cpp (2)
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-08T15:52:09.445Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
src/CLR/CorLib/corlib_native_System_String.cpp (1)
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3172
File: src/CLR/Core/TypeSystem.cpp:971-999
Timestamp: 2025-05-14T17:26:54.181Z
Learning: In the nanoFramework codebase, when handling HRESULT values from function calls like `parser.Advance(elem)`, it's preferable to use `if (parser.Advance(elem) != S_OK)` for local error handling rather than using the `NANOCLR_CHECK_HRESULT` macro, which would immediately propagate errors up the call stack.
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (8)
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3074
File: src/CLR/Core/GarbageCollector_Info.cpp:107-167
Timestamp: 2025-01-22T03:38:57.394Z
Learning: In nanoFramework's memory management code, DataSize() validation is comprehensively handled through CLR_RT_HeapCluster::ValidateBlock() and other caller code. Additional size checks in ValidateCluster() are redundant as the validation is already performed at multiple levels.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-10-08T15:52:09.445Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-09-23T17:56:37.570Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-08T15:52:09.445Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-12T19:00:39.000Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3190
File: src/CLR/Core/TypeSystem.cpp:0-0
Timestamp: 2025-06-26T09:16:55.184Z
Learning: In nanoFramework's CLR attribute parsing (src/CLR/Core/TypeSystem.cpp), the sentinel value 0xFFFF in string tokens represents a null string. When encountered, this should result in a true null reference (using SetObjectReference(nullptr)) rather than an empty string instance, and the boxing operation should be skipped via early return.
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: nf-interpreter (Nightly build) (Build_STM32_targets ST_NUCLEO64_F091RC)
- GitHub Check: nf-interpreter (Nightly build) (Build_STM32_targets ORGPAL_PALX)
- GitHub Check: nf-interpreter (Nightly build) (Build_STM32_targets ORGPAL_PALTHREE)
- GitHub Check: nf-interpreter (Nightly build) (Check_Build_Options)
- GitHub Check: nf-interpreter (Check_Build_Options)
🔇 Additional comments (10)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (6)
10-22
: Improved UTF-8 validation macros.The new positive validation macros are clearer and follow UTF-8 specifications correctly.
156-192
: Correct UTF-16 byte counting with proper surrogate handling.The implementation correctly handles surrogate pairs and invalid sequences, outputting appropriate byte counts for each case.
410-486
: Well-implemented UTF-8 backward traversal.The rewritten function correctly identifies UTF-8 sequence boundaries when moving backward and properly validates sequences.
488-637
: Robust UTF-16 to UTF-8 conversion with proper error handling.The rewritten function correctly handles surrogate pairs, unpaired surrogates, and buffer management. The replacement character (U+FFFD) is properly encoded for invalid sequences.
734-742
: Proper error handling for UTF-8 conversion failure.Good addition of error handling that properly cleans up allocated memory on conversion failure.
754-756
: Fixed memory leak in Release method.Setting
m_wCharArray
to NULL after release prevents use-after-free issues.src/nanoFramework.Graphics/Graphics/Core/Support/Fonts/Font.cpp (2)
158-158
: Correct use of actual UTF-8 character count.Using the actual character count from the UTF-8 string ensures accurate iteration limits for font rendering.
329-329
: Consistent character counting in width calculation.Good consistency with the StringOut method - both now use the accurate UTF-8 character count.
src/CLR/CorLib/corlib_native_System_String.cpp (2)
588-628
: New string matching helper looks correct.The
MatchString
function properly handles UTF-8 to UTF-16 conversion and surrogate pairs for accurate string comparison.
245-425
: Please verify the IndexOf helper implementation still existsI searched for the static helper declared in
src/CLR/CorLib/corlib_native.h
:
Header declares
static HRESULT IndexOf(CLR_RT_StackFrame &stack, int mode);
I could not locate its definition in any
.cpp
file undersrc/CLR/CorLib
Without that implementation, all the public
IndexOf
overloads would be broken.
Can you confirm whether the helper was moved or intentionally removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (3)
74-75
: Fix duplicate continuation byte check.The code checks
pSrc[0]
twice instead of checking both continuation bytes.- int validCount = 0; - if (UTF8_VALID_CONTINUATION(pSrc[0])) validCount++; - if (UTF8_VALID_CONTINUATION(pSrc[0]) && UTF8_VALID_CONTINUATION(pSrc[1])) validCount++; + int validCount = 0; + if (UTF8_VALID_CONTINUATION(pSrc[0])) validCount++; + if (validCount == 1 && UTF8_VALID_CONTINUATION(pSrc[1])) validCount++;
115-121
: Fix duplicate continuation byte checks for 4-byte sequences.Similar to the 3-byte case, this checks
pSrc[0]
multiple times instead of checking all three continuation bytes.- if (UTF8_VALID_CONTINUATION(pSrc[0])) validCount++; - if (UTF8_VALID_CONTINUATION(pSrc[0]) && UTF8_VALID_CONTINUATION(pSrc[1])) validCount++; - if (UTF8_VALID_CONTINUATION(pSrc[0]) && UTF8_VALID_CONTINUATION(pSrc[1]) && - UTF8_VALID_CONTINUATION(pSrc[2])) validCount++; + if (UTF8_VALID_CONTINUATION(pSrc[0])) validCount++; + if (validCount == 1 && UTF8_VALID_CONTINUATION(pSrc[1])) validCount++; + if (validCount == 2 && UTF8_VALID_CONTINUATION(pSrc[2])) validCount++;
324-327
: Fix duplicate continuation byte validation.The code checks
ch2
twice instead of checking different continuation bytes.- if (UTF8_VALID_CONTINUATION(ch2)) validCount++; - if (UTF8_VALID_CONTINUATION(ch2) && UTF8_VALID_CONTINUATION(ch3)) validCount++; - if (UTF8_VALID_CONTINUATION(ch2) &&UTF8_VALID_CONTINUATION(ch3) && - UTF8_VALID_CONTINUATION(ch4)) validCount++; + if (UTF8_VALID_CONTINUATION(ch2)) validCount++; + if (validCount == 1 && UTF8_VALID_CONTINUATION(ch3)) validCount++; + if (validCount == 2 && UTF8_VALID_CONTINUATION(ch4)) validCount++;
🧹 Nitpick comments (2)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (2)
31-31
: Consider potential integer overflow edge case.While
0x7fffffff
is a reasonable large value, consider usingINT_MAX
for consistency and clarity about the intent.- int remaining = (max < 0) ? 0x7fffffff : max; + int remaining = (max < 0) ? INT_MAX : max;
214-216
: Consider using INT_MAX for consistency.For better code clarity and consistency with standard practices.
- if (iMaxBytes == -1) { - iMaxBytes = INT_MAX; - } + if (iMaxBytes == -1) { + iMaxBytes = INT_MAX; + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp
(7 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3144
File: azure-pipelines-templates/download-install-cmake.yml:1-3
Timestamp: 2025-04-07T14:39:17.549Z
Learning: In the nanoframework/nf-interpreter repository, Unix-style line endings (\n) are not mandatory, even for YAML files, despite static analysis tools like YAMLlint flagging them as errors.
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (10)
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3074
File: src/CLR/Core/GarbageCollector_Info.cpp:107-167
Timestamp: 2025-01-22T03:38:57.394Z
Learning: In nanoFramework's memory management code, DataSize() validation is comprehensively handled through CLR_RT_HeapCluster::ValidateBlock() and other caller code. Additional size checks in ValidateCluster() are redundant as the validation is already performed at multiple levels.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-10-08T15:52:09.445Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-09-23T17:56:37.570Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3074
File: src/CLR/Core/GarbageCollector_Info.cpp:107-167
Timestamp: 2025-01-22T03:38:57.394Z
Learning: In CLR_RT_GarbageCollector::ValidateCluster, DataSize() validation is already handled by ValidateBlock() and other caller code, making additional size checks redundant.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3172
File: src/CLR/Core/TypeSystem.cpp:971-999
Timestamp: 2025-05-14T17:26:54.181Z
Learning: In the nanoFramework codebase, when handling HRESULT values from function calls like `parser.Advance(elem)`, it's preferable to use `if (parser.Advance(elem) != S_OK)` for local error handling rather than using the `NANOCLR_CHECK_HRESULT` macro, which would immediately propagate errors up the call stack.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-08T15:52:09.445Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-12T19:00:39.000Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3190
File: src/CLR/Core/TypeSystem.cpp:0-0
Timestamp: 2025-06-26T09:16:55.184Z
Learning: In nanoFramework's CLR attribute parsing (src/CLR/Core/TypeSystem.cpp), the sentinel value 0xFFFF in string tokens represents a null string. When encountered, this should result in a true null reference (using SetObjectReference(nullptr)) rather than an empty string instance, and the boxing operation should be skipped via early return.
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: nf-interpreter (Nightly build) (Check_Build_Options)
- GitHub Check: nf-interpreter (Check_Build_Options)
🔇 Additional comments (7)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (7)
10-11
: LGTM: Improved UTF-8 validation macros.The change from negative to positive validation checks makes the code more explicit and readable.
173-185
: LGTM: Proper surrogate pair and invalid surrogate handling.The logic correctly handles valid surrogate pairs (4 bytes) and invalid/unpaired surrogates (3 bytes for replacement character).
374-387
: LGTM: Proper invalid sequence handling with replacement character.The use of U+FFFD (0xFFFD) as replacement character aligns with Unicode standards and .NET 8 behavior as stated in the PR objectives.
411-486
: LGTM: Robust backward UTF-8 navigation.The rewritten function properly handles UTF-8 sequence boundaries, validates continuation bytes, and ensures atomic character movement. The logic correctly treats invalid lead bytes as single characters.
534-608
: LGTM: Comprehensive surrogate pair handling.The code properly handles both valid surrogate pairs and unpaired surrogates by outputting replacement characters (U+FFFD) for invalid cases, aligning with Unicode standards.
734-742
: LGTM: Proper error handling in conversion.The function correctly resets the UTF-8 input before conversion and properly cleans up allocated memory on conversion failure, preventing memory leaks.
754-756
: LGTM: Consistent resource cleanup.The Release method properly nulls the pointer and resets the length, ensuring consistent state after cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Albert, this is looking great. Well done!
Mostly style comments in the review. Nothing personal, just trying to keep the consistency and code quality. 😅
src/nanoFramework.Graphics/Graphics/Core/Support/Fonts/Font.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (7)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (7)
33-33
: Clarify variable name and magic number.The variable name
remainingChar
could be more descriptive, and the magic number0x7fffffff
needs explanation.Consider renaming to
remainingChars
ormaxRemaining
and adding a comment explaining why0x7fffffff
is used as the default maximum.- int remainingChar = (max < 0) ? 0x7fffffff : max; + int remainingChars = (max < 0) ? 0x7fffffff : max; // Use max int when no limit specified
40-43
: Add explicit braces for consistency.Missing braces around the single-statement if block.
- if (ch == 0) + if (ch == 0) + { break; + }
122-122
: Add explicit braces for consistency.Missing braces around the single-statement if block.
- if (UTF8_VALID_CONTINUATION(pSrc[0])) + if (UTF8_VALID_CONTINUATION(pSrc[0])) + { validCount++; + }
273-276
: Add explicit braces for consistency.Missing braces around the goto statement.
- if (iMaxBytes < 1) + if (iMaxBytes < 1) + { goto invalid_sequence; + }
471-473
: Add explicit braces for consistency.Missing braces around the return statement.
- if (outputUTF16_size < 1) + if (outputUTF16_size < 1) + { return false; + }
78-87
: Fix flawed continuation byte validation logic.The 3-byte sequence validation has a critical flaw - it checks both
pSrc[0]
andpSrc[1]
independently but only incrementsvalidCount
once when both are valid. This doesn't properly track individual byte validity.Apply this fix to validate continuation bytes sequentially:
int validCount = 0; if (UTF8_VALID_CONTINUATION(pSrc[0])) { validCount++; } - if (UTF8_VALID_CONTINUATION(pSrc[0]) && UTF8_VALID_CONTINUATION(pSrc[1])) + if (validCount == 1 && UTF8_VALID_CONTINUATION(pSrc[1])) { validCount++; }
324-333
: Fix duplicate continuation byte validation in 3-byte sequences.Similar to the CountNumberOfCharacters function, this validation logic is flawed - it checks
ch2
andch3
independently rather than sequentially.Apply this fix to validate continuation bytes properly:
int validCount = 0; if (UTF8_VALID_CONTINUATION(ch2)) { validCount++; } - if (UTF8_VALID_CONTINUATION(ch3)) + if (validCount == 1 && UTF8_VALID_CONTINUATION(ch3)) { validCount++; }
🧹 Nitpick comments (1)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (1)
175-179
: Improve readability of counter logic.The counter decrement logic could be clearer about when max is decremented.
CLR_UINT16 ch = *pSrc++; - if (max > 0) - max--; + if (max > 0) + { + max--; + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/CLR/CorLib/corlib_native_System_String.cpp
(5 hunks)src/CLR/Core/CLR_RT_UnicodeHelper.cpp
(9 hunks)src/nanoFramework.Graphics/Graphics/Core/Support/Fonts/Font.cpp
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- src/nanoFramework.Graphics/Graphics/Core/Support/Fonts/Font.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- src/CLR/CorLib/corlib_native_System_String.cpp
🧰 Additional context used
🧠 Learnings (1)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (11)
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3074
File: src/CLR/Core/GarbageCollector_Info.cpp:107-167
Timestamp: 2025-01-22T03:38:57.394Z
Learning: In nanoFramework's memory management code, DataSize() validation is comprehensively handled through CLR_RT_HeapCluster::ValidateBlock() and other caller code. Additional size checks in ValidateCluster() are redundant as the validation is already performed at multiple levels.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-09-23T17:56:37.570Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: frobijn
PR: nanoframework/nf-interpreter#3023
File: CMake/Modules/FindNF_NativeAssemblies.cmake:95-106
Timestamp: 2024-10-08T15:52:09.445Z
Learning: The `AddCorLibAssemblyVersion` macro in `CMake/Modules/FindNF_NativeAssemblies.cmake` does not require additional error handling for regex failures because the associated *.cpp files contain warnings against modification. Any changes to these files should be accompanied by corresponding updates to the CMake file.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3074
File: src/CLR/Core/GarbageCollector_Info.cpp:107-167
Timestamp: 2025-01-22T03:38:57.394Z
Learning: In CLR_RT_GarbageCollector::ValidateCluster, DataSize() validation is already handled by ValidateBlock() and other caller code, making additional size checks redundant.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3158
File: src/CLR/CorLib/corlib_native_System_Guid.cpp:29-30
Timestamp: 2025-04-29T11:34:00.204Z
Learning: For UUID version 4 (random) generation in .NET's byte order representation, the version bits should be set in byte 7 using `(buf[7] & 0x0F) | 0x40` and the variant bits should be set in byte 9 using `(buf[9] & 0x3F) | 0x80`. This differs from the standard RFC 4122 byte layout due to .NET's byte-swapping behavior.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-08T15:52:09.445Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: In `nanoCLR_GetNativeAssemblyInformation`, there is no need to return the number of bytes written, as the memory buffer is zeroed, making the string buffer terminated.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3172
File: src/CLR/Core/TypeSystem.cpp:971-999
Timestamp: 2025-05-14T17:26:54.181Z
Learning: In the nanoFramework codebase, when handling HRESULT values from function calls like `parser.Advance(elem)`, it's preferable to use `if (parser.Advance(elem) != S_OK)` for local error handling rather than using the `NANOCLR_CHECK_HRESULT` macro, which would immediately propagate errors up the call stack.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3190
File: src/CLR/Core/TypeSystem.cpp:0-0
Timestamp: 2025-06-26T09:16:55.184Z
Learning: In nanoFramework's CLR attribute parsing (src/CLR/Core/TypeSystem.cpp), the sentinel value 0xFFFF in string tokens represents a null string. When encountered, this should result in a true null reference (using SetObjectReference(nullptr)) rather than an empty string instance, and the boxing operation should be skipped via early return.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-09-25T11:28:38.536Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
Learnt from: josesimoes
PR: nanoframework/nf-interpreter#3023
File: targets/netcore/nanoFramework.nanoCLR/nanoCLR_native.cpp:191-225
Timestamp: 2024-10-12T19:00:39.000Z
Learning: When working with `nanoCLR_GetNativeAssemblyInformation`, fixed-size assembly names are required, so code that deals with variable-length names cannot be used.
🧬 Code Graph Analysis (1)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (1)
src/CLR/Include/nanoCLR_Runtime.h (4)
utf8StringStart
(940-940)string
(959-959)m_wCharArray
(960-963)m_length
(964-967)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: nf-interpreter (Nightly build) (Check_Build_Options)
- GitHub Check: nf-interpreter (Nightly build) (Check_Build_Options)
- GitHub Check: nf-interpreter (Check_Build_Options)
🔇 Additional comments (9)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (9)
9-25
: LGTM: Well-defined UTF-8 validation macros.The explicit positive validation macros for ASCII and continuation bytes are clear and correct, replacing the previous negative checks.
203-206
: LGTM: Proper handling of unpaired low surrogates.The addition of explicit handling for unpaired low surrogates (counting them as 3 bytes for replacement character) is correct and aligns with Unicode standards.
448-466
: LGTM: Structured error handling for invalid sequences.The replacement of goto-based error handling with structured control flow and consistent U+FFFD output is a significant improvement over the previous implementation.
490-582
: LGTM: Robust backward UTF-8 navigation.The complete rewrite of
MoveBackwardInUTF8
with proper sequence boundary detection, validation, and boundary checking is a significant improvement. The logic correctly handles:
- UTF-8 sequence start detection
- Continuation byte validation
- Boundary checking
- Partial move handling
596-597
: LGTM: Proper embedded null handling.Removing the early break on null characters allows proper handling of embedded nulls in UTF-16 strings, which is correct behavior.
630-647
: LGTM: Correct unpaired surrogate handling.The explicit handling of unpaired high surrogates with U+FFFD replacement character output aligns with Unicode standards and .NET behavior.
687-703
: LGTM: Complete surrogate pair handling.The addition of explicit unpaired low surrogate handling completes the surrogate pair validation and ensures consistent U+FFFD output for invalid sequences.
829-837
: LGTM: Improved error handling in Assign method.The addition of input reset and proper cleanup on conversion failure prevents memory leaks and ensures consistent state.
849-851
: LGTM: Complete state reset in Release method.Setting both
m_wCharArray
to NULL andm_length
to 0 ensures complete object state reset.
@alberk8 please take a look at coderabbitai comments above. It may have found some edge cases... |
@josesimoes, Is there anything else do I need to do. Thanks. |
@alberk8 I'll be reviewing this later today. Apologies for the delay, I've been awfully busy with a deadline and the generics matter... |
Remove seqStart
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Automated fixes for code style.
Automated fixes for code style. # Conflicts: # src/CLR/Core/CLR_RT_UnicodeHelper.cpp
Automated fixes for code style.
Automated fixes for code style.
Automated fixes for code style.
…hm, it is helpful to have comments explaining the logic)
Automated fixes for code style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (3)
124-124
: Add explicit braces for consistency.Following the previous review feedback from @Ellerbach, please add explicit braces to this if statement.
- if (UTF8_VALID_CONTINUATION(pSrc[0])) + if (UTF8_VALID_CONTINUATION(pSrc[0])) + { validCount++; + }
279-279
: Add missing braces for consistency.Following the previous review feedback from @josesimoes, please add explicit braces to this if statement.
- if (iMaxBytes < 1) + if (iMaxBytes < 1) + { goto invalid_sequence; + }
505-505
: Add missing braces for consistency.Following the previous review feedback from @josesimoes, please add explicit braces to this if statement.
- else if ((leadByte & 0xF8) == 0xF0) + else if ((leadByte & 0xF8) == 0xF0) + { expectedLength = 4; // 4-byte sequence + }
🧹 Nitpick comments (1)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (1)
9-9
: Consider using a more descriptive constant name.The constant
MAX_INT
could be renamed to better reflect its purpose in this context, such asUNLIMITED_CHARACTER_COUNT
orMAX_CHARACTER_COUNT
.-#define MAX_INT 0x7fffffff +#define MAX_CHARACTER_COUNT 0x7fffffff
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/CLR/CorLib/corlib_native_System_String.cpp
(5 hunks)src/CLR/Core/CLR_RT_UnicodeHelper.cpp
(9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/CLR/CorLib/corlib_native_System_String.cpp
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (1)
src/CLR/Include/nanoCLR_Runtime.h (1)
HIGH_SURROGATE_START
(895-930)
🔇 Additional comments (8)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (8)
11-25
: LGTM! UTF-8 validation macros are well-designed.The new macros provide clear, explicit range checks and properly handle continuation byte validation. The separation of concerns between validation and loading operations improves code maintainability.
28-169
: Excellent UTF-8 character counting with proper validation.The implementation correctly handles all UTF-8 sequence types with proper validation and fallback behavior. The logic for counting supplementary characters as 2 characters (Line 145) correctly represents surrogate pair behavior in UTF-16. The partial validation handling (Lines 106-108, 154-156) properly advances over valid continuation bytes even in invalid sequences.
232-494
: Robust UTF-8 to UTF-16 conversion with proper error handling.The implementation correctly handles all UTF-8 sequence types with strict validation and proper replacement character output (0xFFFD). The use of lookahead validation (Lines 283, 329-330, 386-388) and the structured approach with
goto invalid_sequence
provides clear error handling flow. The 4-byte sequence handling correctly produces surrogate pairs for valid supplementary characters.Key improvements:
- Proper boundary checking before accessing continuation bytes
- Strict range validation for overlong sequences
- Correct surrogate pair generation for 4-byte sequences
- Consistent replacement character output for all invalid sequences
501-593
: Well-implemented backward UTF-8 navigation.The implementation correctly handles the complex task of moving backward in UTF-8 strings by:
- Properly identifying sequence boundaries using continuation byte patterns
- Validating complete sequences before accepting them
- Handling edge cases like invalid lead bytes and incomplete sequences
- Returning appropriate success/failure status based on whether the full character count was moved
The logic for treating invalid lead bytes as single characters (Lines 548-550) provides robust fallback behavior.
595-743
: Comprehensive UTF-16 to UTF-8 conversion with surrogate handling.The implementation correctly handles:
- All BMP characters with proper 1-3 byte UTF-8 encoding
- Valid surrogate pairs producing 4-byte UTF-8 sequences
- Unpaired surrogates generating replacement characters (0xEF, 0xBF, 0xBD)
- Embedded nulls (Line 607 comment confirms intentional handling)
The removal of null termination on embedded nulls and the consistent replacement character output for invalid surrogates aligns well with the overall error handling strategy.
840-848
: Good defensive programming in conversion error handling.The implementation properly handles conversion failures by cleaning up allocated memory and resetting state before returning an error. The reset of input UTF-8 (Line 840) ensures consistent state for the conversion operation.
860-862
: Consistent state cleanup in Release method.The explicit null assignment and length reset provide proper cleanup, ensuring the object is in a valid state after release.
171-218
: Surrogate handling is consistent across the codebaseWe verify that in CountNumberOfBytes unpaired surrogates add 3 bytes, and in CLR_RT_UnicodeHelper.cpp both unpaired high and low surrogates are encoded as U+FFFD (0xEF 0xBF 0xBD), a 3-byte sequence:
- CountNumberOfBytes:
num += 3
for any lone surrogate- UTF-16→UTF-8 conversion: emits
0xEF, 0xBF, 0xBD
and decrements buffer by 3No further changes are needed here.
…9-39d7-4334-82aa-80740b54925a' into albert-main
Description
Motivation and Context
How Has This Been Tested?
Screenshots
Types of changes
Checklist
Summary by CodeRabbit
Bug Fixes
Refactor