Skip to content

Fix ISO Latin-1 decoder sign extension#660

Merged
swebb2066 merged 2 commits into
apache:masterfrom
metsw24-max:isolatin-decoder-signext
May 12, 2026
Merged

Fix ISO Latin-1 decoder sign extension#660
swebb2066 merged 2 commits into
apache:masterfrom
metsw24-max:isolatin-decoder-signext

Conversation

@metsw24-max
Copy link
Copy Markdown
Contributor

Fix sign-extension corruption in ISOLatinCharsetDecoder::decode when decoding bytes >= 0x80.

charsetdecoder.cpp previously converted input bytes using:

static_cast<unsigned int>(*src)

Because char is signed by default on common MSVC/GCC/Clang x86/x64 builds, bytes in the range 0x80..0xFF were sign-extended to 0xFFFFFFxx. This caused Transcoder::encode to treat them as invalid Unicode values, producing U+FFFD on UTF-8 builds (or invalid wchar_t values on WCHAR builds).

The fix casts through unsigned char first, matching the existing unsigned-byte handling patterns already used elsewhere in the codebase.

Impact

properties.cpp selects this decoder for Java .properties parsing, so any Latin-1 characters above ASCII in log4cxx configuration files could be silently corrupted. This affects cases such as:

  • accented file paths
  • logger names
  • layout patterns
  • localized configuration text

Changes

  • Fix signed-char sign extension in ISOLatinCharsetDecoder::decode
  • Add testISOLatinHighBytes
  • Verify round-trip behavior for all bytes 0x80..0xFF

@swebb2066 swebb2066 merged commit 0e24ad2 into apache:master May 12, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants