fix: add UTF-16 encoding detection and conversion to prevent assertion failures #4347

gaborbernat · 2025-11-13T07:39:43Z

Universal Ctags crashed with assertion failure in vStringPutImpl() when encountering files with UTF-16 encoding. The assertion c >= 0 && c <= 0xff failed because ctags expected all characters to fit within single byte range, but UTF-16 files contain multi-byte sequences that violate this assumption.

This fix adds:

Detection of UTF-16 BOM (both LE and BE) in file reading
Automatic conversion from UTF-16 to UTF-8 using iconv when UTF-16 is detected
Force memory stream processing for UTF-16 files to enable conversion
Test cases for both UTF-16 LE and BE files

Resolves issue #4342

Signed-off-by: Bernát Gábor [email protected]

…n failures Universal Ctags crashed with assertion failure in vStringPutImpl() when encountering files with UTF-16 encoding. The assertion `c >= 0 && c <= 0xff` failed because ctags expected all characters to fit within single byte range, but UTF-16 files contain multi-byte sequences that violate this assumption. This fix adds: - Detection of UTF-16 BOM (both LE and BE) in file reading - Automatic conversion from UTF-16 to UTF-8 using iconv when UTF-16 is detected - Force memory stream processing for UTF-16 files to enable conversion - Test cases for both UTF-16 LE and BE files Resolves issue universal-ctags#4342 Signed-off-by: Bernát Gábor <[email protected]>

codecov · 2025-11-13T08:43:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.88%. Comparing base (d48558f) to head (fa7d3e7).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #4347   +/-   ##
=======================================
  Coverage   85.87%   85.88%           
=======================================
  Files         252      252           
  Lines       62597    62631   +34     
=======================================
+ Hits        53755    53789   +34     
  Misses       8842     8842

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Enables test execution for existing UTF-16 test files by adding the required args.ctags configuration file. This ensures the UTF-16 LE and UTF-16 BE files are processed during test runs, improving code coverage for the UTF-16 to UTF-8 conversion functionality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Adds test for UTF-16 conversion failure path using malformed UTF-16 data with invalid surrogate sequences. This triggers the iconv() failure path and tests the fallback mechanism that preserves original data when UTF-16 to UTF-8 conversion fails. This ensures 100% coverage of the UTF-16 conversion error handling code including the eFree(converted_data) cleanup logic. Signed-off-by: Bernát Gábor <[email protected]>

Adds specific test for UTF-16 Big Endian BOM detection (FE FF) to ensure complete coverage of line 899: (bom[0] == 0xFE && bom[1] == 0xFF). This test completes 100% coverage of all UTF-16 BOM detection paths including both LE (FF FE) and BE (FE FF) byte order markers. Signed-off-by: Bernát Gábor <[email protected]>

gaborbernat · 2025-11-26T22:44:32Z

@masatake any updates on this?

masatake · 2025-11-27T10:39:10Z

Sorry to be late to respond. I will work on this request next.

gaborbernat · 2025-11-27T15:21:13Z

Ideally you can just review and accept this PR. Anything wrong with the solution in it? 🤔

masatake · 2025-11-27T19:30:13Z

The change for getMioFull() is excellent.
Could you write about this change to docs/news/HEAD.rst ?

This change requires new section like:

Bug fixes
-----------------------------

I need time for thinking about the new test cases.
I had struggled once in #4268 but I had burned out.
This is time to focus on the topic again, what we should do with .gitattributes.

gaborbernat force-pushed the fix-utf16-encoding-crash branch from 55764aa to c041872 Compare November 13, 2025 08:18

gaborbernat and others added 3 commits November 13, 2025 06:45

gaborbernat mentioned this pull request Nov 17, 2025

fix(cxx): handle complex template specializations in scope stack #4348

Closed

masatake self-assigned this Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add UTF-16 encoding detection and conversion to prevent assertion failures #4347

fix: add UTF-16 encoding detection and conversion to prevent assertion failures #4347

Uh oh!

gaborbernat commented Nov 13, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 13, 2025 •

edited

Loading

Uh oh!

gaborbernat commented Nov 26, 2025

Uh oh!

masatake commented Nov 27, 2025

Uh oh!

gaborbernat commented Nov 27, 2025

Uh oh!

masatake commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: add UTF-16 encoding detection and conversion to prevent assertion failures #4347

Are you sure you want to change the base?

fix: add UTF-16 encoding detection and conversion to prevent assertion failures #4347

Uh oh!

Conversation

gaborbernat commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gaborbernat commented Nov 26, 2025

Uh oh!

masatake commented Nov 27, 2025

Uh oh!

gaborbernat commented Nov 27, 2025

Uh oh!

masatake commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gaborbernat commented Nov 13, 2025 •

edited

Loading

codecov bot commented Nov 13, 2025 •

edited

Loading