refactor(es/minifier): Improve tpl to str #11415

cuyl · 2026-01-03T16:15:02Z

Description:

Enhance Pure::convert_tpl_to_str to prefer cooked values as much as possible.
Improve Str::from_tpl_raw to return Wtf8Atom instead of Atom.
Add more test cases.

changeset-bot · 2026-01-03T16:15:07Z

⚠️ No Changeset found

Latest commit: 2a37002

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

github-actions · 2026-01-03T16:23:53Z

Binary Sizes

File	Size
`swc.linux-x64-gnu.node`	28M (28446152 bytes)

Commit: b56885d

codspeed-hq · 2026-01-03T16:27:30Z

Merging this PR will not alter performance

Summary

✅ 185 untouched benchmarks

_{Comparing cuyl:refact-convert-tpl-to-str (2a37002) with main (dd007c6)}

cuyl · 2026-01-04T01:58:40Z

I'm not sure why we have force_str_for_tpl here, I can see 2 definition:

Minification::force_str_for_tpl returns false
Eval::force_str_for_tpl returns true

There are some encoding logic in convert_str_value_to_tpl_cooked, why we need convert Str::value to Tpl::cooked using convert_str_value_to_tpl_cooked, Aren't they the same thing?
Can I just remove convert_str_value_to_tpl_cooked?
@kdy1

kdy1 · 2026-01-05T04:06:42Z

Does CI passes, if you remove it?

cuyl · 2026-01-05T04:08:37Z

I have changed some outputs here, but looks not a bug.
@kdy1

kdy1 · 2026-01-05T05:05:38Z

@claude Review this PR. My concerns are

the correctness of compress_array_join_as_tpl in crates/swc_ecma_minifier/src/compress/pure/misc.rs
(as many as) edge cases you can think of, based on the code changes

claude · 2026-01-05T05:06:03Z

Claude encountered an error —— View job

Command failed: git fetch origin --depth=20 refact-convert-tpl-to-str

I'll analyze this and get back to you.

kdy1 · 2026-01-05T05:09:36Z

@claude Review this PR. My concerns are

the correctness of compress_array_join_as_tpl in crates/swc_ecma_minifier/src/compress/pure/misc.rs
(as many as) edge cases you can think of, based on the code changes

You should do gh pr checkout 11415 to checkout the PR branch

claude · 2026-01-05T05:10:02Z

Claude encountered an error —— View job

Command failed: git fetch origin --depth=20 refact-convert-tpl-to-str

I'll analyze this and get back to you.

kdy1

Review Summary

I've identified several critical issues that need to be addressed:

CRITICAL: Compilation error in - calls removed function
HIGH: Surrogate pair handling logic has multiple bugs in
MEDIUM: Error handling uses instead of graceful degradation
LOW: Missing test coverage for edge cases

Detailed comments are provided on each file. Please address the critical and high priority issues before merging.

kdy1

Review Summary

I have identified several critical issues that need to be addressed:

CRITICAL: Compilation error in misc.rs line 1429 - calls removed function
HIGH: Surrogate pair handling logic has multiple bugs in lit.rs
MEDIUM: Error handling uses panic instead of graceful degradation
LOW: Missing test coverage for edge cases

Detailed comments are provided on each file. Please address the critical and high priority issues before merging.

crates/swc_ecma_minifier/src/compress/pure/misc.rs

crates/swc_ecma_ast/src/lit.rs

crates/swc_ecma_minifier/tests/eval.rs

kdy1 · 2026-01-05T05:21:28Z

Correction on my first comment:

You are absolutely right that my comment about line 1429 in misc.rs was incorrect. Looking at the diff more carefully, the change on that line is:

- cur_cooked.push_wtf8(&convert_str_value_to_tpl_cooked(&s.value));
+ cur_cooked.push_wtf8(&Cow::Borrowed(&s.value));

This shows you already fixed this correctly in the PR - the new version uses Cow::Borrowed. I misread the diff and thought the old function was still being called. My apologies for the confusion!

The other comments about surrogate pair handling, iterator advancement, and panic usage remain valid concerns for the from_tpl_raw implementation in lit.rs.

cuyl · 2026-01-05T05:43:22Z

I assumed the panic is unreachable code, I will add error handing later.

Copilot

Pull request overview

This PR refactors the template literal to string conversion logic in the minifier to improve Unicode handling and simplify the code. The main improvements include a comprehensive rewrite of the from_tpl_raw function with proper Unicode escape sequence parsing and surrogate pair handling.

Key changes:

Rewrote Str::from_tpl_raw() to properly handle Unicode escape sequences (\u, \x, \u{...}) and UTF-16 surrogate pairs
Simplified template-to-string conversion logic by removing the convert_str_value_to_tpl_cooked function and directly using string values
Added comprehensive test cases for Unicode character handling including emojis and surrogate pairs

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
crates/swc_ecma_ast/src/lit.rs	Completely rewrote `from_tpl_raw()` function to handle Unicode escapes, surrogate pairs, and various escape sequences; added error emission helper
crates/swc_ecma_minifier/src/compress/pure/strings.rs	Simplified template-to-string conversion by removing `convert_str_value_to_tpl_cooked()` and updating calls to use the improved `from_tpl_raw()`
crates/swc_ecma_minifier/src/compress/pure/misc.rs	Updated imports to remove unused `convert_str_value_to_tpl_cooked` and simplified template cooked string handling
crates/swc_ecma_minifier/tests/eval.rs	Added test cases for emoji and Unicode escape sequence handling in template literals
crates/swc_ecma_minifier/tests/terser/compress/template_string/special_chars_in_string/output.js	Updated expected output to reflect improved template-to-string conversion
crates/swc_ecma_minifier/tests/terser/compress/template_string/allow_null_character/output.js	Updated expected output for null character handling in templates
crates/swc/tests/fixture/issues-8xxx/8496/output/8.js	Updated expected output for hex escape sequences in templates

Comments suppressed due to low confidence (1)

crates/swc_ecma_ast/src/lit.rs:446

Potential bug: If the template string ends with a high surrogate stored in prev_result, it will not be appended to the buffer. After the loop completes at line 444, there should be a check to append any pending high surrogate from prev_result before returning the buffer at line 446.

        }

        buf.into()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/swc_ecma_ast/src/lit.rs

kdy1

Code Review Summary

This is a good refactoring that improves Unicode handling and simplifies the code. However, I've identified one critical bug in the surrogate pair decoding formula and several other issues that should be addressed.

Critical Issues:

⚠️ Incorrect surrogate pair decoding formula at lit.rs:384-385
⚠️ Overly permissive character acceptance in convert_tpl_to_str at strings.rs:228-231
⚠️ Unsafe code needs better validation at lit.rs:398-400

Other Issues:

Test structure issue in lit.rs:104-110
Misleading comment at lit.rs:383

I'll add detailed comments on specific lines. Please review and address these concerns before merging.

Positive Points:

✅ Comprehensive test coverage
✅ Proper error handling with emit_span_error
✅ Good use of WTF-8 to preserve invalid Unicode
✅ Code simplification by removing unnecessary conversions

kdy1 · 2026-01-07T02:26:27Z

1. CRITICAL - Incorrect Surrogate Pair Decoding Formula

Location: crates/swc_ecma_ast/src/lit.rs:384-385

The bitwise operation is incorrect:

result = 0x10000 | ((result & 0x3ff) | ((prev_result & 0x3ff) << 10));

Problem: Using bitwise OR (|) instead of addition (+) will produce incorrect results for certain surrogate pairs where the bits overlap.

Should be:

result = 0x10000 + ((prev_result & 0x3ff) << 10) + (result & 0x3ff);

This is the standard UTF-16 surrogate pair decoding formula. The current code may work for some cases due to bit alignment, but will fail for others.

Example: \uD800\uDC00 should decode to U+10000 (65536), but the current formula may produce incorrect results.

kdy1 · 2026-01-07T02:26:37Z

2. MEDIUM - Overly Permissive Character Acceptance

Location: crates/swc_ecma_minifier/src/compress/pure/strings.rs:228-231

The new logic accepts ALL characters when force_str_for_tpl is false:

if c.chars().all(|c| match c {
    '\n' | '\r' => self.config.force_str_for_tpl,
    _ => true,  // ⚠️ This accepts everything\!
}) {

Problem: This is less conservative than the old code which only accepted ASCII printable characters:

// Old code
'\u{0020}'..='\u{007e}' => true,
_ => false,

Impact: May convert template literals with special characters to strings when they shouldn't be, potentially breaking code with unusual Unicode characters or control characters.

Recommendation: Review if this change is intentional. If so, document why it's safe. Otherwise, restore the ASCII printable check or add proper validation.

kdy1 · 2026-01-07T02:26:47Z

3. MEDIUM - Unsafe Code Needs Better Validation

Location: crates/swc_ecma_ast/src/lit.rs:398-400

if result <= 0x10ffff {
    // We can be sure result is a valid code point here
    buf.push(unsafe { CodePoint::from_u32_unchecked(result) });
}

Problem: The check result <= 0x10ffff alone doesn't guarantee a valid code point. The surrogate range (0xD800-0xDFFF) is also invalid for direct code points.

Should be:

if result <= 0x10ffff && !(0xd800..=0xdfff).contains(&result) {
    buf.push(unsafe { CodePoint::from_u32_unchecked(result) });
} else {
    emit_span_error(span, "Uncaught SyntaxError: Undefined Unicode code-point");
}

Or handle the surrogate case separately before reaching this code. Please verify the safety invariants are correct.

kdy1 · 2026-01-07T02:26:57Z

4. LOW - Test Structure Issue

Location: crates/swc_ecma_ast/tests/lit.rs:104-110

#[test]
#[should_panic]
fn should_panic() {
    test_from_tpl_raw("\\01", "");
    test_from_tpl_raw("\\2", "");   // ⚠️ Never reached
    test_from_tpl_raw("\\7", "");   // ⚠️ Never reached
}

Problem: Only the first assertion will run because the function panics and exits. The other two assertions are never executed.

Should be:

#[test]
#[should_panic]
fn should_panic_octal_01() {
    test_from_tpl_raw("\\01", "");
}

#[test]
#[should_panic]
fn should_panic_octal_2() {
    test_from_tpl_raw("\\2", "");
}

#[test]
#[should_panic]
fn should_panic_octal_7() {
    test_from_tpl_raw("\\7", "");
}

This ensures all three octal cases are actually tested.

kdy1 · 2026-01-07T02:27:07Z

5. LOW - Misleading Comment

Location: crates/swc_ecma_ast/src/lit.rs:383

// Low surrogate pair
// result is less than 0x1ffff here

Problem: The comment is misleading. At this point, result is a low surrogate in the range 0xDC00-0xDFFF, not "less than 0x1ffff".

After the calculation on line 384-385, the final result would be in the range 0x10000-0x10FFFF (supplementary planes), but the comment appears before that calculation.

Suggested fix:

// Low surrogate pair (0xDC00-0xDFFF)
// Decode to supplementary plane code point (0x10000-0x10FFFF)

kdy1 · 2026-01-07T02:27:19Z

✅ Positive Aspects of This PR

Despite the issues noted above, this PR has many strengths:

Excellent Test Coverage: The 11 test cases cover a wide range of edge cases including emoji, surrogate pairs, hex escapes, and invalid sequences. This is exactly what's needed for Unicode handling code.
Proper Error Handling: Using emit_span_error instead of panicking is the right approach for a production minifier. This allows the tool to report errors gracefully without crashing.
WTF-8 Support: Returning Wtf8Atom instead of Atom is crucial for preserving invalid Unicode sequences, which is important for minification correctness.
Code Simplification: Removing the unnecessary convert_str_value_to_tpl_cooked function and using Cow::Borrowed directly eliminates redundant conversions and improves performance.
Comprehensive Escape Sequence Handling: The implementation handles line continuations (\r, \n), Unicode escapes (\u, \x), and octal validation correctly.

Once the critical issues (especially #1) are addressed, this will be a solid improvement to the minifier. Great work on tackling this complex Unicode handling! 🦀

kdy1 · 2026-01-07T02:27:32Z

Recommended Additional Testing

After fixing the issues, please add these test cases to verify correctness:

1. Test for surrogate pair decoding (to catch issue #1):

#[test]
fn surrogate_pair_boundary_cases() {
    // First supplementary plane character
    test_from_tpl_raw("\\uD800\\uDC00", "\u{10000}");
    
    // Last valid character
    test_from_tpl_raw("\\uDBFF\\uDFFF", "\u{10FFFF}");
    
    // Various combinations to ensure formula is correct
    test_from_tpl_raw("\\uD801\\uDC37", "\u{10437}"); // 𐐷
    test_from_tpl_raw("\\uD852\\uDF62", "\u{24B62}"); // 𤭢
}

2. Test for special character handling (related to issue #2):

#[test]
fn special_chars_in_tpl() {
    // Test that control characters are handled correctly
    let result = minify("`\\u0000\\u0001\\u001f`");
    // Verify behavior matches expectations
}

3. Run execution tests:
According to crates/swc_ecma_minifier/AGENTS.md, you should run:

./scripts/exec.sh

This will verify the changes don't break actual JavaScript execution.

cuyl · 2026-01-07T02:49:09Z

3. MEDIUM - Unsafe Code Needs Better Validation

Location: crates/swc_ecma_ast/src/lit.rs:398-400
if result <= 0x10ffff {
    // We can be sure result is a valid code point here
    buf.push(unsafe { CodePoint::from_u32_unchecked(result) });
}
Problem: The check result <= 0x10ffff alone doesn't guarantee a valid code point. The surrogate range (0xD800-0xDFFF) is also invalid for direct code points.

Should be:
if result <= 0x10ffff && !(0xd800..=0xdfff).contains(&result) {
    buf.push(unsafe { CodePoint::from_u32_unchecked(result) });
} else {
    emit_span_error(span, "Uncaught SyntaxError: Undefined Unicode code-point");
}
Or handle the surrogate case separately before reaching this code. Please verify the safety invariants are correct.

Refer to CodePoint::from_u32 result <= 0x10ffff is a valid code points range

swc/crates/hstr/src/wtf8/mod.rs

Lines 80 to 85 in 9632cd1

    
           pub const fn from_u32(value: u32) -> Option<CodePoint> { 
        
               match value { 
        
                   0..=0x10ffff => Some(CodePoint { value }), 
        
                   _ => None, 
        
               } 
        
           }

And we have handled the case (0xd800..=0xdfff).contains(&result) before in crates/swc_ecma_ast/src/lit.rs:362-391

cuyl · 2026-01-07T02:55:28Z

I will add the edge test cases

cuyl · 2026-01-07T03:17:47Z

Thanks, already applied some changes.

2. MEDIUM - Overly Permissive Character Acceptance

Location: crates/swc_ecma_minifier/src/compress/pure/strings.rs:228-231

The new logic accepts ALL characters when force_str_for_tpl is false:
if c.chars().all(|c| match c {
    '\n' | '\r' => self.config.force_str_for_tpl,
    _ => true,  // ⚠️ This accepts everything\!
}) {
Problem: This is less conservative than the old code which only accepted ASCII printable characters:
// Old code
'\u{0020}'..='\u{007e}' => true,
_ => false,
Impact: May convert template literals with special characters to strings when they shouldn't be, potentially breaking code with unusual Unicode characters or control characters.

Recommendation: Review if this change is intentional. If so, document why it's safe. Otherwise, restore the ASCII printable check or add proper validation.

I think the new logic accepts ALL characters when force_str_for_tpl is true.
Again, I don't really know the reason for the old code, or why we have force_str_for_tpl. Do you have any opinion on this?
@kdy1

kdy1 · 2026-01-07T03:20:07Z

IIRC, force_str_for_tpl was necessary for Evaluator mainly because the evaluator didn't work well with Expr::Tpl. But as #11411 is merged, I don't think force_str_for_tpl is required anymore

cuyl · 2026-01-07T03:25:54Z

So, Can I try to remove force_str_for_tpl in this PR?

kdy1 · 2026-01-07T03:52:45Z

Yeap, it would be nice. Thank you!

cuyl · 2026-01-07T06:37:21Z

Seems Optimizer::reduce_escaped_newline_for_str_lit is a inverse operation of Pure::convert_tpl_to_str, added some code to prevent infinity loop.

cuyl · 2026-01-08T02:01:20Z

Remove force_str_for_tpl is complicated, I would open another PR instead.

kdy1 · 2026-01-08T02:04:12Z

I see, it sounds good. Thanks!

cuyl force-pushed the refact-convert-tpl-to-str branch from 7eb2654 to e65312b Compare January 4, 2026 00:03

refactor(es/minifier): Improve tpl to str

7c9954b

cuyl force-pushed the refact-convert-tpl-to-str branch from e65312b to 7c9954b Compare January 4, 2026 01:02

cuyl force-pushed the refact-convert-tpl-to-str branch from 400c7bb to b94f214 Compare January 5, 2026 02:56

remove convert_str_value_to_tpl_cooked

f5de72d

cuyl force-pushed the refact-convert-tpl-to-str branch from b94f214 to f5de72d Compare January 5, 2026 03:17

cuyl marked this pull request as ready for review January 5, 2026 04:07

cuyl requested a review from a team as a code owner January 5, 2026 04:07