Skip to content

Conversation

@cuyl
Copy link
Contributor

@cuyl cuyl commented Jan 3, 2026

Description:

  • Enhance Pure::convert_tpl_to_str to prefer cooked values as much as possible.
  • Improve Str::from_tpl_raw to return Wtf8Atom instead of Atom.
  • Add more test cases.

@changeset-bot
Copy link

changeset-bot bot commented Jan 3, 2026

⚠️ No Changeset found

Latest commit: 2a37002

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions
Copy link
Contributor

github-actions bot commented Jan 3, 2026

Binary Sizes

File Size
swc.linux-x64-gnu.node 28M (28446152 bytes)

Commit: b56885d

@codspeed-hq
Copy link

codspeed-hq bot commented Jan 3, 2026

Merging this PR will not alter performance

Summary

✅ 185 untouched benchmarks


Comparing cuyl:refact-convert-tpl-to-str (2a37002) with main (dd007c6)

Open in CodSpeed

@cuyl cuyl force-pushed the refact-convert-tpl-to-str branch from 7eb2654 to e65312b Compare January 4, 2026 00:03
@cuyl cuyl force-pushed the refact-convert-tpl-to-str branch from e65312b to 7c9954b Compare January 4, 2026 01:02
@cuyl
Copy link
Contributor Author

cuyl commented Jan 4, 2026

I'm not sure why we have force_str_for_tpl here, I can see 2 definition:

  1. Minification::force_str_for_tpl returns false
  2. Eval::force_str_for_tpl returns true

There are some encoding logic in convert_str_value_to_tpl_cooked, why we need convert Str::value to Tpl::cooked using convert_str_value_to_tpl_cooked, Aren't they the same thing?
Can I just remove convert_str_value_to_tpl_cooked?
@kdy1

@cuyl cuyl force-pushed the refact-convert-tpl-to-str branch from 400c7bb to b94f214 Compare January 5, 2026 02:56
@cuyl cuyl force-pushed the refact-convert-tpl-to-str branch from b94f214 to f5de72d Compare January 5, 2026 03:17
@kdy1
Copy link
Member

kdy1 commented Jan 5, 2026

Does CI passes, if you remove it?

@cuyl cuyl marked this pull request as ready for review January 5, 2026 04:07
@cuyl cuyl requested a review from a team as a code owner January 5, 2026 04:07
@cuyl
Copy link
Contributor Author

cuyl commented Jan 5, 2026

I have changed some outputs here, but looks not a bug.
@kdy1

@kdy1
Copy link
Member

kdy1 commented Jan 5, 2026

@claude Review this PR. My concerns are

  • the correctness of compress_array_join_as_tpl in crates/swc_ecma_minifier/src/compress/pure/misc.rs
  • (as many as) edge cases you can think of, based on the code changes

@claude
Copy link
Contributor

claude bot commented Jan 5, 2026

Claude encountered an error —— View job

Command failed: git fetch origin --depth=20 refact-convert-tpl-to-str

I'll analyze this and get back to you.

@kdy1
Copy link
Member

kdy1 commented Jan 5, 2026

@claude Review this PR. My concerns are

  • the correctness of compress_array_join_as_tpl in crates/swc_ecma_minifier/src/compress/pure/misc.rs
  • (as many as) edge cases you can think of, based on the code changes

You should do gh pr checkout 11415 to checkout the PR branch

@claude
Copy link
Contributor

claude bot commented Jan 5, 2026

Claude encountered an error —— View job

Command failed: git fetch origin --depth=20 refact-convert-tpl-to-str

I'll analyze this and get back to you.

Copy link
Member

@kdy1 kdy1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

I've identified several critical issues that need to be addressed:

  1. CRITICAL: Compilation error in - calls removed function
  2. HIGH: Surrogate pair handling logic has multiple bugs in
  3. MEDIUM: Error handling uses instead of graceful degradation
  4. LOW: Missing test coverage for edge cases

Detailed comments are provided on each file. Please address the critical and high priority issues before merging.

Copy link
Member

@kdy1 kdy1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

I have identified several critical issues that need to be addressed:

  1. CRITICAL: Compilation error in misc.rs line 1429 - calls removed function
  2. HIGH: Surrogate pair handling logic has multiple bugs in lit.rs
  3. MEDIUM: Error handling uses panic instead of graceful degradation
  4. LOW: Missing test coverage for edge cases

Detailed comments are provided on each file. Please address the critical and high priority issues before merging.

@kdy1
Copy link
Member

kdy1 commented Jan 5, 2026

Correction on my first comment:

You are absolutely right that my comment about line 1429 in misc.rs was incorrect. Looking at the diff more carefully, the change on that line is:

- cur_cooked.push_wtf8(&convert_str_value_to_tpl_cooked(&s.value));
+ cur_cooked.push_wtf8(&Cow::Borrowed(&s.value));

This shows you already fixed this correctly in the PR - the new version uses Cow::Borrowed. I misread the diff and thought the old function was still being called. My apologies for the confusion!

The other comments about surrogate pair handling, iterator advancement, and panic usage remain valid concerns for the from_tpl_raw implementation in lit.rs.

@cuyl
Copy link
Contributor Author

cuyl commented Jan 5, 2026

I assumed the panic is unreachable code, I will add error handing later.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the template literal to string conversion logic in the minifier to improve Unicode handling and simplify the code. The main improvements include a comprehensive rewrite of the from_tpl_raw function with proper Unicode escape sequence parsing and surrogate pair handling.

Key changes:

  • Rewrote Str::from_tpl_raw() to properly handle Unicode escape sequences (\u, \x, \u{...}) and UTF-16 surrogate pairs
  • Simplified template-to-string conversion logic by removing the convert_str_value_to_tpl_cooked function and directly using string values
  • Added comprehensive test cases for Unicode character handling including emojis and surrogate pairs

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
crates/swc_ecma_ast/src/lit.rs Completely rewrote from_tpl_raw() function to handle Unicode escapes, surrogate pairs, and various escape sequences; added error emission helper
crates/swc_ecma_minifier/src/compress/pure/strings.rs Simplified template-to-string conversion by removing convert_str_value_to_tpl_cooked() and updating calls to use the improved from_tpl_raw()
crates/swc_ecma_minifier/src/compress/pure/misc.rs Updated imports to remove unused convert_str_value_to_tpl_cooked and simplified template cooked string handling
crates/swc_ecma_minifier/tests/eval.rs Added test cases for emoji and Unicode escape sequence handling in template literals
crates/swc_ecma_minifier/tests/terser/compress/template_string/special_chars_in_string/output.js Updated expected output to reflect improved template-to-string conversion
crates/swc_ecma_minifier/tests/terser/compress/template_string/allow_null_character/output.js Updated expected output for null character handling in templates
crates/swc/tests/fixture/issues-8xxx/8496/output/8.js Updated expected output for hex escape sequences in templates
Comments suppressed due to low confidence (1)

crates/swc_ecma_ast/src/lit.rs:446

  • Potential bug: If the template string ends with a high surrogate stored in prev_result, it will not be appended to the buffer. After the loop completes at line 444, there should be a check to append any pending high surrogate from prev_result before returning the buffer at line 446.
        }

        buf.into()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kdy1 kdy1 added this to the Planned milestone Jan 5, 2026
@cuyl cuyl marked this pull request as ready for review January 6, 2026 02:16
Copy link
Member

@kdy1 kdy1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This is a good refactoring that improves Unicode handling and simplifies the code. However, I've identified one critical bug in the surrogate pair decoding formula and several other issues that should be addressed.

Critical Issues:

  1. ⚠️ Incorrect surrogate pair decoding formula at lit.rs:384-385
  2. ⚠️ Overly permissive character acceptance in convert_tpl_to_str at strings.rs:228-231
  3. ⚠️ Unsafe code needs better validation at lit.rs:398-400

Other Issues:

  1. Test structure issue in lit.rs:104-110
  2. Misleading comment at lit.rs:383

I'll add detailed comments on specific lines. Please review and address these concerns before merging.

Positive Points:

✅ Comprehensive test coverage
✅ Proper error handling with emit_span_error
✅ Good use of WTF-8 to preserve invalid Unicode
✅ Code simplification by removing unnecessary conversions

@kdy1
Copy link
Member

kdy1 commented Jan 7, 2026

1. CRITICAL - Incorrect Surrogate Pair Decoding Formula

Location: crates/swc_ecma_ast/src/lit.rs:384-385

The bitwise operation is incorrect:

result = 0x10000 | ((result & 0x3ff) | ((prev_result & 0x3ff) << 10));

Problem: Using bitwise OR (|) instead of addition (+) will produce incorrect results for certain surrogate pairs where the bits overlap.

Should be:

result = 0x10000 + ((prev_result & 0x3ff) << 10) + (result & 0x3ff);

This is the standard UTF-16 surrogate pair decoding formula. The current code may work for some cases due to bit alignment, but will fail for others.

Example: \uD800\uDC00 should decode to U+10000 (65536), but the current formula may produce incorrect results.

@kdy1
Copy link
Member

kdy1 commented Jan 7, 2026

2. MEDIUM - Overly Permissive Character Acceptance

Location: crates/swc_ecma_minifier/src/compress/pure/strings.rs:228-231

The new logic accepts ALL characters when force_str_for_tpl is false:

if c.chars().all(|c| match c {
    '\n' | '\r' => self.config.force_str_for_tpl,
    _ => true,  // ⚠️ This accepts everything\!
}) {

Problem: This is less conservative than the old code which only accepted ASCII printable characters:

// Old code
'\u{0020}'..='\u{007e}' => true,
_ => false,

Impact: May convert template literals with special characters to strings when they shouldn't be, potentially breaking code with unusual Unicode characters or control characters.

Recommendation: Review if this change is intentional. If so, document why it's safe. Otherwise, restore the ASCII printable check or add proper validation.

@kdy1
Copy link
Member

kdy1 commented Jan 7, 2026

3. MEDIUM - Unsafe Code Needs Better Validation

Location: crates/swc_ecma_ast/src/lit.rs:398-400

if result <= 0x10ffff {
    // We can be sure result is a valid code point here
    buf.push(unsafe { CodePoint::from_u32_unchecked(result) });
}

Problem: The check result <= 0x10ffff alone doesn't guarantee a valid code point. The surrogate range (0xD800-0xDFFF) is also invalid for direct code points.

Should be:

if result <= 0x10ffff && !(0xd800..=0xdfff).contains(&result) {
    buf.push(unsafe { CodePoint::from_u32_unchecked(result) });
} else {
    emit_span_error(span, "Uncaught SyntaxError: Undefined Unicode code-point");
}

Or handle the surrogate case separately before reaching this code. Please verify the safety invariants are correct.

@kdy1
Copy link
Member

kdy1 commented Jan 7, 2026

4. LOW - Test Structure Issue

Location: crates/swc_ecma_ast/tests/lit.rs:104-110

#[test]
#[should_panic]
fn should_panic() {
    test_from_tpl_raw("\\01", "");
    test_from_tpl_raw("\\2", "");   // ⚠️ Never reached
    test_from_tpl_raw("\\7", "");   // ⚠️ Never reached
}

Problem: Only the first assertion will run because the function panics and exits. The other two assertions are never executed.

Should be:

#[test]
#[should_panic]
fn should_panic_octal_01() {
    test_from_tpl_raw("\\01", "");
}

#[test]
#[should_panic]
fn should_panic_octal_2() {
    test_from_tpl_raw("\\2", "");
}

#[test]
#[should_panic]
fn should_panic_octal_7() {
    test_from_tpl_raw("\\7", "");
}

This ensures all three octal cases are actually tested.

@kdy1
Copy link
Member

kdy1 commented Jan 7, 2026

5. LOW - Misleading Comment

Location: crates/swc_ecma_ast/src/lit.rs:383

// Low surrogate pair
// result is less than 0x1ffff here

Problem: The comment is misleading. At this point, result is a low surrogate in the range 0xDC00-0xDFFF, not "less than 0x1ffff".

After the calculation on line 384-385, the final result would be in the range 0x10000-0x10FFFF (supplementary planes), but the comment appears before that calculation.

Suggested fix:

// Low surrogate pair (0xDC00-0xDFFF)
// Decode to supplementary plane code point (0x10000-0x10FFFF)

@kdy1
Copy link
Member

kdy1 commented Jan 7, 2026

✅ Positive Aspects of This PR

Despite the issues noted above, this PR has many strengths:

  1. Excellent Test Coverage: The 11 test cases cover a wide range of edge cases including emoji, surrogate pairs, hex escapes, and invalid sequences. This is exactly what's needed for Unicode handling code.

  2. Proper Error Handling: Using emit_span_error instead of panicking is the right approach for a production minifier. This allows the tool to report errors gracefully without crashing.

  3. WTF-8 Support: Returning Wtf8Atom instead of Atom is crucial for preserving invalid Unicode sequences, which is important for minification correctness.

  4. Code Simplification: Removing the unnecessary convert_str_value_to_tpl_cooked function and using Cow::Borrowed directly eliminates redundant conversions and improves performance.

  5. Comprehensive Escape Sequence Handling: The implementation handles line continuations (\r, \n), Unicode escapes (\u, \x), and octal validation correctly.

Once the critical issues (especially #1) are addressed, this will be a solid improvement to the minifier. Great work on tackling this complex Unicode handling! 🦀

@kdy1
Copy link
Member

kdy1 commented Jan 7, 2026

Recommended Additional Testing

After fixing the issues, please add these test cases to verify correctness:

1. Test for surrogate pair decoding (to catch issue #1):

#[test]
fn surrogate_pair_boundary_cases() {
    // First supplementary plane character
    test_from_tpl_raw("\\uD800\\uDC00", "\u{10000}");
    
    // Last valid character
    test_from_tpl_raw("\\uDBFF\\uDFFF", "\u{10FFFF}");
    
    // Various combinations to ensure formula is correct
    test_from_tpl_raw("\\uD801\\uDC37", "\u{10437}"); // 𐐷
    test_from_tpl_raw("\\uD852\\uDF62", "\u{24B62}"); // 𤭢
}

2. Test for special character handling (related to issue #2):

#[test]
fn special_chars_in_tpl() {
    // Test that control characters are handled correctly
    let result = minify("`\\u0000\\u0001\\u001f`");
    // Verify behavior matches expectations
}

3. Run execution tests:
According to crates/swc_ecma_minifier/AGENTS.md, you should run:

./scripts/exec.sh

This will verify the changes don't break actual JavaScript execution.

@cuyl
Copy link
Contributor Author

cuyl commented Jan 7, 2026

3. MEDIUM - Unsafe Code Needs Better Validation

Location: crates/swc_ecma_ast/src/lit.rs:398-400

if result <= 0x10ffff {
    // We can be sure result is a valid code point here
    buf.push(unsafe { CodePoint::from_u32_unchecked(result) });
}

Problem: The check result <= 0x10ffff alone doesn't guarantee a valid code point. The surrogate range (0xD800-0xDFFF) is also invalid for direct code points.

Should be:

if result <= 0x10ffff && !(0xd800..=0xdfff).contains(&result) {
    buf.push(unsafe { CodePoint::from_u32_unchecked(result) });
} else {
    emit_span_error(span, "Uncaught SyntaxError: Undefined Unicode code-point");
}

Or handle the surrogate case separately before reaching this code. Please verify the safety invariants are correct.

Refer to CodePoint::from_u32 result <= 0x10ffff is a valid code points range

pub const fn from_u32(value: u32) -> Option<CodePoint> {
match value {
0..=0x10ffff => Some(CodePoint { value }),
_ => None,
}
}

And we have handled the case (0xd800..=0xdfff).contains(&result) before in crates/swc_ecma_ast/src/lit.rs:362-391

@cuyl
Copy link
Contributor Author

cuyl commented Jan 7, 2026

I will add the edge test cases

@cuyl
Copy link
Contributor Author

cuyl commented Jan 7, 2026

Thanks, already applied some changes.

2. MEDIUM - Overly Permissive Character Acceptance

Location: crates/swc_ecma_minifier/src/compress/pure/strings.rs:228-231

The new logic accepts ALL characters when force_str_for_tpl is false:

if c.chars().all(|c| match c {
    '\n' | '\r' => self.config.force_str_for_tpl,
    _ => true,  // ⚠️ This accepts everything\!
}) {

Problem: This is less conservative than the old code which only accepted ASCII printable characters:

// Old code
'\u{0020}'..='\u{007e}' => true,
_ => false,

Impact: May convert template literals with special characters to strings when they shouldn't be, potentially breaking code with unusual Unicode characters or control characters.

Recommendation: Review if this change is intentional. If so, document why it's safe. Otherwise, restore the ASCII printable check or add proper validation.

I think the new logic accepts ALL characters when force_str_for_tpl is true.
Again, I don't really know the reason for the old code, or why we have force_str_for_tpl. Do you have any opinion on this?
@kdy1

@kdy1
Copy link
Member

kdy1 commented Jan 7, 2026

IIRC, force_str_for_tpl was necessary for Evaluator mainly because the evaluator didn't work well with Expr::Tpl. But as #11411 is merged, I don't think force_str_for_tpl is required anymore

@cuyl
Copy link
Contributor Author

cuyl commented Jan 7, 2026

So, Can I try to remove force_str_for_tpl in this PR?

@kdy1
Copy link
Member

kdy1 commented Jan 7, 2026

Yeap, it would be nice. Thank you!

@cuyl cuyl force-pushed the refact-convert-tpl-to-str branch from 766606f to 627a9f4 Compare January 7, 2026 06:33
@cuyl
Copy link
Contributor Author

cuyl commented Jan 7, 2026

Seems Optimizer::reduce_escaped_newline_for_str_lit is a inverse operation of Pure::convert_tpl_to_str, added some code to prevent infinity loop.

@cuyl cuyl force-pushed the refact-convert-tpl-to-str branch from 627a9f4 to 2a37002 Compare January 8, 2026 01:57
@cuyl
Copy link
Contributor Author

cuyl commented Jan 8, 2026

Remove force_str_for_tpl is complicated, I would open another PR instead.

@kdy1
Copy link
Member

kdy1 commented Jan 8, 2026

I see, it sounds good. Thanks!

@cuyl cuyl requested a review from kdy1 January 8, 2026 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants