-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
refactor(es/minifier): Improve tpl to str #11415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Binary Sizes
Commit: b56885d |
7eb2654 to
e65312b
Compare
e65312b to
7c9954b
Compare
|
I'm not sure why we have
There are some encoding logic in |
400c7bb to
b94f214
Compare
b94f214 to
f5de72d
Compare
|
Does CI passes, if you remove it? |
|
I have changed some outputs here, but looks not a bug. |
|
@claude Review this PR. My concerns are
|
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
@claude Review this PR. My concerns are
You should do |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
kdy1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Summary
I've identified several critical issues that need to be addressed:
- CRITICAL: Compilation error in - calls removed function
- HIGH: Surrogate pair handling logic has multiple bugs in
- MEDIUM: Error handling uses instead of graceful degradation
- LOW: Missing test coverage for edge cases
Detailed comments are provided on each file. Please address the critical and high priority issues before merging.
kdy1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Summary
I have identified several critical issues that need to be addressed:
- CRITICAL: Compilation error in misc.rs line 1429 - calls removed function
- HIGH: Surrogate pair handling logic has multiple bugs in lit.rs
- MEDIUM: Error handling uses panic instead of graceful degradation
- LOW: Missing test coverage for edge cases
Detailed comments are provided on each file. Please address the critical and high priority issues before merging.
|
Correction on my first comment: You are absolutely right that my comment about line 1429 in misc.rs was incorrect. Looking at the diff more carefully, the change on that line is: - cur_cooked.push_wtf8(&convert_str_value_to_tpl_cooked(&s.value));
+ cur_cooked.push_wtf8(&Cow::Borrowed(&s.value));This shows you already fixed this correctly in the PR - the new version uses The other comments about surrogate pair handling, iterator advancement, and panic usage remain valid concerns for the |
|
I assumed the panic is unreachable code, I will add error handing later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the template literal to string conversion logic in the minifier to improve Unicode handling and simplify the code. The main improvements include a comprehensive rewrite of the from_tpl_raw function with proper Unicode escape sequence parsing and surrogate pair handling.
Key changes:
- Rewrote
Str::from_tpl_raw()to properly handle Unicode escape sequences (\u,\x,\u{...}) and UTF-16 surrogate pairs - Simplified template-to-string conversion logic by removing the
convert_str_value_to_tpl_cookedfunction and directly using string values - Added comprehensive test cases for Unicode character handling including emojis and surrogate pairs
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/swc_ecma_ast/src/lit.rs | Completely rewrote from_tpl_raw() function to handle Unicode escapes, surrogate pairs, and various escape sequences; added error emission helper |
| crates/swc_ecma_minifier/src/compress/pure/strings.rs | Simplified template-to-string conversion by removing convert_str_value_to_tpl_cooked() and updating calls to use the improved from_tpl_raw() |
| crates/swc_ecma_minifier/src/compress/pure/misc.rs | Updated imports to remove unused convert_str_value_to_tpl_cooked and simplified template cooked string handling |
| crates/swc_ecma_minifier/tests/eval.rs | Added test cases for emoji and Unicode escape sequence handling in template literals |
| crates/swc_ecma_minifier/tests/terser/compress/template_string/special_chars_in_string/output.js | Updated expected output to reflect improved template-to-string conversion |
| crates/swc_ecma_minifier/tests/terser/compress/template_string/allow_null_character/output.js | Updated expected output for null character handling in templates |
| crates/swc/tests/fixture/issues-8xxx/8496/output/8.js | Updated expected output for hex escape sequences in templates |
Comments suppressed due to low confidence (1)
crates/swc_ecma_ast/src/lit.rs:446
- Potential bug: If the template string ends with a high surrogate stored in
prev_result, it will not be appended to the buffer. After the loop completes at line 444, there should be a check to append any pending high surrogate fromprev_resultbefore returning the buffer at line 446.
}
buf.into()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
kdy1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review Summary
This is a good refactoring that improves Unicode handling and simplifies the code. However, I've identified one critical bug in the surrogate pair decoding formula and several other issues that should be addressed.
Critical Issues:
⚠️ Incorrect surrogate pair decoding formula at lit.rs:384-385⚠️ Overly permissive character acceptance in convert_tpl_to_str at strings.rs:228-231⚠️ Unsafe code needs better validation at lit.rs:398-400
Other Issues:
- Test structure issue in lit.rs:104-110
- Misleading comment at lit.rs:383
I'll add detailed comments on specific lines. Please review and address these concerns before merging.
Positive Points:
✅ Comprehensive test coverage
✅ Proper error handling with emit_span_error
✅ Good use of WTF-8 to preserve invalid Unicode
✅ Code simplification by removing unnecessary conversions
1. CRITICAL - Incorrect Surrogate Pair Decoding FormulaLocation: The bitwise operation is incorrect: result = 0x10000 | ((result & 0x3ff) | ((prev_result & 0x3ff) << 10));Problem: Using bitwise OR ( Should be: result = 0x10000 + ((prev_result & 0x3ff) << 10) + (result & 0x3ff);This is the standard UTF-16 surrogate pair decoding formula. The current code may work for some cases due to bit alignment, but will fail for others. Example: |
2. MEDIUM - Overly Permissive Character AcceptanceLocation: The new logic accepts ALL characters when if c.chars().all(|c| match c {
'\n' | '\r' => self.config.force_str_for_tpl,
_ => true, // ⚠️ This accepts everything\!
}) {Problem: This is less conservative than the old code which only accepted ASCII printable characters: // Old code
'\u{0020}'..='\u{007e}' => true,
_ => false,Impact: May convert template literals with special characters to strings when they shouldn't be, potentially breaking code with unusual Unicode characters or control characters. Recommendation: Review if this change is intentional. If so, document why it's safe. Otherwise, restore the ASCII printable check or add proper validation. |
3. MEDIUM - Unsafe Code Needs Better ValidationLocation: if result <= 0x10ffff {
// We can be sure result is a valid code point here
buf.push(unsafe { CodePoint::from_u32_unchecked(result) });
}Problem: The check Should be: if result <= 0x10ffff && !(0xd800..=0xdfff).contains(&result) {
buf.push(unsafe { CodePoint::from_u32_unchecked(result) });
} else {
emit_span_error(span, "Uncaught SyntaxError: Undefined Unicode code-point");
}Or handle the surrogate case separately before reaching this code. Please verify the safety invariants are correct. |
4. LOW - Test Structure IssueLocation: #[test]
#[should_panic]
fn should_panic() {
test_from_tpl_raw("\\01", "");
test_from_tpl_raw("\\2", ""); // ⚠️ Never reached
test_from_tpl_raw("\\7", ""); // ⚠️ Never reached
}Problem: Only the first assertion will run because the function panics and exits. The other two assertions are never executed. Should be: #[test]
#[should_panic]
fn should_panic_octal_01() {
test_from_tpl_raw("\\01", "");
}
#[test]
#[should_panic]
fn should_panic_octal_2() {
test_from_tpl_raw("\\2", "");
}
#[test]
#[should_panic]
fn should_panic_octal_7() {
test_from_tpl_raw("\\7", "");
}This ensures all three octal cases are actually tested. |
5. LOW - Misleading CommentLocation: // Low surrogate pair
// result is less than 0x1ffff hereProblem: The comment is misleading. At this point, After the calculation on line 384-385, the final result would be in the range 0x10000-0x10FFFF (supplementary planes), but the comment appears before that calculation. Suggested fix: // Low surrogate pair (0xDC00-0xDFFF)
// Decode to supplementary plane code point (0x10000-0x10FFFF) |
✅ Positive Aspects of This PRDespite the issues noted above, this PR has many strengths:
Once the critical issues (especially #1) are addressed, this will be a solid improvement to the minifier. Great work on tackling this complex Unicode handling! 🦀 |
Recommended Additional TestingAfter fixing the issues, please add these test cases to verify correctness: 1. Test for surrogate pair decoding (to catch issue #1): #[test]
fn surrogate_pair_boundary_cases() {
// First supplementary plane character
test_from_tpl_raw("\\uD800\\uDC00", "\u{10000}");
// Last valid character
test_from_tpl_raw("\\uDBFF\\uDFFF", "\u{10FFFF}");
// Various combinations to ensure formula is correct
test_from_tpl_raw("\\uD801\\uDC37", "\u{10437}"); // 𐐷
test_from_tpl_raw("\\uD852\\uDF62", "\u{24B62}"); // 𤭢
}2. Test for special character handling (related to issue #2): #[test]
fn special_chars_in_tpl() {
// Test that control characters are handled correctly
let result = minify("`\\u0000\\u0001\\u001f`");
// Verify behavior matches expectations
}3. Run execution tests: ./scripts/exec.shThis will verify the changes don't break actual JavaScript execution. |
Refer to swc/crates/hstr/src/wtf8/mod.rs Lines 80 to 85 in 9632cd1
And we have handled the case (0xd800..=0xdfff).contains(&result) before in crates/swc_ecma_ast/src/lit.rs:362-391
|
|
I will add the edge test cases |
|
Thanks, already applied some changes.
I think the new logic accepts ALL characters when |
|
IIRC, |
|
So, Can I try to remove |
|
Yeap, it would be nice. Thank you! |
766606f to
627a9f4
Compare
|
Seems |
627a9f4 to
2a37002
Compare
|
Remove force_str_for_tpl is complicated, I would open another PR instead. |
|
I see, it sounds good. Thanks! |
Description:
Pure::convert_tpl_to_strto prefercookedvalues as much as possible.Str::from_tpl_rawto returnWtf8Atominstead ofAtom.