Matching on WTF-8 strings and ECMAScript RegExp simulation #1279
-
| Hey, I'm using the regex crate to implement the ECMAScript RegExp type in Nova JavaScript engine; the engine uses WTF-8 as the internal representation for strings and of course tries to get the ECMAScript RegExp specification implemented as best as possible. I am of course aware that regex does not aim to match any specific language's regular expression syntax, and I'm not exactly looking to change that. I am rather trying to ponder on what it would mean to try simulate (by changing patterns) WTF-8 matching, or what it would mean to do that fundamentally in the regex crate or in a fork of it. What I mean by WTF-8 matching is, first and foremost, to allow matching on "unmatched surrogates" in a WTF-8 byte sequence. For example: use regex::bytes::Regex;
fn main() {
    let haystack = [237, 161, 130];
    let re = Regex::new(r".").unwrap();
    assert_eq!(re.find(&haystack).map(|r| r.range()), Some(0..3));
}Here we attempt to match an unmatched surrogate ( #!/usr/bin/env node
console.log(/./v.test("\ud842")); // trueTo "correctly" match these, I'd need to rewrite the  use regex::bytes::Regex;
fn main() {
    let haystack = [237, 161, 130];
    let re = Regex::new(r"[\u0128-\uffff]").unwrap();
    assert_eq!(re.find(&haystack).map(|r| r.range()), Some(0..3));
}#!/usr/bin/env node
console.log(/[\u0128-\uffff]/v.test("\ud842")); // trueNow I would need to rewrite the range to a character class matching any UTF-8 character in this range, or a WTF-8 lone surrogate in this range. Again, complicated but possible. It's worth noting here that the same patterns in JavaScript work regardless of the  ECMAScript RegExp of course also allows unmatched surrogates to appear in Unicode escapes. For example, this pattern fails to compile: use regex::bytes::Regex;
fn main() {
    let haystack = "𠮷".as_bytes();
    let re = Regex::new(r"\udfb7").unwrap();
    assert_eq!(re.find(&haystack).map(|r| r.range()), Some(0..4));
}But in JS it's considered fine: #!/usr/bin/env node
console.log(/\udfb7/v.test("𠮷")); // false
console.log(/\udfb7/u.test("𠮷")); // false
console.log(/\udfb7/.test("𠮷"));  // trueNote that in the ECMAScript RegExp Unicode modes the lone surrogate in the pattern does not match that same surrogate as part of a surrogate pair, but in the old "UCS-2" mode it does match. Lovely, innit? This same complication with the non-Unicode mode extends to the  A final complication is that at its core, I do want to use regex's Unicode mode always (as I am matching on nearly-UTF-8 strings), but unfortunately it also changes the meaning of eg.  Now, I will reiterate that I do not expect regex to suddenly jump to my aid and implement any or all of this: it's not even exactly clear to me what that would mean! (Matching WTF-8 encoded unpaired surrogates and allowing them in Unicode escape input I would assume isn't too big of a difference, but "splitting" of Unicode characters into two unpaired WTF-8 surrogates is not possible with the  Sorry to be a bother, and cheers <3 Originally posted by @aapoalas in #1253 (comment) | 
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
| 
 Shooting from the hip here, my best guess is that you might need something like  For reference, here is where the  regex/regex-automata/src/nfa/thompson/compiler.rs Lines 1360 to 1447 in 01e2330 
 You can! e.g.,  Now, you can't control this within a character class. That is, you can't do  So if I were you and I wanted to implement this in the quickest cheapest way possible... what would I do? The requirement for supporting  I think what I'd do is something like this: 
 A possible alternative I considered was to make the changes to the AST, but before translating to HIR, expand all of the Unicode codepoint ranges into an equivalent HIR (an alternation of concatenations of  It's quite possible that this is a lot more work than I let on. It seems like you might already know this, but there are considerable differences between ECMAScript regexes and this crate. The Unicode surrogate codepoint handling is just one of them. Before spending a bunch of time trying to paper over this particular incompatibility with  | 
Beta Was this translation helpful? Give feedback.
Shooting from the hip here, my best guess is that you might need something like
regex_syntax::utf8, but for WTF-8. Specifically, that module provides APIs for taking sequences of Unicode scalar values to a corresponding byte-based automaton (see the doc examples in that module). Notice that it takes scalar values. Since WTF-8 is specifically designed to encode unpaired surrogates, you'd need something that takes in all possible Unicode codepoints. Arguably, you could do this by copying that module …