Why does iteration with bytes::Regex
yield empty matches that can split a codepoint, even when Unicode mode is enabled?
#1276
-
What version of regex are you using?1.11.1 Describe the bug at a high level.When using What are the steps to reproduce the behavior?let re = regex::bytes::RegexBuilder::new(r"").unicode(true).build().unwrap();
let subject = "😃".as_bytes(); // I.e. U+1F603
assert_eq!(subject, b"\xF0\x9F\x98\x83"); // 4 UTF-8 bytes
for m in re.find_iter(subject ) {
println!("{:?}", m);
}
let res = re.replace_all(subject, b"<$0>");
println!("{}", String::from_utf8_lossy(&res)); What is the actual behavior?The above prints
In other words, both What is the expected behavior?I expected the above to print:
which is exactly what happens when I use a If I change the regex from
You may just tell me "don't use |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
I noticed this issues is fixed on the similar |
Beta Was this translation helpful? Give feedback.
-
This is very intentional behavior, although I note that it isn't documented. A great deal of care and attention is paid to this in the implementation inside of And special handling of empty matches when UTF-8 mode is enabled versus not: regex/regex-automata/src/hybrid/dfa.rs Lines 588 to 618 in 1a069b9 And then finally, you can read this module dedicated to handling this case for all of the engines inside of If you read the above, you'll notice that enabling UTF-8 mode while providing a haystack that is invalid UTF-8 results in unspecified behavior. Specifically, the reason for unspecified behavior is that the "is char boundary" predicate has unspecified behavior: regex/regex-automata/src/util/utf8.rs Lines 117 to 137 in 1a069b9 I don't know if you require UTF-8 mode on a haystack that is invalid UTF-8, but if so, it requires reckoning with what it means to be a codepoint boundary on arbitrary byte sequences. The unspecified behavior in |
Beta Was this translation helpful? Give feedback.
This is very intentional behavior, although I note that it isn't documented. A great deal of care and attention is paid to this in the implementation inside of
regex-automata
. Notably, the semantics you want are called "UTF-8 mode," which is distinct and orthogonal from Unicode mode. UTF-8 mode has to do with guaranteeing that match spans always fall on UTF-8 boundaries. Unicode mode has to do with whether the regex pattern has Unicode features available. For example, see:meta::Config::utf8_empty
nfa::thompson::Config::utf8
And special handling of empty matches when UTF-8 mode is enabled versus not:
regex/regex-automata/src/hybrid/dfa.rs
Lines 588 to 618 in 1a069b9