Why does iteration with `bytes::Regex` yield empty matches that can split a codepoint, even when Unicode mode is enabled? #1276

IsaacOscar · 2025-08-05T11:40:49Z

IsaacOscar
Aug 5, 2025

What version of regex are you using?

1.11.1

Describe the bug at a high level.

When using regex::bytes with unicode mode enabled (https://docs.rs/regex/latest/regex/bytes/struct.RegexBuilder.html#method.unicode), iterating over matches does not respect unicode character boundaries, but instead iterates over the raw bytes.

What are the steps to reproduce the behavior?

let re = regex::bytes::RegexBuilder::new(r"").unicode(true).build().unwrap();
let subject = "😃".as_bytes(); // I.e. U+1F603
assert_eq!(subject, b"\xF0\x9F\x98\x83"); // 4 UTF-8 bytes
for m in re.find_iter(subject ) {
    println!("{:?}", m);    
}
let res = re.replace_all(subject, b"<$0>");
println!("{}", String::from_utf8_lossy(&res));

What is the actual behavior?

The above prints

Match { start: 0, end: 0, bytes: "" }
Match { start: 1, end: 1, bytes: "" }
Match { start: 2, end: 2, bytes: "" }
Match { start: 3, end: 3, bytes: "" }
Match { start: 4, end: 4, bytes: "" }
<>�<>�<>�<>�<>

In other words, both find_iter and replace_all are operating on the individual byte level and not the UTF-8 character level.

What is the expected behavior?

I expected the above to print:

Match { start: 0, end: 0, string: "" }
Match { start: 4, end: 4, string: "" }
<>😃<>

which is exactly what happens when I use a regex::RegexBuilder instead of a regex::bytes::RegexBuilder.

If I change the regex from "" to ".', both work properly:

Match { start: 0, end: 4, string: "😃" }
<😃>

You may just tell me "don't use regex::bytes", but that is not a solution if what i'm matching over has mixed valid and invalid UTF-8, whereas '.' works correctly there.

Answered by BurntSushi

Aug 5, 2025

This is very intentional behavior, although I note that it isn't documented. A great deal of care and attention is paid to this in the implementation inside of regex-automata. Notably, the semantics you want are called "UTF-8 mode," which is distinct and orthogonal from Unicode mode. UTF-8 mode has to do with guaranteeing that match spans always fall on UTF-8 boundaries. Unicode mode has to do with whether the regex pattern has Unicode features available. For example, see:

And special handling of empty matches when UTF-8 mode is enabled versus not:

regex/regex-automata/src/hybrid/dfa.rs

Lines 588 to 618 in 1a069b9

#[inline]

View full answer

IsaacOscar · 2025-08-05T12:38:26Z

IsaacOscar
Aug 5, 2025
Author

I noticed this issues is fixed on the similar rust-pcre2 crate by this pull request BurntSushi/rust-pcre2#36. So maybe you can do something similar?

1 reply

BurntSushi Aug 5, 2025
Maintainer

I'm not convinced that PR is correct. It at the very least has unspecified behavior when UTF mode is enabled and the haystack isn't valid UTF-8.

BurntSushi · 2025-08-05T14:04:49Z

BurntSushi
Aug 5, 2025
Maintainer

This is very intentional behavior, although I note that it isn't documented. A great deal of care and attention is paid to this in the implementation inside of regex-automata. Notably, the semantics you want are called "UTF-8 mode," which is distinct and orthogonal from Unicode mode. UTF-8 mode has to do with guaranteeing that match spans always fall on UTF-8 boundaries. Unicode mode has to do with whether the regex pattern has Unicode features available. For example, see:

And special handling of empty matches when UTF-8 mode is enabled versus not:

regex/regex-automata/src/hybrid/dfa.rs

Lines 588 to 618 in 1a069b9

    
               #[inline] 
        
               pub fn try_search_fwd( 
        
                   &self, 
        
                   cache: &mut Cache, 
        
                   input: &Input<'_>, 
        
               ) -> Result<Option<HalfMatch>, MatchError> { 
        
                   let utf8empty = self.get_nfa().has_empty() && self.get_nfa().is_utf8(); 
        
                   let hm = match search::find_fwd(self, cache, input)? { 
        
                       None => return Ok(None), 
        
                       Some(hm) if !utf8empty => return Ok(Some(hm)), 
        
                       Some(hm) => hm, 
        
                   }; 
        
                   // We get to this point when we know our DFA can match the empty string 
        
                   // AND when UTF-8 mode is enabled. In this case, we skip any matches 
        
                   // whose offset splits a codepoint. Such a match is necessarily a 
        
                   // zero-width match, because UTF-8 mode requires the underlying NFA 
        
                   // to be built such that all non-empty matches span valid UTF-8. 
        
                   // Therefore, any match that ends in the middle of a codepoint cannot 
        
                   // be part of a span of valid UTF-8 and thus must be an empty match. 
        
                   // In such cases, we skip it, so as not to report matches that split a 
        
                   // codepoint. 
        
                   // 
        
                   // Note that this is not a checked assumption. Callers *can* provide an 
        
                   // NFA with UTF-8 mode enabled but produces non-empty matches that span 
        
                   // invalid UTF-8. But doing so is documented to result in unspecified 
        
                   // behavior. 
        
                   empty::skip_splits_fwd(input, hm, hm.offset(), |input| { 
        
                       let got = search::find_fwd(self, cache, input)?; 
        
                       Ok(got.map(|hm| (hm, hm.offset()))) 
        
                   }) 
        
               }

And then finally, you can read this module dedicated to handling this case for all of the engines inside of regex-automata:

https://github.com/rust-lang/regex/blob/1a069b9232c607b34c4937122361aa075ef573fa/regex-automata/src/util/empty.rs

If you read the above, you'll notice that enabling UTF-8 mode while providing a haystack that is invalid UTF-8 results in unspecified behavior. Specifically, the reason for unspecified behavior is that the "is char boundary" predicate has unspecified behavior:

regex/regex-automata/src/util/utf8.rs

Lines 117 to 137 in 1a069b9

    
           /// Returns true if and only if the given offset in the given bytes falls on a 
        
           /// valid UTF-8 encoded codepoint boundary. 
        
           /// 
        
           /// If `bytes` is not valid UTF-8, then the behavior of this routine is 
        
           /// unspecified. 
        
           #[cfg_attr(feature = "perf-inline", inline(always))] 
        
           pub(crate) fn is_boundary(bytes: &[u8], i: usize) -> bool { 
        
               match bytes.get(i) { 
        
                   // The position at the end of the bytes always represents an empty 
        
                   // string, which is a valid boundary. But anything after that doesn't 
        
                   // make much sense to call valid a boundary. 
        
                   None => i == bytes.len(), 
        
                   // Other than ASCII (where the most significant bit is never set), 
        
                   // valid starting bytes always have their most significant two bits 
        
                   // set, where as continuation bytes never have their second most 
        
                   // significant bit set. Therefore, this only returns true when bytes[i] 
        
                   // corresponds to a byte that begins a valid UTF-8 encoding of a 
        
                   // Unicode scalar value. 
        
                   Some(&b) => b <= 0b0111_1111 || b >= 0b1100_0000, 
        
               } 
        
           }

I don't know if you require UTF-8 mode on a haystack that is invalid UTF-8, but if so, it requires reckoning with what it means to be a codepoint boundary on arbitrary byte sequences. The unspecified behavior in regex-automata may be acceptable to you. In which case, you can use meta::Regex directly.

1 reply

IsaacOscar Aug 5, 2025
Author

Wow thanks for the detailed reply.
For anyone else who comes across this problem, I found more discussion here #484
(which basically makes my initial issue report a duplicate).

The documentation should really be updated to clarify this.
To quote https://docs.rs/regex/latest/regex/index.html#unicode:

The top-level Regex runs searches as if iterating over each of the codepoints in the haystack. That is, the fundamental atom of matching is a single codepoint.

bytes::Regex, in contrast, permits disabling Unicode mode for part of all of your pattern in all cases. When Unicode mode is disabled, then a search is run as if iterating over each byte in the haystack. That is, the fundamental atom of matching is a single byte. (A top-level Regex also permits disabling Unicode and thus matching as if it were one byte at a time, but only when doing so wouldn’t permit matching invalid UTF-8.)

Emphasis added. To me that says that when unicode mode is enabled, it will iterate over unicode characters, not single bytes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does iteration with `bytes::Regex` yield empty matches that can split a codepoint, even when Unicode mode is enabled? #1276

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why does iteration with bytes::Regex yield empty matches that can split a codepoint, even when Unicode mode is enabled? #1276

Uh oh!

IsaacOscar Aug 5, 2025

What version of regex are you using?

Describe the bug at a high level.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

Replies: 2 comments · 2 replies

Uh oh!

IsaacOscar Aug 5, 2025 Author

Uh oh!

BurntSushi Aug 5, 2025 Maintainer

Uh oh!

Uh oh!

BurntSushi Aug 5, 2025 Maintainer

Uh oh!

IsaacOscar Aug 5, 2025 Author

Why does iteration with `bytes::Regex` yield empty matches that can split a codepoint, even when Unicode mode is enabled? #1276

IsaacOscar
Aug 5, 2025

Replies: 2 comments 2 replies

IsaacOscar
Aug 5, 2025
Author

BurntSushi Aug 5, 2025
Maintainer

BurntSushi
Aug 5, 2025
Maintainer

IsaacOscar Aug 5, 2025
Author