repro(5594): Cyrillic has different typo tolerance due to byte counting bug #7

arthurgousset · 2025-05-29T22:07:43Z

Warning

DO NOT MERGE. The goal is not to merge these tests. The tests are created to reproduce bugs. This helps verify that the patch that is created fixes the bug. The intention is to share a "proof" to reduce the cognitive load on the review and visualise that it correctly identified a bug, reproduced it, and patched it.

Cyrillic Typo Tolerance Bug Successfully Reproduced

Root cause analysis: https://hyperdrive.engineering/#report-4722be5b-5b9d-4000-8383-71b5a4296231

Bug Description

Successfully reproduced the Cyrillic typo tolerance bug where the number_of_typos_allowed function in milli uses byte count (word.len()) instead of character count (word.chars().count()) to determine typo tolerance. This causes words with multi-byte Unicode characters (like Cyrillic) to receive incorrect typo tolerance compared to ASCII words with the same character count.

Commands to Reproduce Manually

Navigate to the milli directory:
```
cd crates/milli
```

Create a test file to reproduce the bug:

cat > tests/cyrillic_bug_test.rs << 'EOF'
// Test to validate the Cyrillic typo tolerance bug exists by demonstrating the problematic logic
#
fn test_cyrillic_char_count_bug() {
    // ASCII word "doggy" (5 chars, 5 bytes)
    let ascii_word = "doggy";
    let ascii_byte_len = ascii_word.len();
    
    // Cyrillic word "собак" (5 chars, 10 bytes)  
    let cyrillic_word = "собак";
    let cyrillic_byte_len = cyrillic_word.len();
    
    // Simulate the buggy logic with default settings (oneTypo=5, twoTypos=9)
    let min_len_one_typo = 5;
    let min_len_two_typos = 9;
    
    // Current buggy implementation uses word.len() (byte count)
    let ascii_typos_buggy = if ascii_byte_len < min_len_one_typo {
        0
    } else if ascii_byte_len < min_len_two_typos {
        1
    } else {
        2
    };
    
    let cyrillic_typos_buggy = if cyrillic_byte_len < min_len_one_typo {
        0
    } else if cyrillic_byte_len < min_len_two_typos {
        1
    } else {
        2
    };
    
    eprintln!("ASCII '{}': byte_len={}, typos={}", ascii_word, ascii_byte_len, ascii_typos_buggy);
    eprintln!("Cyrillic '{}': byte_len={}, typos={}", cyrillic_word, cyrillic_byte_len, cyrillic_typos_buggy);
    
    // This assertion will FAIL, proving the bug exists
    assert_eq!(ascii_typos_buggy, cyrillic_typos_buggy,
        "BUG: ASCII and Cyrillic words with same character count get different typo tolerance");
}
EOF

Run the test to reproduce the bug:

cargo test --test cyrillic_bug_test test_cyrillic_char_count_bug -- --nocapture

Expected vs Actual Behavior

Expected Behavior:

Both "doggy" (5 ASCII chars) and "собак" (5 Cyrillic chars) should receive the same typo tolerance (1 typo) since they both have 5 characters
The test should pass if character counting was implemented correctly

Actual Behavior (Bug):

"doggy" (5 chars, 5 bytes) gets 1 typo tolerance
"собак" (5 chars, 10 bytes) gets 2 typos tolerance
The test FAILS with assertion error showing different typo counts: left: 1 right: 2
Output shows: ASCII 'doggy': byte_len=5, typos=1 and Cyrillic 'собак': byte_len=10, typos=2

Bug Location

File: ./crates/milli/src/search/new/query_term/parse_query.rs
Function: number_of_typos_allowed (lines 194-215)
Buggy Lines: 205 and 209 where word.len() is used instead of word.chars().count()

Impact

This bug affects all multi-byte Unicode text including Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, accented characters, and emoji - causing them to receive incorrect typo tolerance in search operations.

repro(DO NOT MERGE): cyrillic bug

2989316

arthurgousset changed the title ~~repro(5594): Cyrillic has different typo tolerance due to byte counting bug~~ repro(5594): Cyrillic has different typo tolerance due to byte counting bug (DO NOT MERGE) May 29, 2025

arthurgousset changed the title ~~repro(5594): Cyrillic has different typo tolerance due to byte counting bug (DO NOT MERGE)~~ repro(5594): Cyrillic has different typo tolerance due to byte counting bug May 29, 2025

arthurgousset mentioned this pull request May 30, 2025

Wrong minWordSizeForTypos processing for cyrillic language meilisearch/meilisearch#5594

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

repro(5594): Cyrillic has different typo tolerance due to byte counting bug #7

repro(5594): Cyrillic has different typo tolerance due to byte counting bug #7

Uh oh!

arthurgousset commented May 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

repro(5594): Cyrillic has different typo tolerance due to byte counting bug #7

Are you sure you want to change the base?

repro(5594): Cyrillic has different typo tolerance due to byte counting bug #7

Uh oh!

Conversation

arthurgousset commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cyrillic Typo Tolerance Bug Successfully Reproduced

Bug Description

Commands to Reproduce Manually

Expected vs Actual Behavior

Bug Location

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arthurgousset commented May 29, 2025 •

edited

Loading