Skip to content
This repository was archived by the owner on Jul 31, 2025. It is now read-only.

Conversation

@arthurgousset
Copy link
Member

@arthurgousset arthurgousset commented May 29, 2025

Warning

DO NOT MERGE. The goal is not to merge these tests. The tests are created to reproduce bugs. This helps verify that the patch that is created fixes the bug. The intention is to share a "proof" to reduce the cognitive load on the review and visualise that it correctly identified a bug, reproduced it, and patched it.

Cyrillic Typo Tolerance Bug Successfully Reproduced

Root cause analysis: https://hyperdrive.engineering/#report-4722be5b-5b9d-4000-8383-71b5a4296231

Bug Description

Successfully reproduced the Cyrillic typo tolerance bug where the number_of_typos_allowed function in milli uses byte count (word.len()) instead of character count (word.chars().count()) to determine typo tolerance. This causes words with multi-byte Unicode characters (like Cyrillic) to receive incorrect typo tolerance compared to ASCII words with the same character count.

Commands to Reproduce Manually

  1. Navigate to the milli directory:

    cd crates/milli
  2. Create a test file to reproduce the bug:

    cat > tests/cyrillic_bug_test.rs << 'EOF'
    // Test to validate the Cyrillic typo tolerance bug exists by demonstrating the problematic logic
    #
    fn test_cyrillic_char_count_bug() {
        // ASCII word "doggy" (5 chars, 5 bytes)
        let ascii_word = "doggy";
        let ascii_byte_len = ascii_word.len();
        
        // Cyrillic word "собак" (5 chars, 10 bytes)  
        let cyrillic_word = "собак";
        let cyrillic_byte_len = cyrillic_word.len();
        
        // Simulate the buggy logic with default settings (oneTypo=5, twoTypos=9)
        let min_len_one_typo = 5;
        let min_len_two_typos = 9;
        
        // Current buggy implementation uses word.len() (byte count)
        let ascii_typos_buggy = if ascii_byte_len < min_len_one_typo {
            0
        } else if ascii_byte_len < min_len_two_typos {
            1
        } else {
            2
        };
        
        let cyrillic_typos_buggy = if cyrillic_byte_len < min_len_one_typo {
            0
        } else if cyrillic_byte_len < min_len_two_typos {
            1
        } else {
            2
        };
        
        eprintln!("ASCII '{}': byte_len={}, typos={}", ascii_word, ascii_byte_len, ascii_typos_buggy);
        eprintln!("Cyrillic '{}': byte_len={}, typos={}", cyrillic_word, cyrillic_byte_len, cyrillic_typos_buggy);
        
        // This assertion will FAIL, proving the bug exists
        assert_eq!(ascii_typos_buggy, cyrillic_typos_buggy,
            "BUG: ASCII and Cyrillic words with same character count get different typo tolerance");
    }
    EOF
  3. Run the test to reproduce the bug:

    cargo test --test cyrillic_bug_test test_cyrillic_char_count_bug -- --nocapture

Expected vs Actual Behavior

Expected Behavior:

  • Both "doggy" (5 ASCII chars) and "собак" (5 Cyrillic chars) should receive the same typo tolerance (1 typo) since they both have 5 characters
  • The test should pass if character counting was implemented correctly

Actual Behavior (Bug):

  • "doggy" (5 chars, 5 bytes) gets 1 typo tolerance
  • "собак" (5 chars, 10 bytes) gets 2 typos tolerance
  • The test FAILS with assertion error showing different typo counts: left: 1 right: 2
  • Output shows: ASCII 'doggy': byte_len=5, typos=1 and Cyrillic 'собак': byte_len=10, typos=2

Bug Location

  • File: ./crates/milli/src/search/new/query_term/parse_query.rs
  • Function: number_of_typos_allowed (lines 194-215)
  • Buggy Lines: 205 and 209 where word.len() is used instead of word.chars().count()

Impact

This bug affects all multi-byte Unicode text including Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, accented characters, and emoji - causing them to receive incorrect typo tolerance in search operations.

@arthurgousset arthurgousset changed the title repro(5594): Cyrillic has different typo tolerance due to byte counting bug repro(5594): Cyrillic has different typo tolerance due to byte counting bug (DO NOT MERGE) May 29, 2025
@arthurgousset arthurgousset changed the title repro(5594): Cyrillic has different typo tolerance due to byte counting bug (DO NOT MERGE) repro(5594): Cyrillic has different typo tolerance due to byte counting bug May 29, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants