This repository was archived by the owner on Jul 31, 2025. It is now read-only.
  
  
  
  
repro(5594): Cyrillic has different typo tolerance due to byte counting bug #7
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Warning
DO NOT MERGE. The goal is not to merge these tests. The tests are created to reproduce bugs. This helps verify that the patch that is created fixes the bug. The intention is to share a "proof" to reduce the cognitive load on the review and visualise that it correctly identified a bug, reproduced it, and patched it.
Cyrillic Typo Tolerance Bug Successfully Reproduced
Root cause analysis: https://hyperdrive.engineering/#report-4722be5b-5b9d-4000-8383-71b5a4296231
Bug Description
Successfully reproduced the Cyrillic typo tolerance bug where the
number_of_typos_allowedfunction in milli uses byte count (word.len()) instead of character count (word.chars().count()) to determine typo tolerance. This causes words with multi-byte Unicode characters (like Cyrillic) to receive incorrect typo tolerance compared to ASCII words with the same character count.Commands to Reproduce Manually
Navigate to the milli directory:
cd crates/milliCreate a test file to reproduce the bug:
Run the test to reproduce the bug:
cargo test --test cyrillic_bug_test test_cyrillic_char_count_bug -- --nocaptureExpected vs Actual Behavior
Expected Behavior:
Actual Behavior (Bug):
left: 1 right: 2ASCII 'doggy': byte_len=5, typos=1andCyrillic 'собак': byte_len=10, typos=2Bug Location
./crates/milli/src/search/new/query_term/parse_query.rsnumber_of_typos_allowed(lines 194-215)word.len()is used instead ofword.chars().count()Impact
This bug affects all multi-byte Unicode text including Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, accented characters, and emoji - causing them to receive incorrect typo tolerance in search operations.