diff --git a/README.md b/README.md index b50037a..9729c34 100644 --- a/README.md +++ b/README.md @@ -302,6 +302,11 @@ Adjusting all implementation to the same tokenization scheme, one my experience | | 86.80% collisions | 93.21% collisions | | | 0.9992 entropy | 0.9967 entropy | +The trickiest part, however, is analyzing the retrieval quality of those fingerprints and comparing them to other approaches. +So, how many bits per fingerprint are needed to achieve a specific recall rate for a given dataset? +Or, how does the average Levenshtein distance among the top-k nearest neighbors change with the fingerprint size? +It must clearly decrease, but how fast, and how does that compare to ground truth? + ## Replicating the Results ### Replicating the Results in Rust 🦀