@@ -221,19 +221,28 @@ Performing in-place lookups in a precomputed table of 256 bytes:
221221
222222Edit Distance calculation is a common component of Search Engines, Data Cleaning, and Natural Language Processing, as well as in Bioinformatics.
223223It's a computationally expensive operation, generally implemented using dynamic programming, with a quadratic time complexity upper bound.
224+ For biological sequences, the Needleman-Wunsch and Smith-Waterman algorithms are more appropriate, as they allow overriding the default substitution costs.
225+ Each of those has two flavors - with linear and affine gap penalties, also known as the "Gotoh" variation.
226+
227+ - byte-level and unicode [ Levenshtein] ( #levenshtein ) distance;
228+ - [ Needleman-Wunsch] ( #needleman-wunsch ) , [ Needleman-Wunsch-Gotoh] ( #needleman-wunsch-gotoh ) ;
229+ - [ Smith-Waterman] ( #smith-waterman ) , [ Smith-Waterman-Gotoh] ( #smith-waterman-gotoh ) .
230+
231+ ### Levenshtein
224232
225233| Library | ≅ 100 bytes lines | ≅ 1'000 bytes lines |
226234| ---------------------------------------------------- | ----------------: | ------------------: |
227235| Rust 🦀 | |
228- | ` rapidfuzz::levenshtein<Bytes> ` | 4'633 MCUPS | 14'316 MCUPS |
236+ | ` bio::levenshtein ` on 1x SPR | 428 MCUPS | 823 MCUPS |
237+ | ` rapidfuzz::levenshtein<Bytes> ` on 1x SPR | 4'633 MCUPS | 14'316 MCUPS |
238+ | ` rapidfuzz::levenshtein<Chars> ` on 1x SPR | 3'877 MCUPS | 13'179 MCUPS |
229239| ` stringzillas::LevenshteinDistances ` on 1x SPR | 3'315 MCUPS | 13'084 MCUPS |
240+ | ` stringzillas::LevenshteinDistancesUtf8 ` on 1x SPR | 3'283 MCUPS | 11'690 MCUPS |
230241| ` stringzillas::LevenshteinDistances ` on 16x SPR | 29'430 MCUPS | 105'400 MCUPS |
242+ | ` stringzillas::LevenshteinDistancesUtf8 ` on 16x SPR | 38'954 MCUPS | 103'500 MCUPS |
231243| ` stringzillas::LevenshteinDistances ` on RTX6000 | __ 32'030 MCUPS__ | __ 901'990 MCUPS__ |
232244| ` stringzillas::LevenshteinDistances ` on H100 | __ 31'913 MCUPS__ | __ 925'890 MCUPS__ |
233245| ` stringzillas::LevenshteinDistances ` on 384x GNR | __ 114'190 MCUPS__ | __ 3'084'270 MCUPS__ |
234- | ` rapidfuzz::levenshtein<Chars> ` | 3'877 MCUPS | 13'179 MCUPS |
235- | ` stringzillas::LevenshteinDistancesUtf8 ` on 1x SPR | 3'283 MCUPS | 11'690 MCUPS |
236- | ` stringzillas::LevenshteinDistancesUtf8 ` on 16x SPR | 38'954 MCUPS | 103'500 MCUPS |
237246| ` stringzillas::LevenshteinDistancesUtf8 ` on 384x GNR | __ 103'590 MCUPS__ | __ 2'938'320 MCUPS__ |
238247| | | |
239248| Python 🐍 | | |
@@ -250,42 +259,61 @@ It's a computationally expensive operation, generally implemented using dynamic
250259| ` stringzillas.LevenshteinDistances ` batch on 16x SPR | 3'762 MCUPS | 119'261 MCUPS |
251260| ` stringzillas.LevenshteinDistances ` batch on H100 | __ 18'081 MCUPS__ | __ 320'109 MCUPS__ |
252261
253-
254- For biological sequences, the Needleman-Wunsch and Smith-Waterman algorithms are more appropriate, as they allow overriding the default substitution costs.
255- Another common adaptation is to used Gotoh's affine gap penalties, which better model the evolutionary events in DNA and Protein sequences.
262+ ### Needleman-Wunsch
256263
257264| Library | ≅ 100 bytes lines | ≅ 1'000 bytes lines |
258265| ----------------------------------------------------- | ----------------: | ------------------: |
259- | Rust 🦀 with linear gaps | |
266+ | Rust 🦀 | | |
267+ | ` bio::pairwise::global ` on 1x SPR | 51 MCUPS | 57 MCUPS |
260268| ` stringzillas::NeedlemanWunschScores ` on 1x SPR | 278 MCUPS | 612 MCUPS |
261269| ` stringzillas::NeedlemanWunschScores ` on 16x SPR | 4'057 MCUPS | 8'492 MCUPS |
262270| ` stringzillas::NeedlemanWunschScores ` on 384x GNR | __ 64'290 MCUPS__ | __ 331'340 MCUPS__ |
263271| ` stringzillas::NeedlemanWunschScores ` on H100 | 131 MCUPS | __ 12'113 MCUPS__ |
264- | ` stringzillas::SmithWatermanScores ` on 1x SPR | 263 MCUPS | 552 MCUPS |
265- | ` stringzillas::SmithWatermanScores ` on 16x SPR | 3'883 MCUPS | 8'011 MCUPS |
266- | ` stringzillas::SmithWatermanScores ` on 384x GNR | __ 58'880 MCUPS__ | __ 285'480 MCUPS__ |
267- | ` stringzillas::SmithWatermanScores ` on H100 | 143 MCUPS | __ 12'921 MCUPS__ |
268272| | | |
269- | Python 🐍 with linear gaps | | |
273+ | Python 🐍 | | |
270274| ` biopython.PairwiseAligner.score ` on 1x SPR | 95 MCUPS | 557 MCUPS |
271275| ` stringzillas.NeedlemanWunschScores ` on 1x SPR | 30 MCUPS | 481 MCUPS |
272276| ` stringzillas.NeedlemanWunschScores ` batch on 1x SPR | 246 MCUPS | 570 MCUPS |
273277| ` stringzillas.NeedlemanWunschScores ` batch on 16x SPR | 3'103 MCUPS | 9'208 MCUPS |
274278| ` stringzillas.NeedlemanWunschScores ` batch on H100 | 127 MCUPS | 12'246 MCUPS |
275- | ` stringzillas.SmithWatermanScores ` on 1x SPR | 28 MCUPS | 440 MCUPS |
276- | ` stringzillas.SmithWatermanScores ` batch on 1x SPR | 255 MCUPS | 582 MCUPS |
277- | ` stringzillas.SmithWatermanScores ` batch on 16x SPR | __ 3'535 MCUPS__ | 8'235 MCUPS |
278- | ` stringzillas.SmithWatermanScores ` batch on H100 | 130 MCUPS | __ 12'702 MCUPS__ |
279- | | | |
280- | Rust 🦀 with affine gaps | | |
281- | ` stringzillas::NeedlemanWunschScores ` on 1x SPR | 83 MCUPS | 354 MCUPS |
282- | ` stringzillas::NeedlemanWunschScores ` on 16x SPR | 1'267 MCUPS | 4'694 MCUPS |
283- | ` stringzillas::NeedlemanWunschScores ` on 384x GNR | __ 42'050 MCUPS__ | __ 155'920 MCUPS__ |
284- | ` stringzillas::NeedlemanWunschScores ` on H100 | 128 MCUPS | __ 13'799 MCUPS__ |
285- | ` stringzillas::SmithWatermanScores ` on 1x SPR | 79 MCUPS | 284 MCUPS |
286- | ` stringzillas::SmithWatermanScores ` on 16x SPR | 1'026 MCUPS | 3'776 MCUPS |
287- | ` stringzillas::SmithWatermanScores ` on 384x GNR | __ 38'430 MCUPS__ | __ 129'140 MCUPS__ |
288- | ` stringzillas::SmithWatermanScores ` on H100 | 127 MCUPS | __ 13'205 MCUPS__ |
279+
280+ ### Smith-Waterman
281+
282+ | Library | ≅ 100 bytes lines | ≅ 1'000 bytes lines |
283+ | --------------------------------------------------- | ----------------: | ------------------: |
284+ | Rust 🦀 | | |
285+ | ` bio::pairwise::local ` on 1x SPR | 49 MCUPS | 50 MCUPS |
286+ | ` stringzillas::SmithWatermanScores ` on 1x SPR | 263 MCUPS | 552 MCUPS |
287+ | ` stringzillas::SmithWatermanScores ` on 16x SPR | 3'883 MCUPS | 8'011 MCUPS |
288+ | ` stringzillas::SmithWatermanScores ` on 384x GNR | __ 58'880 MCUPS__ | __ 285'480 MCUPS__ |
289+ | ` stringzillas::SmithWatermanScores ` on H100 | 143 MCUPS | __ 12'921 MCUPS__ |
290+ | | | |
291+ | Python 🐍 | | |
292+ | ` biopython.PairwiseAligner.score ` on 1x SPR | 95 MCUPS | 557 MCUPS |
293+ | ` stringzillas.SmithWatermanScores ` on 1x SPR | 28 MCUPS | 440 MCUPS |
294+ | ` stringzillas.SmithWatermanScores ` batch on 1x SPR | 255 MCUPS | 582 MCUPS |
295+ | ` stringzillas.SmithWatermanScores ` batch on 16x SPR | __ 3'535 MCUPS__ | 8'235 MCUPS |
296+ | ` stringzillas.SmithWatermanScores ` batch on H100 | 130 MCUPS | __ 12'702 MCUPS__ |
297+
298+ ### Needleman-Wunsch-Gotoh
299+
300+ | Library | ≅ 100 bytes lines | ≅ 1'000 bytes lines |
301+ | ------------------------------------------------- | ----------------: | ------------------: |
302+ | Rust 🦀 | | |
303+ | ` stringzillas::NeedlemanWunschScores ` on 1x SPR | 83 MCUPS | 354 MCUPS |
304+ | ` stringzillas::NeedlemanWunschScores ` on 16x SPR | 1'267 MCUPS | 4'694 MCUPS |
305+ | ` stringzillas::NeedlemanWunschScores ` on 384x GNR | __ 42'050 MCUPS__ | __ 155'920 MCUPS__ |
306+ | ` stringzillas::NeedlemanWunschScores ` on H100 | 128 MCUPS | __ 13'799 MCUPS__ |
307+
308+ ### Smith-Waterman-Gotoh
309+
310+ | Library | ≅ 100 bytes lines | ≅ 1'000 bytes lines |
311+ | ----------------------------------------------- | ----------------: | ------------------: |
312+ | Rust 🦀 | | |
313+ | ` stringzillas::SmithWatermanScores ` on 1x SPR | 79 MCUPS | 284 MCUPS |
314+ | ` stringzillas::SmithWatermanScores ` on 16x SPR | 1'026 MCUPS | 3'776 MCUPS |
315+ | ` stringzillas::SmithWatermanScores ` on 384x GNR | __ 38'430 MCUPS__ | __ 129'140 MCUPS__ |
316+ | ` stringzillas::SmithWatermanScores ` on H100 | 127 MCUPS | __ 13'205 MCUPS__ |
289317
290318## Byte-level Fingerprinting & Sketching Benchmarks
291319
0 commit comments