Add SIMD-accelerated APIs by ashvardanian · Pull Request #7 · MrPowers/ceja

ashvardanian · 2024-02-19T04:25:44Z

This was indented as a small path upgrading from JellyFish to StringZilla to accelerate some of the slowest and frequently used string similarity measures. Along the way I've patched a few minor things.

Hamming and Levenshtein support SIMD and buffers.
Added docstrings for all APIs.
Fixed non-standard 5-char indent in functions.py.
Upgraded PyTest for compatibility with newer Pyhton.
Added pkg_resources for setuptools for tests.

Compared to JellyFish, StringZilla is generally at least 20% faster even on shorter strings. It is also more accurate, as JellyFish doesn't correctly handle Unicode strings. Here is a comparison table for the distance output by different packages.

	Example	Jellyfish	Levenshtein	RapidFuzz	EditDistance	NLTK	StringZilla (Unicode)	StringZilla (Bytes)
0	apple vs aple	1	1	1	1	1	1	1
1	αβγδ vs αγδ	1	1	1	1	1	1	2
2	école vs école	1	2	2	2	2	2	3
3	Schön vs Schön	1	2	2	2	2	2	3
4	💖 vs 💗	1	1	1	1	1	1	1
5	𠜎𠜱𠝹𠱓 vs 𠜎𠜱𠝹𠱓	3	3	3	3	3	3	3
6	München vs Muenchen	2	2	2	2	2	2	2
7	façade vs facade	1	1	1	1	1	1	2
8	こんにちは世界 vs こんばんは世界	2	2	2	2	2	2	3
9	👩‍👩‍👧‍👦 vs 👨‍👩‍👧‍👦	1	1	1	1	1	1	1
10	Data科学123 vs Data科學321	3	3	3	3	3	3	3
11	🙂🌍🚀 vs 🙂🌎✨	2	2	2	2	2	2	5

This patch introduces several SIMD-accelerated APIs for strings and raw byte-arrays, compatible with PySpark v2 and v3. In more detail: - Hamming and Levenshtein support SIMD and buffers. - Fixed non-standard 5-char indent in `functions.py`. - Upgraded PyTest for compatibility with newer Pyhton. - Added `pkg_resources` for `setuptools` for tests. On typical English words StringZilla is 15x faster than JellyFish on both x86 and Arm CPUs.

MrPowers · 2024-02-21T14:03:22Z

@ashvardanian - thanks for submitting this. Do you have any benchmarks that show StringZilla makes ceja faster?

ashvardanian · 2024-02-22T20:47:18Z

@MrPowers I don't have benchmarks specific to Ceja, but have several benchmarks against Jellyfish in the StringZilla repository. There is also a Jupyter notebook to help explore the differences at stringzilla/scripts/bench_similarity.ipynb 🤗

Is there some specific benchmark you have in mind?

PS: There is also a portability issue I haven't referenced. Seems like jellyfish builds only 65 wheels, while today PyPi expects 105 targets. StringZilla publishes all of them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SIMD-accelerated APIs#7

Add SIMD-accelerated APIs#7
ashvardanian wants to merge 1 commit intoMrPowers:masterfrom
ashvardanian:master

ashvardanian commented Feb 19, 2024

Uh oh!

MrPowers commented Feb 21, 2024

Uh oh!

ashvardanian commented Feb 22, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ashvardanian commented Feb 19, 2024

Uh oh!

MrPowers commented Feb 21, 2024

Uh oh!

ashvardanian commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ashvardanian commented Feb 22, 2024 •

edited

Loading