Skip to content

Add SIMD-accelerated APIs#7

Open
ashvardanian wants to merge 1 commit intoMrPowers:masterfrom
ashvardanian:master
Open

Add SIMD-accelerated APIs#7
ashvardanian wants to merge 1 commit intoMrPowers:masterfrom
ashvardanian:master

Conversation

@ashvardanian
Copy link
Copy Markdown

This was indented as a small path upgrading from JellyFish to StringZilla to accelerate some of the slowest and frequently used string similarity measures. Along the way I've patched a few minor things.

  • Hamming and Levenshtein support SIMD and buffers.
  • Added docstrings for all APIs.
  • Fixed non-standard 5-char indent in functions.py.
  • Upgraded PyTest for compatibility with newer Pyhton.
  • Added pkg_resources for setuptools for tests.

Compared to JellyFish, StringZilla is generally at least 20% faster even on shorter strings. It is also more accurate, as JellyFish doesn't correctly handle Unicode strings. Here is a comparison table for the distance output by different packages.

Example Jellyfish Levenshtein RapidFuzz EditDistance NLTK StringZilla (Unicode) StringZilla (Bytes)
0 apple vs aple 1 1 1 1 1 1 1
1 αβγδ vs αγδ 1 1 1 1 1 1 2
2 école vs école 1 2 2 2 2 2 3
3 Schön vs Schön 1 2 2 2 2 2 3
4 💖 vs 💗 1 1 1 1 1 1 1
5 𠜎 𠜱 𠝹 𠱓 vs 𠜎𠜱𠝹𠱓 3 3 3 3 3 3 3
6 München vs Muenchen 2 2 2 2 2 2 2
7 façade vs facade 1 1 1 1 1 1 2
8 こんにちは世界 vs こんばんは世界 2 2 2 2 2 2 3
9 👩‍👩‍👧‍👦 vs 👨‍👩‍👧‍👦 1 1 1 1 1 1 1
10 Data科学123 vs Data科學321 3 3 3 3 3 3 3
11 🙂🌍🚀 vs 🙂🌎✨ 2 2 2 2 2 2 5

This patch introduces several SIMD-accelerated APIs
for strings and raw byte-arrays, compatible with
PySpark v2 and v3. In more detail:

- Hamming and Levenshtein support SIMD and buffers.
- Fixed non-standard 5-char indent in `functions.py`.
- Upgraded PyTest for compatibility with newer Pyhton.
- Added `pkg_resources` for `setuptools` for tests.

On typical English words StringZilla is 15x faster than
JellyFish on both x86 and Arm CPUs.
@MrPowers
Copy link
Copy Markdown
Owner

@ashvardanian - thanks for submitting this. Do you have any benchmarks that show StringZilla makes ceja faster?

@ashvardanian
Copy link
Copy Markdown
Author

ashvardanian commented Feb 22, 2024

@MrPowers I don't have benchmarks specific to Ceja, but have several benchmarks against Jellyfish in the StringZilla repository. There is also a Jupyter notebook to help explore the differences at stringzilla/scripts/bench_similarity.ipynb 🤗

Is there some specific benchmark you have in mind?


PS: There is also a portability issue I haven't referenced. Seems like jellyfish builds only 65 wheels, while today PyPi expects 105 targets. StringZilla publishes all of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants