Open
Conversation
This patch introduces several SIMD-accelerated APIs for strings and raw byte-arrays, compatible with PySpark v2 and v3. In more detail: - Hamming and Levenshtein support SIMD and buffers. - Fixed non-standard 5-char indent in `functions.py`. - Upgraded PyTest for compatibility with newer Pyhton. - Added `pkg_resources` for `setuptools` for tests. On typical English words StringZilla is 15x faster than JellyFish on both x86 and Arm CPUs.
Owner
|
@ashvardanian - thanks for submitting this. Do you have any benchmarks that show StringZilla makes ceja faster? |
Author
|
@MrPowers I don't have benchmarks specific to Ceja, but have several benchmarks against Jellyfish in the StringZilla repository. There is also a Jupyter notebook to help explore the differences at Is there some specific benchmark you have in mind? PS: There is also a portability issue I haven't referenced. Seems like |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This was indented as a small path upgrading from JellyFish to StringZilla to accelerate some of the slowest and frequently used string similarity measures. Along the way I've patched a few minor things.
functions.py.pkg_resourcesforsetuptoolsfor tests.Compared to JellyFish, StringZilla is generally at least 20% faster even on shorter strings. It is also more accurate, as JellyFish doesn't correctly handle Unicode strings. Here is a comparison table for the distance output by different packages.