Skip to content

Conversation

@farhan-syah
Copy link
Collaborator

@farhan-syah farhan-syah commented Dec 2, 2025

Summary

  • Switch default regex backend from PCRE2 to regexr (pure Rust with JIT/SIMD)
  • Make PCRE2 an optional feature (pcre2) for benchmarking comparisons
  • Use regexr from crates.io (0.1.0-beta.2) instead of path dependency
  • Add benchmark scripts for comparing regexr vs PCRE2 performance

Changes

  • Refactored tokenizer.rs to use regexr as the default backend
  • Added .cargo/ to .gitignore for local dev overrides
  • Added benchmark visualization scripts for backend comparison

Test plan

  • Run cargo test to verify all tests pass
  • Run maturin develop --release and test Python bindings
  • Run benchmark scripts to compare performance

Switch from PCRE2 to regexr as the default regex backend, making splintr
a pure-Rust tokenizer with no C dependencies. PCRE2 remains available as
an optional backend via the 'pcre2' feature flag.

Key changes:
- Regexr backend: Pure Rust with JIT compilation and SIMD acceleration
- PCRE2 backend: Now optional, enabled via --features pcre2
- Runtime switching: Added .pcre2() method to switch backends
- Unified implementation: Merged Tokenizer and TokenizerRegexr into
  single Tokenizer class with RegexBackend enum
- Documentation: Updated README, API docs, and benchmarks to reflect
  new default backend

Benchmarking tools:
- benchmark_regexr_comparison.py: Compare regexr vs PCRE2 performance
- benchmark_regexr_viz.py: Visual comparison with charts

This change eliminates C dependencies while maintaining performance
through regexr's JIT and SIMD optimizations. Users requiring PCRE2
can opt-in via feature flags or runtime switching.
Switch regexr dependency from local path to published version 0.1.0-beta.2
on crates.io. Add .cargo/ to .gitignore to support local development
overrides via cargo config patches without committing them.

This enables publishing splintr while maintaining flexibility for local
development with unpublished regexr changes.
Configure regexr with conditional compilation:
- Unix platforms: enable jit + simd features
- Windows: disable jit feature (simd only)

This prevents ABI crashes on Windows x86_64 where JIT-compiled code
causes segmentation faults. The platform-specific dependency ensures
JIT is only enabled where it works reliably.

Bump regexr to 0.1.0-beta.3 for both targets.
Replace per-test tokenizer construction with static LazyLock instances.
Each test file now creates the tokenizer once on first access instead
of reconstructing it for every test function.

This optimization reduces test suite execution time from 60+ seconds
to under 1 second by amortizing expensive regex compilation and
vocabulary loading across all tests in each file.

Changes:
- Add LazyLock static for shared tokenizer instance
- Split helper into accessor and implementation functions
- Preserve existing API for variant-specific tests
Move regexr dependency from platform-specific targets to main dependencies
section where it belongs, and update to version 0.1.0-beta.4. This fixes
the crate not being linked properly.

Box the RegexrRegex variant in RegexBackend enum to resolve clippy warning
about large enum variant size difference (2912 bytes vs 64 bytes).
@farhan-syah farhan-syah merged commit 5d376ce into main Dec 2, 2025
5 checks passed
@farhan-syah farhan-syah deleted the feat/regexr-default-backend branch December 2, 2025 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants