tokenize() issue with SPACE_OR_PUNCTUATION and quotes

SPACE_OR_PUNCTUATION has changed at some point and now is regexp: ``/[\n\r\p{Z}\p{P}]+/u``
This matches quotation characters " and ' which results in, for me at least, unwanted and incorrect results. ie. ``p{P}`` matches all unicode punctuation characters. 

For example: song's now matches "song" and "s" - so every 's' character in a document matches. Further documents which don't include "song" but do include "s' match. You can see this using: [Demo](https://lucaong.github.io/minisearch/demo/)

The older SPACE_OR_PUNCTUATION regexp did not use the new Unicode categories and did not match " and ' etc. 

From reading: [Unicode Character Categories](https://unicodeplus.com/category) I can't see how any of the Punctuation categories can be used for SPACE_OR_PUNCTUATION. That said I hadn't come across these before now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize() issue with SPACE_OR_PUNCTUATION and quotes #309

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

tokenize() issue with SPACE_OR_PUNCTUATION and quotes #309

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions