Skip to content

tokenize() issue with SPACE_OR_PUNCTUATION and quotes #309

@clibu

Description

@clibu

SPACE_OR_PUNCTUATION has changed at some point and now is regexp: /[\n\r\p{Z}\p{P}]+/u
This matches quotation characters " and ' which results in, for me at least, unwanted and incorrect results. ie. p{P} matches all unicode punctuation characters.

For example: song's now matches "song" and "s" - so every 's' character in a document matches. Further documents which don't include "song" but do include "s' match. You can see this using: Demo

The older SPACE_OR_PUNCTUATION regexp did not use the new Unicode categories and did not match " and ' etc.

From reading: Unicode Character Categories I can't see how any of the Punctuation categories can be used for SPACE_OR_PUNCTUATION. That said I hadn't come across these before now.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions