Skip to content

Incompatible types in benchmarks.word_tokenization #1030

@bact

Description

@bact

Description

MyPy reports a bunch of typing issues in pythainlp/benchmarks/word_tokenization.py

Expected results

  • All functions have explicit type hinting information
  • No typing incompatible issues

Current results

ref_sample in these two lines for examples, are seen as str and should not have shape attribute.

c_pos_pred = c_pos_pred[c_pos_pred < ref_sample.shape[0]]
c_neg_pred = c_neg_pred[c_neg_pred < ref_sample.shape[0]]

But it looks like from _binary_representation function, it may has a type of ND array.

However, the _binary_representation type hints and docstring said they are str:

def _binary_representation(txt: str, verbose: bool = False):
"""
Transform text into {0, 1} sequence.
where (1) indicates that the corresponding character is the beginning of
a word. For example, ผม|ไม่|ชอบ|กิน|ผัก -> 10100...
:param str txt: input text that we want to transform
:param bool verbose: for debugging purposes
:return: {0, 1} sequence
:rtype: str
"""
chars = np.array(list(txt))

So there're confusions here to be fixed.

Steps to reproduce

Use MyPy to check the code

PyThaiNLP version

5

Python version

any

Operating system and version

any

More info

No response

Possible solution

No response

Files

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbugs in the libraryhelp wantedno contributor yet

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions