Skip to content

Norwegian is not supported #65

@emilmuller

Description

@emilmuller

I'm doing keyword extraction in Norwegian. If I do not use Pattern, I'm getting stop words within the keyword extraction. E.g. if I extract the keywords from the first paragraph on Albert Einstein in the Norwegian Wikipedia:

Albert Einstein var en tyskfødt teoretisk fysiker og nobelprisvinner som er mest kjent for å ha formulert relativitetsteorien og vist at masse og energi er ekvivalente ved masseenergiloven, E = mc2. Gjennom den spesielle relativitetsteorien revolusjonerte han mekanikken og presiserte tidsbegrepet. Han var sentral i utviklingen av kvantemekanikken og er grunnleggeren av moderne kosmologi. Han regnes for å være en av de mest betydningsfulle vitenskapsmenn i det 20. århundre.

I'll get the following keywords:

  • i
  • og
  • han
  • hans
  • av
  • for å
  • ble
  • om
  • einstein var en
  • ved
  • som er mest
  • relativitetsteorien
  • det
  • fysikk
  • med
  • den
  • verden
  • verdens
  • enn
  • vitenskapelige
  • århundre
  • århundrets
  • person
  • første årene
  • professor

I, og, av, for, å, ble, om, etc. are stop words, and as such, the result is unusable.

When installing Pattern, I just get:

>>> from summa.summarizer import summarize
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\__init__.py", line 1, in <module>
    from summa import commons, graph, keywords, pagerank_weighted, \
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\keywords.py", line 5, in <module>
    from .preprocessing.textcleaner import clean_text_by_word as _clean_text_by_
word
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\preprocessing\textcleaner.py", line 8, in <module>
    from pattern.en import tag
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\__init__.py", line 61, in <module>
    from pattern.text.en.inflect import (
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\__init__.py", line 80, in <module>
    from pattern.text.en import wordnet
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\wordnet\__init__.py", line 57, in <module>
    nltk.data.find("corpora/" + token)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 673, in find
    return find(modified_name, paths)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 660, in find
    return ZipFilePathPointer(p, zipentry)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\compat.py", line 228, in _decorator
    return init_func(*args, **kwargs)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 506, in __init__
    zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\compat.py", line 228, in _decorator
    return init_func(*args, **kwargs)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 1055, in __init__
    zipfile.ZipFile.__init__(self, filename)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\..\..\..\..\zipfile.py", line 1222, in __init__
    self._RealGetContents()
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\..\..\..\..\zipfile.py", line 1289, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

And so I cannot use Pattern (issue #30), making Norwegian unusable and unsupported. Assuming this goes for other languages as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions