-
Notifications
You must be signed in to change notification settings - Fork 256
Description
I'm doing keyword extraction in Norwegian. If I do not use Pattern, I'm getting stop words within the keyword extraction. E.g. if I extract the keywords from the first paragraph on Albert Einstein in the Norwegian Wikipedia:
Albert Einstein var en tyskfødt teoretisk fysiker og nobelprisvinner som er mest kjent for å ha formulert relativitetsteorien og vist at masse og energi er ekvivalente ved masseenergiloven, E = mc2. Gjennom den spesielle relativitetsteorien revolusjonerte han mekanikken og presiserte tidsbegrepet. Han var sentral i utviklingen av kvantemekanikken og er grunnleggeren av moderne kosmologi. Han regnes for å være en av de mest betydningsfulle vitenskapsmenn i det 20. århundre.
I'll get the following keywords:
- i
- og
- han
- hans
- av
- for å
- ble
- om
- einstein var en
- ved
- som er mest
- relativitetsteorien
- det
- fysikk
- med
- den
- verden
- verdens
- enn
- vitenskapelige
- århundre
- århundrets
- person
- første årene
- professor
I, og, av, for, å, ble, om, etc. are stop words, and as such, the result is unusable.
When installing Pattern, I just get:
>>> from summa.summarizer import summarize
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\__init__.py", line 1, in <module>
from summa import commons, graph, keywords, pagerank_weighted, \
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\keywords.py", line 5, in <module>
from .preprocessing.textcleaner import clean_text_by_word as _clean_text_by_
word
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\preprocessing\textcleaner.py", line 8, in <module>
from pattern.en import tag
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\__init__.py", line 61, in <module>
from pattern.text.en.inflect import (
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\__init__.py", line 80, in <module>
from pattern.text.en import wordnet
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\wordnet\__init__.py", line 57, in <module>
nltk.data.find("corpora/" + token)
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 673, in find
return find(modified_name, paths)
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 660, in find
return ZipFilePathPointer(p, zipentry)
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\compat.py", line 228, in _decorator
return init_func(*args, **kwargs)
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 506, in __init__
zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\compat.py", line 228, in _decorator
return init_func(*args, **kwargs)
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 1055, in __init__
zipfile.ZipFile.__init__(self, filename)
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\..\..\..\..\zipfile.py", line 1222, in __init__
self._RealGetContents()
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\..\..\..\..\zipfile.py", line 1289, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
And so I cannot use Pattern (issue #30), making Norwegian unusable and unsupported. Assuming this goes for other languages as well.