This git is used to try apache tika on different documents format.
the aim is to collect all words, out of stopwords, in the content of the documents and create files with metadata.
After, you can search or draw treemap or wordclouds based on the csv generated.
stopwords can be changed to other language, just check on this site: https://countwordsfree.com/stopwords/french
Drawings are based on this information: https://towardsdatascience.com/beyond-the-cloud-4-visualizations-to-use-instead-of-word-cloud-960dd516f215
wordcloud tutorial: https://www.youtube.com/watch?v=l7w7unBNAeU
a category is needed. it means a subdirectory of the home directory with your all your documents (images, documents) files to be transformed in textual words.
category is defined in the file: python/global_variables.py (it can be moved directly in the directory jupyter_files if needed)
New in this version tika3:
- Change way of working by using dictionnary and parallelisation. Increase speed a lot.
> parallel parameter can be found in global_variables.py
- Simplification.
- pysimplegui interface to chose remove or add data into file
- compare files treated in the result file to be able to retrieve in case of missing or added new files
Change has been done following a training in Udemy: "Python Coder un dashboard de Rachid." It seems that this training is not yet opened for trainnee..
1 file is already developped,
| FIle | Description |
|---|---|
| jupyter_files/00_Step1_OCR_files_metadata with_tika.ipynb | tika parsing of files from a category directory, with output is a file with all metadata found, including qtt by file |
| jupyter_files/01_wordccloud_of_tika_result.ipynb | File to generate a wordcloud output based on result of 00_Step1_OCR_files_metadata with_tika.ipynb |
Results wil be provided in a subdirectory named: category_results
| FIle | Description |
|---|---|
| category_results/category__metadatas.csv | file with all treated files with data format as : ['category','file','metadata','count','timestamp'] |
| category_results/category__wordcloud1.png | wordcloud output of metadata collected |
-
using ubuntu 23 with python 3.11 and openjava sdk 17
-
using jupyter lab
-
working in a venv (virtual environment)
-
To use jupyter files, you need to add the subdirectory python in the PYTHONPATH variable. it is where is located the global_variables.py
-
for tika, java jdk is needed
to be added in the vitrual environment with pip.
- tika
- jupyterlab
- pandas
- pysimplegui
to change the log level and get more print:
- change DEBUG_OL value to 2 in the file ./python/global_variables.py
to check function or different steps of the code, it can be done inside jupyter lab, by executing some steps and creating to launch unitary tests.
Local follow-up is in the file:
backlog & issues