README.md

Introduction

This git is used to try apache tika on different documents format.
the aim is to collect all words, out of stopwords, in the content of the documents and create files with metadata.
After, you can search or draw treemap or wordclouds based on the csv generated.

stopwords can be changed to other language, just check on this site: https://countwordsfree.com/stopwords/french

Drawings are based on this information: https://towardsdatascience.com/beyond-the-cloud-4-visualizations-to-use-instead-of-word-cloud-960dd516f215

wordcloud tutorial: https://www.youtube.com/watch?v=l7w7unBNAeU

a category is needed. it means a subdirectory of the home directory with your all your documents (images, documents) files to be transformed in textual words.
category is defined in the file: python/global_variables.py (it can be moved directly in the directory jupyter_files if needed)

New in this version tika3:
- Change way of working by using dictionnary and parallelisation. Increase speed a lot.
> parallel parameter can be found in global_variables.py
- Simplification.
- pysimplegui interface to chose remove or add data into file
- compare files treated in the result file to be able to retrieve in case of missing or added new files

Change has been done following a training in Udemy: "Python Coder un dashboard de Rachid." It seems that this training is not yet opened for trainnee..

Files usage

1 file is already developped,

FIle	Description
jupyter_files/00_Step1_OCR_files_metadata with_tika.ipynb	tika parsing of files from a category directory, with output is a file with all metadata found, including qtt by file
jupyter_files/01_wordccloud_of_tika_result.ipynb	File to generate a wordcloud output based on result of 00_Step1_OCR_files_metadata with_tika.ipynb

result

Results wil be provided in a subdirectory named: category_results

FIle	Description
category_results/category__metadatas.csv	file with all treated files with data format as : ['category','file','metadata','count','timestamp']
category_results/category__wordcloud1.png	wordcloud output of metadata collected

Development environment

using ubuntu 23 with python 3.11 and openjava sdk 17
using jupyter lab
working in a venv (virtual environment)
To use jupyter files, you need to add the subdirectory python in the PYTHONPATH variable. it is where is located the global_variables.py
for tika, java jdk is needed

additional libraries used

to be added in the vitrual environment with pip.

tika
jupyterlab
pandas
pysimplegui

log mod

to change the log level and get more print:

change DEBUG_OL value to 2 in the file ./python/global_variables.py

debug mode

to check function or different steps of the code, it can be done inside jupyter lab, by executing some steps and creating to launch unitary tests.

backlog and issues

Local follow-up is in the file:
backlog & issues

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
doc-engineering		doc-engineering
jupyter_files		jupyter_files
python		python
LICENSE		LICENSE
README.md		README.md
hints.md		hints.md
todo_list.md		todo_list.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README.md

Introduction

Files usage

result

Development environment

additional libraries used

log mod

debug mode

backlog and issues

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

README.md

Introduction

Files usage

result

Development environment

additional libraries used

log mod

debug mode

backlog and issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages