Skip to content

oliboub/tika3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README.md

Introduction

This git is used to try apache tika on different documents format.
the aim is to collect all words, out of stopwords, in the content of the documents and create files with metadata.
After, you can search or draw treemap or wordclouds based on the csv generated.

stopwords can be changed to other language, just check on this site: https://countwordsfree.com/stopwords/french

Drawings are based on this information: https://towardsdatascience.com/beyond-the-cloud-4-visualizations-to-use-instead-of-word-cloud-960dd516f215

wordcloud tutorial: https://www.youtube.com/watch?v=l7w7unBNAeU

a category is needed. it means a subdirectory of the home directory with your all your documents (images, documents) files to be transformed in textual words.
category is defined in the file: python/global_variables.py (it can be moved directly in the directory jupyter_files if needed)


New in this version tika3:
- Change way of working by using dictionnary and parallelisation. Increase speed a lot.
> parallel parameter can be found in global_variables.py
- Simplification.
- pysimplegui interface to chose remove or add data into file
- compare files treated in the result file to be able to retrieve in case of missing or added new files

Change has been done following a training in Udemy: "Python Coder un dashboard de Rachid." It seems that this training is not yet opened for trainnee..

Files usage

1 file is already developped,

FIle Description
 jupyter_files/00_Step1_OCR_files_metadata with_tika.ipynb tika parsing of files from a category directory, with output is a file with all metadata found, including qtt by file
jupyter_files/01_wordccloud_of_tika_result.ipynb File to generate a wordcloud output based on result of 00_Step1_OCR_files_metadata with_tika.ipynb

result

Results wil be provided in a subdirectory named: category_results

FIle Description
category_results/category__metadatas.csv file with all treated files with data format as : ['category','file','metadata','count','timestamp']
category_results/category__wordcloud1.png wordcloud output of metadata collected

Development environment

  • using ubuntu 23 with python 3.11 and openjava sdk 17

  • using jupyter lab

  • working in a venv (virtual environment)

  • To use jupyter files, you need to add the subdirectory python in the PYTHONPATH variable. it is where is located the global_variables.py

  • for tika, java jdk is needed

additional libraries used

to be added in the vitrual environment with pip.

  • tika
  • jupyterlab
  • pandas
  • pysimplegui

log mod

to change the log level and get more print:

  • change DEBUG_OL value to 2 in the file ./python/global_variables.py

debug mode

to check function or different steps of the code, it can be done inside jupyter lab, by executing some steps and creating to launch unitary tests.

backlog and issues

Local follow-up is in the file:
backlog & issues

About

Improvement of tika2 by changing the way to process files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors