This project is a Python-based solution for extracting text from PDF files, preprocessing the text, vectorizing it using Cohere embeddings, and storing the vectors in Pinecone for further use.
-
Initialization: The script begins by initializing the necessary libraries. These include:
- PyMuPDF for PDF processing
- pytesseract for Optical Character Recognition (OCR)
- spaCy for Natural Language Processing (NLP)
- Cohere for generating text embeddings
- Pinecone for vector storage
-
Text Extraction: The
extract_text_from_pagefunction is responsible for extracting text from each page of the PDF. It uses PyMuPDF for text extraction and Tesseract for OCR in case the page contains scanned images. -
Text Preprocessing: The
preprocess_textfunction uses spaCy to normalize and clean the extracted text. Thechunk_textfunction then divides the cleaned text into smaller pieces for efficient processing. -
Vectorization: The
vectorize_textfunction takes the preprocessed text chunks and generates vector embeddings using the Cohere model. -
Upload to Pinecone: The
upload_vectorsfunction takes the generated vectors and uploads them to a Pinecone index for storage and retrieval. -
Process PDF: The
process_pdffunction orchestrates the entire workflow for each PDF file. It extracts, preprocesses, and vectorizes the text from each page, and then uploads the vectors to Pinecone. -
Main Function: The
mainfunction serves as the entry point of the script. It iterates through a specified directory, identifies all PDF files, and processes each one using theprocess_pdffunction.
To use this script, specify the directory containing your PDF files in the main function and run the script. Ensure that all necessary environment variables are set in your .env file.