This project enhances the readability and text extraction capability of scanned newspaper PDFs. It first applies color-preserving image preprocessing and then adds an invisible OCR-based text layer so that users can search and copy text from the resulting PDF without altering the original visual appearance.
β
High-resolution image conversion (300 DPI)
β
Color-preserving contrast enhancement & denoising
β
Image reassembly into a PDF while maintaining full color
β
Text-layer embedding using ocrmypdf
β
Searchable and selectable final PDF output
Python 3.x
Python packages:
ocrmypdfpdf2imageopencv-pythonnumpyPillow
Poppler utilities:
- macOS:
brew install poppler - Ubuntu:
sudo apt-get install poppler-utils
convert_from_path(): Converts PDF pages to high-resolution color imagescv2.convertScaleAbs(): Enhances contrast and brightnesscv2.fastNlMeansDenoisingColored(): Denoises while preserving colorImage.fromarray(): Converts NumPy array back to PIL image for savingocrmypdf.ocr(): Adds an invisible, selectable OCR text layer to the PDF
-
Place the original scanned newspaper PDF in the
./datafolder. -
Update the file name in the script (
input_pdfvariable). -
Run the Python script:
python main.py