Skip to content

sowada23/CS4Good

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“° Newspaper OCR Enhancement & Text-Layering Tool

This project enhances the readability and text extraction capability of scanned newspaper PDFs. It first applies color-preserving image preprocessing and then adds an invisible OCR-based text layer so that users can search and copy text from the resulting PDF without altering the original visual appearance.


πŸ“Œ Features

βœ… High-resolution image conversion (300 DPI)
βœ… Color-preserving contrast enhancement & denoising
βœ… Image reassembly into a PDF while maintaining full color
βœ… Text-layer embedding using ocrmypdf
βœ… Searchable and selectable final PDF output
⚠️ Note: You should download output pdf files to try text extraction


πŸ“… Requirements

Python 3.x

Python packages:

  • ocrmypdf
  • pdf2image
  • opencv-python
  • numpy
  • Pillow

Poppler utilities:

  • macOS: brew install poppler
  • Ubuntu: sudo apt-get install poppler-utils

πŸ” Main Function Explanations

  • convert_from_path() : Converts PDF pages to high-resolution color images
  • cv2.convertScaleAbs() : Enhances contrast and brightness
  • cv2.fastNlMeansDenoisingColored() : Denoises while preserving color
  • Image.fromarray() : Converts NumPy array back to PIL image for saving
  • ocrmypdf.ocr() : Adds an invisible, selectable OCR text layer to the PDF

πŸ”„ How to Run

  1. Place the original scanned newspaper PDF in the ./data folder.

  2. Update the file name in the script (input_pdf variable).

  3. Run the Python script:

    python main.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages