Skip to content

Conversation

@gracetyy
Copy link

This PR adds a new script, PDF Image Extractor, which recursively scans a directory tree for PDF files and extracts all embedded images from each document.

  • All extracted images are saved in a subfolder named PDF within the input root directory by default (customizable via --out).
  • Each PDF file is organized into its own folder, containing all images extracted from that document.
  • The script supports an optional --dedup flag to enable per-PDF deduplication of images.

Additional notes:

  • Please let me know if you’d like any changes to the folder naming or CLI options.
  • Happy to update documentation or add more examples if needed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new utility script that recursively extracts all embedded images from PDF files in a directory tree. The script uses PyMuPDF (fitz) to process PDFs and supports optional deduplication of images per document.

Key changes:

  • Adds pdf_image_extractor.py with command-line interface for PDF image extraction
  • Includes comprehensive README with usage examples and documentation
  • Supports customizable output directory and per-PDF deduplication options

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
PDF Image Extractor/pdf_image_extractor.py Main script implementing recursive PDF scanning and image extraction logic with CLI argument parsing
PDF Image Extractor/README.md Documentation covering requirements, usage, CLI options, and output structure

Code Quality Observations:

The implementation is generally well-structured with clear separation of concerns. However, there are a few technical issues to address:

  1. Potential crash with os.path.commonpath() (lines 14-16): The code uses os.path.commonpath([pdf_path, output_root]) which can raise a ValueError on Windows when paths are on different drives, or when they don't share a common ancestor. This would crash the script in common scenarios where users specify an output directory on a different drive. The logic appears intended to mirror the directory structure, but using the common path as the base is problematic. A simpler approach would be to calculate the relative path from the input pdf_dir directory.

  2. Inefficient directory creation logic (lines 35-36): The condition if img_count == 0 and not os.path.exists(output_folder) only creates the directory before writing the first image. While os.makedirs() is called with exist_ok=True, the double-check is redundant. It would be clearer to create the directory once before the loop if there are images to extract.

  3. Redundant deduplication checks (lines 27-30): The code checks if dedup twice - once to skip duplicates and again to add to the set. This could be simplified to a single conditional block.

  4. Missing requirements.txt: Several other projects in this repository include a requirements.txt file (e.g., PDF Merger, Image Watermarker, Image to ASCII). Adding one for this project would improve consistency and make dependency installation clearer for users.

  5. Missing error handling for image extraction: If doc.extract_image(xref) fails (line 32), the script will crash. While PyMuPDF is generally robust, adding a try-except block would make the script more resilient.

Documentation:
The README is well-written with clear examples and appropriate detail. The structure follows good practices with separate sections for requirements, usage, examples, and output structure.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant