-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
HocrTransform.to_pdf: add parameters: scale_factor pil_image_save_kwargs #1586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
HocrTransform.to_pdf: add parameters: scale_factor pil_image_save_kwargs #1586
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1586 +/- ##
==========================================
- Coverage 89.72% 89.55% -0.18%
==========================================
Files 96 96
Lines 7185 7202 +17
Branches 735 739 +4
==========================================
+ Hits 6447 6450 +3
- Misses 529 542 +13
- Partials 209 210 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
The intended workflow in ocrmypdf is to use higher optimization level settings when you have very DPI images. The hocrtransform step is just intended to copy the image from the input file at its existing quality level. The optimizer is the place that would make sense to introduce an option for downsampling. ocrmypdf does not convert images to JPEG2000. Apart from superior handling of >32k pixel images, modern JPEG codecs exceed it in quality, are significantly less complex, significantly faster to render, etc. |
but then why does the file size explode from 133MB to 554MB? pil_image_save_kwargs={
# 554M 100-hocr2pdf.jpg.q50.pdf (wtf?)
# 133M 110-tiff2jpg-q50
"format": "JPEG",
"quality": 50,
},
img2jpg.py#!/usr/bin/env python3
import os
from concurrent.futures import ProcessPoolExecutor, as_completed
from PIL import Image
import psutil
# --- Configuration ---
# QUALITY = 50 # Compression quality (0–100)
# QUALITY = 20
# QUALITY = 10
# QUALITY = 5
# QUALITY = 2
QUALITY = 50
INPUT_DIR = "070-deskew"
OUTPUT_DIR = os.path.splitext(os.path.basename(__file__))[0]
# OUTPUT_DIR = os.path.splitext(os.path.basename(__file__))[0] + f"-q{QUALITY}" # compare qualities
MAX_WORKERS = psutil.cpu_count(logical=False) or 1 # Use all available cores
SCALE_FACTOR = 0.5
# use imagemagick to compare jpeg qualities around 50%
r'''
src=001.tiff;
for (( q=10; q<=80; q+=10 )); do
dst="$src".magick.s50.q$(printf %03d "$q").jpg;
[ -e "$dst" ] && continue;
magick "$src" -scale 50% -quality ${q}% "$dst";
done
'''
os.makedirs(OUTPUT_DIR, exist_ok=True)
def compress_tiff_to_jpeg(filename):
"""Convert a TIFF image to JPEG optimized for text/graphics."""
input_path = os.path.join(INPUT_DIR, filename)
output_name = os.path.splitext(filename)[0] + ".jpg"
output_path = os.path.join(OUTPUT_DIR, output_name)
if os.path.exists(output_path):
return f"keeping {output_path}"
try:
with Image.open(input_path) as img:
if img.mode not in ("RGB", "L"):
img = img.convert("RGB")
if SCALE_FACTOR != 1.0:
new_size = (
int(img.width * SCALE_FACTOR),
int(img.height * SCALE_FACTOR)
)
img = img.resize(new_size, Image.LANCZOS)
img.save(
output_path,
format="JPEG",
quality=QUALITY,
)
return f"writing {output_path}"
except Exception as e:
raise
return f"error {filename}: {e}"
def main():
tiff_files = sorted(
f for f in os.listdir(INPUT_DIR) if f.lower().endswith((".tif", ".tiff"))
)
if not tiff_files:
print("no TIFF files found.")
return
print(f"processing {len(tiff_files)} files using {MAX_WORKERS} workers...")
with ProcessPoolExecutor(max_workers=MAX_WORKERS) as executor:
futures = {executor.submit(compress_tiff_to_jpeg, f): f for f in tiff_files}
for future in as_completed(futures):
print(future.result())
if __name__ == "__main__":
main()
1. scan book pages to 600 dpi tiff images 2. run tesseract on the tiff images to produce hocr files 3. publish the hocr files in a git repo to track the progress of proofreading 4. run hocr-to-epub-fxl to convert hocr files to a 300 dpi fixed-layout epub file now the missing part is where i convert hocr files to a pdf file this works with my expected interface: the input type should be detected from the image paths should be parsed from the hocr files |
help convert hocr files to pdf
status: abandoned draft
i store my raw scan images with 600 dpi (becaue why not...)
but to release an ebook, i want to downscale to 300 dpi
and apply some image compression
example use
hocr2pdf.py
... but what i dont understand:
apparently the compressed
image_to_drawis not usedby
canvas.do.draw_imagein
src/ocrmypdf/hocrtransform/_hocr.py... so there seems to be another decode-encode step
which stores the images with a different quality
so with
format="JPEG"andquality=50i get a PDF with 554MB instead of 150MB
fixme:
with
format="JPEG"the scaling does not workthe content is scaled, but the page is not scaled
and the content is at the bottom-left of the page
fixme:
why is JPEG2000 compression in imagemagick so much better than in pillow
img2jp2.py
alternative:
hocr-to-epub-fxl - convert hocr files of a scanned book to a fixed-layout epub
this creates about 2x smaller files due to the AVIF image format
which is not supported in PDF files
(EPUB is the future...)
related issues