Skip to content

Conversation

@bigbruno
Copy link

In some situations, RapidOCR has much more accurate results than Tesseract. I have created an implementation that makes it simple to use with OCRmyPDF.

Congratulations on OCRmyPDF; it is the most practical way to make scanned PDF files searchable.

Copy link
Collaborator

@jbarlow83 jbarlow83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR which is definitely a serious effort and will be a valuable contribution.

I do have some review comments and questions I hope you can answer (especially about the interactions with GPU) - the RapidOCR AI docs are all Chinese, which I do not read.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in ocrmypdf._exec is for managing subprocesses we interact with. Since RapidOCR is a library and not a process, everything here should be moved to either builtin_plugins.rapidocr_engine or ocr_engine.rapidocr

if not RAPIDOCR_AVAILABLE:
error_msg = f"rapidocr_onnxruntime is not installed: {IMPORT_ERROR_MESSAGE}"
log.error(error_msg)
raise ImportError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please raise ocrmypdf.exceptions.MissingDependencyError instead.

f"{error_msg}\nInstall it with: pip install rapidocr_onnxruntime"
)

try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be restructured like

try:
    self._generate_hocr(...)
except Exception ...:
    


try:
# Load image
image = Image.open(input_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image is never closed - can lead to high memory usage. Use with Image.open(...) as image: instead.

Comment on lines +137 to +141
padded = Image.new(
image.mode, (image.width, image.height + 20), (255, 255, 255)
)
padded.paste(image, (0, 0))
img_array = np.array(padded)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use with Image.new() as padded: to ensure disposal of intermediate image after creating img_array.

Comment on lines +162 to +166
elif isinstance(result, tuple):
log.info(f"Got tuple with {len(result)} items")
for i, item in enumerate(result):
if i < 2: # Only log first two items to avoid excessive output
log.info(f"Tuple item {i}: type={type(item)}, value={item}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to ask rapidocr to always return the same data type?
If not, turn the tuple into a list with something like
result = [result] and eliminate all of the branching based on return value.

Comment on lines +172 to +186
# Format appears to be direct data from PaddleOCR
# Each element is [box_coordinates, text, confidence]
text_results = []
boxes_list = []

for item in result:
if len(item) == 3: # [box, text, confidence]
box_coords, text, confidence = item
text_results.append((text, float(confidence)))
boxes_list.append(box_coords)
log.debug(f"Added text '{text}' with box {box_coords}")
else:
log.warning(f"Unexpected result item format: {item}")

log.info(f"Extracted {len(text_results)} text elements from RapidOCR")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this code into a separate method to reduce complexity.

Comment on lines +188 to +215
# The output is a tuple with (result_list, timing_info)
# Where result_list is a list of [box_coords, text, confidence] triplets
result_list = result[0]
if isinstance(result_list, list):
text_results = []
boxes_list = []

log.info(
f"Processing tuple result with {len(result_list)} text items"
)

# Process each detection item from tuple result
for item in result_list:
if isinstance(item, list) and len(item) == 3:
box_coords, text, confidence = item
text_results.append((text, float(confidence)))
boxes_list.append(box_coords)
log.debug(
f"Added text from tuple: '{text}' with confidence {confidence}"
)
else:
log.warning(
f"Unexpected item format in tuple result: {item}"
)

log.info(
f"Extracted {len(text_results)} text elements from tuple result"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to me this branch can be deleted if the return type from rapidocr is unified.


# Use the existing HocrTransform for converting HOCR to PDF
# This is a simplified approach - a full implementation would create PDF directly
from ocrmypdf.hocrtransform import HocrTransform
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move import to top of file

Comment on lines +422 to +425
try:
conf = float(confidence)
except (TypeError, ValueError):
conf = 0.9
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to omit confidence than report a fake value?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants