Added support for rapidocr-onnxruntime#1502
Conversation
jbarlow83
left a comment
There was a problem hiding this comment.
Thank you for this PR which is definitely a serious effort and will be a valuable contribution.
I do have some review comments and questions I hope you can answer (especially about the interactions with GPU) - the RapidOCR AI docs are all Chinese, which I do not read.
There was a problem hiding this comment.
Everything in ocrmypdf._exec is for managing subprocesses we interact with. Since RapidOCR is a library and not a process, everything here should be moved to either builtin_plugins.rapidocr_engine or ocr_engine.rapidocr
| if not RAPIDOCR_AVAILABLE: | ||
| error_msg = f"rapidocr_onnxruntime is not installed: {IMPORT_ERROR_MESSAGE}" | ||
| log.error(error_msg) | ||
| raise ImportError( |
There was a problem hiding this comment.
Please raise ocrmypdf.exceptions.MissingDependencyError instead.
| f"{error_msg}\nInstall it with: pip install rapidocr_onnxruntime" | ||
| ) | ||
|
|
||
| try: |
There was a problem hiding this comment.
This should be restructured like
try:
self._generate_hocr(...)
except Exception ...:
|
|
||
| try: | ||
| # Load image | ||
| image = Image.open(input_file) |
There was a problem hiding this comment.
Image is never closed - can lead to high memory usage. Use with Image.open(...) as image: instead.
| padded = Image.new( | ||
| image.mode, (image.width, image.height + 20), (255, 255, 255) | ||
| ) | ||
| padded.paste(image, (0, 0)) | ||
| img_array = np.array(padded) |
There was a problem hiding this comment.
Use with Image.new() as padded: to ensure disposal of intermediate image after creating img_array.
| elif isinstance(result, tuple): | ||
| log.info(f"Got tuple with {len(result)} items") | ||
| for i, item in enumerate(result): | ||
| if i < 2: # Only log first two items to avoid excessive output | ||
| log.info(f"Tuple item {i}: type={type(item)}, value={item}") |
There was a problem hiding this comment.
Is there a way to ask rapidocr to always return the same data type?
If not, turn the tuple into a list with something like
result = [result] and eliminate all of the branching based on return value.
| # Format appears to be direct data from PaddleOCR | ||
| # Each element is [box_coordinates, text, confidence] | ||
| text_results = [] | ||
| boxes_list = [] | ||
|
|
||
| for item in result: | ||
| if len(item) == 3: # [box, text, confidence] | ||
| box_coords, text, confidence = item | ||
| text_results.append((text, float(confidence))) | ||
| boxes_list.append(box_coords) | ||
| log.debug(f"Added text '{text}' with box {box_coords}") | ||
| else: | ||
| log.warning(f"Unexpected result item format: {item}") | ||
|
|
||
| log.info(f"Extracted {len(text_results)} text elements from RapidOCR") |
There was a problem hiding this comment.
Move this code into a separate method to reduce complexity.
| # The output is a tuple with (result_list, timing_info) | ||
| # Where result_list is a list of [box_coords, text, confidence] triplets | ||
| result_list = result[0] | ||
| if isinstance(result_list, list): | ||
| text_results = [] | ||
| boxes_list = [] | ||
|
|
||
| log.info( | ||
| f"Processing tuple result with {len(result_list)} text items" | ||
| ) | ||
|
|
||
| # Process each detection item from tuple result | ||
| for item in result_list: | ||
| if isinstance(item, list) and len(item) == 3: | ||
| box_coords, text, confidence = item | ||
| text_results.append((text, float(confidence))) | ||
| boxes_list.append(box_coords) | ||
| log.debug( | ||
| f"Added text from tuple: '{text}' with confidence {confidence}" | ||
| ) | ||
| else: | ||
| log.warning( | ||
| f"Unexpected item format in tuple result: {item}" | ||
| ) | ||
|
|
||
| log.info( | ||
| f"Extracted {len(text_results)} text elements from tuple result" | ||
| ) |
There was a problem hiding this comment.
It appears to me this branch can be deleted if the return type from rapidocr is unified.
|
|
||
| # Use the existing HocrTransform for converting HOCR to PDF | ||
| # This is a simplified approach - a full implementation would create PDF directly | ||
| from ocrmypdf.hocrtransform import HocrTransform |
There was a problem hiding this comment.
Move import to top of file
| try: | ||
| conf = float(confidence) | ||
| except (TypeError, ValueError): | ||
| conf = 0.9 |
There was a problem hiding this comment.
Wouldn't it be better to omit confidence than report a fake value?
In some situations, RapidOCR has much more accurate results than Tesseract. I have created an implementation that makes it simple to use with OCRmyPDF.
Congratulations on OCRmyPDF; it is the most practical way to make scanned PDF files searchable.