-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Added support for rapidocr-onnxruntime #1502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
jbarlow83
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR which is definitely a serious effort and will be a valuable contribution.
I do have some review comments and questions I hope you can answer (especially about the interactions with GPU) - the RapidOCR AI docs are all Chinese, which I do not read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything in ocrmypdf._exec is for managing subprocesses we interact with. Since RapidOCR is a library and not a process, everything here should be moved to either builtin_plugins.rapidocr_engine or ocr_engine.rapidocr
| if not RAPIDOCR_AVAILABLE: | ||
| error_msg = f"rapidocr_onnxruntime is not installed: {IMPORT_ERROR_MESSAGE}" | ||
| log.error(error_msg) | ||
| raise ImportError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please raise ocrmypdf.exceptions.MissingDependencyError instead.
| f"{error_msg}\nInstall it with: pip install rapidocr_onnxruntime" | ||
| ) | ||
|
|
||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be restructured like
try:
self._generate_hocr(...)
except Exception ...:
|
|
||
| try: | ||
| # Load image | ||
| image = Image.open(input_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Image is never closed - can lead to high memory usage. Use with Image.open(...) as image: instead.
| padded = Image.new( | ||
| image.mode, (image.width, image.height + 20), (255, 255, 255) | ||
| ) | ||
| padded.paste(image, (0, 0)) | ||
| img_array = np.array(padded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use with Image.new() as padded: to ensure disposal of intermediate image after creating img_array.
| elif isinstance(result, tuple): | ||
| log.info(f"Got tuple with {len(result)} items") | ||
| for i, item in enumerate(result): | ||
| if i < 2: # Only log first two items to avoid excessive output | ||
| log.info(f"Tuple item {i}: type={type(item)}, value={item}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to ask rapidocr to always return the same data type?
If not, turn the tuple into a list with something like
result = [result] and eliminate all of the branching based on return value.
| # Format appears to be direct data from PaddleOCR | ||
| # Each element is [box_coordinates, text, confidence] | ||
| text_results = [] | ||
| boxes_list = [] | ||
|
|
||
| for item in result: | ||
| if len(item) == 3: # [box, text, confidence] | ||
| box_coords, text, confidence = item | ||
| text_results.append((text, float(confidence))) | ||
| boxes_list.append(box_coords) | ||
| log.debug(f"Added text '{text}' with box {box_coords}") | ||
| else: | ||
| log.warning(f"Unexpected result item format: {item}") | ||
|
|
||
| log.info(f"Extracted {len(text_results)} text elements from RapidOCR") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this code into a separate method to reduce complexity.
| # The output is a tuple with (result_list, timing_info) | ||
| # Where result_list is a list of [box_coords, text, confidence] triplets | ||
| result_list = result[0] | ||
| if isinstance(result_list, list): | ||
| text_results = [] | ||
| boxes_list = [] | ||
|
|
||
| log.info( | ||
| f"Processing tuple result with {len(result_list)} text items" | ||
| ) | ||
|
|
||
| # Process each detection item from tuple result | ||
| for item in result_list: | ||
| if isinstance(item, list) and len(item) == 3: | ||
| box_coords, text, confidence = item | ||
| text_results.append((text, float(confidence))) | ||
| boxes_list.append(box_coords) | ||
| log.debug( | ||
| f"Added text from tuple: '{text}' with confidence {confidence}" | ||
| ) | ||
| else: | ||
| log.warning( | ||
| f"Unexpected item format in tuple result: {item}" | ||
| ) | ||
|
|
||
| log.info( | ||
| f"Extracted {len(text_results)} text elements from tuple result" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears to me this branch can be deleted if the return type from rapidocr is unified.
|
|
||
| # Use the existing HocrTransform for converting HOCR to PDF | ||
| # This is a simplified approach - a full implementation would create PDF directly | ||
| from ocrmypdf.hocrtransform import HocrTransform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move import to top of file
| try: | ||
| conf = float(confidence) | ||
| except (TypeError, ValueError): | ||
| conf = 0.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to omit confidence than report a fake value?
In some situations, RapidOCR has much more accurate results than Tesseract. I have created an implementation that makes it simple to use with OCRmyPDF.
Congratulations on OCRmyPDF; it is the most practical way to make scanned PDF files searchable.