Added support for rapidocr-onnxruntime#1502

Open

bigbruno wants to merge 2 commits intoocrmypdf:mainfrom

bigbruno commented Mar 30, 2025

In some situations, RapidOCR has much more accurate results than Tesseract. I have created an implementation that makes it simple to use with OCRmyPDF.

Congratulations on OCRmyPDF; it is the most practical way to make scanned PDF files searchable.

bigbruno added 2 commits

March 30, 2025 17:35


          Add support for rapidocr-onnxruntime

3aa66a4


          Cleanning

6c1b727

jbarlow83 reviewed

View reviewed changes

Collaborator

jbarlow83 left a comment

Thank you for this PR which is definitely a serious effort and will be a valuable contribution.

I do have some review comments and questions I hope you can answer (especially about the interactions with GPU) - the RapidOCR AI docs are all Chinese, which I do not read.

src/ocrmypdf/_exec/rapidocr.py

Collaborator

jbarlow83 Mar 30, 2025

Everything in ocrmypdf._exec is for managing subprocesses we interact with. Since RapidOCR is a library and not a process, everything here should be moved to either builtin_plugins.rapidocr_engine or ocr_engine.rapidocr

src/ocrmypdf/ocr_engine/rapidocr.py

+                      if not RAPIDOCR_AVAILABLE:
+                          error_msg = f"rapidocr_onnxruntime is not installed: {IMPORT_ERROR_MESSAGE}"
+                          log.error(error_msg)
+                          raise ImportError(

Collaborator

jbarlow83 Mar 30, 2025

Please raise ocrmypdf.exceptions.MissingDependencyError instead.

src/ocrmypdf/ocr_engine/rapidocr.py

+                              f"{error_msg}\nInstall it with: pip install rapidocr_onnxruntime"
+                          )
+                      try:

Collaborator

jbarlow83 Mar 30, 2025

This should be restructured like

try:
    self._generate_hocr(...)
except Exception ...:

src/ocrmypdf/ocr_engine/rapidocr.py

+                      try:
+                          # Load image
+                          image = Image.open(input_file)

Collaborator

jbarlow83 Mar 30, 2025

Image is never closed - can lead to high memory usage. Use with Image.open(...) as image: instead.

src/ocrmypdf/ocr_engine/rapidocr.py

Comment on lines +137 to +141

+                          padded = Image.new(
+                              image.mode, (image.width, image.height + 20), (255, 255, 255)
+                          )
+                          padded.paste(image, (0, 0))
+                          img_array = np.array(padded)

Collaborator

jbarlow83 Mar 30, 2025

Use with Image.new() as padded: to ensure disposal of intermediate image after creating img_array.

src/ocrmypdf/ocr_engine/rapidocr.py

Comment on lines +162 to +166

+                          elif isinstance(result, tuple):
+                              log.info(f"Got tuple with {len(result)} items")
+                              for i, item in enumerate(result):
+                                  if i < 2:  # Only log first two items to avoid excessive output
+                                      log.info(f"Tuple item {i}: type={type(item)}, value={item}")

Collaborator

jbarlow83 Mar 30, 2025

Is there a way to ask rapidocr to always return the same data type?
If not, turn the tuple into a list with something like
result = [result] and eliminate all of the branching based on return value.

src/ocrmypdf/ocr_engine/rapidocr.py

Comment on lines +172 to +186

+                              # Format appears to be direct data from PaddleOCR
+                              # Each element is [box_coordinates, text, confidence]
+                              text_results = []
+                              boxes_list = []
+                              for item in result:
+                                  if len(item) == 3:  # [box, text, confidence]
+                                      box_coords, text, confidence = item
+                                      text_results.append((text, float(confidence)))
+                                      boxes_list.append(box_coords)
+                                      log.debug(f"Added text '{text}' with box {box_coords}")
+                                  else:
+                                      log.warning(f"Unexpected result item format: {item}")
+                              log.info(f"Extracted {len(text_results)} text elements from RapidOCR")

Collaborator

jbarlow83 Mar 30, 2025

Move this code into a separate method to reduce complexity.

src/ocrmypdf/ocr_engine/rapidocr.py

Comment on lines +188 to +215

+                              # The output is a tuple with (result_list, timing_info)
+                              # Where result_list is a list of [box_coords, text, confidence] triplets
+                              result_list = result[0]
+                              if isinstance(result_list, list):
+                                  text_results = []
+                                  boxes_list = []
+                                  log.info(
+                                      f"Processing tuple result with {len(result_list)} text items"
+                                  )
+                                  # Process each detection item from tuple result
+                                  for item in result_list:
+                                      if isinstance(item, list) and len(item) == 3:
+                                          box_coords, text, confidence = item
+                                          text_results.append((text, float(confidence)))
+                                          boxes_list.append(box_coords)
+                                          log.debug(
+                                              f"Added text from tuple: '{text}' with confidence {confidence}"
+                                          )
+                                      else:
+                                          log.warning(
+                                              f"Unexpected item format in tuple result: {item}"
+                                          )
+                                  log.info(
+                                      f"Extracted {len(text_results)} text elements from tuple result"
+                                  )

Collaborator

jbarlow83 Mar 30, 2025

It appears to me this branch can be deleted if the return type from rapidocr is unified.

src/ocrmypdf/ocr_engine/rapidocr.py

+                      # Use the existing HocrTransform for converting HOCR to PDF
+                      # This is a simplified approach - a full implementation would create PDF directly
+                      from ocrmypdf.hocrtransform import HocrTransform

Collaborator

jbarlow83 Mar 30, 2025

Move import to top of file

src/ocrmypdf/ocr_engine/rapidocr.py

Comment on lines +422 to +425

+                  try:
+                      conf = float(confidence)
+                  except (TypeError, ValueError):
+                      conf = 0.9

Collaborator

jbarlow83 Mar 30, 2025

Wouldn't it be better to omit confidence than report a fake value?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet