Skip to content

Commit 406339a

Browse files
LntanohuangAsksksnclaude
authored
Fix(paddleocr): load all PDF pages for image cropping instead of first 100 (#13811)
## Summary Closes #13803 The `__images__` method in `paddleocr_parser.py` defaulted to `page_to=100`, only loading the first 100 pages for image cropping. However, the PaddleOCR API processes **all** pages of the PDF. For PDFs with more than 100 pages, page indices beyond 99 were rejected as out of range during crop validation, causing content loss. ## Root Cause ``` __images__(page_to=100) → loads pages 0-99 → page_images has 100 entries PaddleOCR API → processes all 226 pages → tags reference pages 1-226 extract_positions() → converts tag "101" to index 100 crop() validation → 0 <= 100 < 100 → False → "All page indices [100] out of range" ``` ## Fix Changed `page_to` default from `100` to `10**9`, so all PDF pages are loaded for cropping. Python's list slicing safely handles oversized indices. ## Test plan - [ ] Parse a PDF with >100 pages using PaddleOCR — no more "out of range" warnings - [ ] Parse a PDF with <100 pages — behavior unchanged - [ ] Verify cropped images are generated correctly for all pages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Asksksn <Asksksn@noreply.gitcode.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 992a151 commit 406339a

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

deepdoc/parser/paddleocr_parser.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -422,7 +422,7 @@ def _transfer_to_tables(self, result: dict[str, Any]) -> list[TableTuple]:
422422
"""Convert API response to table tuples."""
423423
return []
424424

425-
def __images__(self, fnm, page_from=0, page_to=100, callback=None):
425+
def __images__(self, fnm, page_from=0, page_to=10**9, callback=None):
426426
"""Generate page images from PDF for cropping."""
427427
self.page_from = page_from
428428
self.page_to = page_to

0 commit comments

Comments
 (0)