Skip to content

fix(paddleocr): load all PDF pages for image cropping instead of first 100#13811

Merged
yingfeng merged 1 commit intoinfiniflow:mainfrom
Lntanohuang:fix/paddleocr-page-index-out-of-range
Mar 27, 2026
Merged

fix(paddleocr): load all PDF pages for image cropping instead of first 100#13811
yingfeng merged 1 commit intoinfiniflow:mainfrom
Lntanohuang:fix/paddleocr-page-index-out-of-range

Conversation

@Lntanohuang
Copy link
Copy Markdown
Contributor

Summary

Closes #13803

The __images__ method in paddleocr_parser.py defaulted to page_to=100, only loading the first 100 pages for image cropping. However, the PaddleOCR API processes all pages of the PDF. For PDFs with more than 100 pages, page indices beyond 99 were rejected as out of range during crop validation, causing content loss.

Root Cause

__images__(page_to=100) → loads pages 0-99 → page_images has 100 entries
PaddleOCR API → processes all 226 pages → tags reference pages 1-226
extract_positions() → converts tag "101" to index 100
crop() validation → 0 <= 100 < 100 → False → "All page indices [100] out of range"

Fix

Changed page_to default from 100 to 10**9, so all PDF pages are loaded for cropping. Python's list slicing safely handles oversized indices.

Test plan

  • Parse a PDF with >100 pages using PaddleOCR — no more "out of range" warnings
  • Parse a PDF with <100 pages — behavior unchanged
  • Verify cropped images are generated correctly for all pages

🤖 Generated with Claude Code

…t 100

The __images__ method defaulted to page_to=100, but the PaddleOCR API
processes all pages of the PDF. For PDFs with more than 100 pages, page
indices beyond 99 were rejected as out of range during crop validation.

Closes infiniflow#13803

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. 🐞 bug Something isn't working, pull request that fix bug. labels Mar 26, 2026
@Magicbook1108 Magicbook1108 added the ci Continue Integration label Mar 26, 2026
@yingfeng yingfeng marked this pull request as draft March 26, 2026 13:13
@yingfeng yingfeng marked this pull request as ready for review March 26, 2026 13:13
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.72%. Comparing base (1b29522) to head (7f010bc).
⚠️ Report is 20 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main   #13811       +/-   ##
===========================================
+ Coverage   49.52%   96.72%   +47.19%     
===========================================
  Files          45       10       -35     
  Lines        9657      702     -8955     
  Branches      112      112               
===========================================
- Hits         4783      679     -4104     
+ Misses       4864        5     -4859     
- Partials       10       18        +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@yingfeng yingfeng merged commit 406339a into infiniflow:main Mar 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐞 bug Something isn't working, pull request that fix bug. ci Continue Integration size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: [PaddleOCR] All page indices [101] out of range for 100 pages; skipping.

3 participants