fix(paddleocr): load all PDF pages for image cropping instead of first 100 by Lntanohuang · Pull Request #13811 · infiniflow/ragflow

Lntanohuang · 2026-03-26T09:05:46Z

Summary

The __images__ method in paddleocr_parser.py defaulted to page_to=100, only loading the first 100 pages for image cropping. However, the PaddleOCR API processes all pages of the PDF. For PDFs with more than 100 pages, page indices beyond 99 were rejected as out of range during crop validation, causing content loss.

Root Cause

__images__(page_to=100) → loads pages 0-99 → page_images has 100 entries
PaddleOCR API → processes all 226 pages → tags reference pages 1-226
extract_positions() → converts tag "101" to index 100
crop() validation → 0 <= 100 < 100 → False → "All page indices [100] out of range"

Fix

Changed page_to default from 100 to 10**9, so all PDF pages are loaded for cropping. Python's list slicing safely handles oversized indices.

Test plan

Parse a PDF with >100 pages using PaddleOCR — no more "out of range" warnings
Parse a PDF with <100 pages — behavior unchanged
Verify cropped images are generated correctly for all pages

🤖 Generated with Claude Code

…t 100 The __images__ method defaulted to page_to=100, but the PaddleOCR API processes all pages of the PDF. For PDFs with more than 100 pages, page indices beyond 99 were rejected as out of range during crop validation. Closes infiniflow#13803 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-26T14:02:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.72%. Comparing base (1b29522) to head (7f010bc).
⚠️ Report is 20 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #13811       +/-   ##
===========================================
+ Coverage   49.52%   96.72%   +47.19%     
===========================================
  Files          45       10       -35     
  Lines        9657      702     -8955     
  Branches      112      112               
===========================================
- Hits         4783      679     -4104     
+ Misses       4864        5     -4859     
- Partials       10       18        +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. 🐞 bug Something isn't working, pull request that fix bug. labels Mar 26, 2026

Magicbook1108 added the ci Continue Integration label Mar 26, 2026

yingfeng marked this pull request as draft March 26, 2026 13:13

yingfeng marked this pull request as ready for review March 26, 2026 13:13

yingfeng merged commit 406339a into infiniflow:main Mar 27, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(paddleocr): load all PDF pages for image cropping instead of first 100#13811

fix(paddleocr): load all PDF pages for image cropping instead of first 100#13811
yingfeng merged 1 commit intoinfiniflow:mainfrom
Lntanohuang:fix/paddleocr-page-index-out-of-range

Lntanohuang commented Mar 26, 2026

Uh oh!

codecov bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Lntanohuang commented Mar 26, 2026

Summary

Root Cause

Fix

Test plan

Uh oh!

codecov bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 26, 2026 •

edited

Loading