Skip to content

Fix O(n) memory growth in PDF conversion by calling page.close() afte…#1612

Open
lesyk wants to merge 4 commits intomicrosoft:mainfrom
lesyk:u/vilesyk/perf
Open

Fix O(n) memory growth in PDF conversion by calling page.close() afte…#1612
lesyk wants to merge 4 commits intomicrosoft:mainfrom
lesyk:u/vilesyk/perf

Conversation

@lesyk
Copy link
Contributor

@lesyk lesyk commented Mar 12, 2026

Problem
Fixes #1611. Since v0.1.5, converting large PDFs causes memory usage to grow linearly with page count (~1.1 MiB/page). A 400-page PDF consumes ~458 MiB instead of the expected ~7 MiB. The root cause is that pdfplumber's per-page cached properties (_rect_edges, _curve_edges, _edges, _objects, _layout) and get_textmap LRU cache are never freed during conversion.

Benchmark Results

Pages Before (peak) After (peak) Saving
100 109.3 MiB 3.2 MiB 97%
200 225.6 MiB 4.5 MiB 98%
400 458.1 MiB 6.8 MiB 99%

@lesyk
Copy link
Contributor Author

lesyk commented Mar 12, 2026

Waiting for a verification from smarsou

@smarsou
Copy link

smarsou commented Mar 13, 2026

Awesome @lesyk !

My issue #1611 is fixed with this changes.

I'm not sure if I tested correctly : I just copy paste your new version of _pdf_converter.py into the markitdown lib (v0.1.5) in my venv, and it works perfectly ! Thanks a lot !

Copy link

@smarsou smarsou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes #1611.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory usage regression in 0.1.5 during PDF text extraction

3 participants