Skip to content

Memory usage regression in 0.1.5 during PDF text extraction #1611

@smarsou

Description

@smarsou

Context

Hello,

We are using markitdown to extract text from PDFs in a VM with 2 GB of RAM.

Everything was working correctly with markitdown 0.1.4, but after upgrading to 0.1.5, our Python process started failing with exit code 137, which typically indicates the process was killed due to out-of-memory (OOM).

After profiling the memory usage, it seems that markitdown consumes significantly more memory starting from version 0.1.5.


Observed behavior

Memory usage when converting a 400-page PDF to Markdown:

Version Peak Memory
0.1.4 ~200 MiB
0.1.5 ~2.7 GiB

This makes the new version unusable in low-memory environments.

Memory profiling with markitdown==0.1.4

Image

Memory profiling with markitdown==0.1.5

Image

Minimal reproduction

  1. Use a PDF with many pages (~100+)
  2. Install memory-profiler
  3. Install either version to compare behavior:
    • markitdown[all]==0.1.4
    • markitdown[all]==0.1.5

Run the script with memory-profiler:

mprof run python my_script.py
from markitdown import MarkItDown

PDF_PATH = <PDF_PATH>

def main() -> None:
    md = MarkItDown()
    md.convert(PDF_PATH)

if __name__ == "__main__":
    main()

Generate the memory plot:

mprof plot

Possible source of the issue

It seems this regression might be related to PR #1499, which introduced a new method for table extraction.

During profiling, I noticed that the following line appears to significantly increase memory usage:

words = page.extract_words(keep_blank_chars=True, x_tolerance=3, y_tolerance=3)

File:

packages/markitdown/src/markitdown/converters/_pdf_converter.py

Line:

132

From the profiling results, memory usage increases by roughly 5 MiB per page when this line is executed, which scales quickly for large PDFs.


Expected behavior

Memory usage similar to 0.1.4 (or at least within the same order of magnitude), allowing the library to run in environments with limited memory.


If helpful, I can provide additional profiling data or test cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions