-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
Context
Hello,
We are using markitdown to extract text from PDFs in a VM with 2 GB of RAM.
Everything was working correctly with markitdown 0.1.4, but after upgrading to 0.1.5, our Python process started failing with exit code 137, which typically indicates the process was killed due to out-of-memory (OOM).
After profiling the memory usage, it seems that markitdown consumes significantly more memory starting from version 0.1.5.
Observed behavior
Memory usage when converting a 400-page PDF to Markdown:
| Version | Peak Memory |
|---|---|
0.1.4 |
~200 MiB |
0.1.5 |
~2.7 GiB |
This makes the new version unusable in low-memory environments.
Memory profiling with markitdown==0.1.4
Memory profiling with markitdown==0.1.5
Minimal reproduction
- Use a PDF with many pages (~100+)
- Install
memory-profiler - Install either version to compare behavior:
markitdown[all]==0.1.4markitdown[all]==0.1.5
Run the script with memory-profiler:
mprof run python my_script.pyfrom markitdown import MarkItDown
PDF_PATH = <PDF_PATH>
def main() -> None:
md = MarkItDown()
md.convert(PDF_PATH)
if __name__ == "__main__":
main()Generate the memory plot:
mprof plotPossible source of the issue
It seems this regression might be related to PR #1499, which introduced a new method for table extraction.
During profiling, I noticed that the following line appears to significantly increase memory usage:
words = page.extract_words(keep_blank_chars=True, x_tolerance=3, y_tolerance=3)File:
packages/markitdown/src/markitdown/converters/_pdf_converter.py
Line:
132
From the profiling results, memory usage increases by roughly 5 MiB per page when this line is executed, which scales quickly for large PDFs.
Expected behavior
Memory usage similar to 0.1.4 (or at least within the same order of magnitude), allowing the library to run in environments with limited memory.
If helpful, I can provide additional profiling data or test cases.