Memory usage regression in 0.1.5 during PDF text extraction

## Context

Hello,

We are using `markitdown` to extract text from PDFs in a VM with **2 GB of RAM**.

Everything was working correctly with **markitdown 0.1.4**, but after upgrading to **0.1.5**, our Python process started failing with **exit code 137**, which typically indicates the process was killed due to **out-of-memory (OOM)**.

After profiling the memory usage, it seems that `markitdown` consumes **significantly more memory** starting from version `0.1.5`.

---

## Observed behavior

Memory usage when converting a **400-page PDF to Markdown**:

| Version | Peak Memory |
|-------|-------------|
| `0.1.4` | ~200 MiB |
| `0.1.5` | ~2.7 GiB |

This makes the new version unusable in low-memory environments.

### Memory profiling with `markitdown==0.1.4`
<img width="832" height="455" alt="Image" src="https://github.com/user-attachments/assets/822726b0-f5de-452e-8d1d-da84f6b0a620" />

### Memory profiling with `markitdown==0.1.5`
<img width="848" height="448" alt="Image" src="https://github.com/user-attachments/assets/e7b51dc3-6e3f-4a79-864f-677183e1d085" />

---

## Minimal reproduction

1. Use a PDF with many pages (~100+)
2. Install `memory-profiler`
3. Install either version to compare behavior:
   - `markitdown[all]==0.1.4`
   - `markitdown[all]==0.1.5`

Run the script with `memory-profiler`:

```bash
mprof run python my_script.py
````

```python
from markitdown import MarkItDown

PDF_PATH = <PDF_PATH>

def main() -> None:
    md = MarkItDown()
    md.convert(PDF_PATH)

if __name__ == "__main__":
    main()
```

Generate the memory plot:

```bash
mprof plot
```

---

## Possible source of the issue

It seems this regression might be related to **PR #1499**, which introduced a new method for table extraction.

During profiling, I noticed that the following line appears to significantly increase memory usage:

```python
words = page.extract_words(keep_blank_chars=True, x_tolerance=3, y_tolerance=3)
```

File:

```
packages/markitdown/src/markitdown/converters/_pdf_converter.py
```

Line:

```
132
```

From the profiling results, memory usage increases by roughly **5 MiB per page** when this line is executed, which scales quickly for large PDFs.

---

## Expected behavior

Memory usage similar to `0.1.4` (or at least within the same order of magnitude), allowing the library to run in environments with limited memory.

---

If helpful, I can provide additional profiling data or test cases.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage regression in 0.1.5 during PDF text extraction #1611

Context

Observed behavior

Memory profiling with `markitdown==0.1.4`

Memory profiling with `markitdown==0.1.5`

Minimal reproduction

Possible source of the issue

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory usage regression in 0.1.5 during PDF text extraction #1611

Description

Context

Observed behavior

Memory profiling with markitdown==0.1.4

Memory profiling with markitdown==0.1.5

Minimal reproduction

Possible source of the issue

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Memory profiling with `markitdown==0.1.4`

Memory profiling with `markitdown==0.1.5`