-
Notifications
You must be signed in to change notification settings - Fork 127
Open
Description
Bug Description
The extract_text_from_pdf
function in pdf.py
(lines 91-100) creates PyMuPDF document objects but never closes them, causing memory leaks when processing multiple PDFs.
Current Implementation
def extract_text_from_pdf(self, pdf_path: str) -> Optional[str]:
try:
# Validate PDF file before processing
self._validate_pdf_file(pdf_path)
doc = pymupdf.open(pdf_path) # ← Document opened
pages = range(doc.page_count)
resume_text = to_markdown(
doc,
pages=pages,
)
logger.debug(
f"Extracted text from PDF: {len(resume_text) if resume_text else 0} characters"
)
return resume_text
# ← Missing: doc.close()
except (FileNotFoundError, ValueError) as e:
# Handle validation errors with specific messages
logger.error(f"PDF validation error: {e}")
return None
except Exception as e:
# Handle PyMuPDF and other processing errors
logger.error(f"An error occurred while reading the PDF: {e}")
return None
Problem
- Memory Leak: PyMuPDF document objects are not properly closed after processing
- Resource Exhaustion: When processing multiple PDFs, memory usage keeps increasing
- System Instability: Can lead to out-of-memory errors in production environments
- Poor Resource Management: Violates best practices for file handle management
Impact
- Memory Usage: Continuous memory growth when processing multiple PDFs
- Performance: System becomes slower over time
- Reliability: Can cause crashes in memory-constrained environments
- Scalability: Prevents processing large batches of resumes efficiently
Reproduction Steps
- Process multiple PDF files in sequence using the same PDFHandler instance
- Monitor memory usage (e.g., using
psutil
or system monitor) - Observe continuous memory growth that doesn't get released
- Memory usage will keep increasing until system runs out of memory
Expected Behavior
The function should:
- Open the PDF document
- Process the content
- Close the document to free memory
- Handle exceptions while ensuring cleanup
Proposed Solution
def extract_text_from_pdf(self, pdf_path: str) -> Optional[str]:
doc = None
try:
# Validate PDF file before processing
self._validate_pdf_file(pdf_path)
doc = pymupdf.open(pdf_path)
pages = range(doc.page_count)
resume_text = to_markdown(doc, pages=pages)
logger.debug(
f"Extracted text from PDF: {len(resume_text) if resume_text else 0} characters"
)
return resume_text
except (FileNotFoundError, ValueError) as e:
logger.error(f"PDF validation error: {e}")
return None
except Exception as e:
logger.error(f"An error occurred while reading the PDF: {e}")
return None
finally:
# Ensure document is always closed
if doc is not None:
doc.close()
Alternative Solution (Context Manager)
def extract_text_from_pdf(self, pdf_path: str) -> Optional[str]:
try:
self._validate_pdf_file(pdf_path)
with pymupdf.open(pdf_path) as doc: # Context manager ensures cleanup
pages = range(doc.page_count)
resume_text = to_markdown(doc, pages=pages)
logger.debug(
f"Extracted text from PDF: {len(resume_text) if resume_text else 0} characters"
)
return resume_text
except (FileNotFoundError, ValueError) as e:
logger.error(f"PDF validation error: {e}")
return None
except Exception as e:
logger.error(f"An error occurred while reading the PDF: {e}")
return None
Additional Considerations
- Testing: Add memory usage tests to prevent regression
- Monitoring: Consider adding memory usage logging for production
- Documentation: Update code comments to emphasize resource cleanup
- Code Review: Ensure all file operations follow proper cleanup patterns
Files Affected
pdf.py
(primary)- Any code that processes multiple PDFs in sequence
Priority
High - This is a memory leak that affects system stability and performance, especially in production environments processing multiple resumes.
Related Issues
This is similar to general resource management issues that should be addressed across the codebase.
Metadata
Metadata
Assignees
Labels
No labels