Skip to content

[Bug] : Memory leak in PDF processing due to unclosed PyMuPDF document objects #102

@deepakstwt

Description

@deepakstwt

Bug Description

The extract_text_from_pdf function in pdf.py (lines 91-100) creates PyMuPDF document objects but never closes them, causing memory leaks when processing multiple PDFs.

Current Implementation

def extract_text_from_pdf(self, pdf_path: str) -> Optional[str]:
    try:
        # Validate PDF file before processing
        self._validate_pdf_file(pdf_path)

        doc = pymupdf.open(pdf_path)  # ← Document opened
        pages = range(doc.page_count)
        resume_text = to_markdown(
            doc,
            pages=pages,
        )
        logger.debug(
            f"Extracted text from PDF: {len(resume_text) if resume_text else 0} characters"
        )
        return resume_text
        # ← Missing: doc.close()
    except (FileNotFoundError, ValueError) as e:
        # Handle validation errors with specific messages
        logger.error(f"PDF validation error: {e}")
        return None
    except Exception as e:
        # Handle PyMuPDF and other processing errors
        logger.error(f"An error occurred while reading the PDF: {e}")
        return None

Problem

  1. Memory Leak: PyMuPDF document objects are not properly closed after processing
  2. Resource Exhaustion: When processing multiple PDFs, memory usage keeps increasing
  3. System Instability: Can lead to out-of-memory errors in production environments
  4. Poor Resource Management: Violates best practices for file handle management

Impact

  • Memory Usage: Continuous memory growth when processing multiple PDFs
  • Performance: System becomes slower over time
  • Reliability: Can cause crashes in memory-constrained environments
  • Scalability: Prevents processing large batches of resumes efficiently

Reproduction Steps

  1. Process multiple PDF files in sequence using the same PDFHandler instance
  2. Monitor memory usage (e.g., using psutil or system monitor)
  3. Observe continuous memory growth that doesn't get released
  4. Memory usage will keep increasing until system runs out of memory

Expected Behavior

The function should:

  • Open the PDF document
  • Process the content
  • Close the document to free memory
  • Handle exceptions while ensuring cleanup

Proposed Solution

def extract_text_from_pdf(self, pdf_path: str) -> Optional[str]:
    doc = None
    try:
        # Validate PDF file before processing
        self._validate_pdf_file(pdf_path)

        doc = pymupdf.open(pdf_path)
        pages = range(doc.page_count)
        resume_text = to_markdown(doc, pages=pages)
        
        logger.debug(
            f"Extracted text from PDF: {len(resume_text) if resume_text else 0} characters"
        )
        return resume_text
        
    except (FileNotFoundError, ValueError) as e:
        logger.error(f"PDF validation error: {e}")
        return None
    except Exception as e:
        logger.error(f"An error occurred while reading the PDF: {e}")
        return None
    finally:
        # Ensure document is always closed
        if doc is not None:
            doc.close()

Alternative Solution (Context Manager)

def extract_text_from_pdf(self, pdf_path: str) -> Optional[str]:
    try:
        self._validate_pdf_file(pdf_path)
        
        with pymupdf.open(pdf_path) as doc:  # Context manager ensures cleanup
            pages = range(doc.page_count)
            resume_text = to_markdown(doc, pages=pages)
            
        logger.debug(
            f"Extracted text from PDF: {len(resume_text) if resume_text else 0} characters"
        )
        return resume_text
        
    except (FileNotFoundError, ValueError) as e:
        logger.error(f"PDF validation error: {e}")
        return None
    except Exception as e:
        logger.error(f"An error occurred while reading the PDF: {e}")
        return None

Additional Considerations

  1. Testing: Add memory usage tests to prevent regression
  2. Monitoring: Consider adding memory usage logging for production
  3. Documentation: Update code comments to emphasize resource cleanup
  4. Code Review: Ensure all file operations follow proper cleanup patterns

Files Affected

  • pdf.py (primary)
  • Any code that processes multiple PDFs in sequence

Priority

High - This is a memory leak that affects system stability and performance, especially in production environments processing multiple resumes.

Related Issues

This is similar to general resource management issues that should be addressed across the codebase.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions