[Bug] : Memory leak in PDF processing due to unclosed PyMuPDF document objects

##  Bug Description
The `extract_text_from_pdf` function in `pdf.py` (lines 91-100) creates PyMuPDF document objects but never closes them, causing memory leaks when processing multiple PDFs.

## Current Implementation
```python
def extract_text_from_pdf(self, pdf_path: str) -> Optional[str]:
    try:
        # Validate PDF file before processing
        self._validate_pdf_file(pdf_path)

        doc = pymupdf.open(pdf_path)  # ← Document opened
        pages = range(doc.page_count)
        resume_text = to_markdown(
            doc,
            pages=pages,
        )
        logger.debug(
            f"Extracted text from PDF: {len(resume_text) if resume_text else 0} characters"
        )
        return resume_text
        # ← Missing: doc.close()
    except (FileNotFoundError, ValueError) as e:
        # Handle validation errors with specific messages
        logger.error(f"PDF validation error: {e}")
        return None
    except Exception as e:
        # Handle PyMuPDF and other processing errors
        logger.error(f"An error occurred while reading the PDF: {e}")
        return None
```

## Problem
1. **Memory Leak**: PyMuPDF document objects are not properly closed after processing
2. **Resource Exhaustion**: When processing multiple PDFs, memory usage keeps increasing
3. **System Instability**: Can lead to out-of-memory errors in production environments
4. **Poor Resource Management**: Violates best practices for file handle management

## Impact
- **Memory Usage**: Continuous memory growth when processing multiple PDFs
- **Performance**: System becomes slower over time
- **Reliability**: Can cause crashes in memory-constrained environments
- **Scalability**: Prevents processing large batches of resumes efficiently

## Reproduction Steps
1. Process multiple PDF files in sequence using the same PDFHandler instance
2. Monitor memory usage (e.g., using `psutil` or system monitor)
3. Observe continuous memory growth that doesn't get released
4. Memory usage will keep increasing until system runs out of memory

## Expected Behavior
The function should:
- Open the PDF document
- Process the content
- **Close the document** to free memory
- Handle exceptions while ensuring cleanup

## Proposed Solution
```python
def extract_text_from_pdf(self, pdf_path: str) -> Optional[str]:
    doc = None
    try:
        # Validate PDF file before processing
        self._validate_pdf_file(pdf_path)

        doc = pymupdf.open(pdf_path)
        pages = range(doc.page_count)
        resume_text = to_markdown(doc, pages=pages)
        
        logger.debug(
            f"Extracted text from PDF: {len(resume_text) if resume_text else 0} characters"
        )
        return resume_text
        
    except (FileNotFoundError, ValueError) as e:
        logger.error(f"PDF validation error: {e}")
        return None
    except Exception as e:
        logger.error(f"An error occurred while reading the PDF: {e}")
        return None
    finally:
        # Ensure document is always closed
        if doc is not None:
            doc.close()
```

## Alternative Solution (Context Manager)
```python
def extract_text_from_pdf(self, pdf_path: str) -> Optional[str]:
    try:
        self._validate_pdf_file(pdf_path)
        
        with pymupdf.open(pdf_path) as doc:  # Context manager ensures cleanup
            pages = range(doc.page_count)
            resume_text = to_markdown(doc, pages=pages)
            
        logger.debug(
            f"Extracted text from PDF: {len(resume_text) if resume_text else 0} characters"
        )
        return resume_text
        
    except (FileNotFoundError, ValueError) as e:
        logger.error(f"PDF validation error: {e}")
        return None
    except Exception as e:
        logger.error(f"An error occurred while reading the PDF: {e}")
        return None
```

## Additional Considerations
1. **Testing**: Add memory usage tests to prevent regression
2. **Monitoring**: Consider adding memory usage logging for production
3. **Documentation**: Update code comments to emphasize resource cleanup
4. **Code Review**: Ensure all file operations follow proper cleanup patterns

## Files Affected
- `pdf.py` (primary)
- Any code that processes multiple PDFs in sequence

## Priority
**High** - This is a memory leak that affects system stability and performance, especially in production environments processing multiple resumes.

## Related Issues
This is similar to general resource management issues that should be addressed across the codebase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] : Memory leak in PDF processing due to unclosed PyMuPDF document objects #102

Bug Description

Current Implementation

Problem

Impact

Reproduction Steps

Expected Behavior

Proposed Solution

Alternative Solution (Context Manager)

Additional Considerations

Files Affected

Priority

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] : Memory leak in PDF processing due to unclosed PyMuPDF document objects #102

Description

Bug Description

Current Implementation

Problem

Impact

Reproduction Steps

Expected Behavior

Proposed Solution

Alternative Solution (Context Manager)

Additional Considerations

Files Affected

Priority

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions