Improve PDF Parser Robustness for Diverse Resume Layouts and Styles

### **Summary of the Issue**

The current PDF parsing pipeline is fragile and struggles with common resume formats, leading to incomplete data extraction and inaccurate candidate evaluations. The system fails on resumes with multi-column layouts or alternative styling for section headers, which can cause qualified candidates to be incorrectly filtered out.

### **The Problem in Detail**

After a deep dive into the parsing logic, I've identified three core issues:

1.  **Sensitivity to Text Extraction Order:** The parser expects a strict, linear, top-to-bottom text flow. It fails on PDFs with common two-column headers because the internal text blocks are read in a jumbled order (e.g., left column -> middle of the page -> right column). This breaks the section-finding logic.

2.  **Brittle Header Identification:** The current header detection logic relies solely on **font size**, assuming headers are always larger than body text. This is a fragile assumption. Many modern resumes use other visual cues for headers, such as **bolding**, **all-caps**, or **background colors**, sometimes with a *smaller* font. The current parser misidentifies these as regular text.

3.  **Inefficient Extraction Strategy:** The agent sends the entire document to an LLM multiple times to extract each section. This is a costly, high-latency, and non-deterministic approach for a simple parsing task. The fragility of this method is evident when the LLM fails to return perfect JSON, causing the extraction to fail for that section.

### **Impact**

This fragility means the system is likely rejecting a significant number of qualified candidates simply due to their resume's formatting. This creates a negative candidate experience and can cause the company to miss out on top talent whose resumes don't conform to the parser's rigid, unstated expectations.

### **Proposed Solution**

The PDF-to-Markdown conversion script (`pymupdf_rag.py`) can be significantly improved to handle these edge cases. I propose the following enhancements:

1.  **Create an `AdvancedHeaderIdentifier`:** Implement a new class that uses a multi-heuristic scoring system to identify headers based on a combination of **boldness**, **all-caps text**, and **relative font size**. This is far more resilient than relying on font size alone.

2.  **Enforce a Strict Reading Order:** After the initial text blocks are identified, explicitly sort them by their `(top, left)` coordinates. This will guarantee a logical, linear text flow for the parser, correctly handling multi-column layouts.

These changes would make the initial parsing step robust and deterministic, providing a clean, well-structured input for the downstream evaluation logic.

### **Next Steps**

I have already developed and tested a proof-of-concept for these improvements. I would like to submit a Pull Request to address this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve PDF Parser Robustness for Diverse Resume Layouts and Styles #137

Summary of the Issue

The Problem in Detail

Impact

Proposed Solution

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve PDF Parser Robustness for Diverse Resume Layouts and Styles #137

Description

Summary of the Issue

The Problem in Detail

Impact

Proposed Solution

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions