-
Notifications
You must be signed in to change notification settings - Fork 127
Description
Summary of the Issue
The current PDF parsing pipeline is fragile and struggles with common resume formats, leading to incomplete data extraction and inaccurate candidate evaluations. The system fails on resumes with multi-column layouts or alternative styling for section headers, which can cause qualified candidates to be incorrectly filtered out.
The Problem in Detail
After a deep dive into the parsing logic, I've identified three core issues:
-
Sensitivity to Text Extraction Order: The parser expects a strict, linear, top-to-bottom text flow. It fails on PDFs with common two-column headers because the internal text blocks are read in a jumbled order (e.g., left column -> middle of the page -> right column). This breaks the section-finding logic.
-
Brittle Header Identification: The current header detection logic relies solely on font size, assuming headers are always larger than body text. This is a fragile assumption. Many modern resumes use other visual cues for headers, such as bolding, all-caps, or background colors, sometimes with a smaller font. The current parser misidentifies these as regular text.
-
Inefficient Extraction Strategy: The agent sends the entire document to an LLM multiple times to extract each section. This is a costly, high-latency, and non-deterministic approach for a simple parsing task. The fragility of this method is evident when the LLM fails to return perfect JSON, causing the extraction to fail for that section.
Impact
This fragility means the system is likely rejecting a significant number of qualified candidates simply due to their resume's formatting. This creates a negative candidate experience and can cause the company to miss out on top talent whose resumes don't conform to the parser's rigid, unstated expectations.
Proposed Solution
The PDF-to-Markdown conversion script (pymupdf_rag.py
) can be significantly improved to handle these edge cases. I propose the following enhancements:
-
Create an
AdvancedHeaderIdentifier
: Implement a new class that uses a multi-heuristic scoring system to identify headers based on a combination of boldness, all-caps text, and relative font size. This is far more resilient than relying on font size alone. -
Enforce a Strict Reading Order: After the initial text blocks are identified, explicitly sort them by their
(top, left)
coordinates. This will guarantee a logical, linear text flow for the parser, correctly handling multi-column layouts.
These changes would make the initial parsing step robust and deterministic, providing a clean, well-structured input for the downstream evaluation logic.
Next Steps
I have already developed and tested a proof-of-concept for these improvements. I would like to submit a Pull Request to address this issue.