Skip to content

Resume data showing as null due to stale cache files #135

@Mohd-Mursaleen

Description

@Mohd-Mursaleen

Bug Description

In development mode, the hiring agent caches resume extraction results to avoid re-processing the same PDF. However, if the initial extraction fails or produces invalid data, these corrupted cache files persist and cause all subsequent runs to return null resume data, even after the underlying issues are fixed.

Expected Behavior

  • If cache files contain valid resume data, use them to speed up processing
  • If cache files are corrupted/invalid, automatically detect this and reprocess the PDF
  • Users should get valid resume data regardless of cache state

Actual Behavior

  • System loads corrupted cache files with all null values
  • Resume data remains null even after fixing environment/configuration issues
  • No automatic detection or recovery from stale cache

Reproduction Steps

  1. Setup: Ensure DEVELOPMENT_MODE = True in config.py

  2. Create corrupted cache: Run the system with invalid configuration (e.g., wrong LLM settings) to generate a failed cache file

  3. Fix configuration: Update environment variables, install missing dependencies, etc.

  4. Run system: Execute python score.py data/resume.pdf

  5. Observe: Resume data shows as null despite valid configuration

Root Cause Analysis

The issue occurs in score.py around lines 210-213:

if DEVELOPMENT_MODE and os.path.exists(cache_filename):
    print(f"Loading cached data from {cache_filename}")
    cached_data = json.loads(Path(cache_filename).read_text())
    resume_data = JSONResume(**cached_data)

The system checks if the cache file exists but doesn't validate if the cached data is actually valid/useful.

Impact

  • Severity: High - Completely blocks resume processing in development mode
  • User Experience: Confusing - users may think their PDF is corrupted or system is broken
  • Debugging Difficulty: Hard to diagnose without inspecting cache files manually

Example of Corrupted Cache File

{
  "basics": null,
  "work": null,
  "volunteer": null,
  "education": null,
  "awards": null,
  "certificates": null,
  "publications": null,
  "skills": null,
  "languages": null,
  "interests": null,
  "references": null,
  "projects": null
}

Proposed Solution

  1. Cache Validation: Add validation logic to detect corrupted/empty cache files
  2. Automatic Recovery: If cache is invalid, automatically delete it and reprocess
  3. CLI Options: Add command-line options to manage cache (clear, validate, force refresh)
  4. Better Logging: Improve cache-related log messages for better debugging

Environment

  • OS: macOS
  • Python: 3.x
  • Development Mode: Enabled
  • LLM Provider: Google Gemini

Workaround

Manually delete cache files: rm cache/resumecache_*.json cache/githubcache_*.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions