Skip to content

oroldan1/or_veryfi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

This document describes the approach taken in the InvoiceParser implementation for parsing Switch invoices using Veryfi OCR text. It covers the overall architecture, format detection logic, parsing strategies, error handling, and testing considerations.

1. Code paradigm and architecture

  • Paradigm: The solution follows an object‑oriented design centered around the InvoiceParser class, which encapsulates all behavior needed to process a single invoice: calling Veryfi, validating the format, and extracting structured fields from ocr_text.
  • Single responsibility per method:
    • process_document() orchestrates the full pipeline: calls the Veryfi API, runs the format detector, then triggers parsing steps.
    • parse_vendor, parse_bill_to, parse_general_fields, parse_line_items, and parse_total each handle one logical section of the invoice.
    • matches_switch_format() is a separate pure function responsible only for deciding whether an ocr_text belongs to the “Switch invoice” layout.
  • Configuration via environment variables: Veryfi credentials are read from environment variables and validated in __init__, so secrets are not hard‑coded and failures are explicit.
  • Data modeling:
    • Simple dictionaries (vendor, bill_to, general_fields) hold top‑level fields.
    • Line items are stored as a list of dicts with explicit keys (sku, description, quantity, tax_rate, price, total), which later can be converted to a pandas DataFrame.
    • DetectResult is a @dataclass, making the format detector’s return value explicit (is_supported, reason).

2. Format detection (“exclusion” logic)

  • The matches_switch_format(ocr_text) function implements the requirement “support any document with the same format while excluding other documents”.
  • It computes several boolean indicators:
    • Vendor markers: presence of "switch" and the specific "PO Box 674592" string.
    • Header labels: "Invoice Date", "Due Date", "Invoice No".
    • Line‑item table headers: "Description", "Quantity", "Rate", "Amount".
    • Footer marker: "Please update your system".
    • Vendor line structure: reuses the same header logic as the parser (slice between the first and second "Invoice", remove "Page X of Y", and search for Invoice\n<name>\t<city, ST ZIP>).
  • These indicators are aggregated into a score. The document is accepted only if:
    • score >= 4,
    • the vendor line is present, and
    • header labels are present.
  • process_document() calls matches_switch_format immediately after obtaining ocr_text. If is_supported is False, it raises ValueError("Document format not supported: ...") and does not run any parsing. This cleanly separates supported invoices from all other documents (including the candidate’s own test document).

3. Parsing strategy and regex usage

  • Parsing operates only on OCR text returned by Veryfi (self.ocr_text), never on raw PDFs.
  • Each section of the invoice is parsed using targeted regex aware of layout:
    • Vendor:
      • Slice a header substring between first and second "Invoice" and remove "Page X of Y".
      • Extract vendor_name and vendor_city_state with Invoice\s*\n([^\t\n]+)\t([^\n]+).
      • Extract PO Box with (PO Box\s*\d+)[^\n]* and combine with the city/state line into a single address.
    • Bill‑to block:
      • Slice between "Invoice No." and "Account No.".
      • Extract three consecutive non‑empty lines (name + two address lines) with \n([^\n]+)\n([^\n]+)\n([^\n]+)\n.
    • General fields:
      • Invoice and due dates and invoice number from the header table using a strict pattern for dates and numeric ID.
      • Account number and PO number from "Account No." block using Account No\.[^\n]*\n[^\n]*\n([A-Z0-9\-]+)\s+([A-Z0-9\-]+).
    • Line items:
      • Slice items_block between "Description" and "Please update your system", then drop the header row.
      • Split into physical lines and classify each line as:
        • an item line (contains three decimal numbers) or
        • a continuation line (no triple‑number pattern).
      • Build “logical items” by:
        • For each new item line, closing the previous item and starting a new description + numeric part.
        • Appending continuation lines to the current description.
      • Use a second regex on the numeric part to extract quantity, rate, and amount as floats.
  • This two‑pass approach (line classification + numeric parsing) makes the solution robust to variations like:
    • different numbers of tabs between columns,
    • long descriptions that wrap to the next line, and
    • slight OCR differences between invoices.

4. Error handling and robustness

  • Missing OCR text: every parse_* method first checks if not self.ocr_text and initializes its output to None/empty structures, preventing attribute errors.
  • Missing matches: regex searches are always guarded:
    • if match is None, the corresponding fields are set to None instead of trying to access .group.
    • If no valid line items are found, self.line_items is set to None.
  • API credentials: __init__ validates that all Veryfi credentials are present in environment variables and raises immediately if anything is missing.
  • Idempotence: reset() clears all parsed state, and set_invoice() builds a fresh invoice dict; the parser can be reused across multiple documents.

5. Unit‑testing strategy (what you can describe)

Tests where done in folder tests/ prior to the final class implementation. More test need to be added in the future.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors