README

This document describes the approach taken in the InvoiceParser implementation for parsing Switch invoices using Veryfi OCR text. It covers the overall architecture, format detection logic, parsing strategies, error handling, and testing considerations.

1. Code paradigm and architecture

Paradigm: The solution follows an object‑oriented design centered around the InvoiceParser class, which encapsulates all behavior needed to process a single invoice: calling Veryfi, validating the format, and extracting structured fields from ocr_text.
Single responsibility per method:
- process_document() orchestrates the full pipeline: calls the Veryfi API, runs the format detector, then triggers parsing steps.
- parse_vendor, parse_bill_to, parse_general_fields, parse_line_items, and parse_total each handle one logical section of the invoice.
- matches_switch_format() is a separate pure function responsible only for deciding whether an ocr_text belongs to the “Switch invoice” layout.
Configuration via environment variables: Veryfi credentials are read from environment variables and validated in __init__, so secrets are not hard‑coded and failures are explicit.
Data modeling:
- Simple dictionaries (vendor, bill_to, general_fields) hold top‑level fields.
- Line items are stored as a list of dicts with explicit keys (sku, description, quantity, tax_rate, price, total), which later can be converted to a pandas DataFrame.
- DetectResult is a @dataclass, making the format detector’s return value explicit (is_supported, reason).

2. Format detection (“exclusion” logic)

The matches_switch_format(ocr_text) function implements the requirement “support any document with the same format while excluding other documents”.
It computes several boolean indicators:
- Vendor markers: presence of "switch" and the specific "PO Box 674592" string.
- Header labels: "Invoice Date", "Due Date", "Invoice No".
- Line‑item table headers: "Description", "Quantity", "Rate", "Amount".
- Footer marker: "Please update your system".
- Vendor line structure: reuses the same header logic as the parser (slice between the first and second "Invoice", remove "Page X of Y", and search for Invoice\n<name>\t<city, ST ZIP>).
These indicators are aggregated into a score. The document is accepted only if:
- score >= 4,
- the vendor line is present, and
- header labels are present.
process_document() calls matches_switch_format immediately after obtaining ocr_text. If is_supported is False, it raises ValueError("Document format not supported: ...") and does not run any parsing. This cleanly separates supported invoices from all other documents (including the candidate’s own test document).

3. Parsing strategy and regex usage

Parsing operates only on OCR text returned by Veryfi (self.ocr_text), never on raw PDFs.
Each section of the invoice is parsed using targeted regex aware of layout:
- Vendor:
  - Slice a header substring between first and second "Invoice" and remove "Page X of Y".
  - Extract vendor_name and vendor_city_state with Invoice\s*\n([^\t\n]+)\t([^\n]+).
  - Extract PO Box with (PO Box\s*\d+)[^\n]* and combine with the city/state line into a single address.
- Bill‑to block:
  - Slice between "Invoice No." and "Account No.".
  - Extract three consecutive non‑empty lines (name + two address lines) with \n([^\n]+)\n([^\n]+)\n([^\n]+)\n.
- General fields:
  - Invoice and due dates and invoice number from the header table using a strict pattern for dates and numeric ID.
  - Account number and PO number from "Account No." block using Account No\.[^\n]*\n[^\n]*\n([A-Z0-9\-]+)\s+([A-Z0-9\-]+).
- Line items:
  - Slice items_block between "Description" and "Please update your system", then drop the header row.
  - Split into physical lines and classify each line as:
    - an item line (contains three decimal numbers) or
    - a continuation line (no triple‑number pattern).
  - Build “logical items” by:
    - For each new item line, closing the previous item and starting a new description + numeric part.
    - Appending continuation lines to the current description.
  - Use a second regex on the numeric part to extract quantity, rate, and amount as floats.
This two‑pass approach (line classification + numeric parsing) makes the solution robust to variations like:
- different numbers of tabs between columns,
- long descriptions that wrap to the next line, and
- slight OCR differences between invoices.

4. Error handling and robustness

Missing OCR text: every parse_* method first checks if not self.ocr_text and initializes its output to None/empty structures, preventing attribute errors.
Missing matches: regex searches are always guarded:
- if match is None, the corresponding fields are set to None instead of trying to access .group.
- If no valid line items are found, self.line_items is set to None.
API credentials: __init__ validates that all Veryfi credentials are present in environment variables and raises immediately if anything is missing.
Idempotence: reset() clears all parsed state, and set_invoice() builds a fresh invoice dict; the parser can be reused across multiple documents.

5. Unit‑testing strategy (what you can describe)

Tests where done in folder tests/ prior to the final class implementation. More test need to be added in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
documents		documents
examples		examples
src		src
tests		tests
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

1. Code paradigm and architecture

2. Format detection (“exclusion” logic)

3. Parsing strategy and regex usage

4. Error handling and robustness

5. Unit‑testing strategy (what you can describe)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

README

1. Code paradigm and architecture

2. Format detection (“exclusion” logic)

3. Parsing strategy and regex usage

4. Error handling and robustness

5. Unit‑testing strategy (what you can describe)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages