Skip to content
This repository was archived by the owner on Dec 11, 2025. It is now read-only.
/ autoD Public archive

Enterprise document processing with OpenAI Responses API - PDF metadata extraction pipeline

Notifications You must be signed in to change notification settings

walksalot/autoD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paper Autopilot - Automatic Document Processing

CI Pre-commit codecov Python 3.9+ Code style: black MyPy: strict

Automatic PDF processing from your ScanSnap scanner using OpenAI Responses API

Paper Autopilot continuously monitors your scanner's inbox folder and automatically processes PDFs as they arrive. No manual intervention needed - just scan your documents and let the autopilot handle the rest.

What's New

Week 3 - Vector Store Observability (January 2025)

Observability:

  • Real-time cost tracking ($0.10/GB/day after 1GB free tier)
  • Performance metrics (P50/P95/P99 search latency)
  • Upload success rate monitoring (target >95%)
  • Structured JSON logging with embedded metrics

Documentation:

  • ADR-030: Vector Store Integration (458 lines)
  • Complete usage guide (683 lines) with troubleshooting
  • Cost management strategies and optimization tips

Discovery:

  • Phase 4 (Error Handling) already complete from Wave 1 ✓
  • CompensatingTransaction pattern with LIFO rollback
  • 30 tests (100% passing) for transaction safety

Wave 2 - Type Safety & Cache (October 2025)

Type Safety:

  • 100% type annotation coverage with MyPy strict mode
  • Zero Any type leakage from external libraries
  • Pre-commit hooks enforce type safety

Performance:

  • Production-ready embedding cache with <0.1ms latency
  • 70%+ cache hit rate with temporal locality
  • 1M lookups/sec throughput

  • SHA-256 cache keys with LRU eviction

Quality:

  • 41 new cache tests (unit + integration + performance)
  • Property-based testing with Hypothesis framework
  • 6 new ADRs documenting technical decisions (ADR-027 through ADR-030)

See CHANGELOG.md for full details.

What It Does

  1. Watches your scanner's inbox folder (/Users/krisstudio/Paper/InboxA)
  2. Detects new PDFs instantly using filesystem events
  3. Validates PDF integrity and waits for scanner to finish writing
  4. Processes documents using OpenAI Responses API for metadata extraction
  5. Stores results in SQLite database with full audit trail
  6. Uploads to OpenAI vector store for semantic search
  7. Moves processed PDFs to organized folders

Quick Start

1. Install Dependencies

# Create and activate virtual environment
python3 -m venv .venv && source .venv/bin/activate

# Install required packages
pip install -r requirements.txt

2. Configure API Key

The daemon will automatically load your OpenAI API key from ~/.OPENAI_API_KEY:

# Save your API key to the file
echo "sk-your-actual-key-here" > ~/.OPENAI_API_KEY
chmod 600 ~/.OPENAI_API_KEY

Or set it as an environment variable:

export OPENAI_API_KEY=sk-your-actual-key-here

3. Run the Daemon

# Start the automatic processing daemon
python3 run_daemon.py

The daemon will:

  • Create necessary directories if they don't exist
  • Start watching /Users/krisstudio/Paper/InboxA for new PDFs
  • Process documents automatically as they arrive
  • Log all activities to logs/paper_autopilot.log

4. Configure Your Scanner

Set your ScanSnap scanner to save PDFs to:

/Users/krisstudio/Paper/InboxA

See docs/scansnap-ix1600-setup.md for detailed scanner configuration.

Automatic Startup (macOS)

To have Paper Autopilot start automatically on login:

# Copy LaunchAgent plist
cp com.paperautopilot.daemon.plist ~/Library/LaunchAgents/

# Load and start the daemon
launchctl load ~/Library/LaunchAgents/com.paperautopilot.daemon.plist

# Verify it's running
launchctl list | grep paperautopilot

The daemon will now start automatically every time you log in to your Mac.

Repository Structure

.
├── run_daemon.py          # Entry point for automatic daemon
├── src/
│   ├── daemon.py          # File watching and automatic processing
│   ├── processor.py       # Document processing pipeline
│   ├── config.py          # Configuration management (Pydantic V2)
│   ├── cache.py           # LRU embedding cache (NEW)
│   ├── database.py        # SQLite database operations
│   ├── api_client.py      # OpenAI Responses API client
│   └── vector_store.py    # Vector store management
├── docs/
│   ├── DAEMON_MODE.md     # Detailed daemon setup guide
│   ├── RUNBOOK.md         # Production operations guide
│   ├── DEVELOPMENT_MODEL.md  # Parallel execution guide (NEW)
│   └── scansnap-ix1600-setup.md  # Scanner configuration
└── com.paperautopilot.daemon.plist  # macOS LaunchAgent config

Configuration

All settings can be configured via environment variables:

# Required
OPENAI_API_KEY=sk-...              # OpenAI API key

# Paths (defaults shown)
PAPER_AUTOPILOT_INBOX_PATH=/Users/krisstudio/Paper/InboxA
PAPER_AUTOPILOT_DB_URL=sqlite:///paper_autopilot.db

# Processing
OPENAI_MODEL=gpt-5-mini           # gpt-5-mini, gpt-5-nano, gpt-5, gpt-5-pro, gpt-4.1
API_TIMEOUT_SECONDS=300           # API call timeout (30-600s)
MAX_RETRIES=5                     # Retry attempts (1-10)

# Logging
LOG_LEVEL=INFO                    # DEBUG, INFO, WARNING, ERROR
LOG_FORMAT=json                   # json or text

See docs/DAEMON_MODE.md for complete configuration reference.

Monitoring

View daemon logs in real-time:

# Application logs (structured JSON)
tail -f logs/paper_autopilot.log | jq .

# Daemon stdout
tail -f logs/daemon_stdout.log

# Daemon errors
tail -f logs/daemon_stderr.log

Check daemon status:

# macOS LaunchAgent
launchctl list | grep paperautopilot

# View recent activity
grep "Processing complete" logs/paper_autopilot.log | tail -10

Folder Organization

/Users/krisstudio/Paper/
├── InboxA/          # Scanner drops PDFs here
├── Processed/       # Successfully processed PDFs
└── Failed/          # PDFs that failed processing

The daemon automatically moves PDFs to the appropriate folder after processing.

Supported Models

Paper Autopilot uses only OpenAI Frontier models per project requirements:

  • gpt-5-mini (default) - Fast, cost-efficient
  • gpt-5-nano - Fastest, most cost-efficient
  • gpt-5 - Best for coding and agentic tasks
  • gpt-5-pro - Smarter and more precise
  • gpt-4.1 - Smartest non-reasoning model

Important: Never use gpt-4o or chat completions models. Paper Autopilot uses only the Responses API endpoint (/v1/responses), never chat completions.

Documentation

Contributing

Review AGENTS.md for project conventions, testing expectations, and security practices. Key points:

  • Follow PEP 8, run black before commits
  • Use pytest for automated testing
  • Never commit sample PDFs or raw API responses
  • Keep model selections aligned with Frontier models only
  • Run policy checks before PRs:
    python scripts/check_model_policy.py --diff
    pytest tests/test_model_policy.py

Architecture

Paper Autopilot implements a production-grade document processing pipeline:

  1. File Watching: Real-time detection with filesystem events (watchdog library)
  2. File Stabilization: Handles scanner's phased writes (waits for OCR completion)
  3. Deduplication: SHA-256 hash-based duplicate detection
  4. Processing Pipeline: Responses API → Schema Validation → Database Storage
  5. Vector Search: Automatic upload to OpenAI vector store with LRU embedding cache
  6. Audit Trail: Complete processing history with costs and timing
  7. Error Handling: Automatic retries with exponential backoff
  8. Type Safety: MyPy strict mode with 100% annotation coverage
  9. Performance: <0.1ms cache latency, 70%+ hit rate, >1M ops/sec throughput

License

See LICENSE file for details.


Maintained By: Platform Engineering Team Version: 1.0.0

About

Enterprise document processing with OpenAI Responses API - PDF metadata extraction pipeline

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages