Skip to content

chore(scheduled-ops): COURIER hardening (retries, observability, single-flight) #285

@rz1989s

Description

@rz1989s

Background

Spec 5 PR-C. Sipher's COURIER crank (`packages/agent/src/crank.ts:99-113`) is coarse — works for happy-path scheduled ops but isn't production-hardened. Lands after PR-A and PR-B unblock the actual broadcast paths.

Issues to fix

  1. No retries on transient failures — network blip during broadcast leaves op as `pending`; next tick retries immediately with no backoff. Could DOS the RPC during outages.
  2. No deduplication — concurrent ticks could both pick up the same op. Solana dedupes by signature on-chain, but wasteful + race-prone.
  3. Limited observability — single line per tick. No per-op latency, no growth-hook emission correlation, no failure-rate dashboard signal.
  4. Race conditions on op_status — two cranks in parallel (e.g., multi-process deploy) could both pick up the same op. Status update isn't atomic-CAS.

What to do

  • Exponential backoff per op: `next_exec = now + min(max_backoff, base * 2^attempts)`
  • Single-flight per op: lock by op-id in a Map for the duration of a tick
  • Structured logging: emit per-op start/end events with latency + result
  • Atomic status CAS: SQLite `UPDATE ... WHERE status='pending'` returning rowcount, only proceed if rowcount=1
  • Metrics endpoint: `GET /admin/api/courier/stats` — last-N-tick summary

Why

  • Production-grade reliability for unattended operation
  • Prevents RPC quota exhaustion during outages
  • Operator visibility into crank behavior

Cost

~1-2 weeks (one PR). Depends on PR-A + PR-B for full validation surface (no scheduled ops actually broadcasting today).

References

Acceptance

  • Exponential backoff verified via unit test (consecutive failures increase tick delay)
  • Single-flight lock verified — concurrent tick simulation produces one broadcast
  • `/admin/api/courier/stats` returns sane summary (per-op latency, success/fail counts, last tick timestamps)
  • No regression in existing scheduled-op creation tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions