Background
Spec 5 PR-C. Sipher's COURIER crank (`packages/agent/src/crank.ts:99-113`) is coarse — works for happy-path scheduled ops but isn't production-hardened. Lands after PR-A and PR-B unblock the actual broadcast paths.
Issues to fix
- No retries on transient failures — network blip during broadcast leaves op as `pending`; next tick retries immediately with no backoff. Could DOS the RPC during outages.
- No deduplication — concurrent ticks could both pick up the same op. Solana dedupes by signature on-chain, but wasteful + race-prone.
- Limited observability — single line per tick. No per-op latency, no growth-hook emission correlation, no failure-rate dashboard signal.
- Race conditions on op_status — two cranks in parallel (e.g., multi-process deploy) could both pick up the same op. Status update isn't atomic-CAS.
What to do
- Exponential backoff per op: `next_exec = now + min(max_backoff, base * 2^attempts)`
- Single-flight per op: lock by op-id in a Map for the duration of a tick
- Structured logging: emit per-op start/end events with latency + result
- Atomic status CAS: SQLite `UPDATE ... WHERE status='pending'` returning rowcount, only proceed if rowcount=1
- Metrics endpoint: `GET /admin/api/courier/stats` — last-N-tick summary
Why
- Production-grade reliability for unattended operation
- Prevents RPC quota exhaustion during outages
- Operator visibility into crank behavior
Cost
~1-2 weeks (one PR). Depends on PR-A + PR-B for full validation surface (no scheduled ops actually broadcasting today).
References
Acceptance
- Exponential backoff verified via unit test (consecutive failures increase tick delay)
- Single-flight lock verified — concurrent tick simulation produces one broadcast
- `/admin/api/courier/stats` returns sane summary (per-op latency, success/fail counts, last tick timestamps)
- No regression in existing scheduled-op creation tests
Background
Spec 5 PR-C. Sipher's COURIER crank (`packages/agent/src/crank.ts:99-113`) is coarse — works for happy-path scheduled ops but isn't production-hardened. Lands after PR-A and PR-B unblock the actual broadcast paths.
Issues to fix
What to do
Why
Cost
~1-2 weeks (one PR). Depends on PR-A + PR-B for full validation surface (no scheduled ops actually broadcasting today).
References
Acceptance