You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Four independent failure points, all silent. Discovered 2026-05-14 after our 27-agent fleet hadn't synced for ~5 days despite the sync infrastructure being "operational."
Band-aid landed: cron job gradata-sync-cron runs every 60s, reads system.db, POSTs deltas to /api/v1/sync. Works today but adds another moving part.
Proposal
brain.correct(draft, final) POSTs the correction to api.gradata.ai immediately after local SQLite write. Fire-and-forget HTTP via background thread or asyncio. Local SQLite remains source of truth; cloud is a write-through replica.
Architecture
brain.correct(draft, final)
├── write to system.db (existing)
├── append to events.jsonl (existing)
└── enqueue to sync_queue table (NEW)
└── background drain (NEW)
└── POST /api/v1/ingest (NEW, lightweight single-correction endpoint)
└── on success: mark synced_at
└── on failure: retry on next correct() call
Files
New
src/gradata/_sync_queue.py (~80 LOC) — SQLite table + drain function
Update gradata install --agent X to skip writing the on_session_end cloud-sync hook entry
Acceptance criteria
brain.correct(draft, final) returns in <50ms (no network in critical path)
Cloud dashboard reflects the correction within 5 seconds of the call
If the cloud is unreachable, the correction is queued locally and synced on next opportunity
Re-running brain.correct() with the same content is idempotent (existing dedup via event_id)
gradata doctor reports sync_queue: N pending so users can see backlog
Removing the hook entry from ~/.hermes/config.yaml does NOT break cloud sync
Risks
Background thread crashes → corrections accumulate in queue. Mitigation: queue depth metric in gradata doctor, watchdog cron alerts on backlog >1000.
Rate limit at /ingest → bursts of corrections (e.g. fleet restart) overflow. Mitigation: client-side batching (group corrections enqueued within 1s into one POST).
API key rotation → all corrections fail until key updated. Mitigation: log 401s loudly to local file, surface in gradata doctor.
Schema drift between SDK and cloud → 422 validation errors. Mitigation: version the IngestRequest model, server accepts N and N-1.
Rollback
If write-through causes issues:
Set GRADATA_DISABLE_WRITE_THROUGH=1 env var → SDK falls back to hook+session_close path
Write-through sync in brain.correct() — eliminate hook+session dependency
Problem
Current cloud sync only fires inside
gradata.hooks.session_close, which requires:hermes --continue= sessions never end)GRADATA_API_KEYin subprocess env (often missing)Four independent failure points, all silent. Discovered 2026-05-14 after our 27-agent fleet hadn't synced for ~5 days despite the sync infrastructure being "operational."
Band-aid landed: cron job
gradata-sync-cronruns every 60s, readssystem.db, POSTs deltas to/api/v1/sync. Works today but adds another moving part.Proposal
brain.correct(draft, final)POSTs the correction toapi.gradata.aiimmediately after local SQLite write. Fire-and-forget HTTP via background thread or asyncio. Local SQLite remains source of truth; cloud is a write-through replica.Architecture
Files
New
src/gradata/_sync_queue.py(~80 LOC) — SQLite table + drain functionsrc/gradata/_sync_worker.py(~50 LOC) — background daemon thread, 30s timerModified
src/gradata/brain.py:Brain.__init__: start sync worker ifGRADATA_API_KEYsetBrain.correct: enqueue after local writeBrain.end_session: drain synchronously (preserve batch compat)src/gradata/_migrations/: new migration for sync_queue tablecloud/app/routes/sync.py: addPOST /api/v1/ingestfor single-correction writescloud/app/models.py:IngestRequest(single CorrectionPayload, no batching)Schema for sync_queue table
Implementation plan (5-day target)
Day 1: Schema + queue primitives
_sync_queue.py::enqueue()and_sync_queue.py::drain()Day 2: Cloud /ingest endpoint
IngestRequestmodel (single CorrectionPayload)POST /api/v1/ingesthandler — same projector path as /sync but single-rowDay 3: Wire into Brain class
Brain.__init__starts_sync_workerthread if API key presentBrain.correctenqueues after local write (non-blocking)Brain.close()drains synchronously before exitend_session()still triggers full syncDay 4: Fleet deployment
data-engineer)brain.correct()call shows up in dashboard in <5sDay 5: Hook deprecation
cloud_sync_tickfromsession_close.py(now redundant)gradata-sync-cron(replaced by SDK write-through)gradata install --agent Xto skip writing theon_session_endcloud-sync hook entryAcceptance criteria
brain.correct(draft, final)returns in <50ms (no network in critical path)brain.correct()with the same content is idempotent (existing dedup viaevent_id)gradata doctorreportssync_queue: N pendingso users can see backlog~/.hermes/config.yamldoes NOT break cloud syncRisks
gradata doctor, watchdog cron alerts on backlog >1000.gradata doctor.Rollback
If write-through causes issues:
GRADATA_DISABLE_WRITE_THROUGH=1env var → SDK falls back to hook+session_close pathgradata-sync-croncron jobRefs
/home/olive/gradata-office-hours-memo.md/home/olive/.gradata/sync_cron.py, job idad2bacd12fdf