fix(daemon): surface OpenCode error frames + treat empty-output runs as failed#700
Conversation
…as failed Closes nexu-io#691. OpenCode runs would silently complete in ~3 seconds without producing any visible chat output and still be rendered as a successful turn — three independent bugs along the structured-stream path conspired to produce this silent-failure shape. ## Bug 1 — `apps/daemon/src/json-event-stream.ts:85-91` OpenCode emits structured error frames on stdout (e.g. provider auth failures, network errors, schema mismatches) and still exits 0. The parser was downgrading these to `{type: 'raw', line: ...}`, which the chat UI does not render as an assistant message. The error string was discarded as "no-op output." Fix: emit a proper `{type: 'error', message, raw}` event matching the qoder-stream contract that the daemon's existing error-handling path already recognises. ## Bug 2 — `apps/daemon/src/server.ts:4199-4205` Even after Bug 1 was fixed, the json-event-stream branch wired the parser to a bare `(ev) => send('agent', ev)` lambda — bypassing the `sendAgentEvent` wrapper that interprets `type:'error'` events and sets the `agentStreamError` flag the close handler reads to flip the run to `failed`. So an emitted `error` event would just be forwarded as a no-op `agent` SSE event with no lifecycle effect. Fix: route json-event-stream through `sendAgentEvent`, mirroring the qoder-stream-json wiring at line 4175. ## Bug 3 — `apps/daemon/src/server.ts:4220-4234` Even after Bugs 1 and 2 are fixed, there's still a class of runs where OpenCode never emits any error frame, never emits any substantive event, and exits 0. Pre-fix this was marked `succeeded` and the user saw a blank chat with no diagnostic. Fix: track `agentProducedOutput` inside `sendAgentEvent` (set on `text_delta`, `thinking_delta`, `tool_use`, `tool_result`, `artifact` — deliberately NOT on `status` / `usage`, since a model can emit token-usage numbers for an empty completion). When the close handler sees `code === 0 && trackingSubstantiveOutput && !agentProducedOutput` the run is marked `failed` with an explicit AGENT_EXECUTION_FAILED SSE error so the chat shows a clear reason instead of a silent empty turn. The check is gated by `trackingSubstantiveOutput` so it only fires on streams that actually contribute to the output flag (currently qoder-stream-json and json-event-stream). ACP sessions and plain stdout streams keep their existing success/failure determination. ## Tests - 3 new unit tests in `apps/daemon/tests/json-event-stream.test.ts` pin the OpenCode error event shape: full repro (`error.data.message`), `error.name` fallback, and the generic-fallback shape when `error` is empty. - All 60 daemon test files (851 tests) pass on `pnpm --filter @open-design/daemon test`. All 42 web test files (309 tests) pass on `pnpm --filter @open-design/web test`. - Full repo `pnpm typecheck` clean. ## Live verification Verified end-to-end via a stub `opencode` binary that mimics each of the failure shapes against `pnpm tools-dev run web`: 1. Stub emits `{"type":"error",...}` then `exit 0` — run now ends as `failed` with the OpenCode error message surfaced as an SSE `error` event. Pre-fix this was `succeeded` with an empty chat. 2. Stub emits nothing then `exit 0` — run now ends as `failed` with "Agent completed without producing any output…" diagnostic. Pre-fix this was `succeeded` with an empty chat. 3. Stub emits a normal `step_start` / `text` / `step_finish` sequence then `exit 0` — run still succeeds. (Regression check.) ## Out of scope (mentioned for the next person) - `claude-stream-json` and `copilot-stream-json` still wire to a bare `(ev) => send('agent', ev)` and don't currently parse `type:'error'` frames. If their CLIs ever start emitting structured error events the same pattern (route through `sendAgentEvent` + emit proper `type:'error'`) applies. Not in scope here because we have no evidence those CLIs do this today, and changing the wiring without a confirmed failure mode risks regressing currently-working flows. - ACP sessions (`pi-rpc`, `acp-json-rpc`) own their own success / failure determination via `acpSession?.hasFatalError()` and the empty-output guard explicitly skips them via `trackingSubstantiveOutput`. - Plain stdout streams have no event-level tracking, so the empty- output guard skips them too. Diagnosing a no-output plain-stream agent is a separate problem that needs different signals.
|
CI failure is pre-existing on `main` HEAD, not from this PR. The failing test is `e2e/tests/localized-content.test.ts` complaining that `de` / `ru` / `fr` locale content covers 67 skills but the canonical list has 68: ``` I rebased onto upstream/main HEAD (`9af2886`) and ran the same test locally — it fails identically on a clean main checkout, no diff applied. The 68th skill is `social-media-dashboard` from PR #678 (`42e4d08 feat(skills): add social-media-dashboard skill + Totality Festival design system`), which added the skill to the canonical list but didn't backfill the three localized content files. Confirmed scope of this PR is unaffected:
The localized-content gap should be backfilled in a separate PR (or rolled into #678's follow-up). Happy to spin one up if helpful — just say the word. |
lefarcen
left a comment
There was a problem hiding this comment.
Hey @Sid-Qin! 👋
This is a really solid fix — three independent bugs along the stream path, each one analyzed at the right layer with clear post-mortem documentation. The empty-output guard is particularly clever: gating it with trackingSubstantiveOutput keeps ACP/plain-stdout streams from false-positiving, while the SUBSTANTIVE_AGENT_EVENT_TYPES set correctly distinguishes "model emitted nothing" from "model emitted lifecycle noise."
What I like:
- The three-level error fallback (
error.data.message→error.name→ generic) is thorough - Routing json-event-stream through
sendAgentEventmirrors the qoder-stream pattern — consistent architecture - Test coverage pins the error shape regressions
- The PR body doubles as a post-mortem for whoever audits the next stream adapter
One observation (non-blocking):
The empty-output guard path in server.ts:4256-4280 has no direct unit test — it's an integration scenario that needs a full daemon + child process setup to exercise. That's probably fine given the live e2e verification table in the PR body, but if you ever add daemon integration tests, this would be a good candidate.
No security / correctness issues found. The raw: stringifyContent(obj) in the error event is just the original JSON string OpenCode already emitted, and the Set lookup on SUBSTANTIVE_AGENT_EVENT_TYPES is negligible for stream volumes.
Looks good to me; deferring final approval to a maintainer. 🎉
|
Confirmed — your diagnosis is spot-on. The locale gap from PR #678 is independent of this fix, and your scope (daemon stream handling) is unaffected. No need to block this PR on the locale backfill — that can land separately. If you want to spin up a quick fix for the // locale files, feel free, but it's not a prerequisite here. The daemon fix itself is clean and ready for maintainer review once CI alignment is sorted (either via the locale backfill or temporarily skipping the failing test). |
mrcfps
left a comment
There was a problem hiding this comment.
@Sid-Qin I reviewed the daemon stream handling changes for OpenCode error frames, the json-event-stream wiring through sendAgentEvent, the empty-output failure guard, and the new parser regression coverage. The implementation cleanly surfaces structured error frames and avoids the silent-success path while keeping ACP/plain stdout streams gated out; the changed ranges look safe and maintainable. Thanks for the careful post-mortem and live verification work here — this should make the failure mode much clearer for users. 🙂
|
Thanks @mrcfps 🙏 — appreciate the close read on the gating logic. The `trackingSubstantiveOutput` flag was the part I went back and forth on the most; landing it as opt-in (qoder-stream-json + json-event-stream) felt safer than fanning out to every stream when only OpenCode has a confirmed need today, and the comment in the close handler should make it cheap for the next person to flip on for claude/copilot when those CLIs justify it. #702 is the locale backfill that unblocks CI on rebase here. Either order works for landing — admin-merge this one now or wait for #702 → green CI on rebase, the daemon-side outcome is identical. |
|
Merged! 🎉 Thanks @Sid-Qin for the careful post-mortem and the three-layer fix — this closes a subtle failure mode that was hard to diagnose. The structured error surfacing + empty-output guard should make OpenCode runs much clearer for users going forward. |
Summary
Closes #691. OpenCode runs would silently complete in ~3 seconds without producing any visible chat output and still be rendered as a successful turn — three independent bugs along the structured-stream path conspired to produce this silent-failure shape. Fix lands all three at the right layer with full live-traffic validation.
PR description == post-mortem for whoever audits the next stream adapter that hits the same shape.
cc @lefarcen — your diagnosis on #691 cited the exact file:line for two of the three bugs; this implements that plus the empty-output guard you flagged in the third bullet.
Root cause (three independent bugs)
Bug 1 — `apps/daemon/src/json-event-stream.ts:85-91`
OpenCode emits structured error frames on stdout (provider auth failures, network errors, schema mismatches) and still exits 0. The parser was downgrading these to `{type: 'raw', line: ...}`, which the chat UI does not render as an assistant message. The error string was effectively discarded.
Fix: emit a proper `{type: 'error', message, raw}` event matching the qoder-stream contract that the daemon's existing error-handling path already recognises (server.ts:4156-4168).
Bug 2 — `apps/daemon/src/server.ts:4199-4205`
Even with Bug 1 fixed, the json-event-stream branch wired the parser to a bare `(ev) => send('agent', ev)` lambda — bypassing the `sendAgentEvent` wrapper that interprets `type:'error'` and sets the `agentStreamError` flag the close handler reads to flip the run to `failed`. So an emitted `error` event would just be forwarded as a no-op `agent` SSE event with no lifecycle effect.
Fix: route json-event-stream through `sendAgentEvent`, mirroring the qoder-stream-json wiring at line 4175.
Bug 3 — `apps/daemon/src/server.ts:4220-4234` (the empty-output case)
Even with Bugs 1 and 2 fixed, there's still a class of runs where the upstream emits no error frame, no substantive event, and exits 0. Pre-fix this was marked `succeeded` and the user saw a blank chat with no diagnostic.
Fix: track `agentProducedOutput` inside `sendAgentEvent` (set on `text_delta`, `thinking_delta`, `tool_use`, `tool_result`, `artifact` — deliberately NOT on `status` / `usage`, since a model can emit token-usage numbers for an empty completion). When the close handler sees `code === 0 && trackingSubstantiveOutput && !agentProducedOutput`, the run is marked `failed` with an explicit `AGENT_EXECUTION_FAILED` SSE error so the chat shows a clear reason instead of an empty success.
The check is gated by `trackingSubstantiveOutput` so it only fires on streams that actually contribute to the output flag — currently `qoder-stream-json` and `json-event-stream`. ACP sessions and plain stdout streams keep their existing success/failure determination.
Live verification
Verified end-to-end via a stub `opencode` binary against `pnpm tools-dev run web` for all three failure shapes:
The exit code is propagated correctly in all three cases (the actual code from the child process, not a placeholder).
Test plan
Out of scope (called out for the next person who audits this surface)