Skip to content

Claude Code wrapper: distinguish semantic completion from transport failure and improve recovery #353

@Th0rgal

Description

@Th0rgal

Summary

Claude Code reliability in sandboxed.sh still appears meaningfully worse than it should be for long-running missions. The codebase already contains a large amount of Claude-specific compensation logic, which is good evidence that the failure modes are real, but the current wrapper is still brittle enough that missions can stop even when Claude is only in an intermediate/non-final state.

This is not just "use the CLI directly." The current implementation heavily wraps the Claude CLI with per-mission auth/home isolation, session continuation, PTY execution, event normalization, startup and idle timeouts, and process-exit heuristics. That adds resilience in some cases, but it also creates more edges where partial/intermediate Claude behavior can be mistaken for terminal completion or a fatal stall.

Evidence in current code

  • Claude client still uses the vendor CLI with --print --output-format stream-json --include-partial-messages in src/backend/claudecode/client.rs.
  • Mission runner has Claude-specific session handling, PTY spawning, timeout logic, and non-JSON capture in src/api/mission_runner.rs.
  • The runner explicitly documents known bad states:
    • stdin piping causing "Agent is working..." hangs
    • hangs when stdout is a normal pipe
    • startup timeout when no parseable stream events arrive
    • idle timeout termination
    • old/stale session recovery paths

Relevant areas include:

  • src/api/mission_runner.rs:2900
  • src/api/mission_runner.rs:3088
  • src/api/mission_runner.rs:3264
  • src/api/mission_runner.rs:3289

Problem

The current wrapper seems too willing to infer failure or completion from transport-level symptoms:

  • no parseable stream event yet
  • no output within idle timeout
  • PTY child exited before a final result was observed
  • resumed session produced unexpected/non-JSON output

Those conditions are reasonable signals, but Claude Code is inconsistent enough that they should not always be treated as hard terminal conditions. In practice this can surface as missions stopping on intermediate messages or appearing to fail even though the underlying task is still recoverable.

Proposed improvements

1. Separate transport health from semantic completion

Do not treat process exit, lack of parseable JSON, or temporary silence as equivalent to task completion unless a Claude-level terminal event has been observed.

Introduce an explicit completion state machine for Claude runs:

  • Starting
  • Streaming
  • WaitingForTool
  • ToolRunningNoOutput
  • NeedsResume
  • SemanticallyComplete
  • TransportFailed

The important change is that only Claude-semantic completion should mark the mission complete. Transport failures should move into a resumable/repairable state, not immediately flatten into generic failure.

2. Preserve and surface raw PTY transcript for recovery

When parsing fails or the stream becomes ambiguous, keep the raw PTY transcript as a first-class artifact:

  • persist recent raw lines
  • expose them in mission diagnostics
  • use them during automatic recovery/classification

Right now some non-JSON output is collected, but the recovery model still looks narrow compared to the range of Claude weirdness seen in production.

3. Improve resume strategy instead of hard-killing aggressively

Current startup/idle timeouts are understandable, but for Claude they should prefer a staged recovery sequence:

  1. detect lack of semantic progress
  2. inspect session marker/session files
  3. attempt resume/continue once automatically
  4. only then terminate or ask the user to intervene

This matters because Claude sessions often remain recoverable even after streaming breaks.

4. Make timeout policy state-aware

A single idle timeout is too coarse for long tasks. Timeout behavior should depend on what Claude was doing last:

  • if a tool just started, allow longer silent windows
  • if there was active thinking, allow different thresholds
  • if the process exited but tool children may still be draining output, preserve a longer structured grace period
  • if the run was in a resumed session with stale history, classify as NeedsResume instead of generic LLM failure

5. Add explicit detection for intermediate-message false terminals

The user-visible failure here is "Claude stopped even though it should not actually stop; it only emitted an intermediate message."

That suggests a missing distinction between:

  • partial content / thinking / tool progress
  • final assistant result
  • process transport ending unexpectedly

The wrapper should record the last semantically meaningful Claude event and refuse to mark completion unless a final-result condition is met.

6. Strengthen observability around Claude missions

For each Claude mission, log and expose:

  • exact CLI args used
  • whether prompt was passed via argv or stdin
  • session id and whether resume was attempted
  • last parseable event type
  • last raw PTY line timestamp
  • timeout reason
  • whether process exit preceded semantic completion

This will make real-world failures much easier to classify.

7. Add replayable fixtures/tests from bad Claude sessions

The wrapper logic is now complex enough that it needs regression fixtures for actual broken transcripts:

  • no JSON after init
  • partial/thinking events followed by silence
  • tool call followed by PTY exit
  • non-JSON interleaving
  • resumed stale session
  • intermediate assistant text without final result

Without fixture-driven tests, this area will keep regressing as CLI behavior changes.

Suggested implementation areas

  • src/api/mission_runner.rs
  • src/backend/claudecode/client.rs
  • any mission event normalization layer that decides terminal vs resumable state

Acceptance criteria

  • Claude missions are not marked complete unless a semantic terminal event is observed.
  • Transport failures/timeouts can move the mission into a resumable state instead of immediate hard failure where possible.
  • The raw PTY transcript is available for diagnostics and recovery.
  • Timeout behavior is state-aware rather than a single coarse idle threshold.
  • There are regression tests/fixtures covering partial/intermediate Claude stream behavior.

Why this matters

sandboxed.sh is doing more than emdash, so some extra complexity is unavoidable. But that makes wrapper correctness much more important. If Claude reliability remains weak, the stronger orchestration model becomes less valuable because the core long-task experience still feels untrustworthy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions