Claude Code wrapper: distinguish semantic completion from transport failure and improve recovery

## Summary

Claude Code reliability in `sandboxed.sh` still appears meaningfully worse than it should be for long-running missions. The codebase already contains a large amount of Claude-specific compensation logic, which is good evidence that the failure modes are real, but the current wrapper is still brittle enough that missions can stop even when Claude is only in an intermediate/non-final state.

This is not just "use the CLI directly." The current implementation heavily wraps the Claude CLI with per-mission auth/home isolation, session continuation, PTY execution, event normalization, startup and idle timeouts, and process-exit heuristics. That adds resilience in some cases, but it also creates more edges where partial/intermediate Claude behavior can be mistaken for terminal completion or a fatal stall.

## Evidence in current code

- Claude client still uses the vendor CLI with `--print --output-format stream-json --include-partial-messages` in [src/backend/claudecode/client.rs](src/backend/claudecode/client.rs).
- Mission runner has Claude-specific session handling, PTY spawning, timeout logic, and non-JSON capture in [src/api/mission_runner.rs](src/api/mission_runner.rs).
- The runner explicitly documents known bad states:
  - stdin piping causing "Agent is working..." hangs
  - hangs when stdout is a normal pipe
  - startup timeout when no parseable stream events arrive
  - idle timeout termination
  - old/stale session recovery paths

Relevant areas include:

- `src/api/mission_runner.rs:2900`
- `src/api/mission_runner.rs:3088`
- `src/api/mission_runner.rs:3264`
- `src/api/mission_runner.rs:3289`

## Problem

The current wrapper seems too willing to infer failure or completion from transport-level symptoms:

- no parseable stream event yet
- no output within idle timeout
- PTY child exited before a final result was observed
- resumed session produced unexpected/non-JSON output

Those conditions are reasonable signals, but Claude Code is inconsistent enough that they should not always be treated as hard terminal conditions. In practice this can surface as missions stopping on intermediate messages or appearing to fail even though the underlying task is still recoverable.

## Proposed improvements

### 1. Separate transport health from semantic completion

Do not treat process exit, lack of parseable JSON, or temporary silence as equivalent to task completion unless a Claude-level terminal event has been observed.

Introduce an explicit completion state machine for Claude runs:

- `Starting`
- `Streaming`
- `WaitingForTool`
- `ToolRunningNoOutput`
- `NeedsResume`
- `SemanticallyComplete`
- `TransportFailed`

The important change is that only Claude-semantic completion should mark the mission complete. Transport failures should move into a resumable/repairable state, not immediately flatten into generic failure.

### 2. Preserve and surface raw PTY transcript for recovery

When parsing fails or the stream becomes ambiguous, keep the raw PTY transcript as a first-class artifact:

- persist recent raw lines
- expose them in mission diagnostics
- use them during automatic recovery/classification

Right now some non-JSON output is collected, but the recovery model still looks narrow compared to the range of Claude weirdness seen in production.

### 3. Improve resume strategy instead of hard-killing aggressively

Current startup/idle timeouts are understandable, but for Claude they should prefer a staged recovery sequence:

1. detect lack of semantic progress
2. inspect session marker/session files
3. attempt resume/continue once automatically
4. only then terminate or ask the user to intervene

This matters because Claude sessions often remain recoverable even after streaming breaks.

### 4. Make timeout policy state-aware

A single idle timeout is too coarse for long tasks. Timeout behavior should depend on what Claude was doing last:

- if a tool just started, allow longer silent windows
- if there was active thinking, allow different thresholds
- if the process exited but tool children may still be draining output, preserve a longer structured grace period
- if the run was in a resumed session with stale history, classify as `NeedsResume` instead of generic LLM failure

### 5. Add explicit detection for intermediate-message false terminals

The user-visible failure here is "Claude stopped even though it should not actually stop; it only emitted an intermediate message."

That suggests a missing distinction between:

- partial content / thinking / tool progress
- final assistant result
- process transport ending unexpectedly

The wrapper should record the last semantically meaningful Claude event and refuse to mark completion unless a final-result condition is met.

### 6. Strengthen observability around Claude missions

For each Claude mission, log and expose:

- exact CLI args used
- whether prompt was passed via argv or stdin
- session id and whether resume was attempted
- last parseable event type
- last raw PTY line timestamp
- timeout reason
- whether process exit preceded semantic completion

This will make real-world failures much easier to classify.

### 7. Add replayable fixtures/tests from bad Claude sessions

The wrapper logic is now complex enough that it needs regression fixtures for actual broken transcripts:

- no JSON after init
- partial/thinking events followed by silence
- tool call followed by PTY exit
- non-JSON interleaving
- resumed stale session
- intermediate assistant text without final result

Without fixture-driven tests, this area will keep regressing as CLI behavior changes.

## Suggested implementation areas

- `src/api/mission_runner.rs`
- `src/backend/claudecode/client.rs`
- any mission event normalization layer that decides terminal vs resumable state

## Acceptance criteria

- Claude missions are not marked complete unless a semantic terminal event is observed.
- Transport failures/timeouts can move the mission into a resumable state instead of immediate hard failure where possible.
- The raw PTY transcript is available for diagnostics and recovery.
- Timeout behavior is state-aware rather than a single coarse idle threshold.
- There are regression tests/fixtures covering partial/intermediate Claude stream behavior.

## Why this matters

`sandboxed.sh` is doing more than `emdash`, so some extra complexity is unavoidable. But that makes wrapper correctness much more important. If Claude reliability remains weak, the stronger orchestration model becomes less valuable because the core long-task experience still feels untrustworthy.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude Code wrapper: distinguish semantic completion from transport failure and improve recovery #353

Summary

Evidence in current code

Problem

Proposed improvements

1. Separate transport health from semantic completion

2. Preserve and surface raw PTY transcript for recovery

3. Improve resume strategy instead of hard-killing aggressively

4. Make timeout policy state-aware

5. Add explicit detection for intermediate-message false terminals

6. Strengthen observability around Claude missions

7. Add replayable fixtures/tests from bad Claude sessions

Suggested implementation areas

Acceptance criteria

Why this matters

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Claude Code wrapper: distinguish semantic completion from transport failure and improve recovery #353

Description

Summary

Evidence in current code

Problem

Proposed improvements

1. Separate transport health from semantic completion

2. Preserve and surface raw PTY transcript for recovery

3. Improve resume strategy instead of hard-killing aggressively

4. Make timeout policy state-aware

5. Add explicit detection for intermediate-message false terminals

6. Strengthen observability around Claude missions

7. Add replayable fixtures/tests from bad Claude sessions

Suggested implementation areas

Acceptance criteria

Why this matters

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions