Codex task failure can lose previous thread context on retry

## Summary

When a remote Codex task fails before/around model execution, e.g.

- `Task failed: Codex thread entered systemError`
- `Task failed: Selected model is at capacity. Please try a different model.`

continuing from the HAPI UI can lose the previous conversation/task context. The next user message is treated like a fresh request and the agent no longer has the earlier instruction/context.

Example observed flow:

1. User asks: `现在给我一个完整的表格，关于这些方法的数据`
2. Codex task fails with capacity/systemError.
3. User retries/sends again.
4. Agent answers as if it has no previous context: `可以，但我现在还不知道“这些方法”具体指哪些。`

## Expected behavior

After a task fails due to a transient model/provider error, HAPI should preserve enough session/thread state so a retry or follow-up continues the same Codex thread/context, or at least offers an explicit “resume/retry with previous context” path.

## Actual behavior

The retry/follow-up appears to lose the prior task context and behaves like a new empty interaction.

## Relevant code/data points

I traced the current runner/session path and noticed most of the active session tracking is in runner memory only:

| Area | File | Data | Persistence | Risk |
|---|---|---|---|---|
| runner local state | `cli/src/persistence.ts` | `pid`, `httpPort`, version, `startedWithApiUrl`, `startedWithMachineId`, token hash, heartbeat | `~/.hapi/runner.state.json` | only runner process metadata, not full session/thread state |
| active sessions | `cli/src/runner/run.ts` | `pidToTrackedSession` | memory only | lost on runner restart/crash |
| spawn awaiters | `cli/src/runner/run.ts` | `pidToAwaiter`, `pidToErrorAwaiter` | memory only | lost on failure/restart |
| session webhook | `cli/src/runner/run.ts` | `happySessionId`, metadata, PID | memory map update | not enough if not persisted |
| resume path | `cli/src/runner/run.ts` | `resumeSessionId` passed to agent command | request only | depends on caller preserving/providing resume id |
| Codex command | `cli/src/commands/codex.ts` | parses `hapi codex resume <id>` | no | ok as CLI path, but HAPI retry may not re-use it correctly |
| Codex session | `cli/src/codex/loop.ts`, `cli/src/codex/session.ts` | `resumeSessionId ?? null` as `sessionId` | no | session can resume only if the id is still known |

The runner README also documents that `runner.state.json` only stores runner process state, not session mapping.

## Possible fix direction

A robust fix may be one or more of:

1. Persist session/thread mapping (`happySessionId`, provider-specific session/thread id such as Codex resume id, PID, cwd, agent, model settings) beyond in-memory `pidToTrackedSession`.
2. On task failure (`systemError`, capacity, provider transient errors), keep the same Codex thread/session id and expose retry against that same id.
3. Make the HAPI UI/backend send `resumeSessionId` when retrying a failed Codex task if a prior session/thread id exists.
4. Add diagnostics/logging around failed Codex tasks to show whether a resume id was available and whether it was used.

## Environment

- HAPI CLI version observed: `0.17.1`
- Agent: Codex
- Start path observed in logs: `hapi codex resume <resumeSessionId> --hapi-starting-mode remote --started-by runner ...`
- Runner keeps process metadata in `~/.hapi/runner.state.json`; active session maps are in-memory.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Codex task failure can lose previous thread context on retry #565

Summary

Expected behavior

Actual behavior

Relevant code/data points

Possible fix direction

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Area	File	Data	Persistence	Risk
runner local state	`cli/src/persistence.ts`	`pid`, `httpPort`, version, `startedWithApiUrl`, `startedWithMachineId`, token hash, heartbeat	`~/.hapi/runner.state.json`	only runner process metadata, not full session/thread state
active sessions	`cli/src/runner/run.ts`	`pidToTrackedSession`	memory only	lost on runner restart/crash
spawn awaiters	`cli/src/runner/run.ts`	`pidToAwaiter`, `pidToErrorAwaiter`	memory only	lost on failure/restart
session webhook	`cli/src/runner/run.ts`	`happySessionId`, metadata, PID	memory map update	not enough if not persisted
resume path	`cli/src/runner/run.ts`	`resumeSessionId` passed to agent command	request only	depends on caller preserving/providing resume id
Codex command	`cli/src/commands/codex.ts`	parses `hapi codex resume <id>`	no	ok as CLI path, but HAPI retry may not re-use it correctly
Codex session	`cli/src/codex/loop.ts`, `cli/src/codex/session.ts`	`resumeSessionId ?? null` as `sessionId`	no	session can resume only if the id is still known

Uh oh!

Codex task failure can lose previous thread context on retry #565

Description

Summary

Expected behavior

Actual behavior

Relevant code/data points

Possible fix direction

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions