Skip to content

Codex task failure can lose previous thread context on retry #565

@Shujakuinkuraudo

Description

@Shujakuinkuraudo

Summary

When a remote Codex task fails before/around model execution, e.g.

  • Task failed: Codex thread entered systemError
  • Task failed: Selected model is at capacity. Please try a different model.

continuing from the HAPI UI can lose the previous conversation/task context. The next user message is treated like a fresh request and the agent no longer has the earlier instruction/context.

Example observed flow:

  1. User asks: 现在给我一个完整的表格,关于这些方法的数据
  2. Codex task fails with capacity/systemError.
  3. User retries/sends again.
  4. Agent answers as if it has no previous context: 可以,但我现在还不知道“这些方法”具体指哪些。

Expected behavior

After a task fails due to a transient model/provider error, HAPI should preserve enough session/thread state so a retry or follow-up continues the same Codex thread/context, or at least offers an explicit “resume/retry with previous context” path.

Actual behavior

The retry/follow-up appears to lose the prior task context and behaves like a new empty interaction.

Relevant code/data points

I traced the current runner/session path and noticed most of the active session tracking is in runner memory only:

Area File Data Persistence Risk
runner local state cli/src/persistence.ts pid, httpPort, version, startedWithApiUrl, startedWithMachineId, token hash, heartbeat ~/.hapi/runner.state.json only runner process metadata, not full session/thread state
active sessions cli/src/runner/run.ts pidToTrackedSession memory only lost on runner restart/crash
spawn awaiters cli/src/runner/run.ts pidToAwaiter, pidToErrorAwaiter memory only lost on failure/restart
session webhook cli/src/runner/run.ts happySessionId, metadata, PID memory map update not enough if not persisted
resume path cli/src/runner/run.ts resumeSessionId passed to agent command request only depends on caller preserving/providing resume id
Codex command cli/src/commands/codex.ts parses hapi codex resume <id> no ok as CLI path, but HAPI retry may not re-use it correctly
Codex session cli/src/codex/loop.ts, cli/src/codex/session.ts resumeSessionId ?? null as sessionId no session can resume only if the id is still known

The runner README also documents that runner.state.json only stores runner process state, not session mapping.

Possible fix direction

A robust fix may be one or more of:

  1. Persist session/thread mapping (happySessionId, provider-specific session/thread id such as Codex resume id, PID, cwd, agent, model settings) beyond in-memory pidToTrackedSession.
  2. On task failure (systemError, capacity, provider transient errors), keep the same Codex thread/session id and expose retry against that same id.
  3. Make the HAPI UI/backend send resumeSessionId when retrying a failed Codex task if a prior session/thread id exists.
  4. Add diagnostics/logging around failed Codex tasks to show whether a resume id was available and whether it was used.

Environment

  • HAPI CLI version observed: 0.17.1
  • Agent: Codex
  • Start path observed in logs: hapi codex resume <resumeSessionId> --hapi-starting-mode remote --started-by runner ...
  • Runner keeps process metadata in ~/.hapi/runner.state.json; active session maps are in-memory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions