Skip to content

Add OpenAI Codex runtime support#154

Open
louismorgner wants to merge 6 commits into
mainfrom
louismorgner/add-openai-codex
Open

Add OpenAI Codex runtime support#154
louismorgner wants to merge 6 commits into
mainfrom
louismorgner/add-openai-codex

Conversation

@louismorgner
Copy link
Copy Markdown
Owner

Summary

  • add a new codex runtime alongside claude-code and toc-native
  • wire Codex session launch, detached execution, resume, log parsing, and token accounting into toc's runtime plumbing
  • expose Codex in agent creation and update docs/tests for the new runtime

Testing

  • go test ./...

Comment thread internal/runtime/codex.go
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The detached-resume path is building codex exec resume -m ... -o ... --dangerously-bypass-approvals-and-sandbox ..., but on the Codex CLI installed here (codex-cli 0.46.0) those flags are rejected once they appear after resume (unexpected argument -m, -o, and --dangerously-bypass-approvals-and-sandbox). That means toc can spawn a Codex sub-agent, but ResumeSubSession cannot actually restart it. We either need a different invocation shape or an explicit minimum-supported Codex CLI version check before claiming resume support.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 22a283e: flags are now placed before resume in BuildCodexDetachedScript and buildCodexExecArgs, matching the Codex CLI 0.46.0 invocation shape.

Comment thread cmd/agent_create.go
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interactive flow still hardcodes runtimeinfo.DefaultRuntime in the final config/audit path, so selecting codex or toc-native in the new runtime picker still writes out a claude-code agent config. The non-interactive --runtime path is wired correctly, but the main UX path silently creates the wrong runtime. This needs to use the selected runtimeName end-to-end, plus a regression test for interactive config creation.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 22a283e: the interactive flow now uses runtimeName (the form-selected value) throughout — including in AgentConfig.Runtime and the audit log — instead of the hardcoded runtimeinfo.DefaultRuntime.

Comment thread internal/runtime/codex.go
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollout parser only recognizes the newer shell_command / exec_command schema and the textual Exit code: / Wall time: / Output: format. Local Codex rollout logs on this machine still emit response_item calls named shell with JSON-string outputs, so replay/watch will drop Bash steps entirely on the currently installed CLI. The tests only cover the newer schema, which is why this slips through. If the goal is broad Codex support, this parser needs a compatibility path; otherwise the docs should declare and enforce a minimum CLI version.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 22a283e: codexRolloutCallOutputToStep now includes "shell" alongside "shell_command"/"exec_command". parseCodexCommandOutput tries the structured text format first, then falls back to JSON-unmarshal (for older CLI JSON-string outputs), then falls back to treating the whole string as output.

- Move flags before `resume` subcommand in codex exec/interactive
  args so they're accepted by codex-cli 0.46.0
- Use selected `runtimeName` in interactive agent create flow instead
  of hardcoded DefaultRuntime
- Add `shell` as compat alias in rollout parser for older Codex CLI
  versions that emit `response_item` with `shell` function calls
- Handle JSON-string output format from older `shell` calls

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix resume args, interactive runtime, and shell parser compat
Copy link
Copy Markdown
Owner Author

@louismorgner louismorgner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review feedback

Shell injection surface in BuildCodexDetachedScript

codex.go:655-691 — The detached script is built with fmt.Sprintf and %q quoting, but the prompt is piped via < %q from a file whose path includes the session directory. The command variable (line 667-673) embeds opts.Model with %q, which is fine for Go-level quoting, but this is being interpolated into a shell script. If opts.Model ever contains shell metacharacters that survive %q quoting in a shell context, this could be exploitable. Consider using shellescape or writing args to a file and reading them back, rather than interpolating into shell scripts.

codexProvider.ValidateModel accepts any non-empty string

runtimeinfo.go:85-88 — Unlike claude-code which validates against a known set (default, sonnet, opus, haiku), the codex runtime accepts literally any string as a model. This is probably intentional for forward-compat with new Codex models, but it means typos like gpt-5-codx will silently proceed. Consider at least warning if the model isn't in the known set.

findCodexLogForWorkspace walks the entire ~/.codex/sessions tree

codex.go:823-871 — This does a full filepath.WalkDir across all Codex session logs on disk to find one matching the workspace path. For users with many Codex sessions, this could be slow. The createdAt time filter helps skip old files, but only by mod time, not by directory structure. Consider using the date-based directory structure (sessions/YYYY/MM/DD/) to narrow the walk.

Token accounting fallback logic is fragile

usage.go:798-803parseCodexJSONL accumulates turn.completed tokens, then falls back to the latest event_msg token_count if the turn-based total is zero. This means if the log has any turn.completed events with tokens, rollout-style token_count events are ignored entirely — even if they represent a different/later session phase. This could undercount if a session mixes formats (e.g., resumed session that switched CLI versions).

readCodexSessionMeta has a dead code path

codex.go:919-927 — When line.Type == "thread.started", it re-unmarshals the same scanner.Bytes() into execLine and returns (threadID, "", time.Time{}). But the function already parsed line.Payload.ID which would be empty for thread.started events. The re-parse is redundant and the early return means the cwd from a thread.started payload is never extracted. The branch is confusing and should be cleaned up.

No integration or smoke test for actual Codex CLI interaction

The tests are all unit-level with synthetic JSONL. There's no test that validates the actual codex CLI arg shapes work. Given that the first round of review caught exactly this kind of issue (wrong arg ordering for codex-cli 0.46.0), a simple smoke test that at least validates the constructed args are accepted would prevent regressions.

agent.md is deleted in PrepareSession

codex.go:421PrepareSession removes agent.md after writing AGENTS.md. The Claude Code provider doesn't delete agent.md — it writes CLAUDE.md alongside it. Consider keeping both for debuggability, so users inspecting the session workspace can still see the original instruction file.

Co-Authored-By: Codex <noreply@openai.com>
Copy link
Copy Markdown
Owner Author

@louismorgner louismorgner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Feedback

Good work — the PR follows the existing runtime provider pattern cleanly and has solid test coverage for log parsing. A few things to address before merging:

Robustness

  • apply_patch success heuristic is too loose (internal/runtime/codex.go ~L288): !strings.Contains(strings.ToLower(output), "error") will misclassify output that legitimately contains the word "error" (e.g. a comment about fixing an error, or a filename like error_handler.go). Check for a more specific pattern like a known Codex error prefix instead.

  • Dual log format dispatch is fragile (internal/runtime/codex.go ~L556): parseCodexSessionLine routes exec events (item.completed) vs rollout events (response_item) purely by type string. If a future Codex version adds a type that collides between formats, the wrong parser runs. Consider checking for a format marker (e.g. presence of payload key) rather than relying on type strings being globally unique.

  • maxTokenUsage could produce misleading hybrid totals (internal/usage/usage.go ~L936): Taking the per-field max independently across different measurement sources means the returned TokenUsage could mix values from rollout totals and summed turn.completed events. If this is intentional, add a comment explaining the rationale.

Maintainability

  • Shell script template is hard to audit (internal/runtime/codex.go ~L674): BuildCodexDetachedScript has 18+ positional shQuote args in a single Sprintf. Easy to misorder. Consider text/template or strings.Builder with named variables, or at minimum inline comments mapping each %s to its meaning.

  • Codex model validation accepts any non-empty string (internal/runtimeinfo/runtimeinfo.go ~L87): Unlike claude-code which validates against a known set, typos in model names won't be caught until runtime. Consider at least a warning if the model isn't in the known ModelOptions list.

  • Unnecessary defensive copy (internal/runtime/codex.go ~L605): append([]string{}, codexModelArgs(model)...)codexModelArgs already returns a new slice, so you can just do args := codexModelArgs(model).

Minor / UX

  • loadRuntimeStateSummary behavior change could use a comment (cmd/runtime_state_helpers.go ~L35): The special-case that still returns nil,nil for toc-native when state doesn't exist suggests native sessions have a different lifecycle. A brief comment explaining why would help future readers.

  • Two-form interactive flow (cmd/agent_create.go ~L116): Splitting runtime selection from details is good UX. Minor note: if the user cancels during the second form, they can't go back to change the runtime selection.

  • TestCodexCLIHelpAcceptsCurrentExecShapes requires codex binary (internal/runtime/codex_test.go ~L706): This will be skipped in most CI environments. Worth documenting whether CI is expected to have codex installed or if this is purely a local smoke test.

  • Filesystem scan performance (internal/runtime/codex.go ~L829): findCodexLogForWorkspace does a broad glob across ~/.codex/sessions. On machines with many Codex sessions this could get slow. Worth a note about expected performance or a cap on the scan.

- apply_patch: check for "Error:" prefix instead of any occurrence of "error"
- parseCodexSessionLine: use payload key as format marker before falling back to type-string dispatch
- maxTokenUsage: add comment explaining per-field max rationale
- BuildCodexDetachedScript: extract named shQuote variables to eliminate positional arg ambiguity
- ValidateModelSelection: validate Codex model against known ModelOptions list
- buildCodexInteractiveArgs: remove unnecessary defensive copy of codexModelArgs result
- loadRuntimeStateSummary: add comment explaining nil,nil return for native sessions
- TestCodexCLIHelpAcceptsCurrentExecShapes: document local-only smoke test expectation
- findCodexLogForWorkspace: add performance note in doc comment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@louismorgner
Copy link
Copy Markdown
Owner Author

Code Review

Solid design and good test coverage for the happy path. The three-layer separation (proxy → report → CLI) is clean. A few items to address before merge:

Blocking Issues

1. codexWorkspaceCandidates only handles one direction of the /private symlink (internal/runtime/codex.go:584-593)

The guard adds /private/tmp/foo when toc's workspacePath is /tmp/foo. But the reverse is unhandled: if toc's workspace is /private/tmp/foo (resolved through os.Stat, filepath.Abs, etc. on macOS), and Codex stored cwd as /tmp/foo, matchesCodexWorkspace returns false and the session log is never found. This is the common macOS case for sessions in /tmp. Fix: also strip /private when the input starts with it.

2. ParseSessionLogLineEvents silently drops all tool calls for rollout-format logs (internal/runtime/codex.go:233-235)

When pending == nil, function_call events return nil, and function_call_output events have no matching pending entry. Any streaming consumer using ParseSessionLogLineEvents on rollout-format logs sees zero tool steps. If the live-tail display calls this method, all tool calls disappear. Either document this clearly or return a best-effort event from the output line even without the paired call.

3. parseCodexCommandOutput fallback defaults to exit code 0 on parse failure (internal/runtime/codex.go:986-989)

If a command failed and Codex emitted output in an unrecognized format, the step is recorded as success=true with ExitCode=0. Misleading for session timelines. The fallback should use a sentinel exit code (e.g., -1) and Success: boolPtr(false).

4. ensureCodexGitRepo creates nested git repos for workspace subdirectories (internal/runtime/codex.go:237-246)

Only checks for .git directly in workDir. If the workspace is a subdirectory of an existing git repo, git init creates a nested repo — breaking git operations in the outer repo. Use git rev-parse --git-dir or a parent-traversal loop instead.

Security

5. BuildClaudeDetachedScript still uses %q (Go quoting) for shell args (internal/runtime/claude.go:153-155)

Pre-existing issue exposed by this PR: the codex provider correctly introduces shQuote (single-quote escaping), but BuildClaudeDetachedScript still uses Go's %q which doesn't escape shell metacharacters like $() or backticks. A workspace path like /tmp/x$(touch /tmp/pwned) would execute the subshell. Should apply shQuote to the claude provider too.

6. toc-prompt.txt written world-readable (0644) (codex.go:119, claude.go:94)

Prompt files may contain sensitive instructions. Both providers do this — consider 0600 for agent workspace files with sensitive content.

Architecture

7. ExpectedSessionLogPath triggers full disk scan for non-sub-agent sessions (internal/runtime/codex.go:192-200)

For sub-agents: O(1). For top-level interactive sessions: full glob+file-read scan of ~/.codex/sessions/YYYY/MM/DD/*.jsonl. The Expected prefix implies a cheap path derivation. If called in a hot path (polling for log availability), this is repeatedly expensive.

8. maxTokenUsage takes per-field maximum across semantically incompatible sources (internal/usage/usage.go:218-225)

Taking per-field max could combine input tokens from one source with output tokens from another, producing a total that reflects neither accurately. The comment should state whether turn.completed values are per-turn or cumulative.

Testing Gaps

  • No test for nested git repo scenario in ensureCodexGitRepo
  • parseCodexCommandOutput fallback path (exit code 0) is untested
  • No unit tests for buildCodexInteractiveArgs / buildCodexExecArgs
  • No test for ParseSessionLogLineEvents rollout format behavior (tool call drops)
  • maxTokenUsage field-independence not tested adversarially

Nits

  • codexProvider{} constructed inside its own method receiver — use named receiver instead
  • shQuote should be a package-level utility shared with claude.go
  • Wall time parsing only handles "seconds" suffix — "150ms" would silently fail
  • codexModelArgs returns nil not []string{} — subtly surprising for a *Args function
  • ValidateModelSelection for Codex rejects empty model with unhelpful "missing model" — should list valid values

Items 1–5 are the ones I'd address before merge. The rest are fine as follow-ups.

- codexWorkspaceCandidates: strip /private prefix when present, not
  only add it, so macOS session discovery works in both directions

- ParseSessionLogLineEvents: convert codexProvider to pointer receiver
  with a persistent pending map so rollout-format function_call /
  function_call_output pairs are matched across streaming lines

- parseCodexCommandOutput: return -1 on parse failure instead of 0 so
  callers see Success=false rather than a silent false-positive

- ensureCodexGitRepo: use git rev-parse --git-dir instead of a .git
  stat check so ancestor repos are detected and re-init is skipped

- BuildClaudeDetachedScript: apply shQuote (POSIX single-quote escaping)
  to all user-controlled interpolations, matching the codex provider

- codexModelArgs: return []string{} instead of nil on empty model

- ValidateModelSelection: include valid model names in the empty-model
  error, consistent with the non-empty error path

- ExpectedSessionLogPath: use named receiver p.SessionLogPath(sess)
  instead of allocating a throwaway codexProvider{}

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant