Add OpenAI Codex runtime support#154
Conversation
There was a problem hiding this comment.
The detached-resume path is building codex exec resume -m ... -o ... --dangerously-bypass-approvals-and-sandbox ..., but on the Codex CLI installed here (codex-cli 0.46.0) those flags are rejected once they appear after resume (unexpected argument -m, -o, and --dangerously-bypass-approvals-and-sandbox). That means toc can spawn a Codex sub-agent, but ResumeSubSession cannot actually restart it. We either need a different invocation shape or an explicit minimum-supported Codex CLI version check before claiming resume support.
There was a problem hiding this comment.
Fixed in 22a283e: flags are now placed before resume in BuildCodexDetachedScript and buildCodexExecArgs, matching the Codex CLI 0.46.0 invocation shape.
There was a problem hiding this comment.
The interactive flow still hardcodes runtimeinfo.DefaultRuntime in the final config/audit path, so selecting codex or toc-native in the new runtime picker still writes out a claude-code agent config. The non-interactive --runtime path is wired correctly, but the main UX path silently creates the wrong runtime. This needs to use the selected runtimeName end-to-end, plus a regression test for interactive config creation.
There was a problem hiding this comment.
Fixed in 22a283e: the interactive flow now uses runtimeName (the form-selected value) throughout — including in AgentConfig.Runtime and the audit log — instead of the hardcoded runtimeinfo.DefaultRuntime.
There was a problem hiding this comment.
The rollout parser only recognizes the newer shell_command / exec_command schema and the textual Exit code: / Wall time: / Output: format. Local Codex rollout logs on this machine still emit response_item calls named shell with JSON-string outputs, so replay/watch will drop Bash steps entirely on the currently installed CLI. The tests only cover the newer schema, which is why this slips through. If the goal is broad Codex support, this parser needs a compatibility path; otherwise the docs should declare and enforce a minimum CLI version.
There was a problem hiding this comment.
Fixed in 22a283e: codexRolloutCallOutputToStep now includes "shell" alongside "shell_command"/"exec_command". parseCodexCommandOutput tries the structured text format first, then falls back to JSON-unmarshal (for older CLI JSON-string outputs), then falls back to treating the whole string as output.
- Move flags before `resume` subcommand in codex exec/interactive args so they're accepted by codex-cli 0.46.0 - Use selected `runtimeName` in interactive agent create flow instead of hardcoded DefaultRuntime - Add `shell` as compat alias in rollout parser for older Codex CLI versions that emit `response_item` with `shell` function calls - Handle JSON-string output format from older `shell` calls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix resume args, interactive runtime, and shell parser compat
louismorgner
left a comment
There was a problem hiding this comment.
Code review feedback
Shell injection surface in BuildCodexDetachedScript
codex.go:655-691 — The detached script is built with fmt.Sprintf and %q quoting, but the prompt is piped via < %q from a file whose path includes the session directory. The command variable (line 667-673) embeds opts.Model with %q, which is fine for Go-level quoting, but this is being interpolated into a shell script. If opts.Model ever contains shell metacharacters that survive %q quoting in a shell context, this could be exploitable. Consider using shellescape or writing args to a file and reading them back, rather than interpolating into shell scripts.
codexProvider.ValidateModel accepts any non-empty string
runtimeinfo.go:85-88 — Unlike claude-code which validates against a known set (default, sonnet, opus, haiku), the codex runtime accepts literally any string as a model. This is probably intentional for forward-compat with new Codex models, but it means typos like gpt-5-codx will silently proceed. Consider at least warning if the model isn't in the known set.
findCodexLogForWorkspace walks the entire ~/.codex/sessions tree
codex.go:823-871 — This does a full filepath.WalkDir across all Codex session logs on disk to find one matching the workspace path. For users with many Codex sessions, this could be slow. The createdAt time filter helps skip old files, but only by mod time, not by directory structure. Consider using the date-based directory structure (sessions/YYYY/MM/DD/) to narrow the walk.
Token accounting fallback logic is fragile
usage.go:798-803 — parseCodexJSONL accumulates turn.completed tokens, then falls back to the latest event_msg token_count if the turn-based total is zero. This means if the log has any turn.completed events with tokens, rollout-style token_count events are ignored entirely — even if they represent a different/later session phase. This could undercount if a session mixes formats (e.g., resumed session that switched CLI versions).
readCodexSessionMeta has a dead code path
codex.go:919-927 — When line.Type == "thread.started", it re-unmarshals the same scanner.Bytes() into execLine and returns (threadID, "", time.Time{}). But the function already parsed line.Payload.ID which would be empty for thread.started events. The re-parse is redundant and the early return means the cwd from a thread.started payload is never extracted. The branch is confusing and should be cleaned up.
No integration or smoke test for actual Codex CLI interaction
The tests are all unit-level with synthetic JSONL. There's no test that validates the actual codex CLI arg shapes work. Given that the first round of review caught exactly this kind of issue (wrong arg ordering for codex-cli 0.46.0), a simple smoke test that at least validates the constructed args are accepted would prevent regressions.
agent.md is deleted in PrepareSession
codex.go:421 — PrepareSession removes agent.md after writing AGENTS.md. The Claude Code provider doesn't delete agent.md — it writes CLAUDE.md alongside it. Consider keeping both for debuggability, so users inspecting the session workspace can still see the original instruction file.
Co-Authored-By: Codex <noreply@openai.com>
louismorgner
left a comment
There was a problem hiding this comment.
Code Review Feedback
Good work — the PR follows the existing runtime provider pattern cleanly and has solid test coverage for log parsing. A few things to address before merging:
Robustness
-
apply_patchsuccess heuristic is too loose (internal/runtime/codex.go~L288):!strings.Contains(strings.ToLower(output), "error")will misclassify output that legitimately contains the word "error" (e.g. a comment about fixing an error, or a filename likeerror_handler.go). Check for a more specific pattern like a known Codex error prefix instead. -
Dual log format dispatch is fragile (
internal/runtime/codex.go~L556):parseCodexSessionLineroutes exec events (item.completed) vs rollout events (response_item) purely by type string. If a future Codex version adds a type that collides between formats, the wrong parser runs. Consider checking for a format marker (e.g. presence ofpayloadkey) rather than relying on type strings being globally unique. -
maxTokenUsagecould produce misleading hybrid totals (internal/usage/usage.go~L936): Taking the per-field max independently across different measurement sources means the returnedTokenUsagecould mix values from rollout totals and summedturn.completedevents. If this is intentional, add a comment explaining the rationale.
Maintainability
-
Shell script template is hard to audit (
internal/runtime/codex.go~L674):BuildCodexDetachedScripthas 18+ positionalshQuoteargs in a singleSprintf. Easy to misorder. Considertext/templateorstrings.Builderwith named variables, or at minimum inline comments mapping each%sto its meaning. -
Codex model validation accepts any non-empty string (
internal/runtimeinfo/runtimeinfo.go~L87): Unlike claude-code which validates against a known set, typos in model names won't be caught until runtime. Consider at least a warning if the model isn't in the knownModelOptionslist. -
Unnecessary defensive copy (
internal/runtime/codex.go~L605):append([]string{}, codexModelArgs(model)...)—codexModelArgsalready returns a new slice, so you can just doargs := codexModelArgs(model).
Minor / UX
-
loadRuntimeStateSummarybehavior change could use a comment (cmd/runtime_state_helpers.go~L35): The special-case that still returnsnil,nilfortoc-nativewhen state doesn't exist suggests native sessions have a different lifecycle. A brief comment explaining why would help future readers. -
Two-form interactive flow (
cmd/agent_create.go~L116): Splitting runtime selection from details is good UX. Minor note: if the user cancels during the second form, they can't go back to change the runtime selection. -
TestCodexCLIHelpAcceptsCurrentExecShapesrequirescodexbinary (internal/runtime/codex_test.go~L706): This will be skipped in most CI environments. Worth documenting whether CI is expected to have codex installed or if this is purely a local smoke test. -
Filesystem scan performance (
internal/runtime/codex.go~L829):findCodexLogForWorkspacedoes a broad glob across~/.codex/sessions. On machines with many Codex sessions this could get slow. Worth a note about expected performance or a cap on the scan.
- apply_patch: check for "Error:" prefix instead of any occurrence of "error" - parseCodexSessionLine: use payload key as format marker before falling back to type-string dispatch - maxTokenUsage: add comment explaining per-field max rationale - BuildCodexDetachedScript: extract named shQuote variables to eliminate positional arg ambiguity - ValidateModelSelection: validate Codex model against known ModelOptions list - buildCodexInteractiveArgs: remove unnecessary defensive copy of codexModelArgs result - loadRuntimeStateSummary: add comment explaining nil,nil return for native sessions - TestCodexCLIHelpAcceptsCurrentExecShapes: document local-only smoke test expectation - findCodexLogForWorkspace: add performance note in doc comment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Code ReviewSolid design and good test coverage for the happy path. The three-layer separation (proxy → report → CLI) is clean. A few items to address before merge: Blocking Issues1. The guard adds 2. When 3. If a command failed and Codex emitted output in an unrecognized format, the step is recorded as 4. Only checks for Security5. Pre-existing issue exposed by this PR: the codex provider correctly introduces 6. Prompt files may contain sensitive instructions. Both providers do this — consider Architecture7. For sub-agents: O(1). For top-level interactive sessions: full glob+file-read scan of 8. Taking per-field max could combine input tokens from one source with output tokens from another, producing a total that reflects neither accurately. The comment should state whether Testing Gaps
Nits
Items 1–5 are the ones I'd address before merge. The rest are fine as follow-ups. |
- codexWorkspaceCandidates: strip /private prefix when present, not
only add it, so macOS session discovery works in both directions
- ParseSessionLogLineEvents: convert codexProvider to pointer receiver
with a persistent pending map so rollout-format function_call /
function_call_output pairs are matched across streaming lines
- parseCodexCommandOutput: return -1 on parse failure instead of 0 so
callers see Success=false rather than a silent false-positive
- ensureCodexGitRepo: use git rev-parse --git-dir instead of a .git
stat check so ancestor repos are detected and re-init is skipped
- BuildClaudeDetachedScript: apply shQuote (POSIX single-quote escaping)
to all user-controlled interpolations, matching the codex provider
- codexModelArgs: return []string{} instead of nil on empty model
- ValidateModelSelection: include valid model names in the empty-model
error, consistent with the non-empty error path
- ExpectedSessionLogPath: use named receiver p.SessionLogPath(sess)
instead of allocating a throwaway codexProvider{}
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
codexruntime alongsideclaude-codeandtoc-nativeTesting