Skip to content

ADR-OA03.3: Codex CLI Runner

Codex execution path for tnh-conductor via CLI β€” headless codex exec mode with JSONL output capture, superseding the API-based approach in ADR-OA03.2.

  • Status: Implemented
  • Type: Implementation ADR (De-risking Spike)
  • Date: 2026-02-07
  • Owner: Aaron Solomon
  • Author: Aaron Solomon, Claude Opus 4.5
  • Parent ADR: ADR-OA03
  • Supersedes: ADR-OA03.2 (API-based approach)

ADR Editing Policy

IMPORTANT: How you edit this ADR depends on its status.

  • proposed status: ADR is in the design loop. We may rewrite or edit the document as needed to refine the design.
  • accepted, wip, implemented status: Implementation has begun or completed. NEVER edit the original Context/Decision/Consequences sections. Only append addendums.

Context

Background

ADR-OA03.2 attempted to drive Codex via the OpenAI Responses API. The spike revealed fundamental constraints:

  1. No final text output: Model treats tool calls as completion β€” no structured JSON response
  2. API parameter limitations: response_format, temperature, tool_choice: none not supported
  3. VS Code extension architecture: Uses proprietary "app server" with client-side orchestration

These constraints led to suspending OA03.2 on 2026-01-25.

New Development: Codex CLI

The Codex CLI is now available with headless execution capabilities. Reference: https://developers.openai.com/codex/cli/reference

Key capabilities:

Capability CLI Flag/Command Description
Headless execution codex exec Run Codex non-interactively
JSONL output --json Newline-delimited JSON events
Final response capture --output-last-message <path> Write final response to file
Approval bypass --full-auto Low-friction automation preset
Approval control --ask-for-approval untrusted, on-failure, on-request, never
Sandbox policy --sandbox read-only, workspace-write, danger-full-access
Session resume codex exec resume Continue prior sessions
MCP server mode codex mcp-server Agent-to-agent consumption

This provides a CLI-based execution model analogous to Claude Code's --print mode, which is the foundation of ADR-OA03.1.

Why CLI Over API

Concern API Approach (OA03.2) CLI Approach (OA03.3)
Final output capture Model doesn't emit text after tools --output-last-message captures response
Structured events Limited API response format --json provides JSONL event stream
Tool execution Must implement tool providers Native filesystem access
Approval/safety Manual implementation Built-in --sandbox and --ask-for-approval
Architecture alignment Different from Claude Code Same pattern as OA03.1

Decision

Spike Objectives

This ADR defines a de-risking spike to validate Codex CLI as a viable execution surface for tnh-conductor. The spike must prove:

  1. Headless execution works: codex exec "task" completes without interaction
  2. Output is capturable: --json + --output-last-message provide parseable results
  3. Automation controls work: --full-auto or --ask-for-approval never bypass prompts
  4. Sandbox policy is enforceable: --sandbox workspace-write restricts access
  5. Pattern matches OA03.1: Same capture/parse patterns as Claude Code runner

Evaluation Questions

The spike should also answer these higher-level questions:

  1. Can Codex CLI reliably perform multi-step code edits in a repo?
  2. Where and how are run transcripts and summaries stored?
  3. Can Codex be run deterministically with project-local configuration?
  4. What artifacts can be harvested post-run without UI integration?
  5. Is the CLI behavior functionally equivalent to Codex in VS Code?

Execution Model

Codex CLI is invoked headlessly via subprocess, matching the OA03.1 pattern:

TaskSpec + WorkingDirectory
        |
   codex exec --json --output-last-message <path> --full-auto -m gpt-5.2-codex "task"
        |
 Captured Outputs (JSONL events + final response + workspace diff)

Execution constraints:

Constraint Value Rationale
Single-shot execution Per step No multi-turn within a step
Approval mode --full-auto or configured Automation preset
Sandbox workspace-write Match work branch isolation
Output capture JSONL + final message file Dual-channel capture
Model -m gpt-5.2-codex Consistent model selection
Session resume Not used in spike Keep parity with OA03.1
MCP Out of scope ADR-OA01.1 CLI-first, no MCP in spike

Claude Parity (OA03.1 Alignment)

The Codex CLI spike should mirror the Claude Code runner capture contract:

  • Same artifact set (transcript, final response, git diff, run metadata)
  • Same workflow expectations (single-shot execution, headless CLI only)
  • Same PTY posture (fallback only; primary capture via stdout/stderr)

Core Principle (from OA03)

Normalize outputs, not control surfaces.

The Codex CLI adapter handles CLI-specific mechanics; the kernel receives normalized AgentRunResult regardless of which runner produced it.


Spike Deliverables

Pass/Fail Criteria

Test Case Expected Result Status
Basic execution codex exec "list files in src/" completes [x]
JSONL capture --json produces parseable event stream [x]
Final response --output-last-message contains summary [x]
Full-auto mode --full-auto bypasses approval prompts [x]
Sandbox enforcement --sandbox read-only prevents writes [x]
Workspace write --sandbox workspace-write allows project writes [x]
Error handling Non-zero exit code on failure [x]
Timeout handling Process can be killed on timeout [x]

Artifacts

Artifact Description
codex_cli_spike.py Spike implementation
events.ndjson Captured JSONL event stream
response.txt Final response from --output-last-message
run.json Run metadata
SPIKE_REPORT.md Findings, gotchas, recommendations

Decision Point

If spike passes β†’ proceed to full Codex CLI adapter implementation under OA03.

If spike fails β†’ document constraints, evaluate alternatives (e.g., wait for CLI improvements).


Output Contract

Codex CLI runner must produce the same AgentRunResult structure as Claude Code runner (OA03.1):

@dataclass
class AgentRunResult:
    """Normalized output from any agent runner."""
    status: Literal["success", "partial", "blocked", "timeout", "error"]
    transcript: str          # Full JSONL event stream or formatted output
    final_response: str      # Content from --output-last-message
    workspace_diff: str      # Git diff of changes
    exit_code: int
    duration_seconds: float
    metadata: dict           # Agent-specific metadata (model, tokens, etc.)

Dual-Channel Output

Consistent with OA01/OA03:

Channel Content Source
Transcript JSONL event stream --json stdout
Semantic Final response --output-last-message file
Workspace Git diff Post-run git diff

Control Surface Mapping

Required by OA03 ADR Gate.

Authoritative Documentation

  • Codex CLI Reference: https://developers.openai.com/codex/cli/reference
  • OpenAI Codex Overview: https://openai.com/codex/

Invocation Modes

Mode Command Use Case
Headless exec codex exec Primary execution mode
Interactive codex Not used for automation
MCP server codex mcp-server Out of scope for spike

IO Model

  • Stateless per invocation: Each codex exec is independent
  • Session resume available: codex exec resume (not used in spike)
  • Dual output: JSONL to stdout, final response to file

Permission / Safety Controls

Control Mechanism
Approval control --ask-for-approval flag
Sandbox policy --sandbox flag
Directory access --add-dir for additional paths
Model selection -m / --model flag

Experiment Matrix

Test Case Expected Result
Headless execution completes Exit code 0, output captured
JSONL events parseable Valid JSON per line
Final response captured Non-empty file content
Sandbox read-only enforced Write attempts blocked
Sandbox workspace-write works Project files writable
Timeout kills process Process terminates, partial output captured
Model pinning -m gpt-5.2-codex honored in run metadata

Relationship to OA03.2 (API Runner)

This ADR supersedes ADR-OA03.2. The relationship:

OA03.2 Role Description
Historical context Documents API constraints and why that approach failed
Spike evidence Learnings inform what to avoid in CLI approach
Preserved artifacts Harness code may inform CLI adapter design

OA03.2's status is superseded. This ADR (OA03.3) provides the viable Codex execution path.


Consequences

Positive

  • Proven pattern: Matches Claude Code runner (OA03.1) architecture
  • Native capabilities: CLI handles tool execution, sandbox, approvals
  • Simpler adapter: No custom tool providers needed (unlike API approach)
  • Dual-channel capture: JSONL events + final response + git diff
  • Client-driven agent: Outputs are harvested, not synthesized β€” aligns with OA01.1's principle that orchestration owns provenance and lifecycle, not the agent loop itself

Negative / Tradeoffs

  • CLI dependency: Requires Codex CLI installed and configured
  • Less control: Cannot customize tool surface (CLI provides fixed tools)
  • Output parsing: Must handle JSONL event stream format
  • External process: Subprocess management vs API call

Risks and Mitigations

Risk Mitigation
CLI behavior differs from docs Spike validates actual behavior
Output format changes Version-pin CLI, document format
Approval prompts leak through Test --full-auto thoroughly
Sandbox insufficient Combine with git worktree isolation

Open Questions

1. JSONL Event Schema

Question: What is the exact schema of --json output events?

Action: Capture and document during spike.

Decision needed by: Spike completion

2. Model Selection

Question: Which Codex model variant for implementation tasks?

Options: - Default (let CLI choose) - Explicit -m gpt-5.2-codex for consistency - Task-based selection

Decision needed by: Implementation


Archived Versions


As-Built Notes & Addendums

Reserved for post-implementation updates. Never edit the original Context/Decision/Consequences sections β€” always append addendums here.

Addendum 2026-02-08: Spike Passed β€” Implementation Complete

Status changed: proposed β†’ implemented

The Codex CLI spike passed all criteria with a successful 7-minute implementation run:

Run Details: - Run ID: 20260208-155213 - Duration: 6m 47s (15:52:13 β†’ 15:59:00 UTC) - Exit code: 0 (completed) - Task: Implement ADR-CF02 (Prompt Catalog Discovery)

Artifact Capture: - 272 events in NDJSON stream - 108 command executions captured with exit codes - 70 reasoning items (chain of thought) - 13KB unified diff - Structured final report in response.txt

Token Usage: - Input: 4,239,965 tokens - Cached: 4,089,472 (96% cache hit) - Output: 17,936 tokens

Event Types Captured: - item.completed (137), command_execution (108), reasoning (70) - item.started (55), file_change (11), todo_list (4) - Full lifecycle: thread.started β†’ turn.started β†’ items β†’ turn.completed

Codex JSON Schema (observed):

{"type": "item.completed", "item": {"id": "item_N", "type": "command_execution", "command": "...", "aggregated_output": "...", "exit_code": 0, "status": "completed"}}
{"type": "item.completed", "item": {"id": "item_N", "type": "reasoning", "text": "..."}}
{"type": "item.completed", "item": {"id": "item_N", "type": "agent_message", "text": "..."}}
{"type": "turn.completed", "usage": {"input_tokens": N, "cached_input_tokens": N, "output_tokens": N}}

Open Questions Resolved: 1. JSONL Event Schema: Documented above. Events are typed with item.started/completed/updated wrappers. 2. Model Selection: -m gpt-5.2-codex confirmed working and recommended for consistency.

Decision: Proceed with full Codex CLI adapter implementation under OA03. Pattern validated as equivalent to Claude Code runner (OA03.1).