ADR-OA03.3: Codex CLI Runner¶
Codex execution path for tnh-conductor via CLI β headless codex exec mode with JSONL output capture, superseding the API-based approach in ADR-OA03.2.
- Status: Implemented
- Type: Implementation ADR (De-risking Spike)
- Date: 2026-02-07
- Owner: Aaron Solomon
- Author: Aaron Solomon, Claude Opus 4.5
- Parent ADR: ADR-OA03
- Supersedes: ADR-OA03.2 (API-based approach)
ADR Editing Policy¶
IMPORTANT: How you edit this ADR depends on its status.
proposedstatus: ADR is in the design loop. We may rewrite or edit the document as needed to refine the design.accepted,wip,implementedstatus: Implementation has begun or completed. NEVER edit the original Context/Decision/Consequences sections. Only append addendums.
Context¶
Background¶
ADR-OA03.2 attempted to drive Codex via the OpenAI Responses API. The spike revealed fundamental constraints:
- No final text output: Model treats tool calls as completion β no structured JSON response
- API parameter limitations:
response_format,temperature,tool_choice: nonenot supported - VS Code extension architecture: Uses proprietary "app server" with client-side orchestration
These constraints led to suspending OA03.2 on 2026-01-25.
New Development: Codex CLI¶
The Codex CLI is now available with headless execution capabilities. Reference: https://developers.openai.com/codex/cli/reference
Key capabilities:
| Capability | CLI Flag/Command | Description |
|---|---|---|
| Headless execution | codex exec |
Run Codex non-interactively |
| JSONL output | --json |
Newline-delimited JSON events |
| Final response capture | --output-last-message <path> |
Write final response to file |
| Approval bypass | --full-auto |
Low-friction automation preset |
| Approval control | --ask-for-approval |
untrusted, on-failure, on-request, never |
| Sandbox policy | --sandbox |
read-only, workspace-write, danger-full-access |
| Session resume | codex exec resume |
Continue prior sessions |
| MCP server mode | codex mcp-server |
Agent-to-agent consumption |
This provides a CLI-based execution model analogous to Claude Code's --print mode, which is the foundation of ADR-OA03.1.
Why CLI Over API¶
| Concern | API Approach (OA03.2) | CLI Approach (OA03.3) |
|---|---|---|
| Final output capture | Model doesn't emit text after tools | --output-last-message captures response |
| Structured events | Limited API response format | --json provides JSONL event stream |
| Tool execution | Must implement tool providers | Native filesystem access |
| Approval/safety | Manual implementation | Built-in --sandbox and --ask-for-approval |
| Architecture alignment | Different from Claude Code | Same pattern as OA03.1 |
Decision¶
Spike Objectives¶
This ADR defines a de-risking spike to validate Codex CLI as a viable execution surface for tnh-conductor. The spike must prove:
- Headless execution works:
codex exec "task"completes without interaction - Output is capturable:
--json+--output-last-messageprovide parseable results - Automation controls work:
--full-autoor--ask-for-approval neverbypass prompts - Sandbox policy is enforceable:
--sandbox workspace-writerestricts access - Pattern matches OA03.1: Same capture/parse patterns as Claude Code runner
Evaluation Questions¶
The spike should also answer these higher-level questions:
- Can Codex CLI reliably perform multi-step code edits in a repo?
- Where and how are run transcripts and summaries stored?
- Can Codex be run deterministically with project-local configuration?
- What artifacts can be harvested post-run without UI integration?
- Is the CLI behavior functionally equivalent to Codex in VS Code?
Execution Model¶
Codex CLI is invoked headlessly via subprocess, matching the OA03.1 pattern:
TaskSpec + WorkingDirectory
|
codex exec --json --output-last-message <path> --full-auto -m gpt-5.2-codex "task"
|
Captured Outputs (JSONL events + final response + workspace diff)
Execution constraints:
| Constraint | Value | Rationale |
|---|---|---|
| Single-shot execution | Per step | No multi-turn within a step |
| Approval mode | --full-auto or configured |
Automation preset |
| Sandbox | workspace-write |
Match work branch isolation |
| Output capture | JSONL + final message file | Dual-channel capture |
| Model | -m gpt-5.2-codex |
Consistent model selection |
| Session resume | Not used in spike | Keep parity with OA03.1 |
| MCP | Out of scope | ADR-OA01.1 CLI-first, no MCP in spike |
Claude Parity (OA03.1 Alignment)¶
The Codex CLI spike should mirror the Claude Code runner capture contract:
- Same artifact set (transcript, final response, git diff, run metadata)
- Same workflow expectations (single-shot execution, headless CLI only)
- Same PTY posture (fallback only; primary capture via stdout/stderr)
Core Principle (from OA03)¶
Normalize outputs, not control surfaces.
The Codex CLI adapter handles CLI-specific mechanics; the kernel receives normalized AgentRunResult regardless of which runner produced it.
Spike Deliverables¶
Pass/Fail Criteria¶
| Test Case | Expected Result | Status |
|---|---|---|
| Basic execution | codex exec "list files in src/" completes |
[x] |
| JSONL capture | --json produces parseable event stream |
[x] |
| Final response | --output-last-message contains summary |
[x] |
| Full-auto mode | --full-auto bypasses approval prompts |
[x] |
| Sandbox enforcement | --sandbox read-only prevents writes |
[x] |
| Workspace write | --sandbox workspace-write allows project writes |
[x] |
| Error handling | Non-zero exit code on failure | [x] |
| Timeout handling | Process can be killed on timeout | [x] |
Artifacts¶
| Artifact | Description |
|---|---|
codex_cli_spike.py |
Spike implementation |
events.ndjson |
Captured JSONL event stream |
response.txt |
Final response from --output-last-message |
run.json |
Run metadata |
SPIKE_REPORT.md |
Findings, gotchas, recommendations |
Decision Point¶
If spike passes β proceed to full Codex CLI adapter implementation under OA03.
If spike fails β document constraints, evaluate alternatives (e.g., wait for CLI improvements).
Output Contract¶
Codex CLI runner must produce the same AgentRunResult structure as Claude Code runner (OA03.1):
@dataclass
class AgentRunResult:
"""Normalized output from any agent runner."""
status: Literal["success", "partial", "blocked", "timeout", "error"]
transcript: str # Full JSONL event stream or formatted output
final_response: str # Content from --output-last-message
workspace_diff: str # Git diff of changes
exit_code: int
duration_seconds: float
metadata: dict # Agent-specific metadata (model, tokens, etc.)
Dual-Channel Output¶
Consistent with OA01/OA03:
| Channel | Content | Source |
|---|---|---|
| Transcript | JSONL event stream | --json stdout |
| Semantic | Final response | --output-last-message file |
| Workspace | Git diff | Post-run git diff |
Control Surface Mapping¶
Required by OA03 ADR Gate.
Authoritative Documentation¶
- Codex CLI Reference: https://developers.openai.com/codex/cli/reference
- OpenAI Codex Overview: https://openai.com/codex/
Invocation Modes¶
| Mode | Command | Use Case |
|---|---|---|
| Headless exec | codex exec |
Primary execution mode |
| Interactive | codex |
Not used for automation |
| MCP server | codex mcp-server |
Out of scope for spike |
IO Model¶
- Stateless per invocation: Each
codex execis independent - Session resume available:
codex exec resume(not used in spike) - Dual output: JSONL to stdout, final response to file
Permission / Safety Controls¶
| Control | Mechanism |
|---|---|
| Approval control | --ask-for-approval flag |
| Sandbox policy | --sandbox flag |
| Directory access | --add-dir for additional paths |
| Model selection | -m / --model flag |
Experiment Matrix¶
| Test Case | Expected Result |
|---|---|
| Headless execution completes | Exit code 0, output captured |
| JSONL events parseable | Valid JSON per line |
| Final response captured | Non-empty file content |
| Sandbox read-only enforced | Write attempts blocked |
| Sandbox workspace-write works | Project files writable |
| Timeout kills process | Process terminates, partial output captured |
| Model pinning | -m gpt-5.2-codex honored in run metadata |
Relationship to OA03.2 (API Runner)¶
This ADR supersedes ADR-OA03.2. The relationship:
| OA03.2 Role | Description |
|---|---|
| Historical context | Documents API constraints and why that approach failed |
| Spike evidence | Learnings inform what to avoid in CLI approach |
| Preserved artifacts | Harness code may inform CLI adapter design |
OA03.2's status is superseded. This ADR (OA03.3) provides the viable Codex execution path.
Consequences¶
Positive¶
- Proven pattern: Matches Claude Code runner (OA03.1) architecture
- Native capabilities: CLI handles tool execution, sandbox, approvals
- Simpler adapter: No custom tool providers needed (unlike API approach)
- Dual-channel capture: JSONL events + final response + git diff
- Client-driven agent: Outputs are harvested, not synthesized β aligns with OA01.1's principle that orchestration owns provenance and lifecycle, not the agent loop itself
Negative / Tradeoffs¶
- CLI dependency: Requires Codex CLI installed and configured
- Less control: Cannot customize tool surface (CLI provides fixed tools)
- Output parsing: Must handle JSONL event stream format
- External process: Subprocess management vs API call
Risks and Mitigations¶
| Risk | Mitigation |
|---|---|
| CLI behavior differs from docs | Spike validates actual behavior |
| Output format changes | Version-pin CLI, document format |
| Approval prompts leak through | Test --full-auto thoroughly |
| Sandbox insufficient | Combine with git worktree isolation |
Open Questions¶
1. JSONL Event Schema¶
Question: What is the exact schema of --json output events?
Action: Capture and document during spike.
Decision needed by: Spike completion
2. Model Selection¶
Question: Which Codex model variant for implementation tasks?
Options:
- Default (let CLI choose)
- Explicit -m gpt-5.2-codex for consistency
- Task-based selection
Decision needed by: Implementation
Related ADRs¶
- ADR-OA03: Agent Runner Architecture β Parent architecture
- ADR-OA03.1: Claude Code Runner β Sibling runner (pattern reference)
- ADR-OA03.2: Codex Runner (API) β Superseded API approach
- ADR-OA01.1: Conductor Strategy v2 β Parent strategy
Archived Versions¶
- adr-oa03.3-codex-cli-spike-2026-01-29.md β Earlier draft from branch (preserved for reference)
As-Built Notes & Addendums¶
Reserved for post-implementation updates. Never edit the original Context/Decision/Consequences sections β always append addendums here.
Addendum 2026-02-08: Spike Passed β Implementation Complete¶
Status changed: proposed β implemented
The Codex CLI spike passed all criteria with a successful 7-minute implementation run:
Run Details:
- Run ID: 20260208-155213
- Duration: 6m 47s (15:52:13 β 15:59:00 UTC)
- Exit code: 0 (completed)
- Task: Implement ADR-CF02 (Prompt Catalog Discovery)
Artifact Capture:
- 272 events in NDJSON stream
- 108 command executions captured with exit codes
- 70 reasoning items (chain of thought)
- 13KB unified diff
- Structured final report in response.txt
Token Usage: - Input: 4,239,965 tokens - Cached: 4,089,472 (96% cache hit) - Output: 17,936 tokens
Event Types Captured:
- item.completed (137), command_execution (108), reasoning (70)
- item.started (55), file_change (11), todo_list (4)
- Full lifecycle: thread.started β turn.started β items β turn.completed
Codex JSON Schema (observed):
{"type": "item.completed", "item": {"id": "item_N", "type": "command_execution", "command": "...", "aggregated_output": "...", "exit_code": 0, "status": "completed"}}
{"type": "item.completed", "item": {"id": "item_N", "type": "reasoning", "text": "..."}}
{"type": "item.completed", "item": {"id": "item_N", "type": "agent_message", "text": "..."}}
{"type": "turn.completed", "usage": {"input_tokens": N, "cached_input_tokens": N, "output_tokens": N}}
Open Questions Resolved:
1. JSONL Event Schema: Documented above. Events are typed with item.started/completed/updated wrappers.
2. Model Selection: -m gpt-5.2-codex confirmed working and recommended for consistency.
Decision: Proceed with full Codex CLI adapter implementation under OA03. Pattern validated as equivalent to Claude Code runner (OA03.1).