ADR-OA03.2: Codex Runner¶
Codex execution path for tnh-conductor β an API-first, tool-driven runner optimized for implementation and mechanical coding tasks.
- Status: Paused
- Date: 2026-01-22
- Owner: Aaron Solomon
- Author: Aaron Solomon, GPT 5.2, Claude Opus 4.5
- Parent ADR: ADR-OA03
ADR Editing Policy¶
IMPORTANT: How you edit this ADR depends on its status.
proposedstatus: ADR is in the design loop. We may rewrite or edit the document as needed to refine the design.accepted,wip,implementedstatus: Implementation has begun or completed. NEVER edit the original Context/Decision/Consequences sections. Only append addendums.
Context¶
OA03 establishes the runner architecture: a shared kernel handling cross-agent invariants, with per-agent adapters handling surface-specific mechanics. This ADR specifies the Codex runner adapter.
What is Codex?¶
Codex refers to OpenAI's GPT-Codex model family β specialized versions of GPT models optimized for agentic coding tasks. These are accessed via the standard OpenAI Responses API.
Available GPT-Codex models:
gpt-5.2-codexβ Most advanced agentic coding modelgpt-5.1-codex/gpt-5.1-codex-maxβ Project-scale work with context compactiongpt-5-codex/gpt-5-codex-miniβ Standard and cost-effective variantscodex-mini-latestβ Low-latency, optimized for code Q&A and editing
These are first-class models in the OpenAI API, not a separate product. The runner invokes them via the Responses API with tool definitions.
Codex Positioning¶
Codex is positioned as the primary engine for code implementation tasks:
| Task Type | Preferred Agent | Rationale |
|---|---|---|
| Implementation (new features, ADR execution) | Codex | Strong reasoning, large context, structured output |
| Exploratory / refactoring | Claude Code | Filesystem-native, interactive exploration |
| Code review / analysis | Claude Code | Conversational, nuanced assessment |
| Documentation | Either | Task-dependent |
This division reflects each agent's strengths, not a hard constraint. The Planner may route tasks based on capability metadata.
Why Not the VS Code Extension?¶
Codex offers a VS Code extension with rich IDE integration. We explicitly do not attempt to drive or replicate this:
- Rebuilding IDE semantics is out of scope and unnecessary
- Codex's strength comes from reasoning + large context, not UI interactivity
- Tool-driven Codex recovers most IDE affordances with far lower complexity
- Aligns with prompt-program runtime model: behavior defined in prompts, execution via kernel + tools
Decision¶
Execution Model¶
Codex is invoked headlessly via API, not via VS Code extension and not via PTY.
TaskSpec + RepoSnapshot + Policy
β
Codex (Responses API)
β
Structured Output (patch + reasoning)
Execution constraints:
| Constraint | Value | Rationale |
|---|---|---|
| Single-shot execution | Per step | No multi-turn within a step |
| Clarifying questions | Disallowed | Must emit output or block |
| Conversational memory | None | Stateless per invocation |
| Parameters | Deterministic | Set by workflow, not runtime negotiation |
Core Principle¶
Expose the minimum viable IDE as explicit tools. No implicit workspace access.
Codex receives workspace context only through explicit tool calls. This makes all workspace interaction auditable and policy-enforceable.
Tool Surface Contract¶
Required Tools (v1)¶
| Tool | Signature | Purpose |
|---|---|---|
read_file |
read_file(path: str) -> str |
Read file contents |
search_repo |
search_repo(query: str) -> list[Match] |
Search codebase (grep-like) |
list_files |
list_files(path: str) -> list[str] |
List directory contents |
apply_patch |
apply_patch(diff: str) -> PatchResult |
Apply unified diff |
run_tests |
run_tests(command: str) -> TestResult |
Execute test command |
Non-Goals (Explicit Exclusions)¶
| Exclusion | Rationale |
|---|---|
| Terminal emulation | Not needed; tools provide structured access |
| Interactive retries | Single-shot model; retry is Planner's decision |
| Hidden filesystem state | All access via explicit tools |
| UI events | No VS Code extension integration |
Tool Policy Enforcement¶
Tools are subject to RunPolicy constraints:
read_file/list_filesβ restricted to allowed pathsapply_patchβ diff checked against forbidden paths before applicationrun_testsβ command validated against allowed test commands
Policy violations result in tool call rejection, not step failure. Codex may retry with different parameters.
Output Contract¶
Codex must emit schema-validated structured output:
# Codex step output schema
output:
patch: str | null # Unified diff or null if no changes
file_operations: list # Create/delete/rename operations
- operation: "create" | "delete" | "rename"
path: str
content: str | null # For create only
target: str | null # For rename only
rationale: str # Why this change addresses the task
risk_flags: list[str] # Concerns requiring attention
open_questions: list[str] # If blocked, what needs clarification
status: "complete" | "partial" | "blocked"
Output Validation Rules¶
| Condition | Result |
|---|---|
Valid schema, status: complete |
Kernel applies patch, marks step complete |
Valid schema, status: partial |
Kernel applies patch, Planner decides next step |
Valid schema, status: blocked |
No patch applied, Planner evaluates blockers |
| Invalid schema | Step marked unsafe, no changes applied |
| No output (timeout) | Step marked blocked, reason timeout |
Failure to emit valid structure β step marked unsafe.
Kernel Integration¶
Adapter Responsibilities¶
The Codex runner adapter handles:
| Responsibility | Implementation |
|---|---|
| API invocation | Codex Responses API with tool definitions |
| Tool execution | Dispatch tool calls to kernel utilities |
| Output parsing | Validate against schema, extract artifacts |
| Event emission | Translate API events to kernel event stream |
Kernel Responsibilities (Unchanged from OA03)¶
| Responsibility | Description |
|---|---|
| Workspace isolation | Git worktree setup before Codex invocation |
| Pre/post capture | Git status + diff before and after |
| Patch application | Apply validated diff from Codex output |
| Policy enforcement | Check diff against RunPolicy |
| Provenance recording | Write run record with all artifacts |
Dual-Channel Output¶
Consistent with OA01/OA03, Codex produces dual-channel output:
| Channel | Content | Source |
|---|---|---|
| Workspace | Proposed diff, file operations | patch, file_operations fields |
| Semantic | Reasoning, risk assessment | rationale, risk_flags, open_questions |
The Planner consumes both channels for evaluation. Contradictions (e.g., status: complete but empty patch for implementation task) trigger partial or unsafe classification.
Control Surface Mapping¶
Required by OA03 ADR Gate.
Authoritative Documentation¶
- Codex Models: OpenAI GPT-5-Codex
- Responses API: OpenAI Responses API
- Codex Overview: OpenAI Codex
Invocation Modes¶
| Mode | Description | Use Case |
|---|---|---|
| Responses API | Stateless, tool-enabled | Primary execution mode |
| Chat Completions | Multi-turn, no native tools | Not used |
IO Model¶
- Stateless: Each invocation is independent
- No session persistence: Context provided per-call via system prompt + tools
- Structured output: JSON schema enforcement available
Permission / Safety Controls¶
| Control | Mechanism |
|---|---|
| Tool restriction | Only declared tools available |
| Output schema | Enforced via API response_format |
| Token limits | Set per invocation |
| No network access | Tools don't expose network operations |
Output Formats¶
- JSON (structured output mode)
- Tool calls (function calling format)
Experiment Matrix¶
| Test Case | Expected Result |
|---|---|
| Simple file read | Returns file content |
| Search repo | Returns matching locations |
| Apply valid patch | Patch applied successfully |
| Apply patch to forbidden path | Tool call rejected |
| Invalid output schema | Step marked unsafe |
| Timeout | Step marked blocked |
Validation requirement: All test cases must pass before moving to accepted status. See Open Questions for validation tracking.
Relationship to Claude Code Runner (OA03.1)¶
| Aspect | Claude Code (OA03.1) | Codex (OA03.2) |
|---|---|---|
| Invocation | CLI (--print mode) |
API (Responses API) |
| Workspace access | Filesystem-native | Tool-mediated |
| Output format | stdout/JSON stream | Structured JSON |
| Best for | Exploration, refactoring, review | Implementation, mechanical coding |
| Session model | Single-shot CLI | Single-shot API |
Both are first-class sub-agents under the same conductor, planner, and policy system. No special-case kernel logic beyond agent capability metadata.
Consequences¶
Positive¶
- Clean API-first model avoids VS Code extension complexity
- Explicit tool surface makes all workspace access auditable
- Structured output enables reliable Planner evaluation
- Aligns with prompt-program runtime: tools are the "IDE", prompts define behavior
Negative / Tradeoffs¶
- Limited to capabilities expressible via tool surface
- No real-time feedback during execution (single-shot)
- Requires maintaining tool implementations in kernel
- API costs per invocation
Alternatives Considered¶
1. Drive VS Code Extension¶
Approach: Automate the Codex VS Code extension UI.
Rejected because:
- Rebuilding IDE semantics is massive scope
- Brittle (UI changes break automation)
- Unnecessary β tool surface provides required capabilities
2. Multi-Turn Conversational Mode¶
Approach: Allow Codex to ask clarifying questions mid-step.
Rejected because:
- Violates single-shot step model
- Complicates Planner evaluation
- Clarification is Planner's job, not sub-agent's
3. Shared Tool Surface with Claude Code¶
Approach: Same tool definitions for both agents.
Rejected because:
- Claude Code is filesystem-native; forcing tool indirection adds complexity
- Different agents have different optimal interfaces
- OA03 principle: normalize outputs, not control surfaces
Open Questions¶
1. Experiment Matrix Validation¶
Question: Have all control surface mapping tests been validated?
Tests required:
- Simple file read returns content
- Search repo returns matching locations
- Apply valid patch succeeds
- Apply patch to forbidden path is rejected
- Invalid output schema marks step unsafe
- Timeout marks step blocked
Decision needed by: Before moving to accepted status
2. Codex Model Selection¶
Question: Which Codex model variant for implementation tasks?
Options:
gpt-5.2-codexβ Most capable, higher costgpt-5.1-codex-maxβ Good for project-scale workcodex-mini-latestβ Cost-effective for simpler tasks- Model routing based on task complexity
Decision needed by: Implementation start
3. Tool Expansion¶
Question: What additional tools are needed beyond v1 set?
Candidates:
get_symbols(path)β LSP-like symbol extractionget_references(symbol)β Find usagesrun_lint(path)β Lint specific file
Decision needed by: After v1 validation
4. Context Window Management¶
Question: How to handle tasks requiring more context than fits in single invocation?
Options:
- Chunked execution (multiple steps)
- Summarization of large files
- Selective context based on task analysis
Decision needed by: Implementation
Related ADRs¶
- ADR-OA03: Agent Runner Architecture β Parent architecture
- ADR-OA03.1: Claude Code Runner β Sibling runner specification
- ADR-OA01: Agent Orchestration Strategy β Parent strategy
As-Built Notes & Addendums¶
Reserved for post-implementation updates. Never edit the original Context/Decision/Consequences sections β always append addendums here.
Addendum 2026-01-25: Spike Findings β Development Suspended¶
Author: Claude Opus 4.5
Status Change: proposed β suspended
Summary¶
A proof-of-concept harness was built and tested against the Responses API. The spike revealed that the core assumption of this ADR is invalid: GPT-Codex models do not reliably produce structured output after tool-calling rounds.
Key Findings¶
-
No Final Text Output: When tools are available, the model treats tool calls as the "output" and stops. The API returns
status: completedwith only function_call blocks β no text/JSON response. -
API Parameter Limitations: The Responses API does not support
response_formatfor schema enforcement, rejectstemperaturesettings, and has notool_choice: noneequivalent to force text output. -
VS Code Extension Architecture: The Codex VS Code extension does not rely on model-generated summaries. It tracks tool calls client-side and synthesizes summaries from observed behavior. This is a fundamentally different architecture than the harness approach.
-
Cost Implications: Without VS Code's interactive feedback loop, API costs accumulate without proportional productivity gain.
Invalid Assumptions¶
| ADR Assumption | Reality |
|---|---|
| "Codex must emit schema-validated structured output" | Model emits tool calls only, no text response |
| "Output Validation Rules" would classify responses | Responses have no text to validate |
| "Structured output enables reliable Planner evaluation" | No structured output is produced |
Recommendation¶
Suspend development until one of the following conditions is met:
- OpenAI adds forced-text-output mode to Responses API
- Business need justifies building VS Code-style client orchestration
- Chat Completions API proves more suitable for this use case
Preserved Artifacts¶
The spike produced working code that can be resumed:
- Harness package:
src/tnh_scholar/agent_orchestration/codex_harness/ - CLI entrypoint:
src/tnh_scholar/cli_tools/tnh_codex_harness/ - Tool tests:
tests/agent_orchestration/test_codex_harness_tools.py - Detailed findings:
/architecture/agent-orchestration/notes/codex-harness-spike-findings.md
Impact on Parent ADRs¶
- ADR-OA03 (Agent Runner Architecture): Codex runner is not viable as specified. Claude Code runner (OA03.1) should be prioritized.
- ADR-OA01 (Orchestration Strategy): Multi-agent approach remains valid, but Codex cannot serve as the primary implementation engine until API constraints are resolved.