Codex Harness End-to-End Test Report¶
Context¶
We stood up a minimal Codex harness to validate a full tool-calling loop (read/search/apply/run tests) against the repo. This report captures what worked, what blocked, and the practical gotchas to help other agents reproduce and extend the test.
Current State¶
- Harness runs via
tnh-codex-harness(Typer CLI) and writes artifacts under.tnh-codex/runs/<timestamp>/. - Tool calls are executed successfully (read/search), but final structured output is often missing.
- The run ends "blocked" when no final JSON response is produced after tool rounds.
What Worked¶
- Tools executed with real outputs and filesystem access.
- Sandbox patch sync updates code without committing or switching branches.
- Codex runs can be executed from the sandbox with a copied
.env.
Key Blockers¶
- No final output text: The model often emits tool calls only. Without a final JSON response, parsing fails.
- Responses API limitations:
response_formatandtemperatureare rejected, so schema enforcement must use thetextconfig. - Sandbox hygiene: Sync resets remove
.env, so credentials must be copied after sync.
Operational Gotchas¶
- Run
make sync-sandboxfrom the source repo root, not inside the sandbox. - After sync, copy
.envto sandbox before running the harness. - CLI is
tnh-codex-harnesswith--task; there is norunsubcommand.
Recommended Next Steps¶
- Enforce structured output with
textJSON schema in the Responses API. - Persist tool results into artifacts for easier debugging.
- Add a termination rule when tool calls repeat without final output.
References¶
/architecture/agent-orchestration/adr/adr-oa03-agent-runner-architecture.md/architecture/agent-orchestration/adr/adr-oa03.1-claude-code-runner.mdSPIKE_REPORT.md