SPIKE-09 Prompt Dir Three-Arm Comparison¶
This experiment note compares direct Codex, supervisory Codex, and kernel-mediated orchestration on the same bounded implementation task.
Experiment ID¶
SPIKE-09
Question¶
On the same bounded implementation task, how do three arms compare in practical usefulness and behavior?
- direct single-agent Codex
- supervisory Codex with native subagents
- kernel-mediated orchestration through
tnh-conductor
Setup¶
Task brief:
/docs/architecture/agent-orchestration/supervisory-shell-trial/prompt-dir-task-brief.md
Arms:
- direct Codex in isolated worktree:
tmp/prompt-dir-arms/worktrees/direct - supervisory Codex in isolated worktree:
tmp/prompt-dir-arms/worktrees/supervisory - kernel-mediated orchestration in managed worktree:
.tnh-conductor/worktrees/20260416T151658Z
All Codex launches used the sanitized user-like environment established in SPIKE-08.
Primary artifacts:
tmp/prompt-dir-arms/direct.stdout.jsonltmp/prompt-dir-arms/direct.stderr.logtmp/prompt-dir-arms/supervisory.stdout.jsonltmp/prompt-dir-arms/supervisory.stderr.log.tnh-conductor/runs/20260416T151658Z/events.ndjson.tnh-conductor/runs/20260416T151658Z/metadata.json.tnh-conductor/runs/20260416T151658Z/artifacts/implement/policy_summary.json
Result¶
All three arms produced useful implementation work, which is the main positive finding of this experiment.
This was a small bounded task, so the direct arm naturally had the least coordination overhead and therefore looked cleaner. That should not be read as the main conclusion of the experiment.
The more important finding is that both alternative orchestration shapes showed credible viability:
- supervisory Codex demonstrated real native subagent invocation, recovery, and synthesis behavior
- kernel-mediated orchestration produced a real implementation path in its managed worktree rather than only scaffolding or trace output
The main weaknesses observed in the supervisory and kernel arms were operational:
- runtime overhead
- weak or delayed stopping behavior
- opaque execution-state visibility
- and some fragility in delegation and artifact integration
Those look more like improvable runtime issues than evidence that the orchestration approaches are fundamentally unworkable.
Observed at cutoff:
- direct arm: completed cleanly with final report and targeted tests passing
- supervisory arm: worktree reached a complete-looking implementation, but the parent supervisor process was still running when observation was cut off
- kernel arm: managed worktree contained a substantial implementation diff, but the conductor run had emitted only
step_startedprovenance and had not written a final summary at cutoff
Arm A: Direct Codex¶
Outcome:
- implemented the feature directly
- touched 8 files
- targeted validation passed
Changed files:
/docs/cli-reference/tnh-gen.md/src/tnh_scholar/cli_tools/tnh_gen/commands/config.py/src/tnh_scholar/cli_tools/tnh_gen/commands/list.py/src/tnh_scholar/cli_tools/tnh_gen/commands/run.py/src/tnh_scholar/cli_tools/tnh_gen/state.py/src/tnh_scholar/cli_tools/tnh_gen/tnh_gen.py/tests/cli_tools/test_tnh_gen.py/tests/cli_tools/test_tnh_gen_coverage.py
Validation:
poetry run pytest tests/cli_tools/test_tnh_gen.py tests/cli_tools/test_tnh_gen_coverage.py- result:
65 passed, 2 warnings
Practical read:
- on a simple bounded task, this arm had the least overhead and therefore the cleanest execution path
- it remains the right baseline for judging what the more complex orchestration arms add or cost
Arm B: Supervisory Codex¶
Outcome:
- attempted explicit implementation and test/docs delegation to native subagents
- first two
spawn_agentattempts failed because the parent thread rollout was unavailable for fork - recovered by rephrasing the subagent launches without forked thread context
- implementation subagent succeeded
- test/docs subagent initially failed on patch application, but the worktree still ended up with the full expected file set
Changed files:
/docs/cli-reference/tnh-gen.md/src/tnh_scholar/cli_tools/tnh_gen/commands/config.py/src/tnh_scholar/cli_tools/tnh_gen/commands/list.py/src/tnh_scholar/cli_tools/tnh_gen/commands/run.py/src/tnh_scholar/cli_tools/tnh_gen/state.py/src/tnh_scholar/cli_tools/tnh_gen/tnh_gen.py/tests/cli_tools/test_tnh_gen.py/tests/cli_tools/test_tnh_gen_coverage.py
Validation:
- direct inspection of the worktree passed
poetry run pytest tests/cli_tools/test_tnh_gen.py tests/cli_tools/test_tnh_gen_coverage.py- result:
63 passed, 2 warnings
Important evidence:
- real delegation attempts are visible in
tmp/prompt-dir-arms/supervisory.stdout.jsonl - failure evidence is visible in
tmp/prompt-dir-arms/supervisory.stderr.log
Practical read:
- this arm did prove native subagent invocation is viable in a headless run
- the supervisor respected the contract better than earlier experiments and did not immediately collapse into single-agent execution
- the main weaknesses were coordination fragility and stopping/visibility issues, not lack of useful output
- on a more decomposable task, this shape could plausibly outperform a single-agent pass if the current runtime issues are improved
Arm C: Kernel-Mediated Orchestration¶
Outcome:
- managed worktree contained a substantial implementation diff before the run was cut off
- controller-level provenance remained sparse
- the run did not emit a final summary during observation
Changed files:
/docs/cli-reference/tnh-gen.md/src/tnh_scholar/cli_tools/tnh_gen/commands/config.py/src/tnh_scholar/cli_tools/tnh_gen/commands/list.py/src/tnh_scholar/cli_tools/tnh_gen/commands/run.py/src/tnh_scholar/cli_tools/tnh_gen/config_loader.py/src/tnh_scholar/cli_tools/tnh_gen/state.py/src/tnh_scholar/cli_tools/tnh_gen/tnh_gen.py/tests/cli_tools/test_tnh_gen.py/tests/cli_tools/test_tnh_gen_coverage.py
Structural distinction:
- unlike the other two arms, this one introduced an explicit helper in
config_loader.pyfor non-null override merging - that is the main unique design direction across the three arms
Validation:
- local targeted pytest reproduction from the managed worktree did not complete cleanly because the fresh managed worktree virtualenv did not have
pytestavailable yet - result at manual check time:
pyenv: pytest: command not found
Practical read:
- the kernel path did produce meaningful code, not just scaffolding
- this is positive evidence that the existing conductor codepath is viable enough to keep in the spike
- the main current limitations are weak observability, unclear completion state, and higher inspection cost
- those limitations may be tractable enough that the kernel path could become much more competitive on larger tasks
Overlap And Uniqueness¶
Overlap:
- all three arms converged on the same user-facing feature shape
- all three touched the same core files for CLI callback, shared state, list/run/config plumbing, tests, and CLI docs
- all three routed the override through config loading rather than bypassing config resolution entirely
Uniqueness:
- direct arm: most complete and best validated
- supervisory arm: only arm with explicit native subagent evidence and coordination recovery behavior
- kernel arm: only arm to push the override-handling logic further down into
config_loader.py
Practical Conclusion¶
The main outcome is that supervisory Codex and tnh-conductor both look viable enough to keep pursuing in the OA01.x spike.
This experiment should be read as:
- direct execution is the least burdened path on a simple task
- supervisory execution is viable and now has explicit evidence of native subagent behavior
- kernel-mediated execution is viable and can produce real code in the managed worktree
- the main barriers are runtime overhead, stopping-condition behavior, and execution-state opacity
That is a positive result for the broader spike question, because those barriers appear operational and potentially improvable rather than architectural dead ends.
So the comparison does not show that single-agent execution is categorically better. It shows that the orchestration paths are promising, but they currently pay coordination and observability costs that are easy to see on a small task.
Useful Artifacts¶
Most useful:
tmp/prompt-dir-arms/direct.stdout.jsonltmp/prompt-dir-arms/supervisory.stdout.jsonltmp/prompt-dir-arms/supervisory.stderr.log.tnh-conductor/runs/20260416T151658Z/metadata.json.tnh-conductor/runs/20260416T151658Z/events.ndjson
Most useful worktrees:
tmp/prompt-dir-arms/worktrees/directtmp/prompt-dir-arms/worktrees/supervisory.tnh-conductor/worktrees/20260416T151658Z
Next Action¶
Continue the spike with tasks that are large enough, or naturally decomposable enough, to give the supervisory and kernel arms a fairer opportunity to justify their overhead.
Focus the next round on improving and measuring:
- subagent spawn reliability
- return-path and merge reliability
- stopping-condition behavior
- validation handoff
- and run-state visibility in the kernel path
Keep the direct sanitized Codex arm as the baseline, but treat the supervisory and kernel arms as active candidates for larger-task advantage rather than as secondary curiosities.