SPIKE-10 Agent Coordination Comparison Result¶
This note records the observed outcomes, artifact evidence, and practical recommendation from the five-arm SPIKE-10 coordination run.
Experiment ID¶
SPIKE-10
Question¶
On the same bounded implementation task, which coordination shape currently looks most viable for TNH Scholar?
Compared arms:
- direct single-agent Codex
- Codex with native subagents
- explicit external Codex workers through
codex-assistant - explicit external Claude workers through
claude-assistant - existing agent-orch through
tnh-conductor
This result intentionally excludes a live tnh-gen evaluator arm because maintained agent-orch code does not currently wire tnh-gen into supervisory interpretation or workflow routing.
Setup¶
Task brief:
/docs/architecture/agent-orchestration/notes/experiments/spike-10-conductor-watch-task-brief.md
Maintained workflow:
/docs/architecture/agent-orchestration/notes/experiments/spike-10-conductor-watch.workflow.yaml
Committed base for all arms:
6909fccaonmain
Manual-arm experiment root:
tmp/spike-10/20260420T215652Z
Conductor run:
- run id
20260420T215817Z - run directory
.tnh-conductor/runs/20260420T215817Z - managed worktree
.tnh-conductor/worktrees/20260420T215817Z
Task:
- implement
tnh-conductor status --watch - add a configurable polling interval
- emit machine-readable JSON snapshots until a terminal lifecycle state
- preserve existing one-shot behavior
Result¶
All five arms produced viable implementation work on the same feature.
The strongest practical result is not that one arm alone dominated. It is that the comparison now cleanly separates three different kinds of value:
- direct Codex is still the cleanest baseline for bounded coding work
- existing agent-orch has the strongest control surface and artifact model
- explicit external worker CLIs are promising as a repo-native interface, but current environment and auth friction make them less reliable than the other two paths
Native subagents remain credible, but this run did not show a clean native-subagent win because the intended forked-collaboration path failed and the supervisor had to recover locally.
So the current recommendation is:
- keep direct Codex as the baseline for simple bounded work
- keep
tnh-conductoras the main path to invest in for controlled orchestration - keep the external assistant CLIs as experimental worker interfaces, not primary coordination defaults
- treat future
tnh-genwork as a separate review or evaluator layer rather than conflating it with the current orchestrator
Arm A: Direct Codex¶
Outcome:
- completed cleanly
- changed 2 files
- targeted validation passed
Worktree:
tmp/spike-10/20260420T215652Z/worktrees/direct
Artifacts:
tmp/spike-10/20260420T215652Z/captures/direct/supervisor.stdout.jsonltmp/spike-10/20260420T215652Z/captures/direct/supervisor.stderr.log
Validation:
poetry install --with localpoetry run pytest tests/cli_tools/test_tnh_conductor.py- result observed in run trace:
6 passed in 2.37s
Practical read:
- this was the lowest-overhead path
- failures were ordinary local implementation/test issues rather than orchestration issues
- it remains the right baseline any coordination layer needs to justify beating
Arm B: Native Codex Subagents¶
Outcome:
- completed with a useful implementation
- changed 2 files
- targeted validation passed
Worktree:
tmp/spike-10/20260420T215652Z/worktrees/native-subagent
Artifacts:
tmp/spike-10/20260420T215652Z/captures/native-subagent/supervisor.stdout.jsonltmp/spike-10/20260420T215652Z/captures/native-subagent/supervisor.stderr.log
Validation:
poetry install --with localpoetry run pytest tests/cli_tools/test_tnh_conductor.py- result observed in run trace:
6 passed in 2.02s
Observed weakness:
- intended native collaboration failed twice with
parent thread rollout unavailable for fork
Practical read:
- this arm still supports the view that native subagents are viable
- but this run does not support treating them as the default coordination path yet
- the biggest issue was not output quality, it was delegation reliability
Arm C: Explicit External Codex Worker¶
Outcome:
- completed with a useful implementation
- changed 2 files
- targeted validation passed
- explicit
codex-assistantpath was exercised, but the worker did not complete successfully
Worktree:
tmp/spike-10/20260420T215652Z/worktrees/external-codex
Artifacts:
tmp/spike-10/20260420T215652Z/captures/external-codex/supervisor.stdout.jsonltmp/spike-10/20260420T215652Z/captures/external-codex/supervisor.stderr.logtmp/codex-assistant/worker-status-watch.stdout.jsonltmp/codex-assistant/worker-status-watch.stderr.log
Validation:
poetry install --with localpoetry run pytest tests/cli_tools/test_tnh_conductor.py- final worktree validation result:
7 passed
Observed weaknesses:
- worker startup depended on local package install
- Codex worker then hit
~/.codexsession-write restrictions - the supervisor still had to recover locally to finish the task
Practical read:
- the checked-in
codex-assistantwrapper is useful as an explicit delegation seam - the process boundary did improve failure visibility
- but right now the external Codex path is operationally more fragile than direct execution or
tnh-conductor
Arm D: Explicit External Claude Worker¶
Outcome:
- completed with a useful implementation
- changed 2 files
- targeted validation passed
- explicit
claude-assistantpath was exercised, but the worker did not complete successfully
Worktree:
tmp/spike-10/20260420T215652Z/worktrees/external-claude
Artifacts:
tmp/spike-10/20260420T215652Z/captures/external-claude/supervisor.stdout.jsonltmp/spike-10/20260420T215652Z/captures/external-claude/supervisor.stderr.logtmp/spike-10/20260420T215652Z/worktrees/external-claude/tmp/claude-assistant/20260420T220039Z.stdout.jsonltmp/spike-10/20260420T215652Z/worktrees/external-claude/tmp/claude-assistant/20260420T220039Z.stderr.log
Validation:
poetry install --with localpoetry run pytest tests/cli_tools/test_tnh_conductor.py- result observed in run trace:
6 passed in 2.24s
Observed weaknesses:
- first launch failed before the worktree environment was installed
- launch shape needed clarification because
claude-assistantis effectively a single-command CLI - worker then hit
~/.claudewrite restrictions - after redirecting
HOME, the worker still failed on local Claude authentication
Practical read:
- this arm was valuable mainly because it exposed real wrapper and auth constraints
- the CLI is a valid repo-native seam
- but external Claude is not yet a dependable unattended worker path in this environment
Arm E: Existing Agent-Orch¶
Outcome:
- completed successfully through the maintained workflow
- changed 2 files
- implementation and validation artifacts were persisted canonically
Managed worktree:
.tnh-conductor/worktrees/20260420T215817Z
Run artifacts:
.tnh-conductor/runs/20260420T215817Z/status.json.tnh-conductor/runs/20260420T215817Z/metadata.json.tnh-conductor/runs/20260420T215817Z/events.ndjson.tnh-conductor/runs/20260420T215817Z/artifacts/implement/final_response.txt.tnh-conductor/runs/20260420T215817Z/artifacts/validate/validation_stdout.txt
Validation:
- implementation step reported
poetry run pytest tests/cli_tools/test_tnh_conductor.py - observed result in final response:
6 passed - maintained validation step then ran the bootstrap
testsvalidator - observed result in validation artifact:
535 passed, 2 skipped, 10 warnings
Practical read:
- this arm had the best control surface and the clearest artifact model
- status, events, route transitions, workspace diff, final response, and validation output all landed in predictable canonical locations
- the main weakness was coarser live visibility and higher operator inspection cost than the direct arm
Most important distinction:
- unlike the other arms, this path already gives TNH Scholar a repo-owned orchestration boundary with managed worktree, workflow routing, validation, and canonical artifacts
- that is why it still looks like the strongest long-term coordination substrate
Comparison Summary¶
Direct Codex was best on simplicity.
- least overhead
- cleanest path to a passing result
- easiest to reason about during execution
Native subagents were best on potential upside, but not on demonstrated reliability.
- real delegation intent was present
- the supervisor recovered sensibly
- native fork reliability is still not good enough to lean on without caveats
Explicit external worker CLIs were best on process-boundary clarity, but weakest on operational readiness.
- failures were easier to capture and inspect
codex-assistantandclaude-assistantnow give us stable repo-native seams- both paths were blocked by real environment or auth issues before the delegated worker could carry the task end to end
Existing agent-orch was best on control surface.
- canonical run directory
- managed worktree
- event stream
- artifact persistence
- validation routing
That is the main reason to keep investing in it even though the direct arm remains cleaner on a small task.
Follow-up code comparison:¶
- the direct arm and the
tnh-conductorimplementation arm both used the same Codex CLI execution family (codex exec --ephemeral) - the code quality of the direct arm had minor but noticeable quality edges in implementation.
- the small code-quality edge for the direct arm therefore looks more like a prompt and entrypoint-context difference than a runner-surface difference
- in practice, that points to workflow prompt packaging and worker context as the next hardening opportunity for
tnh-conductor
What This Says About tnh-gen¶
This comparison should not be read as evidence against using tnh-gen.
It says something narrower:
- current maintained orchestrator value comes from workflow, artifacts, and controlled execution
tnh-genis still untested in the live evaluator role because that seam is not yet wired- when
tnh-genis introduced, it should be compared as a reviewer or process evaluator, not as a synonym for the current conductor arm
Practical Conclusion¶
If TNH Scholar needs one primary forward path for agent coordination right now, keep the existing agent-orch code and continue hardening tnh-conductor.
That path already owns the most important surfaces:
- workflow definition
- managed workspace lifecycle
- route control
- validation execution
- canonical artifacts
Direct Codex should remain the baseline and fallback for bounded work because it is still the cleanest way to get code written quickly.
Native subagents are worth continuing to test, but as an augmentation inside the Codex baseline rather than as a replacement for the orchestrator.
Explicit external codex-assistant and claude-assistant workers are worth keeping because they create a useful explicit worker interface. But they should still be treated as experimental until environment bootstrapping, home-state isolation, and model authentication are made dependable.
Next Actions¶
- Keep
tnh-conductoras the main coordination path under active development. - Add a targeted maintained validator route for conductor comparison tasks so the validation scope matches the task brief instead of defaulting to the whole test suite.
- Harden native subagent launch reliability before treating native delegation as a dependable default.
- Harden assistant-worker runtime setup so
codex-assistantandclaude-assistantcan run in fresh worktrees without manual environment or auth recovery. - Add a separate follow-on spike for
tnh-genas a review or evaluator artifact producer rather than folding it into the current orchestrator comparison.