SPIKE-10 Agent Coordination Comparison Result¶

This note records the observed outcomes, artifact evidence, and practical recommendation from the five-arm SPIKE-10 coordination run.

Experiment ID¶

SPIKE-10

Question¶

On the same bounded implementation task, which coordination shape currently looks most viable for TNH Scholar?

Compared arms:

direct single-agent Codex
Codex with native subagents
explicit external Codex workers through codex-assistant
explicit external Claude workers through claude-assistant
existing agent-orch through tnh-conductor

This result intentionally excludes a live tnh-gen evaluator arm because maintained agent-orch code does not currently wire tnh-gen into supervisory interpretation or workflow routing.

Setup¶

Task brief:

/docs/architecture/agent-orchestration/notes/experiments/spike-10-conductor-watch-task-brief.md

Maintained workflow:

/docs/architecture/agent-orchestration/notes/experiments/spike-10-conductor-watch.workflow.yaml

Committed base for all arms:

6909fcca on main

Manual-arm experiment root:

tmp/spike-10/20260420T215652Z

Conductor run:

run id 20260420T215817Z
run directory .tnh-conductor/runs/20260420T215817Z
managed worktree .tnh-conductor/worktrees/20260420T215817Z

Task:

implement tnh-conductor status --watch
add a configurable polling interval
emit machine-readable JSON snapshots until a terminal lifecycle state
preserve existing one-shot behavior

Result¶

All five arms produced viable implementation work on the same feature.

The strongest practical result is not that one arm alone dominated. It is that the comparison now cleanly separates three different kinds of value:

direct Codex is still the cleanest baseline for bounded coding work
existing agent-orch has the strongest control surface and artifact model
explicit external worker CLIs are promising as a repo-native interface, but current environment and auth friction make them less reliable than the other two paths

Native subagents remain credible, but this run did not show a clean native-subagent win because the intended forked-collaboration path failed and the supervisor had to recover locally.

So the current recommendation is:

keep direct Codex as the baseline for simple bounded work
keep tnh-conductor as the main path to invest in for controlled orchestration
keep the external assistant CLIs as experimental worker interfaces, not primary coordination defaults
treat future tnh-gen work as a separate review or evaluator layer rather than conflating it with the current orchestrator

Arm A: Direct Codex¶

Outcome:

completed cleanly
changed 2 files
targeted validation passed

Worktree:

tmp/spike-10/20260420T215652Z/worktrees/direct

Artifacts:

tmp/spike-10/20260420T215652Z/captures/direct/supervisor.stdout.jsonl
tmp/spike-10/20260420T215652Z/captures/direct/supervisor.stderr.log

Validation:

poetry install --with local
poetry run pytest tests/cli_tools/test_tnh_conductor.py
result observed in run trace: 6 passed in 2.37s

Practical read:

this was the lowest-overhead path
failures were ordinary local implementation/test issues rather than orchestration issues
it remains the right baseline any coordination layer needs to justify beating

Arm B: Native Codex Subagents¶

Outcome:

completed with a useful implementation
changed 2 files
targeted validation passed

Worktree:

tmp/spike-10/20260420T215652Z/worktrees/native-subagent

Artifacts:

tmp/spike-10/20260420T215652Z/captures/native-subagent/supervisor.stdout.jsonl
tmp/spike-10/20260420T215652Z/captures/native-subagent/supervisor.stderr.log

Validation:

poetry install --with local
poetry run pytest tests/cli_tools/test_tnh_conductor.py
result observed in run trace: 6 passed in 2.02s

Observed weakness:

intended native collaboration failed twice with parent thread rollout unavailable for fork

Practical read:

this arm still supports the view that native subagents are viable
but this run does not support treating them as the default coordination path yet
the biggest issue was not output quality, it was delegation reliability

Arm C: Explicit External Codex Worker¶

Outcome:

completed with a useful implementation
changed 2 files
targeted validation passed
explicit codex-assistant path was exercised, but the worker did not complete successfully

Worktree:

tmp/spike-10/20260420T215652Z/worktrees/external-codex

Artifacts:

tmp/spike-10/20260420T215652Z/captures/external-codex/supervisor.stdout.jsonl
tmp/spike-10/20260420T215652Z/captures/external-codex/supervisor.stderr.log
tmp/codex-assistant/worker-status-watch.stdout.jsonl
tmp/codex-assistant/worker-status-watch.stderr.log

Validation:

poetry install --with local
poetry run pytest tests/cli_tools/test_tnh_conductor.py
final worktree validation result: 7 passed

Observed weaknesses:

worker startup depended on local package install
Codex worker then hit ~/.codex session-write restrictions
the supervisor still had to recover locally to finish the task

Practical read:

the checked-in codex-assistant wrapper is useful as an explicit delegation seam
the process boundary did improve failure visibility
but right now the external Codex path is operationally more fragile than direct execution or tnh-conductor

Arm D: Explicit External Claude Worker¶

Outcome:

completed with a useful implementation
changed 2 files
targeted validation passed
explicit claude-assistant path was exercised, but the worker did not complete successfully

Worktree:

tmp/spike-10/20260420T215652Z/worktrees/external-claude

Artifacts:

tmp/spike-10/20260420T215652Z/captures/external-claude/supervisor.stdout.jsonl
tmp/spike-10/20260420T215652Z/captures/external-claude/supervisor.stderr.log
tmp/spike-10/20260420T215652Z/worktrees/external-claude/tmp/claude-assistant/20260420T220039Z.stdout.jsonl
tmp/spike-10/20260420T215652Z/worktrees/external-claude/tmp/claude-assistant/20260420T220039Z.stderr.log

Validation:

poetry install --with local
poetry run pytest tests/cli_tools/test_tnh_conductor.py
result observed in run trace: 6 passed in 2.24s

Observed weaknesses:

first launch failed before the worktree environment was installed
launch shape needed clarification because claude-assistant is effectively a single-command CLI
worker then hit ~/.claude write restrictions
after redirecting HOME, the worker still failed on local Claude authentication

Practical read:

this arm was valuable mainly because it exposed real wrapper and auth constraints
the CLI is a valid repo-native seam
but external Claude is not yet a dependable unattended worker path in this environment

Arm E: Existing Agent-Orch¶

Outcome:

completed successfully through the maintained workflow
changed 2 files
implementation and validation artifacts were persisted canonically

Managed worktree:

.tnh-conductor/worktrees/20260420T215817Z

Run artifacts:

.tnh-conductor/runs/20260420T215817Z/status.json
.tnh-conductor/runs/20260420T215817Z/metadata.json
.tnh-conductor/runs/20260420T215817Z/events.ndjson
.tnh-conductor/runs/20260420T215817Z/artifacts/implement/final_response.txt
.tnh-conductor/runs/20260420T215817Z/artifacts/validate/validation_stdout.txt

Validation:

implementation step reported poetry run pytest tests/cli_tools/test_tnh_conductor.py
observed result in final response: 6 passed
maintained validation step then ran the bootstrap tests validator
observed result in validation artifact: 535 passed, 2 skipped, 10 warnings

Practical read:

this arm had the best control surface and the clearest artifact model
status, events, route transitions, workspace diff, final response, and validation output all landed in predictable canonical locations
the main weakness was coarser live visibility and higher operator inspection cost than the direct arm

Most important distinction:

unlike the other arms, this path already gives TNH Scholar a repo-owned orchestration boundary with managed worktree, workflow routing, validation, and canonical artifacts
that is why it still looks like the strongest long-term coordination substrate

Comparison Summary¶

Direct Codex was best on simplicity.

least overhead
cleanest path to a passing result
easiest to reason about during execution

Native subagents were best on potential upside, but not on demonstrated reliability.

real delegation intent was present
the supervisor recovered sensibly
native fork reliability is still not good enough to lean on without caveats

Explicit external worker CLIs were best on process-boundary clarity, but weakest on operational readiness.

failures were easier to capture and inspect
codex-assistant and claude-assistant now give us stable repo-native seams
both paths were blocked by real environment or auth issues before the delegated worker could carry the task end to end

Existing agent-orch was best on control surface.

canonical run directory
managed worktree
event stream
artifact persistence
validation routing

That is the main reason to keep investing in it even though the direct arm remains cleaner on a small task.

Follow-up code comparison:¶

the direct arm and the tnh-conductor implementation arm both used the same Codex CLI execution family (codex exec --ephemeral)
the code quality of the direct arm had minor but noticeable quality edges in implementation.
the small code-quality edge for the direct arm therefore looks more like a prompt and entrypoint-context difference than a runner-surface difference
in practice, that points to workflow prompt packaging and worker context as the next hardening opportunity for tnh-conductor

What This Says About `tnh-gen`¶

This comparison should not be read as evidence against using tnh-gen.

It says something narrower:

current maintained orchestrator value comes from workflow, artifacts, and controlled execution
tnh-gen is still untested in the live evaluator role because that seam is not yet wired
when tnh-gen is introduced, it should be compared as a reviewer or process evaluator, not as a synonym for the current conductor arm

Practical Conclusion¶

If TNH Scholar needs one primary forward path for agent coordination right now, keep the existing agent-orch code and continue hardening tnh-conductor.

That path already owns the most important surfaces:

workflow definition
managed workspace lifecycle
route control
validation execution
canonical artifacts

Direct Codex should remain the baseline and fallback for bounded work because it is still the cleanest way to get code written quickly.

Native subagents are worth continuing to test, but as an augmentation inside the Codex baseline rather than as a replacement for the orchestrator.

Explicit external codex-assistant and claude-assistant workers are worth keeping because they create a useful explicit worker interface. But they should still be treated as experimental until environment bootstrapping, home-state isolation, and model authentication are made dependable.

Next Actions¶

Keep tnh-conductor as the main coordination path under active development.
Add a targeted maintained validator route for conductor comparison tasks so the validation scope matches the task brief instead of defaulting to the whole test suite.
Harden native subagent launch reliability before treating native delegation as a dependable default.
Harden assistant-worker runtime setup so codex-assistant and claude-assistant can run in fresh worktrees without manual environment or auth recovery.
Add a separate follow-on spike for tnh-gen as a review or evaluator artifact producer rather than folding it into the current orchestrator comparison.

SPIKE-10 Agent Coordination Comparison Result¶

Experiment ID¶

Question¶

Setup¶

Result¶

Arm A: Direct Codex¶

Arm B: Native Codex Subagents¶

Arm C: Explicit External Codex Worker¶

Arm D: Explicit External Claude Worker¶

Arm E: Existing Agent-Orch¶

Comparison Summary¶

Follow-up code comparison:¶

What This Says About tnh-gen¶

Practical Conclusion¶

Next Actions¶

What This Says About `tnh-gen`¶