Skip to content

SPIKE-04 Narrow Supervisory Comparison

SPIKE-04 compares a direct single-agent pass against the existing supervisory shell run on the same bounded OA01.x design-review task.

Experiment ID

SPIKE-04

Question

Does a tightly bounded supervisor using native subagents produce a meaningfully better result than a direct single-agent pass on the same task?

Setup

Task basis:

  • docs/architecture/agent-orchestration/supervisory-shell-trial/current-supervisory-task-brief.md

Arm A:

  • direct single-agent run
  • artifact: tmp/spike-04-direct-stdout.jsonl
  • stderr: tmp/spike-04-direct-stderr.log

Arm B:

  • existing shell-launched supervisory run
  • artifact: tmp/supervisory-shell-trial-stdout.jsonl
  • stderr: tmp/supervisory-shell-trial-stderr.log

Comparison lens:

  • usefulness of findings
  • distinctness of decomposition
  • added value versus overhead
  • operational reliability

Result

The supervisory arm produced the more useful result, but not cleanly enough yet to count as an endorsed default workflow.

Why the supervisory arm was stronger:

  • it produced three distinct critique workstreams
  • the workstreams converged on a tighter set of structural issues
  • the final synthesis was sharper about pruning and next experiment shape
  • it surfaced one important capability constraint directly: inherited fork-context spawning was not dependable in that launch mode

Why the direct arm still matters:

  • it was simpler
  • it completed cleanly
  • it produced a competent high-level critique without coordination overhead
  • it did not depend on any subagent behavior

Practical judgment:

  • direct single-agent use is the cleaner baseline
  • supervisory use has higher upside
  • current supervisory value depends on explicit-context delegation rather than trusting inherited context

Useful Artifacts

  • tmp/spike-04-direct-stdout.jsonl shows a clean direct critique with no collab_tool_call events
  • tmp/supervisory-shell-trial-stdout.jsonl shows real delegated workstreams and a stronger synthesis
  • tmp/supervisory-shell-trial-stderr.log captures the main supervisory reliability issue: parent-rollout fork-context failure during the first spawn strategy

Comparison Summary

Direct arm strengths:

  • simpler and more reliable
  • no subagent failure mode
  • adequate output quality for a bounded design review

Direct arm weaknesses:

  • less evidence of differentiated exploration
  • less pressure-testing from multiple angles
  • weaker signal about whether supervision itself adds value

Supervisory arm strengths:

  • better decomposition
  • stronger convergence on pruning moves
  • more informative next-experiment recommendation
  • direct evidence about native subagent constraints

Supervisory arm weaknesses:

  • higher overhead
  • failed initial spawn strategy
  • dependence on explicit repo/task context in delegated prompts

Conclusion

The current evidence favors continuing supervisory experimentation, but only in a very bounded form.

The supervisory path has enough upside to keep testing because it produced a better result than the direct pass on this task. But it is not yet clean or robust enough to replace direct-agent workflow as the default.

Next Action

Proceed to SPIKE-05 and formalize the minimum artifact bundle needed for a useful comparison run.

For future supervisory runs:

  • use the direct path as baseline
  • keep the supervisor to at most two subagent calls
  • prefer explicit-context subagent prompts
  • and judge value against one narrow task rather than broad exploratory work