Codex Headless Communication Experiment Plan¶

This plan lays out the next bounded experiment sequence for understanding reliable headless Codex communication.

Purpose¶

Define the next rounds of Codex headless communication experiments in a way that is practical, narrow, sequential, and grounded in the results already observed.

This plan takes useful inputs from:

The research memo is treated as an input, not as a decision document.

The cloud-generated direction docs are preserved as experimental artifacts only. They are not maintained planning authority for the current spike.

Current Baseline¶

The current best machine-oriented baseline is:

use codex exec as the headless surface,
use the authenticated default ~/.codex home,
use repo-local profile collab,
use --ephemeral,
separate stdout from stderr,
and disable plugins plus shell_snapshot when the goal is a cleaner event channel.

This path is viable enough for further experiments, but it is not yet endorsed as the long-term operating mode.

Current Scope¶

The current spike scope is narrower than a general orchestration program.

Active focus:

headless execution patterns,
native subagent invocation and evidence capture,
simple supervisor vs direct-agent comparison,
and minimal operator-facing artifacts.

Explicitly not active right now:

economics,
broad topology exploration,
large benchmark programs,
cross-provider portfolio design,
and broad overnight scale architecture.

Adopted Directions¶

1. Build a Documented-vs-Observed Matrix¶

Adopt the memo's suggestion to explicitly track:

officially documented behavior,
locally observed behavior,
confidence,
and follow-up experiments.

Reason:

This is useful discipline and keeps research from collapsing into opinion.

2. Continue User-Shell Mimic Experiments¶

Adopt direct comparison across execution contexts such as:

direct user shell,
agent-launched zsh -lc,
repo-local wrapper script,
and targeted working-directory or config changes.

Reason:

This is now clearly one of the highest-signal questions.

3. Keep Capability Checking in Any Automation Path¶

Adopt lightweight capability verification before assuming CLI flags or behavior.

Reason:

This is ordinary hygiene for an actively changing CLI surface.

4. Keep Experiments Narrow and Collaboration-Focused¶

Adopt the memo's bias toward narrow collaboration patterns and away from early broad autonomy experiments.

Reason:

This aligns with OA01.4 and with the current stage of uncertainty.

5. Keep the Spike Decisive Rather Than Expansive¶

Adopt a narrow success question:

is headless subagent work viable enough for TNH Scholar,
or are existing direct agent workflows the better practical path for now?

Reason:

This is the real decision boundary for the current spike.

6. Prefer More Meaningful User-Shell Experiments Over More Noise Tuning¶

Adopt a bias toward user-shell supervisory experiments before doing much more local wrapper optimization.

Reason:

The main unknown is no longer whether Codex can be called. The main unknown is whether native collaborative behavior is genuinely useful for the direction now being explored.

On Hold¶

1. Isolated-but-Authenticated Execution¶

This remains interesting, but it is on hold rather than the immediate next move.

Reason:

it may matter later for reproducibility,
but the current path is already viable,
and the project does not yet need a cleaner isolated home badly enough to make this the top priority.

2. Broader Collaboration-Mode Taxonomy¶

The memo's breakdown of guidance-layer, process-layer, and supervisory collaboration is useful conceptually, but it is on hold for planning purposes.

Reason:

It adds conceptual structure faster than current experiments require.

3. Safety-Model Redesign¶

Questions about dangerous mode, custom safety gating, or stronger external permission models are on hold.

Reason:

The project still needs better understanding of the normal execution path before reasoning about stronger or more permissive modes.

Rejected for Now¶

1. Broad New Orchestration or Collaboration Infrastructure¶

Rejected for this phase.

Reason:

The communication path is already viable enough to support further learning without more system buildout.

2. Treating the Research Memo's Recommendations as Directional Decisions¶

Rejected.

Reason:

The memo contains useful research, but its recommendations are not sufficiently grounded in the repo's full context to serve as planning authority.

3. Requiring Clean Isolated State Before Continuing¶

Rejected.

Reason:

This would block useful experimentation unnecessarily. The current authenticated path is noisy but workable.

Next Experiments¶

The next rounds should stay sequential and small.

Experiment 1. Baseline Establishment¶

Status:

completed

What it covered:

headless reachability
stream separation
real review usefulness
user-shell cleanliness
wrapper usefulness
repo-local config viability
native subagent availability
first noise-reduction levers

Primary output:

Codex Headless Communication Report

Experiment 2. Surface-Cost Comparison¶

Goal:

Understand the likely cost of suppressing noisy surfaces before treating the lower-noise path as a design win.

Sub-experiments:

2.1 Plugins Cost¶

Compare one small but meaningful task under:

baseline path
--disable plugins

Question:

what useful behavior, if any, disappears when plugins are disabled?

2.2 Shell Snapshot Cost¶

Compare one small but meaningful task under:

baseline path
--disable shell_snapshot

Question:

does disabling shell snapshots reduce useful environment/context behavior, or only suppress noise?

2.3 Persistence / State Cost¶

Observe whether the persistent state warning appears correlated with any degraded useful behavior during a slightly longer task.

Question:

is the state DB issue merely noisy, or operationally relevant?

Experiment 3. User-Shell Supervisory Trial¶

Goal:

Run a more meaningful Codex session from the user's shell, with explicit permission to use native subagents, and see how far the native collaboration surface goes before custom orchestration is introduced.

Shape:

launch from the user's shell
provide a clear supervisory objective
allow native subagent use
include stop conditions and output expectations

Questions:

does the supervisory mode feel meaningfully stronger than thin wrapper-led task execution?
what native coordination behavior appears without additional structure?
what artifacts are actually useful for human review afterward?

Status:

first round completed

What was learned:

the supervisor did produce useful delegated work and a stronger synthesis than a likely straight-line pass
the initial inherited subagent spawn mode failed with a parent-rollout fork error
explicit-context retry worked
the result suggests promise, but not a clean workflow yet

Immediate implication:

the next comparison should not assume inherited subagent context is dependable
and the next design-review comparison should be narrower and more controlled

Experiment 4. Paired Design-Review Comparison¶

Goal:

Test the real question more cleanly: does supervised native collaboration beat a direct single-agent pass on the same review task?

Shape:

same day
same file set
same brief
same time budget
same output scorecard

Arms:

Arm A: one direct codex exec design-review pass
Arm B: one shell-launched supervisor with at most 2 subagent calls

Constraint:

use one orientation only: Design Review
do not widen into broader runner or architecture work during the trial
use explicit subagent context if inherited context remains unreliable

Goal:

Only after the paired comparison, decide whether the wrapper should become a slightly more intentional launcher surface for future orchestration agents.

Possible additions:

--prompt-file
stable output-path conventions
optional event filtering

Constraint:

keep this thin
do not build a larger orchestration substrate yet

Working Principle¶

The project should continue to optimize for rapid learning through small experiments, not for premature system design.

The next meaningful question is not "can Codex be launched?" That has already been answered. The next meaningful question is whether a user-launched supervisory Codex session, using native collaboration features, is good enough to inform the emerging collaboration-oriented direction before more local tooling is built.

Codex Headless Communication Experiment Plan¶

Purpose¶

Current Baseline¶

Current Scope¶

Adopted Directions¶

1. Build a Documented-vs-Observed Matrix¶

2. Continue User-Shell Mimic Experiments¶

3. Keep Capability Checking in Any Automation Path¶

4. Keep Experiments Narrow and Collaboration-Focused¶

5. Keep the Spike Decisive Rather Than Expansive¶

6. Prefer More Meaningful User-Shell Experiments Over More Noise Tuning¶

On Hold¶

1. Isolated-but-Authenticated Execution¶

2. Broader Collaboration-Mode Taxonomy¶

3. Safety-Model Redesign¶

Rejected for Now¶

1. Broad New Orchestration or Collaboration Infrastructure¶

2. Treating the Research Memo's Recommendations as Directional Decisions¶

3. Requiring Clean Isolated State Before Continuing¶

Next Experiments¶

Experiment 1. Baseline Establishment¶

Experiment 2. Surface-Cost Comparison¶

2.1 Plugins Cost¶

2.2 Shell Snapshot Cost¶

2.3 Persistence / State Cost¶

Experiment 3. User-Shell Supervisory Trial¶

Experiment 4. Paired Design-Review Comparison¶

Experiment 5. Thin Launcher Refinement¶

Working Principle¶