OA01.x Experimental Directions Atlas¶

Experimental artifact from an independent cloud run. Preserve for reference only. Do not treat this document as maintained workflow authority for the current OA01.x spike.

Purpose¶

Define a broad, testable, and implementation-oriented experimental direction set for OA01.x, with a concrete experiment catalog focused on this target:

large-scale headless overnight runs,
complex iterative task loops,
explicit supervision and feedback,
team-of-agents collaboration,
and multi-provider participation (Codex, Claude Code, and non-coding/general AI reviewers).

This atlas is intentionally design-forward and experiment-heavy, so it can drive long-running execution without prematurely freezing one architecture.

Scope and boundaries¶

This document defines experimental directions, specific experiments, and acceptance signals.
It does not lock final production architecture; ADR decisions should be promoted only after repeated evidence.
Experiments are meant to run incrementally and can be split into PR-sized slices.

External surface scan used for this atlas (2026-04-14)¶

The following publicly documented capabilities informed prioritization:

Codex supports subagents, non-interactive execution (codex exec), JSONL event streaming, output schemas, and CI-oriented auth/automation patterns: https://developers.openai.com/codex/subagents, https://developers.openai.com/codex/noninteractive, https://developers.openai.com/codex/agent-approvals-security.
Claude Code docs describe multi-agent workflows, scheduling/automation patterns, and CI action integrations (anthropics/claude-code-action@v1): https://code.claude.com/docs/en/overview, https://code.claude.com/docs/en/github-actions.
Gemini docs provide large-scale asynchronous batch execution and parallel/compositional function-calling patterns: https://ai.google.dev/gemini-api/docs/batch-api, https://ai.google.dev/gemini-api/docs/function-calling.
MCP specification confirms standardized host/client/server tool and resource integration patterns plus safety expectations: https://modelcontextprotocol.io/specification/2025-06-18.
GitHub Actions matrix/concurrency mechanics provide practical substrate for overnight distributed runs: https://docs.github.com/en/actions/how-tos/write-workflows/choose-what-workflows-do/run-job-variations.

Direction portfolio (broad set)¶

Direction D1 — Supervisor Loop Kernel Reliability¶

Hypothesis: Overnight viability depends first on deterministic supervisor loop behavior and failure visibility.

Experiments:

D1.E1 Run-state determinism stress test
Replay same workflow inputs 25x; compare end states, events, and manifests for drift.
D1.E2 Crash-restart recovery drill
Force process kill at multiple lifecycle points; validate resume semantics and idempotency.
D1.E3 Timeout/heartbeat watchdog trials
Inject hung-step scenarios; ensure supervisor terminates, records cause, and continues queue.

Success signal: ≥95% deterministic terminal classifications; zero silent hangs.

Direction D2 — Headless Execution at Scale (Nightly)¶

Hypothesis: Large unattended runs fail mostly from orchestration hygiene rather than model quality.

Experiments:

D2.E1 Overnight burn-in (8h)
Queue mixed tasks continuously for 8 hours with bounded resource quotas.
D2.E2 Multi-runner parallelism sweep
Sweep concurrency levels (½/4/8/16 workers) and map failure-rate inflection points.
D2.E3 Queue backpressure validation
Overfeed workload intentionally; verify graceful shed/defer behavior.

Success signal: stable completion rate, no resource runaway, no orphaned worktrees.

Direction D3 — Cross-Agent Team Topologies¶

Hypothesis: Task quality improves when role-specialized teams are explicit (Implementer, Tester, Critic, Integrator).

Experiments:

D3.E1 Topology A/B/C comparison
Compare single-agent vs 2-agent pair vs 4-role team on same benchmark tasks.
D3.E2 Lead-agent delegation protocol
Evaluate quality/cycle-time when one lead coordinates subagents.
D3.E3 Team memory contract
Add shared artifact ledger; measure context-loss reduction across handoffs.

Success signal: higher merge-ready output per wall-clock hour without elevated policy violations.

Direction D4 — Multi-Provider Agent Fabric¶

Hypothesis: Best outcomes come from provider diversity, not single-model monoculture.

Experiments:

D4.E1 Codex implementer + Claude reviewer pipeline
Primary code changes from Codex; structured critique from Claude.
D4.E2 Claude implementer + Codex verifier pipeline
Reverse pairing to identify directional asymmetry.
D4.E3 General-model “out-of-box critic” stage
Add GPT/Gemini/Claude general review pass focused on risk-blind spots and alternative designs.

Success signal: measurable defect-prevention lift and architecture-option diversity.

Direction D5 — Feedback Loops and Evaluator Quality¶

Hypothesis: Iterative loops only work if evaluators are strict, legible, and difficult to game.

Experiments:

D5.E1 Mechanical vs semantic evaluator stack
Compare simple pass/fail checks to rubric-based semantic scoring.
D5.E2 Anti-gaming adversarial suite
Introduce intentionally superficial fixes to test evaluator robustness.
D5.E3 Loop budget tuning
Sweep max-iteration budgets and stopping criteria to optimize ROI.

Success signal: fewer false passes and better correlation with human review outcomes.

Direction D6 — Prompt/Brief Contract Robustness¶

Hypothesis: Most multi-agent instability is prompt-contract drift, not runtime defects.

Experiments:

D6.E1 Structured collaborator-brief schema trial
Enforce shared brief fields (intent, boundaries, evidence expected, stop conditions).
D6.E2 Ambiguity injection benchmark
Add controlled ambiguity and test whether clarification loops resolve it safely.
D6.E3 Instruction inheritance policy
Validate propagation precedence across system/team/step-level instructions.

Success signal: reduced divergence across agents given equivalent objectives.

Direction D7 — Artifact Contract and Provenance Depth¶

Hypothesis: Overnight supervision depends on first-class artifacts more than on chat transcripts.

Experiments:

D7.E1 Canonical artifact minimum set validation
Enforce transcript/final_response/policy_summary/manifest/event stream completeness.
D7.E2 Evidence traceability audit
Randomly sample runs; verify each decision can be traced to reproducible evidence.
D7.E3 Compression/retention policy sweep
Tune long-run storage to preserve forensic usefulness at manageable cost.

Success signal: post-run audits can reconstruct agent behavior without ambiguity.

Direction D8 — Safety Rails for Long-Running Autonomy¶

Hypothesis: Productive autonomy requires fail-closed controls and granular escalation pathways.

Experiments:

D8.E1 Approval-mode matrix
Evaluate read-only/workspace-write/full-access profiles against productivity and risk.
D8.E2 Destructive-command trap suite
Simulate unsafe operations; verify hard blocks + legible violation records.
D8.E3 Sandbox-escape resilience checks
Validate boundary behavior for shell/network/file-scope policy combinations.

Success signal: zero high-severity safety breaches in unattended windows.

Direction D9 — Human-in-the-Loop Supervision UX¶

Hypothesis: Effective supervision needs concise, interruption-minimizing review surfaces.

Experiments:

D9.E1 Morning digest quality test
Produce nightly summary bundles for human triage in <15 minutes.
D9.E2 Escalation-threshold experiments
Tune conditions that wake/notify humans during overnight runs.
D9.E3 Decision checkpoint templates
Standardize “approve/revise/reject/defer” packets to speed supervisor actions.

Success signal: low cognitive load and higher supervisor agreement consistency.

Direction D10 — Worktree and Diff Lifecycle Economics¶

Hypothesis: Worktree isolation and diff hygiene determine whether large run volumes stay operational.

Experiments:

D10.E1 Worktree churn endurance test
High-volume create/run/discard cycles to uncover filesystem/git bottlenecks.
D10.E2 Branch naming + garbage-collection policy test
Validate recoverability and cleanup under heavy overnight load.
D10.E3 Diff-size governance experiment
Enforce size caps and auto-splitting to maintain reviewability.

Success signal: clean rollback/recovery with no stale workspace accumulation.

Direction D11 — Task Routing and Portfolio Optimization¶

Hypothesis: Headless throughput increases when tasks are auto-routed by complexity/risk profile.

Experiments:

D11.E1 Task classifier prototype
Route tasks into lanes: trivial/mechanical/semantic/high-risk.
D11.E2 Provider-role routing policy
Map task lanes to best agent/provider role combinations.
D11.E3 Cost-aware routing loop
Include budget pressure in planner decisions while maintaining quality SLOs.

Success signal: improved quality-adjusted throughput per dollar/hour.

Direction D12 — Evaluation Harness and Benchmark Corpus¶

Hypothesis: Long-term progress stalls without a stable benchmark corpus and scorecard.

Experiments:

D12.E1 OA01.x benchmark pack v0
Curate 30–50 representative repository tasks with gold review expectations.
D12.E2 Replayable seed-run framework
Re-run benchmark suites across architecture/prompt/policy changes.
D12.E3 Regression gate for orchestration changes
Block merges that degrade key run-quality metrics beyond thresholds.

Success signal: trendable, reproducible evidence of platform improvement.

Prioritized execution waves¶

Wave 1 (Immediate)¶

D1, D2, D6, D8
Goal: make overnight runs safe, restartable, and interpretable.

Wave 2 (Near-term)¶

D3, D4, D5, D7
Goal: maximize multi-agent effectiveness and feedback-loop quality.

Wave 3 (Scale-up)¶

D9, D10, D11, D12
Goal: operationalize at high run volume with measurable governance.

Core metrics for all experiments¶

Run completion: completed / started.
Intervention rate: human interrupts per 100 runs.
Policy breach rate: violations by severity tier.
Loop efficiency: iterations to accepted output.
Review acceptance: share of outputs accepted with minimal edits.
Cost efficiency: accepted-output per unit compute budget.
Time-to-digest: median morning supervisor triage time.

Minimum reporting template per experiment¶

For consistency, each executed experiment should produce a note with:

hypothesis,
setup and inputs,
run counts and timestamps,
outcomes against core metrics,
failure taxonomy,
recommendation: iterate / promote / discard,
follow-up experiment IDs.

Current recommendation¶

Proceed with Wave 1 as the next long-running execution track, while reserving a fixed percentage of capacity for D4 cross-provider experiments so multi-agent diversity is validated early and not postponed.