Skip to content

API Reference

tnh_scholar

TNH Scholar: Text Processing and Analysis Tools

TNH Scholar is an AI-driven project designed to explore, query, process and translate the teachings of Thich Nhat Hanh and other Plum Village Dharma Teachers. The project aims to create a resource for practitioners and scholars to deeply engage with mindfulness and spiritual wisdom through natural language processing and machine learning models.

Core Features
  • Audio transcription and processing
  • Multi-lingual text processing and translation
  • Prompt-based text analysis
  • OCR processing for historical documents
  • CLI tools for batch processing
Package Structure
  • tnh_scholar/
  • CLI_tools/ - Command line interface tools
  • audio_processing/ - Audio file handling and transcription
  • journal_processing/ - Journal and publication processing
  • ocr_processing/ - Optical character recognition tools
  • text_processing/ - Core text processing utilities
  • video_processing/ - Video file handling and transcription
  • utils/ - Shared utility functions
  • xml_processing/ - XML parsing and generation
Environment Configuration
  • The package uses environment variables for configuration, including:
  • TNH_PROMPT_DIR - Directory for text processing prompts
  • OPENAI_API_KEY - OpenAI API authentication
  • GOOGLE_VISION_KEY - Google Cloud Vision API key for OCR
CLI Tools
  • audio-transcribe - Audio file transcription utility
  • tnh-gen - GenAI CLI for text processing and analysis

For more information, see: - Documentation: https://aaronksolomon.github.io/tnh-scholar/ - Source: https://github.com/aaronksolomon/tnh-scholar - Issues: https://github.com/aaronksolomon/tnh-scholar/issues

Dependencies
  • Core: click, pydantic, openai, yt-dlp
  • Optional: streamlit (GUI), spacy (NLP), google-cloud-vision (OCR)

TNH_CLI_TOOLS_DIR = TNH_ROOT_SRC_DIR / 'cli_tools' module-attribute

TNH_METADATA_PROCESS_FIELD = 'tnh_processing' module-attribute

TNH_PROJECT_ROOT_DIR = TNH_ROOT_SRC_DIR.resolve().parent.parent module-attribute

TNH_ROOT_SRC_DIR = Path(__file__).resolve().parent module-attribute

__version__ = '0.4.2' module-attribute

agent_orchestration

Agent orchestration package.

app

Maintained application-layer bootstrap surface for agent orchestration.

__all__ = ['BootstrapRuntimeProfile', 'HeadlessBootstrapConfig', 'HeadlessBootstrapParams', 'HeadlessBootstrapResult', 'HeadlessBootstrapService', 'HeadlessPolicyConfig', 'HeadlessRunnerConfig', 'HeadlessStorageConfig', 'HeadlessValidationConfig', 'build_bootstrap_runtime_profile'] module-attribute
BootstrapRuntimeProfile dataclass

Explicit temporary bootstrap profile for headless maintained runs.

policy instance-attribute
validation instance-attribute
__init__(validation, policy)
HeadlessBootstrapConfig

Bases: BaseModel

Construction-time config for the maintained headless bootstrap path.

base_ref = 'HEAD' class-attribute instance-attribute
branch_prefix = 'tnh/run-' class-attribute instance-attribute
policy instance-attribute
repo_root instance-attribute
runner = Field(default_factory=HeadlessRunnerConfig) class-attribute instance-attribute
storage instance-attribute
validation instance-attribute
HeadlessBootstrapParams

Bases: BaseModel

Per-run parameters for the maintained headless bootstrap path.

workflow_path instance-attribute
HeadlessBootstrapResult

Bases: BaseModel

Stable summary returned by the maintained headless bootstrap path.

final_state_path instance-attribute
metadata_path instance-attribute
run_directory instance-attribute
run_id instance-attribute
status instance-attribute
status_path instance-attribute
workflow_id instance-attribute
workspace_context = None class-attribute instance-attribute
HeadlessBootstrapService dataclass

Load one workflow and run the maintained kernel end to end.

config instance-attribute
kernel_factory = None class-attribute instance-attribute
workflow_loader = field(default_factory=YamlWorkflowLoader) class-attribute instance-attribute
__init__(config, workflow_loader=YamlWorkflowLoader(), kernel_factory=None)
run(params)

Execute one maintained headless bootstrap run.

HeadlessPolicyConfig

Bases: BaseModel

Construction-time execution policy config.

execution_policy_settings instance-attribute
HeadlessRunnerConfig

Bases: BaseModel

Construction-time runner executable config.

claude_executable = None class-attribute instance-attribute
codex_executable = None class-attribute instance-attribute
HeadlessStorageConfig

Bases: BaseModel

Construction-time storage roots for headless bootstrap runs.

runs_root instance-attribute
workspace_root instance-attribute
for_repo_root(repo_root) classmethod

Return the default storage layout rooted under one repository.

HeadlessValidationConfig

Bases: BaseModel

Construction-time builtin validator mapping config.

builtin_commands = Field(default_factory=tuple) class-attribute instance-attribute
build_bootstrap_runtime_profile()

Return the explicit bootstrap profile used by the maintained CLI.

factory

Composition helpers for the maintained headless bootstrap app layer.

BootstrapKernelBundle dataclass

Maintained collaborators required for one headless bootstrap run.

kernel_service instance-attribute
workspace_service instance-attribute
__init__(kernel_service, workspace_service)
BootstrapKernelFactory dataclass

Build the maintained kernel bundle for one headless bootstrap run.

config instance-attribute
__init__(config)
build()

Return the fully assembled maintained kernel bundle.

BootstrapKernelFactoryProtocol

Bases: Protocol

Build one maintained bootstrap kernel bundle.

build()

Return the fully assembled maintained bootstrap bundle.

SystemClock dataclass

System UTC clock for maintained headless bootstrap runs.

__init__()
now()

Return the current UTC timestamp.

TimestampRunIdGenerator dataclass

Generate compact timestamp run IDs for bootstrap runs.

__init__()
next_id(now)

Return a compact timestamp-based run ID.

models

Typed models for the maintained headless bootstrap app layer.

HeadlessBootstrapConfig

Bases: BaseModel

Construction-time config for the maintained headless bootstrap path.

base_ref = 'HEAD' class-attribute instance-attribute
branch_prefix = 'tnh/run-' class-attribute instance-attribute
policy instance-attribute
repo_root instance-attribute
runner = Field(default_factory=HeadlessRunnerConfig) class-attribute instance-attribute
storage instance-attribute
validation instance-attribute
HeadlessBootstrapParams

Bases: BaseModel

Per-run parameters for the maintained headless bootstrap path.

workflow_path instance-attribute
HeadlessBootstrapResult

Bases: BaseModel

Stable summary returned by the maintained headless bootstrap path.

final_state_path instance-attribute
metadata_path instance-attribute
run_directory instance-attribute
run_id instance-attribute
status instance-attribute
status_path instance-attribute
workflow_id instance-attribute
workspace_context = None class-attribute instance-attribute
HeadlessPolicyConfig

Bases: BaseModel

Construction-time execution policy config.

execution_policy_settings instance-attribute
HeadlessRunnerConfig

Bases: BaseModel

Construction-time runner executable config.

claude_executable = None class-attribute instance-attribute
codex_executable = None class-attribute instance-attribute
HeadlessStorageConfig

Bases: BaseModel

Construction-time storage roots for headless bootstrap runs.

runs_root instance-attribute
workspace_root instance-attribute
for_repo_root(repo_root) classmethod

Return the default storage layout rooted under one repository.

HeadlessValidationConfig

Bases: BaseModel

Construction-time builtin validator mapping config.

builtin_commands = Field(default_factory=tuple) class-attribute instance-attribute
profile

Explicit bootstrap profile assembly for the maintained headless app layer.

BootstrapRuntimeProfile dataclass

Explicit temporary bootstrap profile for headless maintained runs.

policy instance-attribute
validation instance-attribute
__init__(validation, policy)
build_bootstrap_runtime_profile()

Return the explicit bootstrap profile used by the maintained CLI.

service

Maintained application-layer bootstrap service for headless orchestration.

HeadlessBootstrapService dataclass

Load one workflow and run the maintained kernel end to end.

config instance-attribute
kernel_factory = None class-attribute instance-attribute
workflow_loader = field(default_factory=YamlWorkflowLoader) class-attribute instance-attribute
__init__(config, workflow_loader=YamlWorkflowLoader(), kernel_factory=None)
run(params)

Execute one maintained headless bootstrap run.

codex_harness

Suspended Codex harness spike preserved as reference-only code.

adapters

Adapters for codex harness boundaries.

output_parser

Parse structured output from Codex.

CodexOutputParser dataclass

Parse JSON output into structured domain models.

__init__()
parse(text)

Parse the JSON response text into a CodexStructuredOutput.

response_schema

Response schema builder for Codex structured output.

ResponseSchema

Bases: BaseModel

Response schema wrapper for OpenAI response_format.

json_schema instance-attribute
type = 'json_schema' class-attribute instance-attribute
to_openai()

Return payload for OpenAI response_format.

ResponseSchemaBuilder dataclass

Build response schema payloads.

name = 'codex_harness_output' class-attribute instance-attribute
__init__(name='codex_harness_output')
build()

Build the schema for structured output.

models

Domain models for the Codex harness.

CodexDefaults dataclass

Default values for harness settings and parameters.

default_system_prompt = 'Use the provided tools to inspect the repo. Use repo-relative paths. Return ONLY JSON with keys: patch (string or null), rationale (string), risk_flags (list of strings), open_questions (list of strings), status (complete|partial|blocked). No extra keys.' class-attribute instance-attribute
max_output_tokens = 2000 class-attribute instance-attribute
max_tool_rounds = 12 class-attribute instance-attribute
model = 'gpt-5.2-codex' class-attribute instance-attribute
runs_root = Path('.tnh-codex/runs') class-attribute instance-attribute
temperature = None class-attribute instance-attribute
timeout_seconds = 900 class-attribute instance-attribute
__init__(runs_root=Path('.tnh-codex/runs'), model='gpt-5.2-codex', timeout_seconds=900, max_output_tokens=2000, temperature=None, max_tool_rounds=12, default_system_prompt='Use the provided tools to inspect the repo. Use repo-relative paths. Return ONLY JSON with keys: patch (string or null), rationale (string), risk_flags (list of strings), open_questions (list of strings), status (complete|partial|blocked). No extra keys.')
CodexMessage

Bases: BaseModel

Message entry for a Codex request.

content instance-attribute
role instance-attribute
CodexOutputStatus

Bases: str, Enum

blocked = 'blocked' class-attribute instance-attribute
complete = 'complete' class-attribute instance-attribute
partial = 'partial' class-attribute instance-attribute
CodexRequest

Bases: BaseModel

Codex request payload for the Responses API.

max_output_tokens instance-attribute
max_tool_rounds instance-attribute
messages instance-attribute
model instance-attribute
temperature instance-attribute
CodexResponseText

Bases: BaseModel

Raw response text captured from the API.

raw_payload instance-attribute
text instance-attribute
CodexRunArtifacts

Bases: BaseModel

Paths to files generated by a run.

output_json instance-attribute
patch_diff instance-attribute
request_json instance-attribute
response_json instance-attribute
response_text instance-attribute
run_metadata instance-attribute
stderr_log instance-attribute
stdout_log instance-attribute
CodexRunConfig

Bases: BaseModel

Construction-time configuration for the harness.

model instance-attribute
runs_root instance-attribute
CodexRunMetadata

Bases: BaseModel

Metadata for a Codex harness run.

artifacts instance-attribute
ended_at instance-attribute
error_message = None class-attribute instance-attribute
model instance-attribute
output_status = None class-attribute instance-attribute
patch_applied = False class-attribute instance-attribute
run_id instance-attribute
started_at instance-attribute
status instance-attribute
test_exit_code = None class-attribute instance-attribute
CodexRunParams

Bases: BaseModel

Per-run parameters for the harness.

apply_patch = True class-attribute instance-attribute
max_output_tokens = Field(default_factory=(lambda: CodexDefaults().max_output_tokens)) class-attribute instance-attribute
max_tool_rounds = Field(default_factory=(lambda: CodexDefaults().max_tool_rounds)) class-attribute instance-attribute
run_tests_command = None class-attribute instance-attribute
system_prompt = None class-attribute instance-attribute
task instance-attribute
temperature = Field(default_factory=(lambda: CodexDefaults().temperature)) class-attribute instance-attribute
timeout_seconds = Field(default_factory=(lambda: CodexDefaults().timeout_seconds)) class-attribute instance-attribute
CodexRunStatus

Bases: str, Enum

blocked = 'blocked' class-attribute instance-attribute
completed = 'completed' class-attribute instance-attribute
failed = 'failed' class-attribute instance-attribute
CodexSettings

Bases: BaseSettings

Environment-driven settings for the Codex harness.

model = Field(default_factory=(lambda: CodexDefaults().model)) class-attribute instance-attribute
model_config = SettingsConfigDict(extra='ignore') class-attribute instance-attribute
openai_api_key = None class-attribute instance-attribute
runs_root = Field(default_factory=(lambda: CodexDefaults().runs_root)) class-attribute instance-attribute
from_env() classmethod

Create settings from environment.

CodexStructuredOutput

Bases: BaseModel

Structured output expected from Codex.

open_questions = Field(default_factory=list) class-attribute instance-attribute
patch = None class-attribute instance-attribute
rationale instance-attribute
risk_flags = Field(default_factory=list) class-attribute instance-attribute
status instance-attribute
PatchApplyResult

Bases: BaseModel

Result of applying a patch.

applied instance-attribute
stderr instance-attribute
stdout instance-attribute
TestRunResult

Bases: BaseModel

Result of running a test command.

exit_code instance-attribute
stderr instance-attribute
stdout instance-attribute
protocols

Protocol definitions for the Codex harness.

ArtifactWriterProtocol

Bases: Protocol

Persist run artifacts to disk.

ensure_run_dir(run_id)

Ensure the run directory exists and return it.

write_json(path, payload)

Write JSON content to a file.

write_text(path, content)

Write text content to a file.

ClockProtocol

Bases: Protocol

Abstraction for time sourcing.

now()

Return the current timestamp.

PatchApplierProtocol

Bases: Protocol

Apply unified diff patches.

apply(patch)

Apply the patch to the workspace.

ResponsesClientProtocol

Bases: Protocol

Call the OpenAI Responses API.

run(request, tool_registry)

Execute the request and return response text.

RunIdGeneratorProtocol

Bases: Protocol

Generate run identifiers.

next_id(*, now)

Return a new run id.

SearcherProtocol

Bases: Protocol

Search for text in the repository.

search(query, root)

Return matching lines for the query.

TestRunnerProtocol

Bases: Protocol

Run test commands.

run(command, timeout_seconds)

Execute a test command and capture results.

ToolExecutorProtocol

Bases: Protocol

Execute tool calls for the Codex harness.

execute(call)

Execute the tool call and return output.

ToolRegistryProtocol

Bases: Protocol

Registry for tool definitions and execution.

definitions()

Return tool definitions.

execute(call)

Execute tool call.

WorkspaceLocatorProtocol

Bases: Protocol

Locate the repository root for tools.

repo_root()

Return the repository root.

providers

Providers for codex harness infrastructure.

artifact_writer

Artifact writer for Codex harness.

FileArtifactWriter dataclass

Bases: ArtifactWriterProtocol

Write artifacts to disk.

runs_root instance-attribute
__init__(runs_root)
ensure_run_dir(run_id)

Ensure the run directory exists and return it.

write_json(path, payload)

Write JSON content to a file.

write_text(path, content)

Write text content to a file.

chat_completions_client

OpenAI Chat Completions API client for Codex harness.

ChatCompletionsClient dataclass

Bases: ResponsesClientProtocol

Chat Completions API client for Codex harness.

api_key instance-attribute
__init__(api_key)
run(request, tool_registry)

Execute the request and return response text.

clock

Clock provider for the Codex harness.

SystemClock dataclass

Bases: ClockProtocol

System clock implementation.

__init__()
now()

Return current time.

openai_responses_client

OpenAI Responses API client for Codex harness.

OpenAIResponsesClient dataclass

Bases: ResponsesClientProtocol

Responses API client for Codex harness.

api_key instance-attribute
schema_builder = None class-attribute instance-attribute
__init__(api_key, schema_builder=None)
run(request, tool_registry)

Execute the request and return response text.

patch_applier

Patch application provider for Codex harness.

GitPatchApplier dataclass

Bases: PatchApplierProtocol

Apply unified diff patches using git.

__init__()
apply(patch)

Apply the patch to the workspace.

run_id

Run id generator for Codex harness.

TimestampRunIdGenerator dataclass

Bases: RunIdGeneratorProtocol

Timestamp-based run id generator.

__init__()
next_id(*, now)

Return a timestamp-based run id.

searcher

Repository search provider for Codex harness tools.

RipgrepSearcher dataclass

Use ripgrep to search the repository.

__init__()
search(query, root)
test_runner

Test runner provider for Codex harness.

ShellTestRunner dataclass

Bases: TestRunnerProtocol

Run test commands via the shell.

__init__()
run(command, timeout_seconds)

Execute a test command and capture results.

tool_executor

Tool execution provider for Codex harness.

CodexToolExecutor dataclass

Bases: ToolExecutorProtocol

Execute Codex tool calls against the repo.

patch_applier instance-attribute
searcher instance-attribute
test_runner instance-attribute
test_timeout_seconds instance-attribute
workspace instance-attribute
__init__(workspace, patch_applier, test_runner, searcher, test_timeout_seconds)
execute(call)
tool_registry

Tool registry for Codex harness.

CodexToolRegistry dataclass

Register tool definitions and execute tool calls.

executor instance-attribute
schema_factory instance-attribute
__init__(schema_factory, executor)
definitions()
execute(call)
workspace_locator

Workspace root locator for Codex harness tools.

GitWorkspaceLocator dataclass

Locate repo root using git.

__init__()
repo_root()
service

Service orchestrator for the Codex harness.

CodexHarnessService dataclass

Coordinate Codex harness execution and artifacts.

artifact_writer instance-attribute
clock instance-attribute
output_parser instance-attribute
patch_applier instance-attribute
responses_client instance-attribute
run_id_generator instance-attribute
test_runner instance-attribute
tool_registry instance-attribute
__init__(clock, run_id_generator, responses_client, artifact_writer, output_parser, patch_applier, test_runner, tool_registry)
run(params, config)

Execute a Codex harness run.

tools

Tooling definitions for Codex harness.

ApplyPatchArgs

Bases: BaseModel

diff instance-attribute
ApplyPatchResult

Bases: BaseModel

applied instance-attribute
stderr instance-attribute
stdout instance-attribute
ListFilesArgs

Bases: BaseModel

path instance-attribute
ListFilesResult

Bases: BaseModel

entries = Field(default_factory=list) class-attribute instance-attribute
error = None class-attribute instance-attribute
path instance-attribute
ReadFileArgs

Bases: BaseModel

path instance-attribute
ReadFileResult

Bases: BaseModel

content instance-attribute
error = None class-attribute instance-attribute
path instance-attribute
RunTestsArgs

Bases: BaseModel

command instance-attribute
RunTestsResult

Bases: BaseModel

exit_code instance-attribute
stderr instance-attribute
stdout instance-attribute
SearchRepoArgs

Bases: BaseModel

query instance-attribute
SearchRepoResult

Bases: BaseModel

error = None class-attribute instance-attribute
matches = Field(default_factory=list) class-attribute instance-attribute
query instance-attribute
ToolCall

Bases: BaseModel

Parsed tool call from Codex.

arguments_json instance-attribute
call_id instance-attribute
name instance-attribute
ToolDefinition

Bases: BaseModel

Definition for a callable tool.

description instance-attribute
name instance-attribute
parameters_schema instance-attribute
ToolName

Bases: str, Enum

apply_patch = 'apply_patch' class-attribute instance-attribute
list_files = 'list_files' class-attribute instance-attribute
read_file = 'read_file' class-attribute instance-attribute
run_tests = 'run_tests' class-attribute instance-attribute
search_repo = 'search_repo' class-attribute instance-attribute
ToolResult

Bases: BaseModel

Tool execution result.

call_id instance-attribute
name instance-attribute
output_json instance-attribute
ToolSchemaFactory dataclass

Build JSON schema definitions for tools.

__init__()
all_definitions()
apply_patch()
list_files()
read_file()
run_tests()
search_repo()

common

Shared primitives for agent orchestration subsystems.

__all__ = ['local_now', 'strftime_run_id', 'utc_now'] module-attribute
local_now()

Return current local timestamp with timezone information.

strftime_run_id(now, format_string)

Return run id generated from timestamp and format.

utc_now()

Return current UTC timestamp.

run_id

Run id helpers shared by orchestration components.

strftime_run_id(now, format_string)

Return run id generated from timestamp and format.

time

Time helpers shared by orchestration components.

local_now()

Return current local timestamp with timezone information.

utc_now()

Return current UTC timestamp.

conductor_mvp

MVP conductor kernel for workflow execution.

__all__ = ['ArtifactPaths', 'BuiltinValidatorSpec', 'ConductorKernelService', 'EvaluateStep', 'GateStep', 'HarnessValidatorSpec', 'KernelRunResult', 'MechanicalOutcome', 'PlannerDecision', 'PlannerStatus', 'RollbackStep', 'RunAgentStep', 'RunValidationStep', 'StopStep', 'ValidatorExecutionSpec', 'WorkflowDefinition', 'WorkflowValidationError'] module-attribute
ArtifactPaths

Bases: BaseModel

MVP artifact outputs per run.

final_state instance-attribute
run_dir instance-attribute
run_log instance-attribute
BuiltinValidatorSpec

Bases: BaseModel

Builtin validator reference resolved by a provider.

kind = 'builtin' class-attribute instance-attribute
name instance-attribute
ConductorKernelService dataclass

Execute a validated workflow deterministically.

agent_runner instance-attribute
artifact_store instance-attribute
clock instance-attribute
gate_approver instance-attribute
planner_evaluator instance-attribute
run_id_generator instance-attribute
validation_runner instance-attribute
workflow_validator instance-attribute
workspace instance-attribute
__init__(clock, run_id_generator, artifact_store, workspace, agent_runner, validation_runner, planner_evaluator, gate_approver, workflow_validator)
run(workflow, run_root)

Execute workflow and return run summary.

EvaluateStep

Bases: BaseStep

Step running planner evaluation.

allowed_next_steps = Field(default_factory=list) class-attribute instance-attribute
opcode = Opcode.evaluate class-attribute instance-attribute
prompt instance-attribute
GateStep

Bases: BaseStep

Human gate step.

gate instance-attribute
opcode = Opcode.gate class-attribute instance-attribute
timeout_seconds = None class-attribute instance-attribute
HarnessValidatorSpec

Bases: BaseModel

Generated harness validator reference resolved by a provider.

artifacts = Field(default_factory=list) class-attribute instance-attribute
kind = 'harness' class-attribute instance-attribute
may_propose_goldens = False class-attribute instance-attribute
name instance-attribute
timeout_seconds = None class-attribute instance-attribute
KernelRunResult

Bases: BaseModel

Kernel execution summary.

artifact_paths instance-attribute
ended_at instance-attribute
last_step_id instance-attribute
run_id instance-attribute
started_at instance-attribute
status instance-attribute
workflow_id instance-attribute
MechanicalOutcome

Bases: str, Enum

Mechanical execution outcomes.

completed = 'completed' class-attribute instance-attribute
error = 'error' class-attribute instance-attribute
killed_idle = 'killed_idle' class-attribute instance-attribute
killed_policy = 'killed_policy' class-attribute instance-attribute
killed_timeout = 'killed_timeout' class-attribute instance-attribute
PlannerDecision

Bases: BaseModel

Structured planner output consumed by EVALUATE.

blockers = Field(default_factory=list) class-attribute instance-attribute
fix_instructions = None class-attribute instance-attribute
next_step = None class-attribute instance-attribute
risk_flags = Field(default_factory=list) class-attribute instance-attribute
status instance-attribute
PlannerStatus

Bases: str, Enum

Semantic planner statuses.

blocked = 'blocked' class-attribute instance-attribute
needs_human = 'needs_human' class-attribute instance-attribute
partial = 'partial' class-attribute instance-attribute
success = 'success' class-attribute instance-attribute
unsafe = 'unsafe' class-attribute instance-attribute
RollbackStep

Bases: BaseStep

Deterministic rollback step.

opcode = Opcode.rollback class-attribute instance-attribute
target instance-attribute
RunAgentStep

Bases: BaseStep

Step invoking an external agent runner.

agent instance-attribute
inputs = Field(default_factory=list) class-attribute instance-attribute
opcode = Opcode.run_agent class-attribute instance-attribute
policy = None class-attribute instance-attribute
prompt instance-attribute
RunValidationStep

Bases: BaseStep

Step running deterministic validators.

opcode = Opcode.run_validation class-attribute instance-attribute
run instance-attribute
StopStep

Bases: BaseStep

Terminal step.

opcode = Opcode.stop class-attribute instance-attribute
reason = None class-attribute instance-attribute
routes = Field(default_factory=list) class-attribute instance-attribute
ValidatorExecutionSpec

Bases: BaseModel

Trusted validator execution resolved by provider code.

artifacts = () class-attribute instance-attribute
command instance-attribute
cwd instance-attribute
timeout_seconds = None class-attribute instance-attribute
WorkflowDefinition

Bases: BaseModel

Workflow bytecode source document.

defaults = None class-attribute instance-attribute
description instance-attribute
entry_step instance-attribute
steps instance-attribute
version instance-attribute
workflow_id instance-attribute
WorkflowValidationError

Bases: Exception

Raised when a workflow fails static or runtime kernel validation.

adapters

Adapters for conductor MVP.

__all__ = ['YamlWorkflowLoader'] module-attribute
YamlWorkflowLoader dataclass

Load workflow documents from YAML files.

__init__()
load(path)

Parse and normalize a workflow definition.

workflow_loader

Adapter for loading workflow YAML into typed models.

YamlWorkflowLoader dataclass

Load workflow documents from YAML files.

__init__()
load(path)

Parse and normalize a workflow definition.

models

Typed domain models for conductor MVP workflow execution.

StepDefinition = Annotated[RunAgentStep | RunValidationStep | EvaluateStep | GateStep | RollbackStep | StopStep, Field(discriminator='opcode')] module-attribute
ValidatorSpec = Annotated[BuiltinValidatorSpec | HarnessValidatorSpec, Field(discriminator='kind')] module-attribute
AgentRunResult

Bases: BaseModel

Agent step result returned by runner implementations.

outcome instance-attribute
ArtifactPaths

Bases: BaseModel

MVP artifact outputs per run.

final_state instance-attribute
run_dir instance-attribute
run_log instance-attribute
BaseStep

Bases: BaseModel

Common step shape.

id instance-attribute
opcode instance-attribute
routes = Field(default_factory=list) class-attribute instance-attribute
BuiltinValidatorName

Bases: str, Enum

Trusted builtin validator identifiers.

lint = 'lint' class-attribute instance-attribute
tests = 'tests' class-attribute instance-attribute
typecheck = 'typecheck' class-attribute instance-attribute
BuiltinValidatorSpec

Bases: BaseModel

Builtin validator reference resolved by a provider.

kind = 'builtin' class-attribute instance-attribute
name instance-attribute
EvaluateStep

Bases: BaseStep

Step running planner evaluation.

allowed_next_steps = Field(default_factory=list) class-attribute instance-attribute
opcode = Opcode.evaluate class-attribute instance-attribute
prompt instance-attribute
GateOutcome

Bases: str, Enum

Human gate outcomes as provenance events.

gate_approved = 'gate_approved' class-attribute instance-attribute
gate_rejected = 'gate_rejected' class-attribute instance-attribute
gate_timed_out = 'gate_timed_out' class-attribute instance-attribute
GateStep

Bases: BaseStep

Human gate step.

gate instance-attribute
opcode = Opcode.gate class-attribute instance-attribute
timeout_seconds = None class-attribute instance-attribute
HarnessReport

Bases: BaseModel

Minimal harness report fields used by kernel runtime checks.

proposed_goldens = Field(default_factory=list) class-attribute instance-attribute
HarnessValidatorName

Bases: str, Enum

Trusted generated harness validator identifiers.

generated_harness = 'generated_harness' class-attribute instance-attribute
HarnessValidatorSpec

Bases: BaseModel

Generated harness validator reference resolved by a provider.

artifacts = Field(default_factory=list) class-attribute instance-attribute
kind = 'harness' class-attribute instance-attribute
may_propose_goldens = False class-attribute instance-attribute
name instance-attribute
timeout_seconds = None class-attribute instance-attribute
KernelRunResult

Bases: BaseModel

Kernel execution summary.

artifact_paths instance-attribute
ended_at instance-attribute
last_step_id instance-attribute
run_id instance-attribute
started_at instance-attribute
status instance-attribute
workflow_id instance-attribute
MechanicalOutcome

Bases: str, Enum

Mechanical execution outcomes.

completed = 'completed' class-attribute instance-attribute
error = 'error' class-attribute instance-attribute
killed_idle = 'killed_idle' class-attribute instance-attribute
killed_policy = 'killed_policy' class-attribute instance-attribute
killed_timeout = 'killed_timeout' class-attribute instance-attribute
Opcode

Bases: str, Enum

Kernel opcode names.

evaluate = 'EVALUATE' class-attribute instance-attribute
gate = 'GATE' class-attribute instance-attribute
rollback = 'ROLLBACK' class-attribute instance-attribute
run_agent = 'RUN_AGENT' class-attribute instance-attribute
run_validation = 'RUN_VALIDATION' class-attribute instance-attribute
stop = 'STOP' class-attribute instance-attribute
PlannerDecision

Bases: BaseModel

Structured planner output consumed by EVALUATE.

blockers = Field(default_factory=list) class-attribute instance-attribute
fix_instructions = None class-attribute instance-attribute
next_step = None class-attribute instance-attribute
risk_flags = Field(default_factory=list) class-attribute instance-attribute
status instance-attribute
PlannerStatus

Bases: str, Enum

Semantic planner statuses.

blocked = 'blocked' class-attribute instance-attribute
needs_human = 'needs_human' class-attribute instance-attribute
partial = 'partial' class-attribute instance-attribute
success = 'success' class-attribute instance-attribute
unsafe = 'unsafe' class-attribute instance-attribute
RollbackStep

Bases: BaseStep

Deterministic rollback step.

opcode = Opcode.rollback class-attribute instance-attribute
target instance-attribute
RouteRule

Bases: BaseModel

Mapping from an outcome key to a next step target.

outcome instance-attribute
target instance-attribute
RunAgentStep

Bases: BaseStep

Step invoking an external agent runner.

agent instance-attribute
inputs = Field(default_factory=list) class-attribute instance-attribute
opcode = Opcode.run_agent class-attribute instance-attribute
policy = None class-attribute instance-attribute
prompt instance-attribute
RunValidationStep

Bases: BaseStep

Step running deterministic validators.

opcode = Opcode.run_validation class-attribute instance-attribute
run instance-attribute
StopStep

Bases: BaseStep

Terminal step.

opcode = Opcode.stop class-attribute instance-attribute
reason = None class-attribute instance-attribute
routes = Field(default_factory=list) class-attribute instance-attribute
ValidationRunResult

Bases: BaseModel

Deterministic validator execution result.

harness_report = None class-attribute instance-attribute
outcome instance-attribute
ValidatorExecutionSpec

Bases: BaseModel

Trusted validator execution resolved by provider code.

artifacts = () class-attribute instance-attribute
command instance-attribute
cwd instance-attribute
timeout_seconds = None class-attribute instance-attribute
WorkflowDefaults

Bases: BaseModel

Workflow-level optional defaults.

artifacts_dir = None class-attribute instance-attribute
component_kind = None class-attribute instance-attribute
eval_profile = None class-attribute instance-attribute
WorkflowDefinition

Bases: BaseModel

Workflow bytecode source document.

defaults = None class-attribute instance-attribute
description instance-attribute
entry_step instance-attribute
steps instance-attribute
version instance-attribute
workflow_id instance-attribute
protocols

Protocols for conductor MVP collaborators.

AgentRunnerProtocol

Bases: Protocol

Execute RUN_AGENT steps.

run(step, run_dir)

Execute an agent step.

ArtifactStoreProtocol

Bases: Protocol

Persist run artifacts.

ensure_run_dir(run_id, root_dir)

Create and return run directory.

write_text(path, content)

Write text artifact.

ClockProtocol

Bases: Protocol

Abstraction for current time.

now()

Return current timestamp.

GateApproverProtocol

Bases: Protocol

Resolve human gate outcomes.

decide(step, run_dir)

Return gate decision outcome.

PlannerEvaluatorProtocol

Bases: Protocol

Execute EVALUATE steps.

evaluate(step, run_dir)

Return structured planner decision.

RunIdGeneratorProtocol

Bases: Protocol

Abstraction for generating run identifiers.

next_id(now)

Generate next run identifier.

ValidationRunnerProtocol

Bases: Protocol

Execute RUN_VALIDATION steps.

run(step, run_dir)

Execute validation step.

ValidatorResolverProtocol

Bases: Protocol

Resolve trusted validator refs to executable specs.

resolve(validator, run_dir)

Resolve validator into a trusted execution spec.

WorkspaceProtocol

Bases: Protocol

Worktree safety and rollback operations.

capture_pre_run(run_id)

Capture pre-run checkpoint.

rollback(step)

Rollback to requested checkpoint.

providers

Providers for conductor MVP.

__all__ = ['FileArtifactStore', 'LocalValidationRunner', 'StaticValidatorResolver', 'SystemClock', 'TimestampRunIdGenerator'] module-attribute
FileArtifactStore dataclass

Bases: ArtifactStoreProtocol

Write run artifacts to local filesystem.

__init__()
ensure_run_dir(run_id, root_dir)

Create and return run directory.

write_text(path, content)

Write UTF-8 text content.

LocalValidationRunner dataclass

Bases: ValidationRunnerProtocol

Run validators via subprocess in the local worktree.

artifacts_subdir = 'validation_artifacts' class-attribute instance-attribute
validator_resolver instance-attribute
__init__(validator_resolver, artifacts_subdir='validation_artifacts')
run(step, run_dir)

Execute all validators and aggregate outcome/report.

StaticValidatorResolver dataclass

Bases: ValidatorResolverProtocol

Resolve trusted validator refs from code-owned mappings.

entries instance-attribute
harness_report_name = 'harness_report.json' class-attribute instance-attribute
harness_script_name = 'generated_harness.py' class-attribute instance-attribute
__init__(entries, harness_script_name='generated_harness.py', harness_report_name='harness_report.json')
resolve(validator, run_dir)

Resolve validator into a trusted execution spec.

SystemClock dataclass

Bases: ClockProtocol

System UTC clock.

__init__()
now()

Return the current UTC timestamp.

TimestampRunIdGenerator dataclass

Bases: RunIdGeneratorProtocol

Generate run IDs from UTC timestamps.

__init__()
next_id(now)

Return a compact timestamp run ID.

artifact_store

Artifact store provider for conductor MVP.

FileArtifactStore dataclass

Bases: ArtifactStoreProtocol

Write run artifacts to local filesystem.

__init__()
ensure_run_dir(run_id, root_dir)

Create and return run directory.

write_text(path, content)

Write UTF-8 text content.

clock

Clock provider for conductor MVP.

SystemClock dataclass

Bases: ClockProtocol

System UTC clock.

__init__()
now()

Return the current UTC timestamp.

run_id

Run id generator provider for conductor MVP.

TimestampRunIdGenerator dataclass

Bases: RunIdGeneratorProtocol

Generate run IDs from UTC timestamps.

__init__()
next_id(now)

Return a compact timestamp run ID.

validation_runner

Local RUN_VALIDATION executor with artifact capture.

TODO(agent-orch, high-priority): This provider still renders trusted validator execution specs into argv tuples for subprocess.run. That is an acceptable short-term infrastructure boundary for PR #35, but it is not the intended end-state architecture for conductor MVP.

Required follow-up: - replace ValidatorExecutionSpec.command: tuple[str, ...] with typed command objects per validator family - move argv rendering into a final executor-only translation layer - eliminate naked command vectors from provider contracts entirely

Do not treat the current implementation as the final security/architecture fix.

BuiltinCommandEntry

Bases: BaseModel

Builtin validator command mapping entry.

command = Field(default_factory=tuple) class-attribute instance-attribute
name instance-attribute
LocalValidationRunner dataclass

Bases: ValidationRunnerProtocol

Run validators via subprocess in the local worktree.

artifacts_subdir = 'validation_artifacts' class-attribute instance-attribute
validator_resolver instance-attribute
__init__(validator_resolver, artifacts_subdir='validation_artifacts')
run(step, run_dir)

Execute all validators and aggregate outcome/report.

StaticValidatorResolver dataclass

Bases: ValidatorResolverProtocol

Resolve trusted validator refs from code-owned mappings.

entries instance-attribute
harness_report_name = 'harness_report.json' class-attribute instance-attribute
harness_script_name = 'generated_harness.py' class-attribute instance-attribute
__init__(entries, harness_script_name='generated_harness.py', harness_report_name='harness_report.json')
resolve(validator, run_dir)

Resolve validator into a trusted execution spec.

service

Deterministic conductor kernel service for MVP workflows.

ConductorKernelService dataclass

Execute a validated workflow deterministically.

agent_runner instance-attribute
artifact_store instance-attribute
clock instance-attribute
gate_approver instance-attribute
planner_evaluator instance-attribute
run_id_generator instance-attribute
validation_runner instance-attribute
workflow_validator instance-attribute
workspace instance-attribute
__init__(clock, run_id_generator, artifact_store, workspace, agent_runner, validation_runner, planner_evaluator, gate_approver, workflow_validator)
run(workflow, run_root)

Execute workflow and return run summary.

KernelState dataclass

Mutable execution state represented immutably.

current_step_id instance-attribute
pending_golden_gate = False class-attribute instance-attribute
trace = field(default_factory=list) class-attribute instance-attribute
__init__(current_step_id, pending_golden_gate=False, trace=list())
advance(step_id, next_step_id, pending_gate=None)

Advance state with trace update.

log_text()

Render trace log text.

WorkflowCatalog dataclass

Indexed workflow helper for step lookups.

workflow instance-attribute
__init__(workflow)
find_step(step_id)

Find a step by id or raise.

has_step_id(step_id)

Return True if workflow contains provided step id.

has_step_type(opcode)

Return True if workflow contains at least one opcode.

route_target(step, outcome_key, *, context)

Return route target for an outcome or raise with context.

transition_targets(step)

Return all declared transition targets for a step.

WorkflowValidationError

Bases: Exception

Raised when a workflow fails static or runtime kernel validation.

WorkflowValidator dataclass

Static validation for MVP workflow definitions.

__init__()
validate(workflow)

Validate schema-level and graph-level invariants.

execution

Maintained execution subsystem for agent orchestration.

__all__ = ['CliExecutableInvocation', 'ExecutionOutputCapturePolicy', 'ExecutionRequest', 'ExecutionResult', 'ExecutionServiceProtocol', 'ExecutionTermination', 'ExplicitEnvironmentPolicy', 'InheritParentEnvironmentPolicy', 'IsolatedEnvironmentPolicy', 'PythonScriptInvocation', 'SubprocessExecutionService', 'TimeoutPolicy'] module-attribute
CliExecutableInvocation

Bases: BaseModel

Invoke a concrete executable with typed arguments.

arguments = () class-attribute instance-attribute
executable instance-attribute
family = 'cli_executable' class-attribute instance-attribute
ExecutionOutputCapturePolicy

Bases: BaseModel

How process output should be captured.

capture_stderr = True class-attribute instance-attribute
capture_stdout = True class-attribute instance-attribute
encoding = OutputEncoding.text class-attribute instance-attribute
ExecutionRequest

Bases: BaseModel

Trusted execution request.

environment_policy instance-attribute
invocation instance-attribute
output_capture_policy = Field(default_factory=ExecutionOutputCapturePolicy) class-attribute instance-attribute
timeout_policy = Field(default_factory=TimeoutPolicy) class-attribute instance-attribute
working_directory instance-attribute
ExecutionResult

Bases: BaseModel

Low-level subprocess result.

exit_code = None class-attribute instance-attribute
failure_message = None class-attribute instance-attribute
stderr_text = '' class-attribute instance-attribute
stdout_text = '' class-attribute instance-attribute
termination instance-attribute
timed_out = False class-attribute instance-attribute
ExecutionServiceProtocol

Bases: Protocol

Execute one trusted execution request.

run(request)

Execute one trusted request and return the normalized result.

ExecutionTermination

Bases: str, Enum

Mechanical subprocess outcomes.

completed = 'completed' class-attribute instance-attribute
idle_timeout = 'idle_timeout' class-attribute instance-attribute
non_zero_exit = 'non_zero_exit' class-attribute instance-attribute
policy_kill = 'policy_kill' class-attribute instance-attribute
startup_failure = 'startup_failure' class-attribute instance-attribute
wall_clock_timeout = 'wall_clock_timeout' class-attribute instance-attribute
ExplicitEnvironmentPolicy

Bases: BaseModel

Use an explicit allowlisted environment.

kind = 'explicit' class-attribute instance-attribute
values = Field(default_factory=dict) class-attribute instance-attribute
empty() classmethod

Return an explicit empty environment policy.

InheritParentEnvironmentPolicy

Bases: BaseModel

Inherit parent environment with optional overrides.

kind = 'inherit_parent' class-attribute instance-attribute
overrides = Field(default_factory=dict) class-attribute instance-attribute
IsolatedEnvironmentPolicy

Bases: BaseModel

Start from an empty environment and allowlist values.

allowlist = Field(default_factory=dict) class-attribute instance-attribute
kind = 'isolated' class-attribute instance-attribute
PythonScriptInvocation

Bases: BaseModel

Invoke a Python script using a specific interpreter.

arguments = () class-attribute instance-attribute
family = 'python_script' class-attribute instance-attribute
interpreter instance-attribute
script_path instance-attribute
SubprocessExecutionService dataclass

Execute trusted subprocess requests.

This boundary accepts only typed invocation models resolved by trusted orchestration code. It never uses shell=True and only renders argv from validated path-bearing invocation families.

__init__()
run(request)

Execute one typed request.

TimeoutPolicy

Bases: BaseModel

Execution timeout settings.

idle_seconds is reserved for future streaming/idleness enforcement. The current subprocess service only enforces wall-clock timeouts.

idle_seconds = Field(default=None, description='Maximum allowed idle time in seconds before termination. Reserved for future enforcement in the subprocess service.') class-attribute instance-attribute
wall_clock_seconds = Field(default=None, description='Maximum wall-clock runtime in seconds before termination.') class-attribute instance-attribute
models

Typed models for the execution subsystem.

EnvironmentPolicy = Annotated[InheritParentEnvironmentPolicy | ExplicitEnvironmentPolicy | IsolatedEnvironmentPolicy, Field(discriminator='kind')] module-attribute
Invocation = Annotated[CliExecutableInvocation | PythonScriptInvocation, Field(discriminator='family')] module-attribute
CliExecutableInvocation

Bases: BaseModel

Invoke a concrete executable with typed arguments.

arguments = () class-attribute instance-attribute
executable instance-attribute
family = 'cli_executable' class-attribute instance-attribute
ExecutionOutputCapturePolicy

Bases: BaseModel

How process output should be captured.

capture_stderr = True class-attribute instance-attribute
capture_stdout = True class-attribute instance-attribute
encoding = OutputEncoding.text class-attribute instance-attribute
ExecutionRequest

Bases: BaseModel

Trusted execution request.

environment_policy instance-attribute
invocation instance-attribute
output_capture_policy = Field(default_factory=ExecutionOutputCapturePolicy) class-attribute instance-attribute
timeout_policy = Field(default_factory=TimeoutPolicy) class-attribute instance-attribute
working_directory instance-attribute
ExecutionResult

Bases: BaseModel

Low-level subprocess result.

exit_code = None class-attribute instance-attribute
failure_message = None class-attribute instance-attribute
stderr_text = '' class-attribute instance-attribute
stdout_text = '' class-attribute instance-attribute
termination instance-attribute
timed_out = False class-attribute instance-attribute
ExecutionTermination

Bases: str, Enum

Mechanical subprocess outcomes.

completed = 'completed' class-attribute instance-attribute
idle_timeout = 'idle_timeout' class-attribute instance-attribute
non_zero_exit = 'non_zero_exit' class-attribute instance-attribute
policy_kill = 'policy_kill' class-attribute instance-attribute
startup_failure = 'startup_failure' class-attribute instance-attribute
wall_clock_timeout = 'wall_clock_timeout' class-attribute instance-attribute
ExplicitEnvironmentPolicy

Bases: BaseModel

Use an explicit allowlisted environment.

kind = 'explicit' class-attribute instance-attribute
values = Field(default_factory=dict) class-attribute instance-attribute
empty() classmethod

Return an explicit empty environment policy.

InheritParentEnvironmentPolicy

Bases: BaseModel

Inherit parent environment with optional overrides.

kind = 'inherit_parent' class-attribute instance-attribute
overrides = Field(default_factory=dict) class-attribute instance-attribute
IsolatedEnvironmentPolicy

Bases: BaseModel

Start from an empty environment and allowlist values.

allowlist = Field(default_factory=dict) class-attribute instance-attribute
kind = 'isolated' class-attribute instance-attribute
OutputEncoding

Bases: str, Enum

Output decoding policy.

text = 'text' class-attribute instance-attribute
PythonScriptInvocation

Bases: BaseModel

Invoke a Python script using a specific interpreter.

arguments = () class-attribute instance-attribute
family = 'python_script' class-attribute instance-attribute
interpreter instance-attribute
script_path instance-attribute
TimeoutPolicy

Bases: BaseModel

Execution timeout settings.

idle_seconds is reserved for future streaming/idleness enforcement. The current subprocess service only enforces wall-clock timeouts.

idle_seconds = Field(default=None, description='Maximum allowed idle time in seconds before termination. Reserved for future enforcement in the subprocess service.') class-attribute instance-attribute
wall_clock_seconds = Field(default=None, description='Maximum wall-clock runtime in seconds before termination.') class-attribute instance-attribute
protocols

Protocols for the execution subsystem.

ExecutionServiceProtocol

Bases: Protocol

Execute one trusted execution request.

run(request)

Execute one trusted request and return the normalized result.

service

Trusted subprocess execution service.

SubprocessExecutionService dataclass

Execute trusted subprocess requests.

This boundary accepts only typed invocation models resolved by trusted orchestration code. It never uses shell=True and only renders argv from validated path-bearing invocation families.

__init__()
run(request)

Execute one typed request.

execution_policy

Maintained execution-policy subsystem for agent orchestration.

__all__ = ['ApprovalPosture', 'EffectiveExecutionPolicy', 'ExecutionPolicyAssembler', 'ExecutionPolicyAssemblerProtocol', 'ExecutionPolicyAssemblyError', 'ExecutionPolicySettings', 'ExecutionPosture', 'NetworkPosture', 'PolicySummary', 'PolicyViolation', 'PolicyViolationClass', 'RequestedExecutionPolicy'] module-attribute
ApprovalPosture

Bases: str, Enum

Interactive approval posture.

bounded_auto_approve = 'bounded_auto_approve' class-attribute instance-attribute
deny_interactive = 'deny_interactive' class-attribute instance-attribute
fail_on_prompt = 'fail_on_prompt' class-attribute instance-attribute
EffectiveExecutionPolicy

Bases: BaseModel

Concrete enforced policy after derivation.

allowed_paths = Field(default_factory=tuple) class-attribute instance-attribute
approval_posture = ApprovalPosture.fail_on_prompt class-attribute instance-attribute
execution_posture = ExecutionPosture.read_only class-attribute instance-attribute
forbidden_operations = Field(default_factory=tuple) class-attribute instance-attribute
forbidden_paths = Field(default_factory=tuple) class-attribute instance-attribute
network_posture = NetworkPosture.deny class-attribute instance-attribute
policy_reference = None class-attribute instance-attribute
ExecutionPolicyAssembler dataclass

Derive requested and effective policy records.

__init__()
assemble(*, settings, workflow_policy_ref=None, step_policy_ref=None, runtime_overrides=None)
ExecutionPolicyAssemblerProtocol

Bases: Protocol

Assemble requested and effective execution policy records.

assemble(*, settings, workflow_policy_ref=None, step_policy_ref=None, runtime_overrides=None)

Assemble one canonical policy summary.

ExecutionPolicyAssemblyError

Bases: ValueError

Raised when execution policy references cannot be assembled.

ExecutionPolicySettings

Bases: BaseModel

System-level execution policy defaults and named references.

default_policy = Field(default_factory=RequestedExecutionPolicy) class-attribute instance-attribute
named_policies = Field(default_factory=dict) class-attribute instance-attribute
runtime_overrides = Field(default_factory=RequestedExecutionPolicy) class-attribute instance-attribute
ExecutionPosture

Bases: str, Enum

Filesystem execution posture.

read_only = 'read_only' class-attribute instance-attribute
workspace_write = 'workspace_write' class-attribute instance-attribute
NetworkPosture

Bases: str, Enum

Network posture.

allow = 'allow' class-attribute instance-attribute
deny = 'deny' class-attribute instance-attribute
PolicySummary

Bases: BaseModel

Canonical persisted policy record for one executed step.

capability_notes = Field(default_factory=tuple) class-attribute instance-attribute
effective_policy instance-attribute
enforcement_notes = Field(default_factory=tuple) class-attribute instance-attribute
requested_policy instance-attribute
runtime_overrides = None class-attribute instance-attribute
violations = Field(default_factory=tuple) class-attribute instance-attribute
PolicyViolation

Bases: BaseModel

One concrete policy violation.

hard_violation = True class-attribute instance-attribute
message instance-attribute
operation = None class-attribute instance-attribute
path = None class-attribute instance-attribute
violation_class instance-attribute
PolicyViolationClass

Bases: str, Enum

Stable policy violation classes.

forbidden_operation = 'forbidden_operation' class-attribute instance-attribute
forbidden_path = 'forbidden_path' class-attribute instance-attribute
interactive_prompt_violation = 'interactive_prompt_violation' class-attribute instance-attribute
native_policy_block = 'native_policy_block' class-attribute instance-attribute
network_violation = 'network_violation' class-attribute instance-attribute
protected_branch_violation = 'protected_branch_violation' class-attribute instance-attribute
RequestedExecutionPolicy

Bases: BaseModel

Policy intent requested by the control plane.

allowed_paths = None class-attribute instance-attribute
approval_posture = None class-attribute instance-attribute
execution_posture = None class-attribute instance-attribute
forbidden_operations = Field(default_factory=tuple) class-attribute instance-attribute
forbidden_paths = Field(default_factory=tuple) class-attribute instance-attribute
network_posture = None class-attribute instance-attribute
policy_reference = None class-attribute instance-attribute
assembly

Execution policy assembly helpers.

ExecutionPolicyAssembler dataclass

Derive requested and effective policy records.

__init__()
assemble(*, settings, workflow_policy_ref=None, step_policy_ref=None, runtime_overrides=None)
ExecutionPolicyAssemblyError

Bases: ValueError

Raised when execution policy references cannot be assembled.

models

Typed models for maintained execution policy contracts.

ApprovalPosture

Bases: str, Enum

Interactive approval posture.

bounded_auto_approve = 'bounded_auto_approve' class-attribute instance-attribute
deny_interactive = 'deny_interactive' class-attribute instance-attribute
fail_on_prompt = 'fail_on_prompt' class-attribute instance-attribute
EffectiveExecutionPolicy

Bases: BaseModel

Concrete enforced policy after derivation.

allowed_paths = Field(default_factory=tuple) class-attribute instance-attribute
approval_posture = ApprovalPosture.fail_on_prompt class-attribute instance-attribute
execution_posture = ExecutionPosture.read_only class-attribute instance-attribute
forbidden_operations = Field(default_factory=tuple) class-attribute instance-attribute
forbidden_paths = Field(default_factory=tuple) class-attribute instance-attribute
network_posture = NetworkPosture.deny class-attribute instance-attribute
policy_reference = None class-attribute instance-attribute
ExecutionPolicySettings

Bases: BaseModel

System-level execution policy defaults and named references.

default_policy = Field(default_factory=RequestedExecutionPolicy) class-attribute instance-attribute
named_policies = Field(default_factory=dict) class-attribute instance-attribute
runtime_overrides = Field(default_factory=RequestedExecutionPolicy) class-attribute instance-attribute
ExecutionPosture

Bases: str, Enum

Filesystem execution posture.

read_only = 'read_only' class-attribute instance-attribute
workspace_write = 'workspace_write' class-attribute instance-attribute
NetworkPosture

Bases: str, Enum

Network posture.

allow = 'allow' class-attribute instance-attribute
deny = 'deny' class-attribute instance-attribute
PolicySummary

Bases: BaseModel

Canonical persisted policy record for one executed step.

capability_notes = Field(default_factory=tuple) class-attribute instance-attribute
effective_policy instance-attribute
enforcement_notes = Field(default_factory=tuple) class-attribute instance-attribute
requested_policy instance-attribute
runtime_overrides = None class-attribute instance-attribute
violations = Field(default_factory=tuple) class-attribute instance-attribute
PolicyViolation

Bases: BaseModel

One concrete policy violation.

hard_violation = True class-attribute instance-attribute
message instance-attribute
operation = None class-attribute instance-attribute
path = None class-attribute instance-attribute
violation_class instance-attribute
PolicyViolationClass

Bases: str, Enum

Stable policy violation classes.

forbidden_operation = 'forbidden_operation' class-attribute instance-attribute
forbidden_path = 'forbidden_path' class-attribute instance-attribute
interactive_prompt_violation = 'interactive_prompt_violation' class-attribute instance-attribute
native_policy_block = 'native_policy_block' class-attribute instance-attribute
network_violation = 'network_violation' class-attribute instance-attribute
protected_branch_violation = 'protected_branch_violation' class-attribute instance-attribute
RequestedExecutionPolicy

Bases: BaseModel

Policy intent requested by the control plane.

allowed_paths = None class-attribute instance-attribute
approval_posture = None class-attribute instance-attribute
execution_posture = None class-attribute instance-attribute
forbidden_operations = Field(default_factory=tuple) class-attribute instance-attribute
forbidden_paths = Field(default_factory=tuple) class-attribute instance-attribute
network_posture = None class-attribute instance-attribute
policy_reference = None class-attribute instance-attribute
protocols

Protocols for maintained execution policy assembly.

ExecutionPolicyAssemblerProtocol

Bases: Protocol

Assemble requested and effective execution policy records.

assemble(*, settings, workflow_policy_ref=None, step_policy_ref=None, runtime_overrides=None)

Assemble one canonical policy summary.

kernel

Maintained kernel subsystem for agent orchestration.

__all__ = ['EvaluateStep', 'GateOutcome', 'GateStep', 'KernelRunResult', 'KernelRunService', 'MechanicalOutcome', 'Opcode', 'PlannerDecision', 'PlannerStatus', 'RollbackStep', 'RouteRule', 'RunAgentStep', 'RunValidationStep', 'StopStep', 'WorkflowDefinition', 'WorkflowValidationError', 'WorkflowValidator'] module-attribute
EvaluateStep

Bases: BaseStep

EVALUATE step.

allowed_next_steps = Field(default_factory=list) class-attribute instance-attribute
opcode = 'EVALUATE' class-attribute instance-attribute
policy = None class-attribute instance-attribute
prompt instance-attribute
GateOutcome

Bases: str, Enum

Human gate outcomes.

gate_approved = 'gate_approved' class-attribute instance-attribute
gate_rejected = 'gate_rejected' class-attribute instance-attribute
gate_timed_out = 'gate_timed_out' class-attribute instance-attribute
GateStep

Bases: BaseStep

GATE step.

gate instance-attribute
opcode = 'GATE' class-attribute instance-attribute
policy = None class-attribute instance-attribute
timeout_seconds = None class-attribute instance-attribute
KernelRunResult

Bases: BaseModel

Kernel run summary.

ended_at instance-attribute
final_state_path instance-attribute
last_step_id instance-attribute
metadata_path instance-attribute
run_directory instance-attribute
run_id instance-attribute
started_at instance-attribute
status instance-attribute
status_path instance-attribute
workflow_id instance-attribute
KernelRunService dataclass

Execute a workflow deterministically.

artifact_store instance-attribute
clock instance-attribute
execution_policy_assembler = field(default_factory=ExecutionPolicyAssembler) class-attribute instance-attribute
execution_policy_settings = field(default_factory=ExecutionPolicySettings) class-attribute instance-attribute
gate_approver instance-attribute
heartbeat_executor = field(default_factory=(lambda: ThreadPoolExecutor(max_workers=1)), repr=False, compare=False) class-attribute instance-attribute
heartbeat_interval_seconds = 30.0 class-attribute instance-attribute
planner_evaluator instance-attribute
run_id_generator instance-attribute
runner_service instance-attribute
validation_service instance-attribute
workflow_validator instance-attribute
workspace instance-attribute
__init__(clock, run_id_generator, artifact_store, workspace, runner_service, validation_service, planner_evaluator, gate_approver, workflow_validator, execution_policy_settings=ExecutionPolicySettings(), execution_policy_assembler=ExecutionPolicyAssembler(), heartbeat_interval_seconds=30.0, heartbeat_executor=(lambda: ThreadPoolExecutor(max_workers=1))())
run(workflow, run_root)

Execute a workflow and return summary.

MechanicalOutcome

Bases: str, Enum

Mechanical outcomes used for kernel routing.

completed = 'completed' class-attribute instance-attribute
error = 'error' class-attribute instance-attribute
killed_idle = 'killed_idle' class-attribute instance-attribute
killed_policy = 'killed_policy' class-attribute instance-attribute
killed_timeout = 'killed_timeout' class-attribute instance-attribute
Opcode

Bases: str, Enum

Kernel opcode names.

Values intentionally mirror the accepted OA04 workflow schema tokens.

evaluate = 'EVALUATE' class-attribute instance-attribute
gate = 'GATE' class-attribute instance-attribute
rollback = 'ROLLBACK' class-attribute instance-attribute
run_agent = 'RUN_AGENT' class-attribute instance-attribute
run_validation = 'RUN_VALIDATION' class-attribute instance-attribute
stop = 'STOP' class-attribute instance-attribute
PlannerDecision

Bases: BaseModel

Structured planner output.

blockers = Field(default_factory=list) class-attribute instance-attribute
fix_instructions = None class-attribute instance-attribute
next_step = None class-attribute instance-attribute
risk_flags = Field(default_factory=list) class-attribute instance-attribute
status instance-attribute
PlannerStatus

Bases: str, Enum

Semantic planner statuses.

blocked = 'blocked' class-attribute instance-attribute
needs_human = 'needs_human' class-attribute instance-attribute
partial = 'partial' class-attribute instance-attribute
success = 'success' class-attribute instance-attribute
unsafe = 'unsafe' class-attribute instance-attribute
RollbackStep

Bases: BaseStep

ROLLBACK step.

opcode = 'ROLLBACK' class-attribute instance-attribute
policy = None class-attribute instance-attribute
target instance-attribute
RouteRule

Bases: BaseModel

Outcome to target mapping.

outcome instance-attribute
target instance-attribute
RunAgentStep

Bases: BaseStep

RUN_AGENT step.

agent instance-attribute
inputs = Field(default_factory=list) class-attribute instance-attribute
opcode = 'RUN_AGENT' class-attribute instance-attribute
policy = None class-attribute instance-attribute
prompt instance-attribute
RunValidationStep

Bases: BaseStep

RUN_VALIDATION step.

opcode = 'RUN_VALIDATION' class-attribute instance-attribute
policy = None class-attribute instance-attribute
run instance-attribute
StopStep

Bases: BaseStep

STOP step.

opcode = 'STOP' class-attribute instance-attribute
reason = None class-attribute instance-attribute
WorkflowDefinition

Bases: BaseModel

Workflow document.

defaults = None class-attribute instance-attribute
description instance-attribute
entry_step instance-attribute
steps instance-attribute
version instance-attribute
workflow_id instance-attribute
WorkflowValidationError

Bases: Exception

Raised when workflow invariants are violated.

WorkflowValidator dataclass

Validate static workflow invariants.

__init__()
validate(workflow)
adapters

Adapters for the maintained kernel subsystem.

workflow_loader

YAML workflow loader for the maintained kernel.

YamlWorkflowLoader dataclass

Load workflow documents from YAML.

__init__()
load(path)

Load and normalize one workflow document.

catalog

Workflow graph helpers.

WorkflowCatalog dataclass

Indexed workflow helper.

step_index = None class-attribute instance-attribute
workflow instance-attribute
__init__(workflow, step_index=None)
__post_init__()
find_step(step_id)

Find a step or raise.

has_step_id(step_id)

Return whether workflow contains a step id.

has_step_type(opcode)

Return whether workflow contains an opcode.

path_contains_gate(start_id)

Return whether any reachable path contains a gate step.

reachable_step_ids(start_id)

Return the set of reachable step ids from one step.

route_target(step, outcome_key, *, context)

Return one transition target.

transition_targets(step)

Return declared transition targets.

enums

Shared kernel enums used across orchestration contracts.

__all__ = ['AgentFamily', 'GateOutcome', 'MechanicalOutcome', 'Opcode', 'PlannerStatus', 'RunnerTermination'] module-attribute
AgentFamily

Bases: str, Enum

Supported maintained runner families.

claude_cli = 'claude_cli' class-attribute instance-attribute
codex_cli = 'codex_cli' class-attribute instance-attribute
GateOutcome

Bases: str, Enum

Human gate outcomes.

gate_approved = 'gate_approved' class-attribute instance-attribute
gate_rejected = 'gate_rejected' class-attribute instance-attribute
gate_timed_out = 'gate_timed_out' class-attribute instance-attribute
MechanicalOutcome

Bases: str, Enum

Mechanical outcomes used for kernel routing.

completed = 'completed' class-attribute instance-attribute
error = 'error' class-attribute instance-attribute
killed_idle = 'killed_idle' class-attribute instance-attribute
killed_policy = 'killed_policy' class-attribute instance-attribute
killed_timeout = 'killed_timeout' class-attribute instance-attribute
Opcode

Bases: str, Enum

Kernel opcode names.

Values intentionally mirror the accepted OA04 workflow schema tokens.

evaluate = 'EVALUATE' class-attribute instance-attribute
gate = 'GATE' class-attribute instance-attribute
rollback = 'ROLLBACK' class-attribute instance-attribute
run_agent = 'RUN_AGENT' class-attribute instance-attribute
run_validation = 'RUN_VALIDATION' class-attribute instance-attribute
stop = 'STOP' class-attribute instance-attribute
PlannerStatus

Bases: str, Enum

Semantic planner statuses.

blocked = 'blocked' class-attribute instance-attribute
needs_human = 'needs_human' class-attribute instance-attribute
partial = 'partial' class-attribute instance-attribute
success = 'success' class-attribute instance-attribute
unsafe = 'unsafe' class-attribute instance-attribute
RunnerTermination

Bases: str, Enum

Mechanical outcomes exposed to the kernel by maintained runners.

completed = 'completed' class-attribute instance-attribute
error = 'error' class-attribute instance-attribute
killed_idle = 'killed_idle' class-attribute instance-attribute
killed_policy = 'killed_policy' class-attribute instance-attribute
killed_timeout = 'killed_timeout' class-attribute instance-attribute
errors

Kernel-specific exceptions.

WorkflowValidationError

Bases: Exception

Raised when workflow invariants are violated.

models

Typed models for the maintained kernel subsystem.

StepDefinition = Annotated[RunAgentStep | RunValidationStep | EvaluateStep | GateStep | RollbackStep | StopStep, Field(discriminator='opcode')] module-attribute
__all__ = ['BaseStep', 'EvaluateStep', 'GateOutcome', 'GateStep', 'KernelRunResult', 'MechanicalOutcome', 'Opcode', 'PlannerDecision', 'PlannerStatus', 'RollbackStep', 'RouteRule', 'RunAgentStep', 'RunValidationStep', 'StepDefinition', 'StopStep', 'WorkflowDefaults', 'WorkflowDefinition'] module-attribute
BaseStep

Bases: BaseModel

Common step shape.

id instance-attribute
opcode instance-attribute
routes = Field(default_factory=list) class-attribute instance-attribute
EvaluateStep

Bases: BaseStep

EVALUATE step.

allowed_next_steps = Field(default_factory=list) class-attribute instance-attribute
opcode = 'EVALUATE' class-attribute instance-attribute
policy = None class-attribute instance-attribute
prompt instance-attribute
GateOutcome

Bases: str, Enum

Human gate outcomes.

gate_approved = 'gate_approved' class-attribute instance-attribute
gate_rejected = 'gate_rejected' class-attribute instance-attribute
gate_timed_out = 'gate_timed_out' class-attribute instance-attribute
GateStep

Bases: BaseStep

GATE step.

gate instance-attribute
opcode = 'GATE' class-attribute instance-attribute
policy = None class-attribute instance-attribute
timeout_seconds = None class-attribute instance-attribute
KernelRunResult

Bases: BaseModel

Kernel run summary.

ended_at instance-attribute
final_state_path instance-attribute
last_step_id instance-attribute
metadata_path instance-attribute
run_directory instance-attribute
run_id instance-attribute
started_at instance-attribute
status instance-attribute
status_path instance-attribute
workflow_id instance-attribute
MechanicalOutcome

Bases: str, Enum

Mechanical outcomes used for kernel routing.

completed = 'completed' class-attribute instance-attribute
error = 'error' class-attribute instance-attribute
killed_idle = 'killed_idle' class-attribute instance-attribute
killed_policy = 'killed_policy' class-attribute instance-attribute
killed_timeout = 'killed_timeout' class-attribute instance-attribute
Opcode

Bases: str, Enum

Kernel opcode names.

Values intentionally mirror the accepted OA04 workflow schema tokens.

evaluate = 'EVALUATE' class-attribute instance-attribute
gate = 'GATE' class-attribute instance-attribute
rollback = 'ROLLBACK' class-attribute instance-attribute
run_agent = 'RUN_AGENT' class-attribute instance-attribute
run_validation = 'RUN_VALIDATION' class-attribute instance-attribute
stop = 'STOP' class-attribute instance-attribute
PlannerDecision

Bases: BaseModel

Structured planner output.

blockers = Field(default_factory=list) class-attribute instance-attribute
fix_instructions = None class-attribute instance-attribute
next_step = None class-attribute instance-attribute
risk_flags = Field(default_factory=list) class-attribute instance-attribute
status instance-attribute
PlannerStatus

Bases: str, Enum

Semantic planner statuses.

blocked = 'blocked' class-attribute instance-attribute
needs_human = 'needs_human' class-attribute instance-attribute
partial = 'partial' class-attribute instance-attribute
success = 'success' class-attribute instance-attribute
unsafe = 'unsafe' class-attribute instance-attribute
RollbackStep

Bases: BaseStep

ROLLBACK step.

opcode = 'ROLLBACK' class-attribute instance-attribute
policy = None class-attribute instance-attribute
target instance-attribute
RouteRule

Bases: BaseModel

Outcome to target mapping.

outcome instance-attribute
target instance-attribute
RunAgentStep

Bases: BaseStep

RUN_AGENT step.

agent instance-attribute
inputs = Field(default_factory=list) class-attribute instance-attribute
opcode = 'RUN_AGENT' class-attribute instance-attribute
policy = None class-attribute instance-attribute
prompt instance-attribute
RunValidationStep

Bases: BaseStep

RUN_VALIDATION step.

opcode = 'RUN_VALIDATION' class-attribute instance-attribute
policy = None class-attribute instance-attribute
run instance-attribute
StopStep

Bases: BaseStep

STOP step.

opcode = 'STOP' class-attribute instance-attribute
reason = None class-attribute instance-attribute
WorkflowDefaults

Bases: BaseModel

Workflow defaults block.

artifacts_dir = None class-attribute instance-attribute
component_kind = None class-attribute instance-attribute
eval_profile = None class-attribute instance-attribute
policy = None class-attribute instance-attribute
WorkflowDefinition

Bases: BaseModel

Workflow document.

defaults = None class-attribute instance-attribute
description instance-attribute
entry_step instance-attribute
steps instance-attribute
version instance-attribute
workflow_id instance-attribute
protocols

Protocols required by the maintained kernel.

__all__ = ['ClockProtocol', 'GateApproverProtocol', 'PlannerEvaluatorProtocol', 'RunArtifactPaths', 'RunArtifactStoreProtocol', 'RunIdGeneratorProtocol', 'RunnerResult', 'RunnerServiceProtocol', 'RunnerTaskRequest', 'RollbackStep', 'ValidationResult', 'ValidationServiceProtocol', 'ValidationStepRequest', 'WorkspaceServiceProtocol'] module-attribute
ClockProtocol

Bases: Protocol

Clock abstraction.

now()

Return current timestamp.

GateApproverProtocol

Bases: Protocol

Gate approver contract.

decide(step, run_directory)

Resolve gate outcome.

PlannerEvaluatorProtocol

Bases: Protocol

Planner evaluator contract.

evaluate(step, run_directory)

Evaluate one step.

RollbackStep

Bases: BaseStep

ROLLBACK step.

opcode = 'ROLLBACK' class-attribute instance-attribute
policy = None class-attribute instance-attribute
target instance-attribute
RunArtifactPaths

Bases: BaseModel

Canonical run-scoped filesystem paths.

artifacts_directory instance-attribute
artifacts_root instance-attribute
event_log_path instance-attribute
final_state_path instance-attribute
metadata_path instance-attribute
run_directory instance-attribute
status_path instance-attribute
RunArtifactStoreProtocol

Bases: Protocol

Persist run-scoped artifacts.

append_event(event, paths)

Append one event record.

artifact_step_dir(step_id, paths)

Return the canonical artifact directory for one step.

copy_file_artifact(*, paths, step_id, role, filename, source_path, media_type, required, important=False)

Copy one existing file into canonical artifact storage.

create_run(run_id, root_directory)

Create and return canonical run paths.

read_status(run_id, root_directory)

Read and validate live run status for one run id.

status_path_for_run(run_id, root_directory)

Return the canonical status path for one run id.

write_final_state(final_state, paths)

Persist the terminal workflow state summary.

write_json_artifact(*, paths, step_id, role, filename, payload, required, important=False)

Write and register one JSON artifact.

write_metadata(metadata, paths)

Persist run metadata.

write_status(status, paths)

Persist live run status.

write_step_manifest(manifest, paths)

Persist one step manifest and return its path.

write_text_artifact(*, paths, step_id, role, filename, content, media_type, required, important=False)

Write and register one text artifact.

RunIdGeneratorProtocol

Bases: Protocol

Run id abstraction.

next_id(now)

Generate next run id.

RunnerResult

Bases: BaseModel

Kernel-facing runner result.

final_response = None class-attribute instance-attribute
metadata = None class-attribute instance-attribute
termination instance-attribute
transcript = None class-attribute instance-attribute
RunnerServiceProtocol

Bases: Protocol

Execute a maintained runner request.

run(request)

Execute a runner request.

RunnerTaskRequest

Bases: BaseModel

Kernel-facing runner task request.

agent_family instance-attribute
prompt_reference = None class-attribute instance-attribute
rendered_task_text instance-attribute
requested_policy = Field(default_factory=RequestedExecutionPolicy) class-attribute instance-attribute
working_directory instance-attribute
ValidationResult

Bases: BaseModel

Validation result exposed to the kernel.

captured_artifacts = Field(default_factory=list) class-attribute instance-attribute
harness_report = None class-attribute instance-attribute
stderr_artifact = None class-attribute instance-attribute
stdout_artifact = None class-attribute instance-attribute
termination instance-attribute
ValidationServiceProtocol

Bases: Protocol

Execute a validation step.

run(request)

Execute validators and return normalized result.

ValidationStepRequest

Bases: BaseModel

Kernel-facing validation step request.

validators instance-attribute
working_directory instance-attribute
WorkspaceServiceProtocol

Bases: Protocol

Workspace safety and rollback operations.

current_context()

Return the current managed workspace context, if any.

diff_summary()

Return normalized diff summary.

planned_worktree_path(run_id)

Return the planned worktree path for one run, if managed.

prepare_pre_run(run_id)

Create and persist the pre-run workspace context.

rollback_pre_run()

Rollback to the pre-run state and return updated context.

snapshot()

Return current semantic snapshot.

provenance

Kernel-side provenance recording helpers.

KernelProvenanceRecorder dataclass

Persist kernel-side provenance artifacts, manifests, and events.

artifact_store instance-attribute
clock instance-attribute
workspace instance-attribute
__init__(artifact_store, workspace, clock)
record_failed_step(*, run_id, step_id, opcode, started_at, ended_at, paths, extra_artifacts=(), notes)

Persist the canonical failure manifest and event for one step.

record_gate_requested(*, run_id, step_id, paths)

Append the canonical gate-requested event.

record_gate_resolved(*, run_id, step_id, paths)

Append the canonical gate-resolved event.

record_rollback_completed(*, run_id, step_id, paths)

Append the canonical rollback-completed event.

record_route_selected(*, run_id, step_id, next_step_id, opcode, paths)

Append the canonical route-selected event.

record_runner_completed(*, run_id, step_id, runner_family, paths)

Append the canonical runner-completed event.

record_runner_started(*, run_id, step_id, runner_family, paths)

Append the canonical runner-started event.

record_status_updated(*, run_id, step_id, lifecycle_state, paths)

Append the canonical status-updated event.

record_step_blocked(*, run_id, step_id, paths)

Append the canonical step-blocked event.

record_step_manifest(*, run_id, step_id, opcode, termination, started_at, ended_at, paths, extra_artifacts=(), notes=(), next_step_id=None)

Persist the canonical manifest and completion events for one step.

record_step_started(*, run_id, step_id, paths)

Append the canonical step-started event.

record_step_waiting(*, run_id, step_id, paths)

Append the canonical step-waiting event.

service

Top-level maintained kernel service.

T = TypeVar('T') module-attribute
KernelRunService dataclass

Execute a workflow deterministically.

artifact_store instance-attribute
clock instance-attribute
execution_policy_assembler = field(default_factory=ExecutionPolicyAssembler) class-attribute instance-attribute
execution_policy_settings = field(default_factory=ExecutionPolicySettings) class-attribute instance-attribute
gate_approver instance-attribute
heartbeat_executor = field(default_factory=(lambda: ThreadPoolExecutor(max_workers=1)), repr=False, compare=False) class-attribute instance-attribute
heartbeat_interval_seconds = 30.0 class-attribute instance-attribute
planner_evaluator instance-attribute
run_id_generator instance-attribute
runner_service instance-attribute
validation_service instance-attribute
workflow_validator instance-attribute
workspace instance-attribute
__init__(clock, run_id_generator, artifact_store, workspace, runner_service, validation_service, planner_evaluator, gate_approver, workflow_validator, execution_policy_settings=ExecutionPolicySettings(), execution_policy_assembler=ExecutionPolicyAssembler(), heartbeat_interval_seconds=30.0, heartbeat_executor=(lambda: ThreadPoolExecutor(max_workers=1))())
run(workflow, run_root)

Execute a workflow and return summary.

StepContext dataclass

Per-run step execution context.

paths instance-attribute
provenance instance-attribute
run_directory instance-attribute
run_id instance-attribute
started_at instance-attribute
workflow_id instance-attribute
workflow_policy_ref = None class-attribute instance-attribute
workspace_context instance-attribute
__init__(run_id, workflow_id, paths, run_directory, started_at, workspace_context, provenance, workflow_policy_ref=None)
StepPolicyRecord dataclass

Per-step persisted policy evidence.

artifact instance-attribute
note instance-attribute
summary instance-attribute
__init__(summary, artifact, note)
state

Kernel state models.

KernelState dataclass

Immutable runtime state.

current_step_id instance-attribute
pending_golden_gate = False class-attribute instance-attribute
trace = field(default_factory=list) class-attribute instance-attribute
__init__(current_step_id, pending_golden_gate=False, trace=list())
advance(step_id, next_step_id, pending_gate=None)

Advance state immutably.

log_text()

Render trace text.

pending_gate_after_outcome(outcome)

Return pending gate state after one gate decision.

with_pending_gate()

Return a state with pending golden-gate approval.

validator

Static workflow validation.

WorkflowValidator dataclass

Validate static workflow invariants.

__init__()
validate(workflow)

run_artifacts

Run artifact persistence subsystem.

__all__ = ['ArtifactRole', 'EvidenceReference', 'EvidenceSummary', 'FilesystemRunArtifactStore', 'RunArtifactPaths', 'RunArtifactStoreProtocol', 'RunEventRecord', 'RunEventType', 'RunLifecycleState', 'RunMetadata', 'RunStatus', 'SchemaVersionRecord', 'StepArtifactEntry', 'StepManifest'] module-attribute
ArtifactRole

Bases: str, Enum

Canonical artifact roles reserved for maintained consumers.

gate_outcome = 'gate_outcome' class-attribute instance-attribute
gate_request = 'gate_request' class-attribute instance-attribute
harness_fixture = 'harness_fixture' class-attribute instance-attribute
planner_decision = 'planner_decision' class-attribute instance-attribute
policy_summary = 'policy_summary' class-attribute instance-attribute
runner_final_response = 'runner_final_response' class-attribute instance-attribute
runner_metadata = 'runner_metadata' class-attribute instance-attribute
runner_transcript = 'runner_transcript' class-attribute instance-attribute
validation_report = 'validation_report' class-attribute instance-attribute
validation_stderr = 'validation_stderr' class-attribute instance-attribute
validation_stdout = 'validation_stdout' class-attribute instance-attribute
workspace_diff = 'workspace_diff' class-attribute instance-attribute
workspace_status = 'workspace_status' class-attribute instance-attribute
EvidenceReference

Bases: BaseModel

Compact evaluator-facing evidence reference.

path instance-attribute
role instance-attribute
summary = None class-attribute instance-attribute
EvidenceSummary

Bases: BaseModel

Thin summary of canonical evidence for one step.

notes = Field(default_factory=tuple) class-attribute instance-attribute
references = Field(default_factory=tuple) class-attribute instance-attribute
FilesystemRunArtifactStore dataclass

Bases: RunArtifactStoreProtocol

Persist run artifacts to the local filesystem.

__init__()
append_event(event, paths)
artifact_step_dir(step_id, paths)
copy_file_artifact(*, paths, step_id, role, filename, source_path, media_type, required, important=False)
create_run(run_id, root_directory)
read_status(run_id, root_directory)
status_path_for_run(run_id, root_directory)
write_final_state(final_state, paths)
write_json_artifact(*, paths, step_id, role, filename, payload, required, important=False)
write_metadata(metadata, paths)
write_status(status, paths)
write_step_manifest(manifest, paths)
write_text_artifact(*, paths, step_id, role, filename, content, media_type, required, important=False)
RunArtifactPaths

Bases: BaseModel

Canonical run-scoped filesystem paths.

artifacts_directory instance-attribute
artifacts_root instance-attribute
event_log_path instance-attribute
final_state_path instance-attribute
metadata_path instance-attribute
run_directory instance-attribute
status_path instance-attribute
RunArtifactStoreProtocol

Bases: Protocol

Persist run-scoped artifacts.

append_event(event, paths)

Append one event record.

artifact_step_dir(step_id, paths)

Return the canonical artifact directory for one step.

copy_file_artifact(*, paths, step_id, role, filename, source_path, media_type, required, important=False)

Copy one existing file into canonical artifact storage.

create_run(run_id, root_directory)

Create and return canonical run paths.

read_status(run_id, root_directory)

Read and validate live run status for one run id.

status_path_for_run(run_id, root_directory)

Return the canonical status path for one run id.

write_final_state(final_state, paths)

Persist the terminal workflow state summary.

write_json_artifact(*, paths, step_id, role, filename, payload, required, important=False)

Write and register one JSON artifact.

write_metadata(metadata, paths)

Persist run metadata.

write_status(status, paths)

Persist live run status.

write_step_manifest(manifest, paths)

Persist one step manifest and return its path.

write_text_artifact(*, paths, step_id, role, filename, content, media_type, required, important=False)

Write and register one text artifact.

RunEventRecord

Bases: BaseModel

Appendable runtime event record.

artifact_path = None class-attribute instance-attribute
artifact_role = None class-attribute instance-attribute
event_type instance-attribute
lifecycle_state = None class-attribute instance-attribute
next_step_id = None class-attribute instance-attribute
opcode = None class-attribute instance-attribute
run_id instance-attribute
runner_family = None class-attribute instance-attribute
step_id instance-attribute
timestamp instance-attribute
RunEventType

Bases: str, Enum

Canonical event types for one workflow run.

artifact_recorded = 'artifact_recorded' class-attribute instance-attribute
gate_requested = 'gate_requested' class-attribute instance-attribute
gate_resolved = 'gate_resolved' class-attribute instance-attribute
rollback_completed = 'rollback_completed' class-attribute instance-attribute
route_selected = 'route_selected' class-attribute instance-attribute
runner_completed = 'runner_completed' class-attribute instance-attribute
runner_started = 'runner_started' class-attribute instance-attribute
status_updated = 'status_updated' class-attribute instance-attribute
step_blocked = 'step_blocked' class-attribute instance-attribute
step_completed = 'step_completed' class-attribute instance-attribute
step_failed = 'step_failed' class-attribute instance-attribute
step_started = 'step_started' class-attribute instance-attribute
step_waiting = 'step_waiting' class-attribute instance-attribute
RunLifecycleState

Bases: str, Enum

Bounded operator-facing lifecycle state for one run.

blocked = 'blocked' class-attribute instance-attribute
completed = 'completed' class-attribute instance-attribute
failed = 'failed' class-attribute instance-attribute
running = 'running' class-attribute instance-attribute
waiting = 'waiting' class-attribute instance-attribute
RunMetadata

Bases: BaseModel

Run-level metadata persisted for one workflow execution.

artifacts_root instance-attribute
ended_at = None class-attribute instance-attribute
entry_step instance-attribute
last_step_id = None class-attribute instance-attribute
run_id instance-attribute
schema_versions = Field(default_factory=tuple) class-attribute instance-attribute
started_at instance-attribute
termination = None class-attribute instance-attribute
workflow_id instance-attribute
workflow_version instance-attribute
workspace_context = None class-attribute instance-attribute
RunStatus

Bases: BaseModel

Live operator-facing status for one workflow run.

active_attempt = None class-attribute instance-attribute
active_opcode = None class-attribute instance-attribute
active_runner_family = None class-attribute instance-attribute
blocking_reason = None class-attribute instance-attribute
current_step_id = None class-attribute instance-attribute
elapsed_seconds = None class-attribute instance-attribute
last_artifact_write = None class-attribute instance-attribute
last_completed_step_id = None class-attribute instance-attribute
last_route_target = None class-attribute instance-attribute
lifecycle_state instance-attribute
operator_note = None class-attribute instance-attribute
run_id instance-attribute
started_at instance-attribute
termination = None class-attribute instance-attribute
updated_at instance-attribute
workflow_id instance-attribute
worktree_path = None class-attribute instance-attribute
SchemaVersionRecord

Bases: BaseModel

Version record for one cross-boundary schema.

name instance-attribute
version instance-attribute
StepArtifactEntry

Bases: BaseModel

Canonical manifest entry for one persisted artifact.

important = False class-attribute instance-attribute
media_type instance-attribute
path instance-attribute
required instance-attribute
role instance-attribute
StepManifest

Bases: BaseModel

Canonical manifest for one executed step.

artifacts = Field(default_factory=tuple) class-attribute instance-attribute
ended_at instance-attribute
evidence_summary instance-attribute
opcode instance-attribute
started_at instance-attribute
step_id instance-attribute
termination instance-attribute
artifact_for_role(role)

Return the first artifact matching the requested role.

filesystem_store

Filesystem-backed run artifact store.

FilesystemRunArtifactStore dataclass

Bases: RunArtifactStoreProtocol

Persist run artifacts to the local filesystem.

__init__()
append_event(event, paths)
artifact_step_dir(step_id, paths)
copy_file_artifact(*, paths, step_id, role, filename, source_path, media_type, required, important=False)
create_run(run_id, root_directory)
read_status(run_id, root_directory)
status_path_for_run(run_id, root_directory)
write_final_state(final_state, paths)
write_json_artifact(*, paths, step_id, role, filename, payload, required, important=False)
write_metadata(metadata, paths)
write_status(status, paths)
write_step_manifest(manifest, paths)
write_text_artifact(*, paths, step_id, role, filename, content, media_type, required, important=False)
models

Typed models for run artifact persistence.

ArtifactRole

Bases: str, Enum

Canonical artifact roles reserved for maintained consumers.

gate_outcome = 'gate_outcome' class-attribute instance-attribute
gate_request = 'gate_request' class-attribute instance-attribute
harness_fixture = 'harness_fixture' class-attribute instance-attribute
planner_decision = 'planner_decision' class-attribute instance-attribute
policy_summary = 'policy_summary' class-attribute instance-attribute
runner_final_response = 'runner_final_response' class-attribute instance-attribute
runner_metadata = 'runner_metadata' class-attribute instance-attribute
runner_transcript = 'runner_transcript' class-attribute instance-attribute
validation_report = 'validation_report' class-attribute instance-attribute
validation_stderr = 'validation_stderr' class-attribute instance-attribute
validation_stdout = 'validation_stdout' class-attribute instance-attribute
workspace_diff = 'workspace_diff' class-attribute instance-attribute
workspace_status = 'workspace_status' class-attribute instance-attribute
EvidenceReference

Bases: BaseModel

Compact evaluator-facing evidence reference.

path instance-attribute
role instance-attribute
summary = None class-attribute instance-attribute
EvidenceSummary

Bases: BaseModel

Thin summary of canonical evidence for one step.

notes = Field(default_factory=tuple) class-attribute instance-attribute
references = Field(default_factory=tuple) class-attribute instance-attribute
GateOutcomeArtifact

Bases: BaseModel

Canonical gate outcome artifact.

outcome instance-attribute
GateRequestArtifact

Bases: BaseModel

Canonical gate request artifact.

gate instance-attribute
timeout_seconds = None class-attribute instance-attribute
RunArtifactPaths

Bases: BaseModel

Canonical run-scoped filesystem paths.

artifacts_directory instance-attribute
artifacts_root instance-attribute
event_log_path instance-attribute
final_state_path instance-attribute
metadata_path instance-attribute
run_directory instance-attribute
status_path instance-attribute
RunEventRecord

Bases: BaseModel

Appendable runtime event record.

artifact_path = None class-attribute instance-attribute
artifact_role = None class-attribute instance-attribute
event_type instance-attribute
lifecycle_state = None class-attribute instance-attribute
next_step_id = None class-attribute instance-attribute
opcode = None class-attribute instance-attribute
run_id instance-attribute
runner_family = None class-attribute instance-attribute
step_id instance-attribute
timestamp instance-attribute
RunEventType

Bases: str, Enum

Canonical event types for one workflow run.

artifact_recorded = 'artifact_recorded' class-attribute instance-attribute
gate_requested = 'gate_requested' class-attribute instance-attribute
gate_resolved = 'gate_resolved' class-attribute instance-attribute
rollback_completed = 'rollback_completed' class-attribute instance-attribute
route_selected = 'route_selected' class-attribute instance-attribute
runner_completed = 'runner_completed' class-attribute instance-attribute
runner_started = 'runner_started' class-attribute instance-attribute
status_updated = 'status_updated' class-attribute instance-attribute
step_blocked = 'step_blocked' class-attribute instance-attribute
step_completed = 'step_completed' class-attribute instance-attribute
step_failed = 'step_failed' class-attribute instance-attribute
step_started = 'step_started' class-attribute instance-attribute
step_waiting = 'step_waiting' class-attribute instance-attribute
RunLifecycleState

Bases: str, Enum

Bounded operator-facing lifecycle state for one run.

blocked = 'blocked' class-attribute instance-attribute
completed = 'completed' class-attribute instance-attribute
failed = 'failed' class-attribute instance-attribute
running = 'running' class-attribute instance-attribute
waiting = 'waiting' class-attribute instance-attribute
RunMetadata

Bases: BaseModel

Run-level metadata persisted for one workflow execution.

artifacts_root instance-attribute
ended_at = None class-attribute instance-attribute
entry_step instance-attribute
last_step_id = None class-attribute instance-attribute
run_id instance-attribute
schema_versions = Field(default_factory=tuple) class-attribute instance-attribute
started_at instance-attribute
termination = None class-attribute instance-attribute
workflow_id instance-attribute
workflow_version instance-attribute
workspace_context = None class-attribute instance-attribute
RunStatus

Bases: BaseModel

Live operator-facing status for one workflow run.

active_attempt = None class-attribute instance-attribute
active_opcode = None class-attribute instance-attribute
active_runner_family = None class-attribute instance-attribute
blocking_reason = None class-attribute instance-attribute
current_step_id = None class-attribute instance-attribute
elapsed_seconds = None class-attribute instance-attribute
last_artifact_write = None class-attribute instance-attribute
last_completed_step_id = None class-attribute instance-attribute
last_route_target = None class-attribute instance-attribute
lifecycle_state instance-attribute
operator_note = None class-attribute instance-attribute
run_id instance-attribute
started_at instance-attribute
termination = None class-attribute instance-attribute
updated_at instance-attribute
workflow_id instance-attribute
worktree_path = None class-attribute instance-attribute
RunnerMetadataArtifact

Bases: BaseModel

Canonical maintained runner metadata artifact.

agent_family instance-attribute
capture_format = None class-attribute instance-attribute
command = Field(default_factory=tuple) class-attribute instance-attribute
ended_at = None class-attribute instance-attribute
exit_code = None class-attribute instance-attribute
invocation_mode = None class-attribute instance-attribute
prompt_reference = None class-attribute instance-attribute
started_at = None class-attribute instance-attribute
termination instance-attribute
working_directory = None class-attribute instance-attribute
from_normalized_metadata(payload) classmethod

Build the canonical artifact from one normalized runner metadata record.

SchemaVersionRecord

Bases: BaseModel

Version record for one cross-boundary schema.

name instance-attribute
version instance-attribute
StepArtifactEntry

Bases: BaseModel

Canonical manifest entry for one persisted artifact.

important = False class-attribute instance-attribute
media_type instance-attribute
path instance-attribute
required instance-attribute
role instance-attribute
StepManifest

Bases: BaseModel

Canonical manifest for one executed step.

artifacts = Field(default_factory=tuple) class-attribute instance-attribute
ended_at instance-attribute
evidence_summary instance-attribute
opcode instance-attribute
started_at instance-attribute
step_id instance-attribute
termination instance-attribute
artifact_for_role(role)

Return the first artifact matching the requested role.

protocols

Protocols for run artifact persistence.

RunArtifactStoreProtocol

Bases: Protocol

Persist run-scoped artifacts.

append_event(event, paths)

Append one event record.

artifact_step_dir(step_id, paths)

Return the canonical artifact directory for one step.

copy_file_artifact(*, paths, step_id, role, filename, source_path, media_type, required, important=False)

Copy one existing file into canonical artifact storage.

create_run(run_id, root_directory)

Create and return canonical run paths.

read_status(run_id, root_directory)

Read and validate live run status for one run id.

status_path_for_run(run_id, root_directory)

Return the canonical status path for one run id.

write_final_state(final_state, paths)

Persist the terminal workflow state summary.

write_json_artifact(*, paths, step_id, role, filename, payload, required, important=False)

Write and register one JSON artifact.

write_metadata(metadata, paths)

Persist run metadata.

write_status(status, paths)

Persist live run status.

write_step_manifest(manifest, paths)

Persist one step manifest and return its path.

write_text_artifact(*, paths, step_id, role, filename, content, media_type, required, important=False)

Write and register one text artifact.

runners

Maintained runner subsystem for agent orchestration.

__all__ = ['AdapterCapabilities', 'DelegatingRunnerService', 'RunnerAdapterProtocol', 'RunnerCaptureFormat', 'RunnerInvocationMetadata', 'RunnerInvocationMode', 'RunnerResult', 'RunnerServiceProtocol', 'RunnerTaskRequest', 'RunnerTermination', 'RunnerTextArtifact'] module-attribute
AdapterCapabilities

Bases: BaseModel

Native controls that one runner adapter can honor.

agent_family instance-attribute
supports_final_response_file instance-attribute
supports_native_approval_controls instance-attribute
supports_network_controls = False class-attribute instance-attribute
supports_path_constraints = False class-attribute instance-attribute
supports_read_only instance-attribute
supports_structured_event_stream instance-attribute
supports_workspace_write instance-attribute
DelegatingRunnerService dataclass

Bases: RunnerServiceProtocol

Dispatch runner requests to the matching maintained adapter.

adapters instance-attribute
__init__(adapters)
run(request)

Execute one runner request using the matching adapter.

RunnerAdapterProtocol

Bases: Protocol

Execute requests for one concrete maintained runner family.

agent_family()

Return the family served by this adapter.

capabilities()

Declare the native controls supported by this adapter.

run(request)

Execute a request for this family.

RunnerCaptureFormat

Bases: str, Enum

Supported normalized capture formats.

ndjson = 'ndjson' class-attribute instance-attribute
text = 'text' class-attribute instance-attribute
RunnerInvocationMetadata

Bases: BaseModel

Canonical normalized invocation metadata returned by adapters.

agent_family instance-attribute
capture_format instance-attribute
command instance-attribute
ended_at instance-attribute
exit_code = None class-attribute instance-attribute
invocation_mode instance-attribute
prompt_reference = None class-attribute instance-attribute
started_at instance-attribute
termination instance-attribute
working_directory instance-attribute
RunnerInvocationMode

Bases: str, Enum

Supported runner invocation modes.

claude_print = 'claude_print' class-attribute instance-attribute
codex_exec = 'codex_exec' class-attribute instance-attribute
RunnerResult

Bases: BaseModel

Kernel-facing runner result.

final_response = None class-attribute instance-attribute
metadata = None class-attribute instance-attribute
termination instance-attribute
transcript = None class-attribute instance-attribute
RunnerServiceProtocol

Bases: Protocol

Execute a maintained runner request.

run(request)

Execute a runner request.

RunnerTaskRequest

Bases: BaseModel

Kernel-facing runner task request.

agent_family instance-attribute
prompt_reference = None class-attribute instance-attribute
rendered_task_text instance-attribute
requested_policy = Field(default_factory=RequestedExecutionPolicy) class-attribute instance-attribute
working_directory instance-attribute
RunnerTermination

Bases: str, Enum

Mechanical outcomes exposed to the kernel by maintained runners.

completed = 'completed' class-attribute instance-attribute
error = 'error' class-attribute instance-attribute
killed_idle = 'killed_idle' class-attribute instance-attribute
killed_policy = 'killed_policy' class-attribute instance-attribute
killed_timeout = 'killed_timeout' class-attribute instance-attribute
RunnerTextArtifact

Bases: BaseModel

One normalized text artifact returned by a runner adapter.

content instance-attribute
filename instance-attribute
media_type instance-attribute
adapters

Maintained runner adapters.

__all__ = ['ClaudeCliInvocationMapper', 'ClaudeCliOutputNormalizer', 'ClaudeCliRunnerAdapter', 'CodexCliInvocationMapper', 'CodexCliOutputNormalizer', 'CodexCliRunnerAdapter'] module-attribute
ClaudeCliInvocationMapper dataclass

Build execution requests for maintained Claude CLI runs.

executable instance-attribute
__init__(executable)
map(request)

Map one runner request into a trusted execution request.

ClaudeCliOutputNormalizer dataclass

Normalize Claude CLI execution results into maintained runner results.

__init__()
normalize(*, request, execution_result, command, started_at, ended_at)

Normalize one execution result.

ClaudeCliRunnerAdapter dataclass

Bases: RunnerAdapterProtocol

Execute maintained headless Claude CLI runs.

executable = None class-attribute instance-attribute
execution_service instance-attribute
__init__(execution_service, executable=None)
agent_family()

Return the maintained family served by this adapter.

capabilities()

Return the native capabilities for Claude CLI.

run(request)

Execute one Claude CLI request.

CodexCliInvocationMapper dataclass

Build execution requests for maintained Codex CLI runs.

default_lang = 'en_US.UTF-8' class-attribute instance-attribute
default_shell = '/bin/zsh' class-attribute instance-attribute
default_term = 'xterm-256color' class-attribute instance-attribute
executable instance-attribute
model_name = None class-attribute instance-attribute
response_filename = 'codex-last-message.txt' class-attribute instance-attribute
__init__(executable, model_name=None, response_filename='codex-last-message.txt', default_shell='/bin/zsh', default_term='xterm-256color', default_lang='en_US.UTF-8')
map(request, response_path)

Map one runner request into a trusted execution request.

CodexCliOutputNormalizer dataclass

Normalize Codex CLI execution results into maintained runner results.

__init__()
normalize(*, request, execution_result, response_path, command, started_at, ended_at)

Normalize one execution result.

CodexCliRunnerAdapter dataclass

Bases: RunnerAdapterProtocol

Execute maintained headless Codex CLI runs.

executable = None class-attribute instance-attribute
execution_service instance-attribute
model_name = None class-attribute instance-attribute
output_normalizer = CodexCliOutputNormalizer() class-attribute instance-attribute
__init__(execution_service, executable=None, model_name=None, output_normalizer=CodexCliOutputNormalizer())
agent_family()

Return the maintained family served by this adapter.

capabilities()

Return the native capabilities for Codex CLI.

run(request)

Execute one Codex CLI request.

claude_cli

Claude CLI maintained runner adapter.

ClaudeCliInvocationMapper dataclass

Build execution requests for maintained Claude CLI runs.

executable instance-attribute
__init__(executable)
map(request)

Map one runner request into a trusted execution request.

ClaudeCliOutputNormalizer dataclass

Normalize Claude CLI execution results into maintained runner results.

__init__()
normalize(*, request, execution_result, command, started_at, ended_at)

Normalize one execution result.

ClaudeCliRunnerAdapter dataclass

Bases: RunnerAdapterProtocol

Execute maintained headless Claude CLI runs.

executable = None class-attribute instance-attribute
execution_service instance-attribute
__init__(execution_service, executable=None)
agent_family()

Return the maintained family served by this adapter.

capabilities()

Return the native capabilities for Claude CLI.

run(request)

Execute one Claude CLI request.

codex_cli

Codex CLI maintained runner adapter.

CodexCliInvocationMapper dataclass

Build execution requests for maintained Codex CLI runs.

default_lang = 'en_US.UTF-8' class-attribute instance-attribute
default_shell = '/bin/zsh' class-attribute instance-attribute
default_term = 'xterm-256color' class-attribute instance-attribute
executable instance-attribute
model_name = None class-attribute instance-attribute
response_filename = 'codex-last-message.txt' class-attribute instance-attribute
__init__(executable, model_name=None, response_filename='codex-last-message.txt', default_shell='/bin/zsh', default_term='xterm-256color', default_lang='en_US.UTF-8')
map(request, response_path)

Map one runner request into a trusted execution request.

CodexCliOutputNormalizer dataclass

Normalize Codex CLI execution results into maintained runner results.

__init__()
normalize(*, request, execution_result, response_path, command, started_at, ended_at)

Normalize one execution result.

CodexCliRunnerAdapter dataclass

Bases: RunnerAdapterProtocol

Execute maintained headless Codex CLI runs.

executable = None class-attribute instance-attribute
execution_service instance-attribute
model_name = None class-attribute instance-attribute
output_normalizer = CodexCliOutputNormalizer() class-attribute instance-attribute
__init__(execution_service, executable=None, model_name=None, output_normalizer=CodexCliOutputNormalizer())
agent_family()

Return the maintained family served by this adapter.

capabilities()

Return the native capabilities for Codex CLI.

run(request)

Execute one Codex CLI request.

models

Typed models for maintained runners.

AdapterCapabilities

Bases: BaseModel

Native controls that one runner adapter can honor.

agent_family instance-attribute
supports_final_response_file instance-attribute
supports_native_approval_controls instance-attribute
supports_network_controls = False class-attribute instance-attribute
supports_path_constraints = False class-attribute instance-attribute
supports_read_only instance-attribute
supports_structured_event_stream instance-attribute
supports_workspace_write instance-attribute
RunnerCaptureFormat

Bases: str, Enum

Supported normalized capture formats.

ndjson = 'ndjson' class-attribute instance-attribute
text = 'text' class-attribute instance-attribute
RunnerInvocationMetadata

Bases: BaseModel

Canonical normalized invocation metadata returned by adapters.

agent_family instance-attribute
capture_format instance-attribute
command instance-attribute
ended_at instance-attribute
exit_code = None class-attribute instance-attribute
invocation_mode instance-attribute
prompt_reference = None class-attribute instance-attribute
started_at instance-attribute
termination instance-attribute
working_directory instance-attribute
RunnerInvocationMode

Bases: str, Enum

Supported runner invocation modes.

claude_print = 'claude_print' class-attribute instance-attribute
codex_exec = 'codex_exec' class-attribute instance-attribute
RunnerResult

Bases: BaseModel

Kernel-facing runner result.

final_response = None class-attribute instance-attribute
metadata = None class-attribute instance-attribute
termination instance-attribute
transcript = None class-attribute instance-attribute
RunnerTaskRequest

Bases: BaseModel

Kernel-facing runner task request.

agent_family instance-attribute
prompt_reference = None class-attribute instance-attribute
rendered_task_text instance-attribute
requested_policy = Field(default_factory=RequestedExecutionPolicy) class-attribute instance-attribute
working_directory instance-attribute
RunnerTextArtifact

Bases: BaseModel

One normalized text artifact returned by a runner adapter.

content instance-attribute
filename instance-attribute
media_type instance-attribute
protocols

Protocols for maintained runners.

RunnerAdapterProtocol

Bases: Protocol

Execute requests for one concrete maintained runner family.

agent_family()

Return the family served by this adapter.

capabilities()

Declare the native controls supported by this adapter.

run(request)

Execute a request for this family.

RunnerServiceProtocol

Bases: Protocol

Execute a maintained runner request.

run(request)

Execute a runner request.

service

Delegating maintained runner service.

DelegatingRunnerService dataclass

Bases: RunnerServiceProtocol

Dispatch runner requests to the matching maintained adapter.

adapters instance-attribute
__init__(adapters)
run(request)

Execute one runner request using the matching adapter.

shared_enums

Shared orchestration enums that must not depend on package initializers.

AgentFamily

Bases: str, Enum

Supported maintained runner families.

claude_cli = 'claude_cli' class-attribute instance-attribute
codex_cli = 'codex_cli' class-attribute instance-attribute
RunnerTermination

Bases: str, Enum

Mechanical outcomes exposed to the kernel by maintained runners.

completed = 'completed' class-attribute instance-attribute
error = 'error' class-attribute instance-attribute
killed_idle = 'killed_idle' class-attribute instance-attribute
killed_policy = 'killed_policy' class-attribute instance-attribute
killed_timeout = 'killed_timeout' class-attribute instance-attribute

spike

Phase 0 protocol layer spike for agent orchestration.

adapters

Adapters for the protocol layer spike.

command_filter

Command filter adapter for the spike.

RegexCommandFilter dataclass

Bases: CommandFilterProtocol

Regex-based command filter.

patterns instance-attribute
__init__(patterns)
evaluate(command)
noop_prompt_handler

No-op prompt handler for headless CLI execution.

NoopPromptHandler dataclass

Bases: PromptHandlerProtocol

Ignore all output as non-interactive in headless runs.

__init__()
handle_output(text)
prompt_handler

Handle confirmation prompts in agent output.

RegexPromptHandler dataclass

Bases: PromptHandlerProtocol

Handle command confirmation prompts using regex parsing.

allow_response instance-attribute
block_response instance-attribute
command_filter instance-attribute
interactive_patterns instance-attribute
parser instance-attribute
__init__(parser, command_filter, interactive_patterns, allow_response, block_response)
handle_output(text)
prompt_parser

Parse command confirmation prompts from agent output.

RegexCommandPromptParser dataclass

Bases: CommandPromptParserProtocol

Regex-based prompt parser.

patterns instance-attribute
__init__(patterns)
parse(text)
run_id

Run ID generation for the spike.

TimestampRunIdGenerator dataclass

Bases: RunIdGeneratorProtocol

Timestamp-based run id generator.

format_str = '%Y%m%d-%H%M%S' class-attribute instance-attribute
__init__(format_str='%Y%m%d-%H%M%S')
next_id(*, now)
models

Domain models for the Phase 0 protocol layer spike.

AgentRunResult

Bases: BaseModel

Outcome of an agent execution.

command_decision = None class-attribute instance-attribute
exit_code = None class-attribute instance-attribute
stderr_text = None class-attribute instance-attribute
stdout_text = None class-attribute instance-attribute
termination_reason instance-attribute
transcript_raw instance-attribute
transcript_text instance-attribute
CommandFilterDecision

Bases: BaseModel

Result of applying the command filter.

blocked instance-attribute
command instance-attribute
matched_pattern = None class-attribute instance-attribute
CommandPromptMatch

Bases: BaseModel

Parsed command confirmation prompt.

command instance-attribute
prompt_text instance-attribute
GitStatusSnapshot

Bases: BaseModel

Snapshot of git status for a workspace.

branch instance-attribute
is_clean instance-attribute
lines = Field(default_factory=list) class-attribute instance-attribute
staged instance-attribute
unstaged instance-attribute
PromptAction

Bases: str, Enum

allow = 'allow' class-attribute instance-attribute
block = 'block' class-attribute instance-attribute
ignore = 'ignore' class-attribute instance-attribute
PromptHandlingOutcome

Bases: BaseModel

Decision for a command confirmation prompt.

action instance-attribute
decision = None class-attribute instance-attribute
response_text = None class-attribute instance-attribute
RunArtifactPaths

Bases: BaseModel

Filesystem paths for run artifacts.

diff_patch instance-attribute
events instance-attribute
git_post instance-attribute
git_pre instance-attribute
response_path instance-attribute
run_metadata instance-attribute
stderr_log instance-attribute
stdout_log instance-attribute
transcript_normalized instance-attribute
transcript_raw instance-attribute
RunEvent

Bases: BaseModel

Single provenance event entry for the spike.

agent = None class-attribute instance-attribute
artifact_paths = Field(default_factory=list) class-attribute instance-attribute
event_type instance-attribute
exit_code = None class-attribute instance-attribute
message = None class-attribute instance-attribute
reason = None class-attribute instance-attribute
run_id instance-attribute
timestamp instance-attribute
work_branch = None class-attribute instance-attribute
RunEventType

Bases: str, Enum

agent_output = 'AGENT_OUTPUT' class-attribute instance-attribute
agent_started = 'AGENT_STARTED' class-attribute instance-attribute
diff_emitted = 'DIFF_EMITTED' class-attribute instance-attribute
heartbeat = 'HEARTBEAT' class-attribute instance-attribute
run_blocked = 'RUN_BLOCKED' class-attribute instance-attribute
run_completed = 'RUN_COMPLETED' class-attribute instance-attribute
run_started = 'RUN_STARTED' class-attribute instance-attribute
workspace_captured_post = 'WORKSPACE_CAPTURED_POST' class-attribute instance-attribute
workspace_captured_pre = 'WORKSPACE_CAPTURED_PRE' class-attribute instance-attribute
RunMetadata

Bases: BaseModel

Metadata for a spike run.

agent instance-attribute
artifact_paths instance-attribute
ended_at instance-attribute
exit_code = None class-attribute instance-attribute
git_post_summary instance-attribute
git_pre_summary instance-attribute
prompt_id = None class-attribute instance-attribute
run_id instance-attribute
started_at instance-attribute
task = None class-attribute instance-attribute
termination_reason instance-attribute
work_branch instance-attribute
SpikeConfig

Bases: BaseModel

Construction-time configuration for the spike service.

runs_root instance-attribute
sandbox_root = None class-attribute instance-attribute
work_branch_prefix instance-attribute
SpikeDefaults dataclass

Default values for spike settings and policy.

allow_response = 'y\n' class-attribute instance-attribute
block_response = 'n\n' class-attribute instance-attribute
default_heartbeat_interval_seconds = 10 class-attribute instance-attribute
default_idle_timeout_seconds = 600 class-attribute instance-attribute
default_output_event_max_chars = 2000 class-attribute instance-attribute
default_timeout_seconds = 600 class-attribute instance-attribute
default_transcript_tail_lines = 200 class-attribute instance-attribute
runs_root = Path('.tnh-gen/runs') class-attribute instance-attribute
work_branch_prefix = 'work' class-attribute instance-attribute
__init__(runs_root=Path('.tnh-gen/runs'), work_branch_prefix='work', default_timeout_seconds=600, default_idle_timeout_seconds=600, default_transcript_tail_lines=200, default_heartbeat_interval_seconds=10, default_output_event_max_chars=2000, allow_response='y\n', block_response='n\n')
SpikeParams

Bases: BaseModel

Per-run parameters for the spike.

agent instance-attribute
heartbeat_interval_seconds = Field(default_factory=(lambda: SpikeDefaults().default_heartbeat_interval_seconds)) class-attribute instance-attribute
idle_timeout_seconds = Field(default_factory=(lambda: SpikeDefaults().default_idle_timeout_seconds)) class-attribute instance-attribute
prompt_id = None class-attribute instance-attribute
response_path = None class-attribute instance-attribute
task = None class-attribute instance-attribute
timeout_seconds = Field(default_factory=(lambda: SpikeDefaults().default_timeout_seconds)) class-attribute instance-attribute
transcript_tail_lines = Field(default_factory=(lambda: SpikeDefaults().default_transcript_tail_lines)) class-attribute instance-attribute
work_branch = None class-attribute instance-attribute
SpikePolicy

Bases: BaseModel

Behavioral policies for the spike.

allow_response = Field(default_factory=(lambda: SpikeDefaults().allow_response)) class-attribute instance-attribute
block_response = Field(default_factory=(lambda: SpikeDefaults().block_response)) class-attribute instance-attribute
blocked_command_patterns = Field(default_factory=list) class-attribute instance-attribute
cleanup_on_failure = True class-attribute instance-attribute
command_capture_patterns = Field(default_factory=list) class-attribute instance-attribute
interactive_prompt_patterns = Field(default_factory=list) class-attribute instance-attribute
output_event_max_chars = Field(default_factory=(lambda: SpikeDefaults().default_output_event_max_chars)) class-attribute instance-attribute
SpikePreflightError

Bases: Exception

Raised when preflight checks fail.

SpikeSettings

Bases: BaseSettings

Environment-driven settings for the spike.

model_config = SettingsConfigDict(extra='ignore') class-attribute instance-attribute
runs_root = Field(default_factory=(lambda: SpikeDefaults().runs_root)) class-attribute instance-attribute
sandbox_root = None class-attribute instance-attribute
work_branch_prefix = Field(default_factory=(lambda: SpikeDefaults().work_branch_prefix)) class-attribute instance-attribute
from_env() classmethod

Create settings from environment.

TerminationReason

Bases: str, Enum

command_blocked = 'command_blocked' class-attribute instance-attribute
completed = 'completed' class-attribute instance-attribute
idle_timeout = 'idle_timeout' class-attribute instance-attribute
interactive_prompt_detected = 'interactive_prompt_detected' class-attribute instance-attribute
killed = 'killed' class-attribute instance-attribute
nonzero_exit = 'nonzero_exit' class-attribute instance-attribute
wall_clock_timeout = 'wall_clock_timeout' class-attribute instance-attribute
policy

Policy defaults for the spike.

SpikePolicyDefaults dataclass

Default policy values for the spike.

blocked_command_patterns = ('\\brm\\s+-r(f)?\\b', '\\bgit\\s+reset\\s+--hard\\b', '\\bgit\\s+clean\\s+-fdx?\\b', '\\bgit\\s+checkout\\s+--(\\s|$)', '\\bgit\\s+restore\\s+--(worktree|staged)\\b', '\\bgit\\s+branch\\s+-D\\b', '\\bgit\\s+rebase\\b', '\\bgit\\s+merge\\b', '\\bgit\\s+push\\s+--force(-with-lease)?\\b', '\\bgit\\s+commit\\b', '\\bgit\\s+push\\b', '\\bmv\\b.*(\\s|/)\\.git(/|\\s|$)', '\\bcp\\b.*(\\s|/)\\.git(/|\\s|$)', '\\b(curl|wget|ssh|scp|rsync)\\b', '\\b(pip|poetry|npm|brew)\\b') class-attribute instance-attribute
command_capture_patterns = ('command:\\s*(?P<command>.+)', 'run\\s+command:\\s*(?P<command>.+)', 'execute:\\s*(?P<command>.+)') class-attribute instance-attribute
interactive_prompt_patterns = ('\\bconfirm\\b', '\\bpassword\\b', '\\bpress\\s+enter\\b', '\\b2fa\\b', '\\botp\\b', '\\by\\/n\\b', '\\byes\\/no\\b') class-attribute instance-attribute
__init__(blocked_command_patterns=('\\brm\\s+-r(f)?\\b', '\\bgit\\s+reset\\s+--hard\\b', '\\bgit\\s+clean\\s+-fdx?\\b', '\\bgit\\s+checkout\\s+--(\\s|$)', '\\bgit\\s+restore\\s+--(worktree|staged)\\b', '\\bgit\\s+branch\\s+-D\\b', '\\bgit\\s+rebase\\b', '\\bgit\\s+merge\\b', '\\bgit\\s+push\\s+--force(-with-lease)?\\b', '\\bgit\\s+commit\\b', '\\bgit\\s+push\\b', '\\bmv\\b.*(\\s|/)\\.git(/|\\s|$)', '\\bcp\\b.*(\\s|/)\\.git(/|\\s|$)', '\\b(curl|wget|ssh|scp|rsync)\\b', '\\b(pip|poetry|npm|brew)\\b'), interactive_prompt_patterns=('\\bconfirm\\b', '\\bpassword\\b', '\\bpress\\s+enter\\b', '\\b2fa\\b', '\\botp\\b', '\\by\\/n\\b', '\\byes\\/no\\b'), command_capture_patterns=('command:\\s*(?P<command>.+)', 'run\\s+command:\\s*(?P<command>.+)', 'execute:\\s*(?P<command>.+)'))
default_spike_policy()

Build the default spike policy.

protocols

Protocol definitions for the Phase 0 spike.

AgentCommandBuilderProtocol

Bases: Protocol

Build agent command line invocation.

build(params)

Build a command for the agent.

AgentRunnerProtocol

Bases: Protocol

Run an agent command and capture output.

run(*, command, timeout_seconds, idle_timeout_seconds, heartbeat_interval_seconds, prompt_handler, on_heartbeat, on_output)

Execute the agent command.

ArtifactWriterProtocol

Bases: Protocol

Persist run artifacts to disk.

ensure_run_dir(run_id)

Ensure the run directory exists and return it.

write_bytes(path, content)

Write bytes content to a file.

write_json(path, payload)

Write JSON content to a file.

write_text(path, content)

Write text content to a file.

ClockProtocol

Bases: Protocol

Abstraction for time sourcing.

now()

Return the current timestamp.

CommandFilterProtocol

Bases: Protocol

Evaluate whether a command should be blocked.

evaluate(command)

Return a decision for the provided command.

CommandPromptParserProtocol

Bases: Protocol

Parse command confirmation prompts.

parse(text)

Parse a prompt from text, if present.

EventWriterFactoryProtocol

Bases: Protocol

Create event writers for runs.

create(events_path)

Create an event writer for the given path.

EventWriterProtocol

Bases: Protocol

Write NDJSON event streams.

write_event(event)

Write a single event.

PromptHandlerProtocol

Bases: Protocol

Handle confirmation prompts from agent output.

handle_output(text)

Process output text and return handling instructions.

RunIdGeneratorProtocol

Bases: Protocol

Generate run identifiers.

next_id(*, now)

Return a new run id.

WorkspaceCaptureProtocol

Bases: Protocol

Capture git workspace details.

capture_diff()

Capture unified diff for the worktree.

capture_status()

Capture git status snapshot.

checkout_branch(branch_name)

Checkout the specified branch.

create_work_branch(branch_name)

Create and checkout a work branch.

current_branch()

Return the current branch name.

delete_branch(branch_name)

Delete a branch.

repo_root()

Return the repo root path.

reset_hard()

Reset the current worktree to HEAD.

providers

Providers for the protocol layer spike.

artifact_writer

Artifact writer for spike runs.

FileArtifactWriter dataclass

Bases: ArtifactWriterProtocol

Write run artifacts to disk.

runs_root instance-attribute
__init__(runs_root)
ensure_run_dir(run_id)
write_bytes(path, content)
write_json(path, payload)
write_text(path, content)
clock

Clock provider for the spike.

SystemClock dataclass

Bases: ClockProtocol

System clock implementation.

__init__()
now()
command_builder

Command builder for agent invocation.

AgentCommandBuilder dataclass

Bases: AgentCommandBuilderProtocol

Build commands for supported agents.

__init__()
build(params)
event_writer

Event stream writer for the spike.

NdjsonEventWriter dataclass

Bases: EventWriterProtocol

Append events to an NDJSON file.

events_path instance-attribute
__init__(events_path)
write_event(event)
event_writer_factory

Factory for event writers.

NdjsonEventWriterFactory dataclass

Bases: EventWriterFactoryProtocol

Create NDJSON event writers.

__init__()
create(events_path)
git_workspace

Git workspace capture provider for the spike.

GitWorkspaceCapture dataclass

Bases: WorkspaceCaptureProtocol

Capture git workspace state and manage work branches.

__init__()
capture_diff()
capture_status()
checkout_branch(branch_name)
create_work_branch(branch_name)
current_branch()
delete_branch(branch_name)
repo_root()
reset_hard()
pty_agent_runner

PTY-based agent runner for the spike.

PtyAgentRunner dataclass

Bases: AgentRunnerProtocol

Run agents in a PTY and capture output.

__init__()
run(*, command, timeout_seconds, idle_timeout_seconds, heartbeat_interval_seconds, prompt_handler, on_heartbeat, on_output)
RunnerState dataclass

Mutable state for PTY collection.

decision instance-attribute
last_heartbeat instance-attribute
last_output instance-attribute
output instance-attribute
termination instance-attribute
__init__(output, last_output, last_heartbeat, decision, termination)
subprocess_agent_runner

Subprocess-based agent runner for the spike.

RunnerState dataclass

Mutable state for subprocess collection.

decision instance-attribute
last_heartbeat instance-attribute
last_output instance-attribute
output instance-attribute
stderr instance-attribute
stdout instance-attribute
termination instance-attribute
__init__(output, stdout, stderr, last_output, last_heartbeat, decision, termination)
SubprocessAgentRunner dataclass

Bases: AgentRunnerProtocol

Run agents via subprocess pipes and capture output.

allowed_executables = ('claude', 'codex') class-attribute instance-attribute
__init__(allowed_executables=('claude', 'codex'))
run(*, command, timeout_seconds, idle_timeout_seconds, heartbeat_interval_seconds, prompt_handler, on_heartbeat, on_output)
service

Spike run orchestration service.

RunContext dataclass

Context for a single spike run.

artifact_paths instance-attribute
base_branch instance-attribute
event_writer instance-attribute
run_id instance-attribute
started_at instance-attribute
work_branch instance-attribute
__init__(run_id, started_at, artifact_paths, event_writer, base_branch, work_branch)
SpikeRunService dataclass

Orchestrate a single spike run.

agent_runner instance-attribute
artifact_writer instance-attribute
clock instance-attribute
command_builder instance-attribute
event_writer_factory instance-attribute
prompt_handler instance-attribute
run_id_generator instance-attribute
workspace instance-attribute
__init__(clock, run_id_generator, agent_runner, workspace, artifact_writer, event_writer_factory, command_builder, prompt_handler)
run(params, *, config, policy)

validation

Maintained validation subsystem for agent orchestration.

ValidationSpec = Annotated[BuiltinValidationSpec | HarnessValidationSpec, Field(discriminator='kind')] module-attribute
__all__ = ['BackendFamily', 'BuiltinCommandEntry', 'BuiltinValidationSpec', 'BuiltinValidatorId', 'GeneratedHarnessValidatorId', 'HarnessBackendRegistry', 'HarnessBackendRequest', 'HarnessBackendResult', 'HarnessReport', 'ValidationArtifactMergeError', 'ScriptHarnessBackend', 'HarnessValidationSpec', 'HarnessReportLoader', 'StaticHarnessBackendResolver', 'StaticValidatorResolver', 'ValidationCapturedArtifact', 'ValidationTextArtifact', 'ValidationResult', 'ValidationService', 'ValidationSpec', 'ValidationStepRequest', 'ValidationTermination'] module-attribute
BackendFamily

Bases: str, Enum

Maintained harness backend families.

cli = 'cli' class-attribute instance-attribute
script = 'script' class-attribute instance-attribute
web = 'web' class-attribute instance-attribute
BuiltinCommandEntry

Bases: BaseModel

Trusted builtin command mapping.

arguments = Field(default_factory=tuple) class-attribute instance-attribute
environment_policy = Field(default_factory=(ExplicitEnvironmentPolicy.empty)) class-attribute instance-attribute
executable instance-attribute
name instance-attribute
BuiltinValidationSpec

Bases: BaseModel

Kernel-facing builtin validator spec.

kind = 'builtin' class-attribute instance-attribute
name instance-attribute
BuiltinValidatorId

Bases: str, Enum

Trusted builtin validator identifiers.

lint = 'lint' class-attribute instance-attribute
tests = 'tests' class-attribute instance-attribute
typecheck = 'typecheck' class-attribute instance-attribute
GeneratedHarnessValidatorId

Bases: str, Enum

Trusted generated harness validator identifiers.

generated_harness = 'generated_harness' class-attribute instance-attribute
HarnessBackendRegistry dataclass

Resolve maintained backend implementations by family.

script_backend instance-attribute
__init__(script_backend)
resolve(family)

Resolve one backend implementation.

HarnessBackendRequest

Bases: BaseModel

Backend-neutral harness execution request.

arguments = Field(default_factory=tuple) class-attribute instance-attribute
artifact_patterns = Field(default_factory=tuple) class-attribute instance-attribute
backend_family instance-attribute
entrypoint = None class-attribute instance-attribute
environment_policy instance-attribute
executable instance-attribute
output_capture_policy = Field(default_factory=ExecutionOutputCapturePolicy) class-attribute instance-attribute
timeout_seconds = None class-attribute instance-attribute
working_directory instance-attribute
HarnessBackendResult

Bases: BaseModel

Normalized harness backend result.

captured_artifacts = Field(default_factory=list) class-attribute instance-attribute
harness_report = None class-attribute instance-attribute
stderr_artifact = None class-attribute instance-attribute
stdout_artifact = None class-attribute instance-attribute
termination instance-attribute
HarnessReport

Bases: BaseModel

Minimal harness report needed by the kernel.

proposed_goldens = Field(default_factory=list) class-attribute instance-attribute
HarnessReportLoader dataclass

Load and normalize script harness reports.

report_name = 'harness_report.json' class-attribute instance-attribute
__init__(report_name='harness_report.json')
load(run_directory)

Load a harness report if present.

HarnessValidationSpec

Bases: BaseModel

Kernel-facing generated harness validator spec.

artifacts = Field(default_factory=list) class-attribute instance-attribute
kind = 'harness' class-attribute instance-attribute
may_propose_goldens = False class-attribute instance-attribute
name instance-attribute
timeout_seconds = None class-attribute instance-attribute
ScriptHarnessBackend dataclass

Bases: HarnessBackendProtocol

Execute generated script harnesses via the execution subsystem.

execution_service instance-attribute
report_loader instance-attribute
__init__(execution_service, report_loader)
run(request)

Execute one script harness request.

StaticHarnessBackendResolver dataclass

Bases: HarnessBackendResolverProtocol

Resolve trusted harness validators into backend requests.

harness_report_name = 'harness_report.json' class-attribute instance-attribute
harness_script_name = 'generated_harness.py' class-attribute instance-attribute
__init__(harness_script_name='generated_harness.py', harness_report_name='harness_report.json')
resolve(spec, working_directory)

Resolve one harness validation spec.

StaticValidatorResolver dataclass

Bases: ValidatorResolverProtocol

Resolve trusted builtin validators into execution requests.

entries instance-attribute
__init__(entries)
resolve(spec, working_directory)

Resolve one builtin validation spec.

ValidationArtifactMergeError

Bases: TnhScholarError

Raised when validation artifacts cannot be merged safely.

ValidationCapturedArtifact

Bases: BaseModel

Captured harness artifact awaiting canonical persistence.

media_type = 'application/octet-stream' class-attribute instance-attribute
relative_path instance-attribute
source_path instance-attribute
ValidationResult

Bases: BaseModel

Validation result exposed to the kernel.

captured_artifacts = Field(default_factory=list) class-attribute instance-attribute
harness_report = None class-attribute instance-attribute
stderr_artifact = None class-attribute instance-attribute
stdout_artifact = None class-attribute instance-attribute
termination instance-attribute
ValidationService dataclass

Bases: ValidationServiceProtocol

Execute validation steps using the execution subsystem.

backend_registry instance-attribute
execution_service instance-attribute
harness_resolver instance-attribute
resolver instance-attribute
__init__(resolver, execution_service, harness_resolver, backend_registry)
run(request)

Execute all validators in a step.

ValidationStepRequest

Bases: BaseModel

Kernel-facing validation step request.

validators instance-attribute
working_directory instance-attribute
ValidationTermination

Bases: str, Enum

Validation outcomes exposed to the kernel.

completed = 'completed' class-attribute instance-attribute
error = 'error' class-attribute instance-attribute
killed_idle = 'killed_idle' class-attribute instance-attribute
killed_policy = 'killed_policy' class-attribute instance-attribute
killed_timeout = 'killed_timeout' class-attribute instance-attribute
ValidationTextArtifact

Bases: BaseModel

Normalized text artifact returned by validation execution.

content instance-attribute
filename instance-attribute
media_type instance-attribute
backends

Maintained harness backends for validation.

__all__ = ['HarnessReportLoader', 'ScriptHarnessBackend'] module-attribute
HarnessReportLoader dataclass

Load and normalize script harness reports.

report_name = 'harness_report.json' class-attribute instance-attribute
__init__(report_name='harness_report.json')
load(run_directory)

Load a harness report if present.

ScriptHarnessBackend dataclass

Bases: HarnessBackendProtocol

Execute generated script harnesses via the execution subsystem.

execution_service instance-attribute
report_loader instance-attribute
__init__(execution_service, report_loader)
run(request)

Execute one script harness request.

script

Script-based harness backend.

HarnessReportLoader dataclass

Load and normalize script harness reports.

report_name = 'harness_report.json' class-attribute instance-attribute
__init__(report_name='harness_report.json')
load(run_directory)

Load a harness report if present.

ScriptHarnessBackend dataclass

Bases: HarnessBackendProtocol

Execute generated script harnesses via the execution subsystem.

execution_service instance-attribute
report_loader instance-attribute
__init__(execution_service, report_loader)
run(request)

Execute one script harness request.

errors

Domain errors for the validation subsystem.

ValidationArtifactMergeError

Bases: TnhScholarError

Raised when validation artifacts cannot be merged safely.

models

Typed models for the validation subsystem.

ValidationSpec = Annotated[BuiltinValidationSpec | HarnessValidationSpec, Field(discriminator='kind')] module-attribute
BackendFamily

Bases: str, Enum

Maintained harness backend families.

cli = 'cli' class-attribute instance-attribute
script = 'script' class-attribute instance-attribute
web = 'web' class-attribute instance-attribute
BuiltinValidationSpec

Bases: BaseModel

Kernel-facing builtin validator spec.

kind = 'builtin' class-attribute instance-attribute
name instance-attribute
BuiltinValidatorId

Bases: str, Enum

Trusted builtin validator identifiers.

lint = 'lint' class-attribute instance-attribute
tests = 'tests' class-attribute instance-attribute
typecheck = 'typecheck' class-attribute instance-attribute
GeneratedHarnessValidatorId

Bases: str, Enum

Trusted generated harness validator identifiers.

generated_harness = 'generated_harness' class-attribute instance-attribute
HarnessBackendRequest

Bases: BaseModel

Backend-neutral harness execution request.

arguments = Field(default_factory=tuple) class-attribute instance-attribute
artifact_patterns = Field(default_factory=tuple) class-attribute instance-attribute
backend_family instance-attribute
entrypoint = None class-attribute instance-attribute
environment_policy instance-attribute
executable instance-attribute
output_capture_policy = Field(default_factory=ExecutionOutputCapturePolicy) class-attribute instance-attribute
timeout_seconds = None class-attribute instance-attribute
working_directory instance-attribute
HarnessBackendResult

Bases: BaseModel

Normalized harness backend result.

captured_artifacts = Field(default_factory=list) class-attribute instance-attribute
harness_report = None class-attribute instance-attribute
stderr_artifact = None class-attribute instance-attribute
stdout_artifact = None class-attribute instance-attribute
termination instance-attribute
HarnessReport

Bases: BaseModel

Minimal harness report needed by the kernel.

proposed_goldens = Field(default_factory=list) class-attribute instance-attribute
HarnessValidationSpec

Bases: BaseModel

Kernel-facing generated harness validator spec.

artifacts = Field(default_factory=list) class-attribute instance-attribute
kind = 'harness' class-attribute instance-attribute
may_propose_goldens = False class-attribute instance-attribute
name instance-attribute
timeout_seconds = None class-attribute instance-attribute
ValidationCapturedArtifact

Bases: BaseModel

Captured harness artifact awaiting canonical persistence.

media_type = 'application/octet-stream' class-attribute instance-attribute
relative_path instance-attribute
source_path instance-attribute
ValidationResult

Bases: BaseModel

Validation result exposed to the kernel.

captured_artifacts = Field(default_factory=list) class-attribute instance-attribute
harness_report = None class-attribute instance-attribute
stderr_artifact = None class-attribute instance-attribute
stdout_artifact = None class-attribute instance-attribute
termination instance-attribute
ValidationStepRequest

Bases: BaseModel

Kernel-facing validation step request.

validators instance-attribute
working_directory instance-attribute
ValidationTermination

Bases: str, Enum

Validation outcomes exposed to the kernel.

completed = 'completed' class-attribute instance-attribute
error = 'error' class-attribute instance-attribute
killed_idle = 'killed_idle' class-attribute instance-attribute
killed_policy = 'killed_policy' class-attribute instance-attribute
killed_timeout = 'killed_timeout' class-attribute instance-attribute
ValidationTextArtifact

Bases: BaseModel

Normalized text artifact returned by validation execution.

content instance-attribute
filename instance-attribute
media_type instance-attribute
protocols

Protocols for the validation subsystem.

HarnessBackendProtocol

Bases: Protocol

Execute one normalized harness backend request.

run(request)

Execute one harness request and normalize outputs.

HarnessBackendRegistryProtocol

Bases: Protocol

Resolve one backend implementation for a harness family.

resolve(family)

Return the backend implementation for one harness family.

HarnessBackendResolverProtocol

Bases: Protocol

Resolve harness validators into backend requests.

resolve(spec, working_directory)

Resolve one harness validator into a trusted backend request.

ValidationServiceProtocol

Bases: Protocol

Execute a validation step.

run(request)

Execute validators and return normalized result.

ValidatorResolverProtocol

Bases: Protocol

Resolve builtin validators into execution requests.

resolve(spec, working_directory)

Resolve one builtin validator into a trusted execution request.

service

Validation service built on the execution subsystem.

BuiltinCommandEntry

Bases: BaseModel

Trusted builtin command mapping.

arguments = Field(default_factory=tuple) class-attribute instance-attribute
environment_policy = Field(default_factory=(ExplicitEnvironmentPolicy.empty)) class-attribute instance-attribute
executable instance-attribute
name instance-attribute
HarnessBackendRegistry dataclass

Resolve maintained backend implementations by family.

script_backend instance-attribute
__init__(script_backend)
resolve(family)

Resolve one backend implementation.

StaticHarnessBackendResolver dataclass

Bases: HarnessBackendResolverProtocol

Resolve trusted harness validators into backend requests.

harness_report_name = 'harness_report.json' class-attribute instance-attribute
harness_script_name = 'generated_harness.py' class-attribute instance-attribute
__init__(harness_script_name='generated_harness.py', harness_report_name='harness_report.json')
resolve(spec, working_directory)

Resolve one harness validation spec.

StaticValidatorResolver dataclass

Bases: ValidatorResolverProtocol

Resolve trusted builtin validators into execution requests.

entries instance-attribute
__init__(entries)
resolve(spec, working_directory)

Resolve one builtin validation spec.

ValidationService dataclass

Bases: ValidationServiceProtocol

Execute validation steps using the execution subsystem.

backend_registry instance-attribute
execution_service instance-attribute
harness_resolver instance-attribute
resolver instance-attribute
__init__(resolver, execution_service, harness_resolver, backend_registry)
run(request)

Execute all validators in a step.

termination

Shared validation termination helpers.

merge_validation_termination(current, new_value)

Keep the more severe validation termination.

to_validation_termination(termination)

Map subprocess termination into validation termination.

validation_termination_rank(value)

Rank validation terminations by severity.

workspace

Workspace subsystem for agent orchestration.

__all__ = ['GitWorktreeWorkspaceService', 'NullWorkspaceService', 'RollbackTarget', 'WorkspaceContext', 'WorkspaceSnapshot'] module-attribute
GitWorktreeWorkspaceService dataclass

Bases: WorkspaceServiceProtocol

Manage one conductor-owned git worktree for a workflow run.

base_ref = 'HEAD' class-attribute instance-attribute
branch_prefix = 'tnh/run-' class-attribute instance-attribute
current_context_value = field(default=None, init=False) class-attribute instance-attribute
repo_root instance-attribute
workspace_root instance-attribute
__init__(repo_root, workspace_root, base_ref='HEAD', branch_prefix='tnh/run-')
current_context()

Return the active managed workspace context.

diff_summary()

Return the normalized diff for the active worktree.

planned_worktree_path(run_id)

Return the managed worktree path for one run.

prepare_pre_run(run_id)

Create the managed worktree and record its base state.

rollback_pre_run()

Discard and recreate the managed worktree at the recorded base state.

snapshot()

Return the current semantic snapshot for the active worktree.

NullWorkspaceService dataclass

Bases: WorkspaceServiceProtocol

Workspace service for tests and explicit non-operational contexts.

repo_root instance-attribute
__init__(repo_root)
current_context()

Return no managed workspace context.

diff_summary()

Return a stable empty diff summary.

planned_worktree_path(run_id)

Return no managed worktree path.

prepare_pre_run(run_id)

Return a stable no-op workspace context.

rollback_pre_run()

Return the stable no-op workspace context.

snapshot()

Return an empty semantic snapshot.

RollbackTarget

Bases: str, Enum

Supported rollback targets.

pre_run = 'pre_run' class-attribute instance-attribute
WorkspaceContext

Bases: BaseModel

Managed workspace identity for one mutable run.

base_ref instance-attribute
base_sha instance-attribute
branch_name instance-attribute
created_at = None class-attribute instance-attribute
head_sha = None class-attribute instance-attribute
repo_root instance-attribute
run_id = None class-attribute instance-attribute
worktree_path instance-attribute
WorkspaceSnapshot

Bases: BaseModel

Semantic snapshot of workspace state.

base_ref = None class-attribute instance-attribute
base_sha = None class-attribute instance-attribute
branch_name = None class-attribute instance-attribute
diff_summary = None class-attribute instance-attribute
head_sha = None class-attribute instance-attribute
is_dirty = False class-attribute instance-attribute
repo_root instance-attribute
staged_count = 0 class-attribute instance-attribute
unstaged_count = 0 class-attribute instance-attribute
worktree_path = None class-attribute instance-attribute
models

Typed models for workspace operations.

RollbackTarget

Bases: str, Enum

Supported rollback targets.

pre_run = 'pre_run' class-attribute instance-attribute
WorkspaceContext

Bases: BaseModel

Managed workspace identity for one mutable run.

base_ref instance-attribute
base_sha instance-attribute
branch_name instance-attribute
created_at = None class-attribute instance-attribute
head_sha = None class-attribute instance-attribute
repo_root instance-attribute
run_id = None class-attribute instance-attribute
worktree_path instance-attribute
WorkspaceSnapshot

Bases: BaseModel

Semantic snapshot of workspace state.

base_ref = None class-attribute instance-attribute
base_sha = None class-attribute instance-attribute
branch_name = None class-attribute instance-attribute
diff_summary = None class-attribute instance-attribute
head_sha = None class-attribute instance-attribute
is_dirty = False class-attribute instance-attribute
repo_root instance-attribute
staged_count = 0 class-attribute instance-attribute
unstaged_count = 0 class-attribute instance-attribute
worktree_path = None class-attribute instance-attribute
protocols

Protocols for workspace operations.

WorkspaceServiceProtocol

Bases: Protocol

Workspace safety and rollback operations.

current_context()

Return the current managed workspace context, if any.

diff_summary()

Return normalized diff summary.

planned_worktree_path(run_id)

Return the planned worktree path for one run, if managed.

prepare_pre_run(run_id)

Create and persist the pre-run workspace context.

rollback_pre_run()

Rollback to the pre-run state and return updated context.

snapshot()

Return current semantic snapshot.

service

Workspace services.

GitWorktreeWorkspaceService dataclass

Bases: WorkspaceServiceProtocol

Manage one conductor-owned git worktree for a workflow run.

base_ref = 'HEAD' class-attribute instance-attribute
branch_prefix = 'tnh/run-' class-attribute instance-attribute
current_context_value = field(default=None, init=False) class-attribute instance-attribute
repo_root instance-attribute
workspace_root instance-attribute
__init__(repo_root, workspace_root, base_ref='HEAD', branch_prefix='tnh/run-')
current_context()

Return the active managed workspace context.

diff_summary()

Return the normalized diff for the active worktree.

planned_worktree_path(run_id)

Return the managed worktree path for one run.

prepare_pre_run(run_id)

Create the managed worktree and record its base state.

rollback_pre_run()

Discard and recreate the managed worktree at the recorded base state.

snapshot()

Return the current semantic snapshot for the active worktree.

NullWorkspaceService dataclass

Bases: WorkspaceServiceProtocol

Workspace service for tests and explicit non-operational contexts.

repo_root instance-attribute
__init__(repo_root)
current_context()

Return no managed workspace context.

diff_summary()

Return a stable empty diff summary.

planned_worktree_path(run_id)

Return no managed worktree path.

prepare_pre_run(run_id)

Return a stable no-op workspace context.

rollback_pre_run()

Return the stable no-op workspace context.

snapshot()

Return an empty semantic snapshot.

ai_text_processing

Public surface for tnh_scholar.ai_text_processing.

Historically this module eagerly imported multiple submodules with heavy dependencies (e.g., audio codecs, ML toolkits) which made importing lightweight components such as Prompt surprisingly expensive and brittle in test environments. We now lazily import the concrete implementations on demand so that callers can depend on just the pieces they need.

__all__ = ['OpenAIProcessor', 'SectionParser', 'SectionProcessor', 'find_sections', 'process_text', 'process_text_by_paragraphs', 'process_text_by_sections', 'get_pattern', 'translate_text_by_lines', 'openai_process_text', 'GitBackedRepository', 'LocalPromptManager', 'Prompt', 'PromptCatalog', 'AIResponse', 'LogicalSection', 'SectionEntry', 'TextObject', 'TextObjectInfo'] module-attribute

AIResponse

Bases: BaseModel

Class for dividing large texts into AI-processable segments while maintaining broader document context.

document_metadata = Field(..., description='Available Dublin Core standard metadata in human-readable YAML format') class-attribute instance-attribute
document_summary = Field(..., description="Concise, comprehensive overview of the text's content and purpose") class-attribute instance-attribute
key_concepts = Field(..., description='Important terms, ideas, or references that appear throughout the text') class-attribute instance-attribute
language = Field(..., description='ISO 639-1 language code') class-attribute instance-attribute
narrative_context = Field(..., description='Concise overview of how the text develops or progresses as a whole') class-attribute instance-attribute
sections instance-attribute

GitBackedRepository

Manages versioned storage of prompts using Git.

Provides basic Git operations while hiding complexity: - Automatic versioning of changes - Basic conflict resolution - History tracking

repo = Repo(repo_path) instance-attribute
repo_path = repo_path instance-attribute
__init__(repo_path)

Initialize or connect to Git repository.

Parameters:

Name Type Description Default
repo_path Path

Path to repository directory

required

Raises:

Type Description
GitCommandError

If Git operations fail

display_history(file_path, max_versions=0)

Display history of changes for a file with diffs between versions.

Shows most recent changes first, limited to max_versions entries. For each change shows: - Commit info and date - Stats summary of changes - Detailed color diff with 2 lines of context

Parameters:

Name Type Description Default
file_path Path

Path to file in repository

required
max_versions int

Maximum number of versions to show; zero shows all revisions.

0
Example

repo.display_history(Path("prompts/format_dharma_talk.yaml")) Commit abc123def (2024-12-28 14:30:22): 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/prompts/format_dharma_talk.yaml ... ...

update_file(file_path)

Stage and commit changes to a file in the Git repository.

Parameters:

Name Type Description Default
file_path Path

Absolute or relative path to the file.

required

Returns:

Name Type Description
str str

Commit hash if changes were made.

Raises:

Type Description
FileNotFoundError

If the file does not exist.

ValueError

If the file is outside the repository.

GitCommandError

If Git operations fail.

LocalPromptManager

A simple singleton implementation of PromptManager that ensures only one instance is created and reused throughout the application lifecycle.

This class wraps the PromptManager to provide efficient prompt loading by maintaining a single reusable instance.

Attributes:

Name Type Description
_instance Optional[SingletonPromptManager]

The singleton instance

_prompt_manager Optional[PromptManager]

The wrapped PromptManager instance

prompt_manager property

Lazy initialization of the PromptManager instance.

Returns:

Name Type Description
PromptManager PromptCatalog

The wrapped PromptManager instance

Raises:

Type Description
RuntimeError

If PATTERN_REPO is not properly configured

__new__()

Create or return the singleton instance.

Returns:

Name Type Description
SingletonPromptManager LocalPromptManager

The singleton instance

get_prompt(name)

Get a prompt by name.

LogicalSection

Bases: BaseModel

Represents a contextually meaningful segment of a larger text.

Sections should preserve natural breaks in content (explicit section markers, topic shifts, argument development, narrative progression) while staying within specified size limits in order to create chunks suitable for AI processing.

start_line = Field(..., description='Starting line number that begins this logical segment') class-attribute instance-attribute
title = Field(..., description="Descriptive title of section's key content") class-attribute instance-attribute

OpenAIProcessor

Bases: TextProcessor

OpenAI-based text processor implementation.

max_tokens = max_tokens instance-attribute
model = model instance-attribute
__init__(model=None, max_tokens=0)
process_text(input_str, instructions, response_format=None, max_tokens=0, **kwargs)

Process text using OpenAI API with optional structured output.

Prompt

Base Prompt class for version-controlled template prompts.

Prompts contain: - Instructions: The main prompt instructions as a Jinja2 template. Note: Instructions are intended to be saved in markdown format in a .md file. - Template fields: Default values for template variables - Metadata: Name and identifier information

Version control is handled externally through Git, not in the prompt itself. Prompt identity is determined by the combination of identifiers.

Attributes:

Name Type Description
name str

The name of the prompt

instructions str

The Jinja2 template string for this prompt

default_template_fields Dict[str, str]

Default values for template variables

_allow_empty_vars bool

Whether to allow undefined template variables

_env Environment

Configured Jinja2 environment instance

default_template_fields = default_template_fields or {} instance-attribute
instructions = instructions instance-attribute
name = name instance-attribute
path = path instance-attribute
__eq__(other)

Compare prompts based on their content.

__hash__()

Hash based on content hash for container operations.

__init__(name, instructions, path=None, default_template_fields=None, allow_empty_vars=False)

Initialize a new Prompt instance.

Parameters:

Name Type Description Default
name str

Unique name identifying the prompt

required
instructions MarkdownStr

Jinja2 template string containing the prompt

required
default_template_fields Optional[Dict[str, str]]

Optional default values for template variables

None
allow_empty_vars bool

Whether to allow undefined template variables

False

Raises:

Type Description
ValueError

If name or instructions are empty

TemplateError

If template syntax is invalid

apply_template(field_values=None)

Apply template values to prompt instructions using Jinja2.

Values precedence (highest to lowest): 1. field_values (explicitly passed) 2. frontmatter values (from prompt file) 3. default_template_fields (prompt defaults)

Parameters:

Name Type Description Default
field_values Optional[Dict[str, str]]

Values to substitute into the template. If None, uses frontmatter/defaults.

None

Returns:

Name Type Description
str str

Rendered instructions with template values applied.

Raises:

Type Description
TemplateError

If template rendering fails

ValueError

If required template variables are missing

content_hash()

Generate a SHA-256 hash of the prompt content.

Useful for quick content comparison and change detection.

Returns:

Name Type Description
str str

Hexadecimal string of the SHA-256 hash

extract_frontmatter()

Extract and validate YAML frontmatter from markdown instructions.

Returns:

Type Description
Optional[Dict[str, Any]]

Optional[Dict]: Frontmatter data if found and valid, None otherwise

Note

Frontmatter must be at the very start of the file and properly formatted.

from_dict(data) classmethod

Create prompt instance from dictionary data.

Parameters:

Name Type Description Default
data Dict[str, Any]

Dictionary containing prompt data

required

Returns:

Name Type Description
Prompt Prompt

New prompt instance

Raises:

Type Description
ValueError

If required fields are missing

get_content_without_frontmatter()

Get markdown content with frontmatter removed.

Returns:

Name Type Description
str str

Markdown content without frontmatter

source_bytes()

Best-effort raw bytes for prompt hashing.

Prefers hashing exact on-disk bytes including front-matter. We therefore first try to read from prompt_path. If that fails, we fall back to hashing the concatenation of known templates. In V1, only the instructions (system template) are used for rendering.

to_dict()

Convert prompt to dictionary for serialization.

Returns:

Type Description
Dict[str, Any]

Dict containing all prompt data in serializable format

update_frontmatter(new_data)

Update or add frontmatter to the markdown content.

Parameters:

Name Type Description Default
new_data Dict[str, Any]

Dictionary of frontmatter fields to update

required

PromptCatalog

Main interface for prompt management system.

Provides high-level operations: - Prompt creation and loading - Automatic versioning - Safe concurrent access - Basic history tracking - Case-insensitive prompt names (stored as lowercase)

access_manager = ConcurrentAccessManager(self.base_path / '.locks') instance-attribute
base_path = Path(base_path).resolve() instance-attribute
repo = GitBackedRepository(self.base_path) instance-attribute
__init__(base_path)

Initialize prompt management system.

Parameters:

Name Type Description Default
base_path Path

Base directory for prompt storage

required
get_path(prompt_name)

Recursively search for a prompt file with the given name (case-insensitive) in base_path and all subdirectories.

Parameters:

Name Type Description Default
prompt_name str

prompt name (without extension) to search for

required

Returns:

Type Description
Optional[Path]

Optional[Path]: Full path to the found prompt file, or None if not found

load(prompt_name)

Load the .md prompt file by name, extract placeholders, and return a fully constructed Prompt object.

Parameters:

Name Type Description Default
prompt_name str

Name of the prompt (without .md extension).

required

Returns:

Type Description
Prompt

A new Prompt object whose 'instructions' is the file's text

Prompt

and whose 'template_fields' are inferred from placeholders in

Prompt

those instructions.

save(prompt, subdir=None)
show_history(prompt_name)
verify_repository(base_path) classmethod

Verify repository integrity and uniqueness of prompt names.

Performs the following checks: 1. Validates Git repository structure. 2. Ensures no duplicate prompt names exist.

Parameters:

Name Type Description Default
base_path Path

Repository path to verify.

required

Returns:

Name Type Description
bool bool

True if the repository is valid

bool

and contains no duplicate prompt files.

SectionEntry

Bases: NamedTuple

Represents a section with its content during iteration.

content instance-attribute
number instance-attribute
range instance-attribute
title instance-attribute

SectionParser

Generates structured section breakdowns of text content.

review_count = review_count instance-attribute
section_pattern = section_pattern instance-attribute
section_scanner = section_scanner instance-attribute
__init__(section_scanner, section_pattern, review_count=DEFAULT_REVIEW_COUNT)

Initialize section generator.

Parameters:

Name Type Description Default
section_scanner TextProcessor

Text processor used to extract sections

required
section_pattern Prompt

Pattern object containing section generation instructions

required
review_count int

Number of review passes

DEFAULT_REVIEW_COUNT
find_sections(text, section_count_target=None, segment_size_target=None, template_dict=None)

Generate section breakdown of input text. The text must be split up by newlines.

Parameters:

Name Type Description Default
text TextObject

Input TextObject to process

required
section_count_target Optional[int]

the target for the number of sections to find

None
segment_size_target Optional[int]

the target for the number of lines per section (if section_count_target is specified, this value will be set to generate correct segments)

None
template_dict Optional[Dict[str, str]]

Optional additional template variables

None

Returns:

Type Description
TextObject

TextObject containing section breakdown

SectionProcessor

Handles section-based XML text processing with configurable output handling.

pattern = pattern instance-attribute
processor = processor instance-attribute
template_dict = template_dict instance-attribute
wrap_in_document = wrap_in_document instance-attribute
__init__(processor, pattern, template_dict, wrap_in_document=True)

Initialize the XML section processor.

Parameters:

Name Type Description Default
processor TextProcessor

Implementation of TextProcessor to use

required
pattern Prompt

Pattern object containing processing instructions

required
template_dict Dict

Dictionary for template substitution

required
wrap_in_document bool

Whether to wrap output in tags

True
process_paragraphs(text)

Process transcript by paragraphs (as sections), yielding ProcessedSection objects. Paragraphs are assumed to be given as newline separated.

Parameters:

Name Type Description Default
text TextObject

TextObject to process

required

Yields:

Name Type Description
ProcessedSection ProcessedSection

One processed paragraph at a time, containing: - title: Paragraph number (e.g., 'Paragraph 1') - original_str: Raw paragraph text - processed_str: Processed paragraph text - metadata: Optional metadata dict

process_sections(text_object)

Process transcript sections and yield results one section at a time.

Parameters:

Name Type Description Default
text_object TextObject

Object containing section definitions

required

Yields:

Name Type Description
ProcessedSection ProcessedSection

One processed section at a time, containing: - title: Section title (English or original language) - original_text: Raw text segment - processed_text: Processed text content - start_line: Starting line number

TextObject

Manages text content with section organization and metadata tracking.

TextObject serves as the core container for text processing, providing: - Line-numbered text content management - Language identification - Section organization and access - Metadata tracking including incorporated processing stages

The class allows for section boundaries through line numbering, allowing sections to be defined by start lines without explicit end lines. Subsequent sections implicitly end where the next section begins. SectionObjects are utilized to represent sections.

Attributes:

Name Type Description
num_text

Line-numbered text content manager

language

ISO 639-1 language code for the text content

sections

List of text sections with boundaries

metadata

Processing and content metadata container

Example

content = NumberedText("Line 1\nLine 2\nLine 3") obj = TextObject(content, language="en")

content property

Get the raw text content without line numbers.

Returns:

Type Description
str

Plain text content as string

language = language or get_language_code_from_text(num_text.content) instance-attribute
last_line_num property

Get the last line number in the text.

Returns:

Type Description
int

Last line number (1-based indexing)

metadata = metadata or Metadata() instance-attribute
metadata_str property

Get metadata as YAML-formatted string.

Returns:

Type Description
str

YAML representation of metadata

Example

print(obj.metadata_str) author: Thich Nhat Hanh language: en

num_text = num_text instance-attribute
numbered_content property

Get text content with line numbers prefixed.

Returns:

Type Description
str

Text with line numbers in format " 1 | line content"

Example

print(obj.numbered_content) 1 | First line 2 | Second line

section_count property

Get the total number of sections.

Returns:

Type Description
int

Number of sections, or 0 if no sections defined

sections = sections or [] instance-attribute
__init__(num_text, language=None, sections=None, metadata=None, validate_on_init=True)
__iter__()

Iterate through sections, yielding full section information.

Note: Pydantic BaseModel defines iter for dict-like iteration over fields. We override it here for domain-specific section iteration. The type: ignore is intentional as we're providing a different iteration interface.

__str__()
export_info(source_file=None)

Export serializable state for persistence.

Parameters:

Name Type Description Default
source_file Optional[Path]

Optional path to source file to record in metadata

None

Returns:

Type Description
TextObjectInfo

TextObjectInfo instance containing serializable state

Note

If source_file is provided, it will be resolved to an absolute path.

from_info(info, metadata, num_text) classmethod

Create TextObject from serialized info and content.

Parameters:

Name Type Description Default
info TextObjectInfo

Serialized TextObjectInfo with section and language data

required
metadata Metadata

Base metadata to merge into the object

required
num_text NumberedText

NumberedText instance with the actual content

required

Returns:

Type Description
TextObject

TextObject instance with combined info and metadata

Example

info = TextObjectInfo.model_validate_json(json_str) text = read_str_from_file(info.source_file) obj = TextObject.from_info(info, Metadata(), NumberedText(text))

from_response(response, existing_metadata, num_text) classmethod

Create TextObject from AI response with section boundaries and metadata.

Extracts sections, language, and metadata from an AI-generated response (e.g., from sectioning or translation processing).

Parameters:

Name Type Description Default
response AIResponse

AIResponse model containing sections and metadata

required
existing_metadata Metadata

Base metadata to start with

required
num_text NumberedText

NumberedText instance with the text content

required

Returns:

Type Description
TextObject

TextObject with sections and merged metadata from AI response

Note

Merges metadata in order: existing → ai_summary/concepts/context → document_metadata

from_section_file(section_file, source=None) classmethod

Create TextObject from a section info file, loading content from source_file. Metadata is extracted from the source_file or from content.

Parameters:

Name Type Description Default
section_file Path

Path to JSON file containing TextObjectInfo

required
source Optional[str]

Optional source string in case no source file is found.

None

Returns:

Type Description
TextObject

TextObject instance

Raises:

Type Description
ValueError

If source_file is missing from section info

FileNotFoundError

If either section_file or source_file not found

from_str(text, language=None, sections=None, metadata=None) classmethod

Create a TextObject from a string, extracting any frontmatter.

Parameters:

Name Type Description Default
text str

Input text string, potentially containing frontmatter

required
language Optional[str]

ISO language code

None
sections Optional[List[SectionObject]]

List of section objects

None
metadata Optional[Metadata]

Optional base metadata to merge with frontmatter

None

Returns:

Type Description
TextObject

TextObject instance with combined metadata

from_text_file(file) classmethod

Create TextObject from a text file.

Reads the file and extracts any frontmatter metadata.

Parameters:

Name Type Description Default
file Path

Path to text file

required

Returns:

Type Description
TextObject

TextObject instance with extracted content and metadata

Example

obj = TextObject.from_text_file(Path("document.txt"))

get_section_content(index)

Get content for a section by index.

Parameters:

Name Type Description Default
index int

Zero-based section index

required

Returns:

Type Description
str

Section content as string

Raises:

Type Description
ValueError

If no sections are available

IndexError

If index is out of range

Example

obj = TextObject(num_text, sections=[...]) content = obj.get_section_content(0) # First section

load(path, config=None) classmethod

Load TextObject from file with optional configuration.

Parameters:

Name Type Description Default
path Path

Input file path

required
config Optional[LoadConfig]

Optional loading configuration. If not provided, loads directly from text file.

None

Returns:

Type Description
TextObject

TextObject instance

Usage
Load from text file with frontmatter

obj = TextObject.load(Path("content.txt"))

Load state from JSON with source content string

config = LoadConfig( format=StorageFormat.JSON, source_str="Text content..." ) obj = TextObject.load(Path("state.json"), config)

Load state from JSON with source content file

config = LoadConfig( format=StorageFormat.JSON, source_file=Path("content.txt") ) obj = TextObject.load(Path("state.json"), config)

merge_metadata(new_metadata, strategy=MergeStrategy.PRESERVE, source=None)

Merge metadata with explicit strategy and optional provenance tracking.

merge_metadata_legacy(new_metadata, override=False)

Deprecated legacy merge interface that maps to MergeStrategy.

save(path, output_format=StorageFormat.TEXT, source_file=None, pretty=True)

Save TextObject to file in specified format.

Parameters:

Name Type Description Default
path Path

Output file path

required
output_format StorageFormatType

"text" for full content+metadata or "json" for serialized state

TEXT
source_file Optional[Path]

Optional source file to record in metadata

None
pretty bool

For JSON output, whether to pretty print

True
transform(data_str=None, language=None, metadata=None, process_metadata=None, sections=None)

Return a new TextObject with requested changes; does not mutate the original.

Parameters:

Name Type Description Default
data_str Optional[str]

Optional new text content

None
language Optional[str]

Optional new language code

None
metadata Optional[Metadata]

Metadata to merge into the new object

None
process_metadata Optional[ProcessMetadata]

Identifier/details for the process performed

None
sections Optional[List[SectionObject]]

Optional replacement list of sections

None
update_metadata(**kwargs)

Update metadata with new key-value pairs using PRESERVE strategy.

Convenience method for adding metadata without overriding existing keys.

Parameters:

Name Type Description Default
**kwargs Any

Key-value pairs to add to metadata

{}
Example

obj.update_metadata(author="Thich Nhat Hanh", year=2020)

validate_sections(raise_on_error=True)

Validate section integrity using NumberedText boundary checks.

TextObjectInfo

Bases: BaseModel

Serializable information about a text and its sections.

language instance-attribute
metadata instance-attribute
sections instance-attribute
source_file = None class-attribute instance-attribute
model_post_init(__context)

Ensure metadata is always a Metadata instance after initialization.

__dir__()

__getattr__(name)

find_sections(text, source_language=None, section_pattern=None, section_model=None, max_tokens=DEFAULT_SECTION_RESULT_MAX_SIZE, section_count=None, review_count=DEFAULT_REVIEW_COUNT, template_dict=None)

High-level function for generating text sections.

Parameters:

Name Type Description Default
text TextObject

Input text

required
source_language Optional[str]

ISO 639-1 language code

None
section_pattern Optional[Prompt]

Optional custom pattern (uses default if None)

None
section_model Optional[str]

Optional model identifier

None
max_tokens int

Maximum tokens for response

DEFAULT_SECTION_RESULT_MAX_SIZE
section_count Optional[int]

Target number of sections

None
review_count int

Number of review passes

DEFAULT_REVIEW_COUNT
template_dict Optional[Dict[str, str]]

Optional additional template variables

None

Returns:

Type Description
TextObject

TextObject containing section breakdown

get_pattern(name)

Get a pattern by name using the singleton PatternManager.

This is a more efficient version that reuses a single PatternManager instance.

Parameters:

Name Type Description Default
name str

Name of the pattern to load

required

Returns:

Type Description
Prompt

The loaded pattern

Raises:

Type Description
ValueError

If pattern name is invalid

FileNotFoundError

If pattern file doesn't exist

openai_process_text(text_input, process_instructions, model=None, response_format=None, batch=False, max_tokens=0)

postprocessing a transcription.

process_text(text, pattern, source_language=None, model=None, template_dict=None)

process_text_by_paragraphs(text, template_dict, pattern=None, model=None)

High-level function for processing text paragraphs, yielding ProcessedSection objects. Assumes paragraphs are separated by newlines. Uses DEFAULT_XML_FORMAT_PATTERN as default pattern for text processing.

Parameters:

Name Type Description Default
text TextObject

TextObject to process

required
template_dict Dict[str, str]

Dictionary for template substitution

required
pattern Optional[Prompt]

Pattern object containing processing instructions

None
model Optional[str]

Optional model identifier for processor

None

Returns:

Type Description
None

Generator for ProcessedSection objects (one per paragraph)

process_text_by_sections(text_object, template_dict, pattern, model=None)

High-level function for processing text sections with configurable output handling.

Parameters:

Name Type Description Default
text_object TextObject

Object containing section definitions

required
pattern Prompt

Pattern object containing processing instructions

required
template_dict Dict

Dictionary for template substitution

required
model Optional[str]

Optional model identifier for processor

None

Returns:

Type Description
None

Generator for ProcessedSections

translate_text_by_lines(text, source_language=None, target_language=DEFAULT_TARGET_LANGUAGE, pattern=None, model=None, style=None, segment_size=None, context_lines=None, review_count=None, template_dict=None)

ai_text_processing

DEFAULT_MIN_SECTION_COUNT = 3 module-attribute
DEFAULT_OPENAI_MODEL = 'gpt-4o' module-attribute
DEFAULT_PARAGRAPH_FORMAT_PATTERN = 'default_xml_paragraph_format' module-attribute
DEFAULT_PUNCTUATE_MODEL = 'gpt-4o' module-attribute
DEFAULT_PUNCTUATE_PATTERN = 'default_punctuate' module-attribute
DEFAULT_PUNCTUATE_STYLE = 'APA' module-attribute
DEFAULT_REVIEW_COUNT = 5 module-attribute
DEFAULT_SECTION_PATTERN = 'default_section' module-attribute
DEFAULT_SECTION_RANGE_VAR = 2 module-attribute
DEFAULT_SECTION_RESULT_MAX_SIZE = 4000 module-attribute
DEFAULT_SECTION_TOKEN_SIZE = 650 module-attribute
DEFAULT_XML_FORMAT_PATTERN = 'default_xml_format' module-attribute
SECTION_SEGMENT_SIZE_WARNING_LIMIT = 5 module-attribute
logger = get_child_logger(__name__) module-attribute
GeneralProcessor
pattern = pattern instance-attribute
processor = processor instance-attribute
review_count = review_count instance-attribute
source_language = source_language instance-attribute
__init__(processor, pattern, source_language=None, review_count=DEFAULT_REVIEW_COUNT)

Initialize general processor.

Parameters:

Name Type Description Default
processor TextProcessor

Implementation of TextProcessor

required
pattern Prompt

Pattern object containing processing instructions

required
source_language Optional[str]

ISO code for the source language

None
review_count int

Number of review passes

DEFAULT_REVIEW_COUNT
process_text(text, template_dict=None)

process a text based on a pattern and source language.

OpenAIProcessor

Bases: TextProcessor

OpenAI-based text processor implementation.

max_tokens = max_tokens instance-attribute
model = model instance-attribute
__init__(model=None, max_tokens=0)
process_text(input_str, instructions, response_format=None, max_tokens=0, **kwargs)

Process text using OpenAI API with optional structured output.

ProcessedSection dataclass

Represents a processed section of text with its metadata.

metadata = field(default_factory=dict) class-attribute instance-attribute
original_str instance-attribute
processed_str instance-attribute
title instance-attribute
__init__(title, original_str, processed_str, metadata=dict())
SectionParser

Generates structured section breakdowns of text content.

review_count = review_count instance-attribute
section_pattern = section_pattern instance-attribute
section_scanner = section_scanner instance-attribute
__init__(section_scanner, section_pattern, review_count=DEFAULT_REVIEW_COUNT)

Initialize section generator.

Parameters:

Name Type Description Default
section_scanner TextProcessor

Text processor used to extract sections

required
section_pattern Prompt

Pattern object containing section generation instructions

required
review_count int

Number of review passes

DEFAULT_REVIEW_COUNT
find_sections(text, section_count_target=None, segment_size_target=None, template_dict=None)

Generate section breakdown of input text. The text must be split up by newlines.

Parameters:

Name Type Description Default
text TextObject

Input TextObject to process

required
section_count_target Optional[int]

the target for the number of sections to find

None
segment_size_target Optional[int]

the target for the number of lines per section (if section_count_target is specified, this value will be set to generate correct segments)

None
template_dict Optional[Dict[str, str]]

Optional additional template variables

None

Returns:

Type Description
TextObject

TextObject containing section breakdown

SectionProcessor

Handles section-based XML text processing with configurable output handling.

pattern = pattern instance-attribute
processor = processor instance-attribute
template_dict = template_dict instance-attribute
wrap_in_document = wrap_in_document instance-attribute
__init__(processor, pattern, template_dict, wrap_in_document=True)

Initialize the XML section processor.

Parameters:

Name Type Description Default
processor TextProcessor

Implementation of TextProcessor to use

required
pattern Prompt

Pattern object containing processing instructions

required
template_dict Dict

Dictionary for template substitution

required
wrap_in_document bool

Whether to wrap output in tags

True
process_paragraphs(text)

Process transcript by paragraphs (as sections), yielding ProcessedSection objects. Paragraphs are assumed to be given as newline separated.

Parameters:

Name Type Description Default
text TextObject

TextObject to process

required

Yields:

Name Type Description
ProcessedSection ProcessedSection

One processed paragraph at a time, containing: - title: Paragraph number (e.g., 'Paragraph 1') - original_str: Raw paragraph text - processed_str: Processed paragraph text - metadata: Optional metadata dict

process_sections(text_object)

Process transcript sections and yield results one section at a time.

Parameters:

Name Type Description Default
text_object TextObject

Object containing section definitions

required

Yields:

Name Type Description
ProcessedSection ProcessedSection

One processed section at a time, containing: - title: Section title (English or original language) - original_text: Raw text segment - processed_text: Processed text content - start_line: Starting line number

TextProcessor

Bases: ABC

Abstract base class for text processors that can return Pydantic objects.

process_text(input_str, instructions, response_format=None, **kwargs) abstractmethod

Process text according to instructions.

Parameters:

Name Type Description Default
input_str str

Input text to process

required
instructions str

Processing instructions

required
response_format Optional[Type[BaseModel]]

Optional Pydantic class for structured output

None
**kwargs Any

Additional processing parameters

{}

Returns:

Type Description
ProcessorResult

Either string or Pydantic model instance based on response_model

find_sections(text, source_language=None, section_pattern=None, section_model=None, max_tokens=DEFAULT_SECTION_RESULT_MAX_SIZE, section_count=None, review_count=DEFAULT_REVIEW_COUNT, template_dict=None)

High-level function for generating text sections.

Parameters:

Name Type Description Default
text TextObject

Input text

required
source_language Optional[str]

ISO 639-1 language code

None
section_pattern Optional[Prompt]

Optional custom pattern (uses default if None)

None
section_model Optional[str]

Optional model identifier

None
max_tokens int

Maximum tokens for response

DEFAULT_SECTION_RESULT_MAX_SIZE
section_count Optional[int]

Target number of sections

None
review_count int

Number of review passes

DEFAULT_REVIEW_COUNT
template_dict Optional[Dict[str, str]]

Optional additional template variables

None

Returns:

Type Description
TextObject

TextObject containing section breakdown

get_pattern(name)

Get a pattern by name using the singleton PatternManager.

This is a more efficient version that reuses a single PatternManager instance.

Parameters:

Name Type Description Default
name str

Name of the pattern to load

required

Returns:

Type Description
Prompt

The loaded pattern

Raises:

Type Description
ValueError

If pattern name is invalid

FileNotFoundError

If pattern file doesn't exist

process_text(text, pattern, source_language=None, model=None, template_dict=None)
process_text_by_paragraphs(text, template_dict, pattern=None, model=None)

High-level function for processing text paragraphs, yielding ProcessedSection objects. Assumes paragraphs are separated by newlines. Uses DEFAULT_XML_FORMAT_PATTERN as default pattern for text processing.

Parameters:

Name Type Description Default
text TextObject

TextObject to process

required
template_dict Dict[str, str]

Dictionary for template substitution

required
pattern Optional[Prompt]

Pattern object containing processing instructions

None
model Optional[str]

Optional model identifier for processor

None

Returns:

Type Description
None

Generator for ProcessedSection objects (one per paragraph)

process_text_by_sections(text_object, template_dict, pattern, model=None)

High-level function for processing text sections with configurable output handling.

Parameters:

Name Type Description Default
text_object TextObject

Object containing section definitions

required
pattern Prompt

Pattern object containing processing instructions

required
template_dict Dict

Dictionary for template substitution

required
model Optional[str]

Optional model identifier for processor

None

Returns:

Type Description
None

Generator for ProcessedSections

general_processor

line_translator

DEFAULT_TARGET_LANGUAGE = 'en' module-attribute
DEFAULT_TRANSLATE_CONTEXT_LINES = 3 module-attribute
DEFAULT_TRANSLATE_STYLE = "'American Dharma Teaching'" module-attribute
DEFAULT_TRANSLATION_PATTERN = 'default_line_translate' module-attribute
DEFAULT_TRANSLATION_TARGET_TOKENS = 300 module-attribute
FOLLOWING_CONTEXT_MARKER = 'FOLLOWING_CONTEXT' module-attribute
MAX_RETRIES = 6 module-attribute
MIN_SEGMENT_SIZE = 4 module-attribute
PRECEDING_CONTEXT_MARKER = 'PRECEDING_CONTEXT' module-attribute
TRANSCRIPT_SEGMENT_MARKER = 'TRANSCRIPT_SEGMENT' module-attribute
logger = get_child_logger(__name__) module-attribute
LineTranslator

Translates text line by line while maintaining line numbers and context.

context_lines = context_lines instance-attribute
pattern = pattern instance-attribute
processor = processor instance-attribute
review_count = review_count instance-attribute
style = style instance-attribute
__init__(processor, pattern, review_count=DEFAULT_REVIEW_COUNT, style=DEFAULT_TRANSLATE_STYLE, context_lines=DEFAULT_TRANSLATE_CONTEXT_LINES)

Initialize line translator.

Parameters:

Name Type Description Default
processor TextProcessor

Implementation of TextProcessor

required
pattern Prompt

Pattern object containing translation instructions

required
review_count int

Number of review passes

DEFAULT_REVIEW_COUNT
style str

Translation style to apply

DEFAULT_TRANSLATE_STYLE
context_lines int

Number of context lines to include before/after

DEFAULT_TRANSLATE_CONTEXT_LINES
translate_segment(num_text, start_line, end_line, metadata, target_language, source_language, template_dict=None)

Translate a segment of text with context.

Parameters:

Name Type Description Default
num_text NumberedText

Numbered text to extract segment from

required
start_line int

Starting line number of segment

required
end_line int

Ending line number of segment

required
metadata Metadata

metadata for text

required
source_language str

Source language code

required
target_language str

Target language code (default: en for English)

required
template_dict Optional[Dict]

Optional additional template values

None

Returns:

Type Description
str

Translated text segment with line numbers preserved

translate_text(text, source_language, segment_size=None, target_language=DEFAULT_TARGET_LANGUAGE, template_dict=None)

Translate entire text in segments while maintaining line continuity.

Parameters:

Name Type Description Default
text TextObject

Text to translate

required
segment_size Optional[int]

Number of lines per translation segment

None
source_language str

Source language code

required
target_language str

Target language code (default: en for English)

DEFAULT_TARGET_LANGUAGE
template_dict Optional[Dict]

Optional additional template values

None

Returns:

Type Description
TextObject

Complete translated text with line numbers preserved

translate_text_by_lines(text, source_language=None, target_language=DEFAULT_TARGET_LANGUAGE, pattern=None, model=None, style=None, segment_size=None, context_lines=None, review_count=None, template_dict=None)

openai_process_interface

TOKEN_BUFFER = 500 module-attribute
logger = get_child_logger(__name__) module-attribute
openai_process_text(text_input, process_instructions, model=None, response_format=None, batch=False, max_tokens=0)

postprocessing a transcription.

prompts

MANAGER_UPDATE_MESSAGE = 'PromptManager Update:' module-attribute
MarkdownStr = NewType('MarkdownStr', str) module-attribute
logger = get_child_logger(__name__) module-attribute
ConcurrentAccessManager

Manages concurrent access to prompt files.

Provides: - File-level locking - Safe concurrent access prompts - Lock cleanup

lock_dir = Path(lock_dir) instance-attribute
__init__(lock_dir)

Initialize access manager.

Parameters:

Name Type Description Default
lock_dir Path

Directory for lock files

required
file_lock(file_path)

Context manager for safely accessing files.

Parameters:

Name Type Description Default
file_path Path

Path to file to lock

required

Yields:

Type Description
None

None when lock is acquired

Raises:

Type Description
RuntimeError

If file is already locked

OSError

If lock file operations fail

is_locked(file_path)

Check if a file is currently locked.

Parameters:

Name Type Description Default
file_path Path

Path to file to check

required

Returns:

Name Type Description
bool bool

True if file is locked

GitBackedRepository

Manages versioned storage of prompts using Git.

Provides basic Git operations while hiding complexity: - Automatic versioning of changes - Basic conflict resolution - History tracking

repo = Repo(repo_path) instance-attribute
repo_path = repo_path instance-attribute
__init__(repo_path)

Initialize or connect to Git repository.

Parameters:

Name Type Description Default
repo_path Path

Path to repository directory

required

Raises:

Type Description
GitCommandError

If Git operations fail

display_history(file_path, max_versions=0)

Display history of changes for a file with diffs between versions.

Shows most recent changes first, limited to max_versions entries. For each change shows: - Commit info and date - Stats summary of changes - Detailed color diff with 2 lines of context

Parameters:

Name Type Description Default
file_path Path

Path to file in repository

required
max_versions int

Maximum number of versions to show; zero shows all revisions.

0
Example

repo.display_history(Path("prompts/format_dharma_talk.yaml")) Commit abc123def (2024-12-28 14:30:22): 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/prompts/format_dharma_talk.yaml ... ...

update_file(file_path)

Stage and commit changes to a file in the Git repository.

Parameters:

Name Type Description Default
file_path Path

Absolute or relative path to the file.

required

Returns:

Name Type Description
str str

Commit hash if changes were made.

Raises:

Type Description
FileNotFoundError

If the file does not exist.

ValueError

If the file is outside the repository.

GitCommandError

If Git operations fail.

LocalPromptManager

A simple singleton implementation of PromptManager that ensures only one instance is created and reused throughout the application lifecycle.

This class wraps the PromptManager to provide efficient prompt loading by maintaining a single reusable instance.

Attributes:

Name Type Description
_instance Optional[SingletonPromptManager]

The singleton instance

_prompt_manager Optional[PromptManager]

The wrapped PromptManager instance

prompt_manager property

Lazy initialization of the PromptManager instance.

Returns:

Name Type Description
PromptManager PromptCatalog

The wrapped PromptManager instance

Raises:

Type Description
RuntimeError

If PATTERN_REPO is not properly configured

__new__()

Create or return the singleton instance.

Returns:

Name Type Description
SingletonPromptManager LocalPromptManager

The singleton instance

get_prompt(name)

Get a prompt by name.

Prompt

Base Prompt class for version-controlled template prompts.

Prompts contain: - Instructions: The main prompt instructions as a Jinja2 template. Note: Instructions are intended to be saved in markdown format in a .md file. - Template fields: Default values for template variables - Metadata: Name and identifier information

Version control is handled externally through Git, not in the prompt itself. Prompt identity is determined by the combination of identifiers.

Attributes:

Name Type Description
name str

The name of the prompt

instructions str

The Jinja2 template string for this prompt

default_template_fields Dict[str, str]

Default values for template variables

_allow_empty_vars bool

Whether to allow undefined template variables

_env Environment

Configured Jinja2 environment instance

default_template_fields = default_template_fields or {} instance-attribute
instructions = instructions instance-attribute
name = name instance-attribute
path = path instance-attribute
__eq__(other)

Compare prompts based on their content.

__hash__()

Hash based on content hash for container operations.

__init__(name, instructions, path=None, default_template_fields=None, allow_empty_vars=False)

Initialize a new Prompt instance.

Parameters:

Name Type Description Default
name str

Unique name identifying the prompt

required
instructions MarkdownStr

Jinja2 template string containing the prompt

required
default_template_fields Optional[Dict[str, str]]

Optional default values for template variables

None
allow_empty_vars bool

Whether to allow undefined template variables

False

Raises:

Type Description
ValueError

If name or instructions are empty

TemplateError

If template syntax is invalid

apply_template(field_values=None)

Apply template values to prompt instructions using Jinja2.

Values precedence (highest to lowest): 1. field_values (explicitly passed) 2. frontmatter values (from prompt file) 3. default_template_fields (prompt defaults)

Parameters:

Name Type Description Default
field_values Optional[Dict[str, str]]

Values to substitute into the template. If None, uses frontmatter/defaults.

None

Returns:

Name Type Description
str str

Rendered instructions with template values applied.

Raises:

Type Description
TemplateError

If template rendering fails

ValueError

If required template variables are missing

content_hash()

Generate a SHA-256 hash of the prompt content.

Useful for quick content comparison and change detection.

Returns:

Name Type Description
str str

Hexadecimal string of the SHA-256 hash

extract_frontmatter()

Extract and validate YAML frontmatter from markdown instructions.

Returns:

Type Description
Optional[Dict[str, Any]]

Optional[Dict]: Frontmatter data if found and valid, None otherwise

Note

Frontmatter must be at the very start of the file and properly formatted.

from_dict(data) classmethod

Create prompt instance from dictionary data.

Parameters:

Name Type Description Default
data Dict[str, Any]

Dictionary containing prompt data

required

Returns:

Name Type Description
Prompt Prompt

New prompt instance

Raises:

Type Description
ValueError

If required fields are missing

get_content_without_frontmatter()

Get markdown content with frontmatter removed.

Returns:

Name Type Description
str str

Markdown content without frontmatter

source_bytes()

Best-effort raw bytes for prompt hashing.

Prefers hashing exact on-disk bytes including front-matter. We therefore first try to read from prompt_path. If that fails, we fall back to hashing the concatenation of known templates. In V1, only the instructions (system template) are used for rendering.

to_dict()

Convert prompt to dictionary for serialization.

Returns:

Type Description
Dict[str, Any]

Dict containing all prompt data in serializable format

update_frontmatter(new_data)

Update or add frontmatter to the markdown content.

Parameters:

Name Type Description Default
new_data Dict[str, Any]

Dictionary of frontmatter fields to update

required
PromptCatalog

Main interface for prompt management system.

Provides high-level operations: - Prompt creation and loading - Automatic versioning - Safe concurrent access - Basic history tracking - Case-insensitive prompt names (stored as lowercase)

access_manager = ConcurrentAccessManager(self.base_path / '.locks') instance-attribute
base_path = Path(base_path).resolve() instance-attribute
repo = GitBackedRepository(self.base_path) instance-attribute
__init__(base_path)

Initialize prompt management system.

Parameters:

Name Type Description Default
base_path Path

Base directory for prompt storage

required
get_path(prompt_name)

Recursively search for a prompt file with the given name (case-insensitive) in base_path and all subdirectories.

Parameters:

Name Type Description Default
prompt_name str

prompt name (without extension) to search for

required

Returns:

Type Description
Optional[Path]

Optional[Path]: Full path to the found prompt file, or None if not found

load(prompt_name)

Load the .md prompt file by name, extract placeholders, and return a fully constructed Prompt object.

Parameters:

Name Type Description Default
prompt_name str

Name of the prompt (without .md extension).

required

Returns:

Type Description
Prompt

A new Prompt object whose 'instructions' is the file's text

Prompt

and whose 'template_fields' are inferred from placeholders in

Prompt

those instructions.

save(prompt, subdir=None)
show_history(prompt_name)
verify_repository(base_path) classmethod

Verify repository integrity and uniqueness of prompt names.

Performs the following checks: 1. Validates Git repository structure. 2. Ensures no duplicate prompt names exist.

Parameters:

Name Type Description Default
base_path Path

Repository path to verify.

required

Returns:

Name Type Description
bool bool

True if the repository is valid

bool

and contains no duplicate prompt files.

response_format

TEXT_SECTIONS_DESCRIPTION = 'Ordered list of logical sections for the text. The sequence of line ranges for the sections must cover every line from start to finish without any overlaps or gaps.' module-attribute
LogicalSection

Bases: BaseModel

A logically coherent section of text.

end_line = Field(..., description='Ending line number of the section (inclusive).') class-attribute instance-attribute
start_line = Field(..., description='Starting line number of the section (inclusive).') class-attribute instance-attribute
title = Field(..., description='Meaningful title for the section in the original language of the section.') class-attribute instance-attribute
TextObject

Bases: BaseModel

Represents a text in any language broken into coherent logical sections.

language = Field(..., description='ISO 639-1 language code of the text.') class-attribute instance-attribute
sections = Field(..., description=TEXT_SECTIONS_DESCRIPTION) class-attribute instance-attribute

section_processor

text_object

StorageFormatType = Union[StorageFormat, Literal['text', 'json']] module-attribute
logger = get_child_logger(__name__) module-attribute
AIResponse

Bases: BaseModel

Class for dividing large texts into AI-processable segments while maintaining broader document context.

document_metadata = Field(..., description='Available Dublin Core standard metadata in human-readable YAML format') class-attribute instance-attribute
document_summary = Field(..., description="Concise, comprehensive overview of the text's content and purpose") class-attribute instance-attribute
key_concepts = Field(..., description='Important terms, ideas, or references that appear throughout the text') class-attribute instance-attribute
language = Field(..., description='ISO 639-1 language code') class-attribute instance-attribute
narrative_context = Field(..., description='Concise overview of how the text develops or progresses as a whole') class-attribute instance-attribute
sections instance-attribute
LoadConfig dataclass

Configuration for loading a TextObject.

Attributes:

Name Type Description
format StorageFormat

Storage format of the input file

source_str Optional[str]

Optional source content as string

source_file Optional[Path]

Optional path to source content file

Note

For JSON format, exactly one of source_str or source_file may be provided. Both fields are ignored for TEXT format.

format = StorageFormat.TEXT class-attribute instance-attribute
source_file = None class-attribute instance-attribute
source_str = None class-attribute instance-attribute
__init__(format=StorageFormat.TEXT, source_str=None, source_file=None)
__post_init__()

Validate LoadConfig constraints.

Ensures exactly one source is provided for JSON format using XOR logic.

Raises:

Type Description
ValueError

If JSON format specified without exactly one source

get_source_text()

Get source content as text.

Reads from source_file if provided, otherwise returns source_str.

Returns:

Type Description
Optional[str]

Source text content, or None if neither source is set

Note

This method is primarily used internally by TextObject.load() for JSON format loading.

LogicalSection

Bases: BaseModel

Represents a contextually meaningful segment of a larger text.

Sections should preserve natural breaks in content (explicit section markers, topic shifts, argument development, narrative progression) while staying within specified size limits in order to create chunks suitable for AI processing.

start_line = Field(..., description='Starting line number that begins this logical segment') class-attribute instance-attribute
title = Field(..., description="Descriptive title of section's key content") class-attribute instance-attribute
MergeStrategy

Bases: Enum

Strategy for merging metadata.

DEEP_MERGE = 'deep' class-attribute instance-attribute
FAIL_ON_CONFLICT = 'fail' class-attribute instance-attribute
PRESERVE = 'preserve' class-attribute instance-attribute
UPDATE = 'update' class-attribute instance-attribute
SectionBoundaryError

Bases: ValidationError

Raised when section boundaries have gaps, overlaps, or out-of-bounds errors.

Attributes:

Name Type Description
errors

List of SectionValidationError instances from NumberedText

coverage_report

Coverage statistics (coverage_pct, gaps, overlaps)

coverage_report = coverage_report instance-attribute
errors = errors instance-attribute
__init__(errors, coverage_report)
SectionEntry

Bases: NamedTuple

Represents a section with its content during iteration.

content instance-attribute
number instance-attribute
range instance-attribute
title instance-attribute
SectionObject dataclass

Represents a section of text with computed boundaries and optional metadata.

SectionObject is used internally by TextObject to track section ranges. Unlike LogicalSection (which only has start_line), SectionObject includes the computed end boundary.

Attributes:

Name Type Description
title str

Descriptive title of the section

section_range SectionRange

Line range (start inclusive, end exclusive)

metadata Optional[Metadata]

Optional section-specific metadata

metadata instance-attribute
section_range instance-attribute
title instance-attribute
__init__(title, section_range, metadata)
from_logical_section(logical_section, end_line, metadata=None) classmethod

Create a SectionObject from a LogicalSection with computed end boundary.

Parameters:

Name Type Description Default
logical_section LogicalSection

AI-generated section with start_line and title

required
end_line int

Computed end boundary (exclusive)

required
metadata Optional[Metadata]

Optional metadata for this section

None

Returns:

Type Description
SectionObject

SectionObject with complete range information

SectionRange

Bases: NamedTuple

Represents the line range of a section.

end instance-attribute
start instance-attribute
StorageFormat

Bases: Enum

JSON = 'json' class-attribute instance-attribute
TEXT = 'text' class-attribute instance-attribute
TextObject

Manages text content with section organization and metadata tracking.

TextObject serves as the core container for text processing, providing: - Line-numbered text content management - Language identification - Section organization and access - Metadata tracking including incorporated processing stages

The class allows for section boundaries through line numbering, allowing sections to be defined by start lines without explicit end lines. Subsequent sections implicitly end where the next section begins. SectionObjects are utilized to represent sections.

Attributes:

Name Type Description
num_text

Line-numbered text content manager

language

ISO 639-1 language code for the text content

sections

List of text sections with boundaries

metadata

Processing and content metadata container

Example

content = NumberedText("Line 1\nLine 2\nLine 3") obj = TextObject(content, language="en")

content property

Get the raw text content without line numbers.

Returns:

Type Description
str

Plain text content as string

language = language or get_language_code_from_text(num_text.content) instance-attribute
last_line_num property

Get the last line number in the text.

Returns:

Type Description
int

Last line number (1-based indexing)

metadata = metadata or Metadata() instance-attribute
metadata_str property

Get metadata as YAML-formatted string.

Returns:

Type Description
str

YAML representation of metadata

Example

print(obj.metadata_str) author: Thich Nhat Hanh language: en

num_text = num_text instance-attribute
numbered_content property

Get text content with line numbers prefixed.

Returns:

Type Description
str

Text with line numbers in format " 1 | line content"

Example

print(obj.numbered_content) 1 | First line 2 | Second line

section_count property

Get the total number of sections.

Returns:

Type Description
int

Number of sections, or 0 if no sections defined

sections = sections or [] instance-attribute
__init__(num_text, language=None, sections=None, metadata=None, validate_on_init=True)
__iter__()

Iterate through sections, yielding full section information.

Note: Pydantic BaseModel defines iter for dict-like iteration over fields. We override it here for domain-specific section iteration. The type: ignore is intentional as we're providing a different iteration interface.

__str__()
export_info(source_file=None)

Export serializable state for persistence.

Parameters:

Name Type Description Default
source_file Optional[Path]

Optional path to source file to record in metadata

None

Returns:

Type Description
TextObjectInfo

TextObjectInfo instance containing serializable state

Note

If source_file is provided, it will be resolved to an absolute path.

from_info(info, metadata, num_text) classmethod

Create TextObject from serialized info and content.

Parameters:

Name Type Description Default
info TextObjectInfo

Serialized TextObjectInfo with section and language data

required
metadata Metadata

Base metadata to merge into the object

required
num_text NumberedText

NumberedText instance with the actual content

required

Returns:

Type Description
TextObject

TextObject instance with combined info and metadata

Example

info = TextObjectInfo.model_validate_json(json_str) text = read_str_from_file(info.source_file) obj = TextObject.from_info(info, Metadata(), NumberedText(text))

from_response(response, existing_metadata, num_text) classmethod

Create TextObject from AI response with section boundaries and metadata.

Extracts sections, language, and metadata from an AI-generated response (e.g., from sectioning or translation processing).

Parameters:

Name Type Description Default
response AIResponse

AIResponse model containing sections and metadata

required
existing_metadata Metadata

Base metadata to start with

required
num_text NumberedText

NumberedText instance with the text content

required

Returns:

Type Description
TextObject

TextObject with sections and merged metadata from AI response

Note

Merges metadata in order: existing → ai_summary/concepts/context → document_metadata

from_section_file(section_file, source=None) classmethod

Create TextObject from a section info file, loading content from source_file. Metadata is extracted from the source_file or from content.

Parameters:

Name Type Description Default
section_file Path

Path to JSON file containing TextObjectInfo

required
source Optional[str]

Optional source string in case no source file is found.

None

Returns:

Type Description
TextObject

TextObject instance

Raises:

Type Description
ValueError

If source_file is missing from section info

FileNotFoundError

If either section_file or source_file not found

from_str(text, language=None, sections=None, metadata=None) classmethod

Create a TextObject from a string, extracting any frontmatter.

Parameters:

Name Type Description Default
text str

Input text string, potentially containing frontmatter

required
language Optional[str]

ISO language code

None
sections Optional[List[SectionObject]]

List of section objects

None
metadata Optional[Metadata]

Optional base metadata to merge with frontmatter

None

Returns:

Type Description
TextObject

TextObject instance with combined metadata

from_text_file(file) classmethod

Create TextObject from a text file.

Reads the file and extracts any frontmatter metadata.

Parameters:

Name Type Description Default
file Path

Path to text file

required

Returns:

Type Description
TextObject

TextObject instance with extracted content and metadata

Example

obj = TextObject.from_text_file(Path("document.txt"))

get_section_content(index)

Get content for a section by index.

Parameters:

Name Type Description Default
index int

Zero-based section index

required

Returns:

Type Description
str

Section content as string

Raises:

Type Description
ValueError

If no sections are available

IndexError

If index is out of range

Example

obj = TextObject(num_text, sections=[...]) content = obj.get_section_content(0) # First section

load(path, config=None) classmethod

Load TextObject from file with optional configuration.

Parameters:

Name Type Description Default
path Path

Input file path

required
config Optional[LoadConfig]

Optional loading configuration. If not provided, loads directly from text file.

None

Returns:

Type Description
TextObject

TextObject instance

Usage
Load from text file with frontmatter

obj = TextObject.load(Path("content.txt"))

Load state from JSON with source content string

config = LoadConfig( format=StorageFormat.JSON, source_str="Text content..." ) obj = TextObject.load(Path("state.json"), config)

Load state from JSON with source content file

config = LoadConfig( format=StorageFormat.JSON, source_file=Path("content.txt") ) obj = TextObject.load(Path("state.json"), config)

merge_metadata(new_metadata, strategy=MergeStrategy.PRESERVE, source=None)

Merge metadata with explicit strategy and optional provenance tracking.

merge_metadata_legacy(new_metadata, override=False)

Deprecated legacy merge interface that maps to MergeStrategy.

save(path, output_format=StorageFormat.TEXT, source_file=None, pretty=True)

Save TextObject to file in specified format.

Parameters:

Name Type Description Default
path Path

Output file path

required
output_format StorageFormatType

"text" for full content+metadata or "json" for serialized state

TEXT
source_file Optional[Path]

Optional source file to record in metadata

None
pretty bool

For JSON output, whether to pretty print

True
transform(data_str=None, language=None, metadata=None, process_metadata=None, sections=None)

Return a new TextObject with requested changes; does not mutate the original.

Parameters:

Name Type Description Default
data_str Optional[str]

Optional new text content

None
language Optional[str]

Optional new language code

None
metadata Optional[Metadata]

Metadata to merge into the new object

None
process_metadata Optional[ProcessMetadata]

Identifier/details for the process performed

None
sections Optional[List[SectionObject]]

Optional replacement list of sections

None
update_metadata(**kwargs)

Update metadata with new key-value pairs using PRESERVE strategy.

Convenience method for adding metadata without overriding existing keys.

Parameters:

Name Type Description Default
**kwargs Any

Key-value pairs to add to metadata

{}
Example

obj.update_metadata(author="Thich Nhat Hanh", year=2020)

validate_sections(raise_on_error=True)

Validate section integrity using NumberedText boundary checks.

TextObjectInfo

Bases: BaseModel

Serializable information about a text and its sections.

language instance-attribute
metadata instance-attribute
sections instance-attribute
source_file = None class-attribute instance-attribute
model_post_init(__context)

Ensure metadata is always a Metadata instance after initialization.

typing

ProcessorResult = Union[str, ResponseFormat] module-attribute
ResponseFormat = TypeVar('ResponseFormat', bound=BaseModel) module-attribute

audio_processing

__all__ = ['ArtifactRetention', 'DiarizationConfig', 'MultilingualTranscriptionRequest', 'MultilingualTranscriptionService', 'TranscriptionProvider'] module-attribute

__getattr__(name)

Lazily expose audio processing exports to avoid heavy import side effects.

audio_slice_utils

resolve_audio_format(audio_file)

Resolve the export format from the source audio suffix.

slice_audio_bytes(base_audio, start_ms, end_ms, audio_file)

Export a byte stream for a bounded audio slice.

diarization

__all__ = ['DiarizationProcessor', 'diarize', 'diarize_to_file', 'DiarizationParams', 'PyannoteClient', 'PyannoteConfig'] module-attribute
DiarizationParams

Bases: BaseModel

Per-request diarization options; maps to pyannote API payload. Use .to_api_dict() to emit API field names.

confidence = Field(default=None, ge=0.0, le=1.0, description='Confidence threshold for segments.') class-attribute instance-attribute
model_config = ConfigDict(frozen=True, populate_by_name=True, extra='forbid') class-attribute instance-attribute
num_speakers = Field(default=None, alias='numSpeakers', description="Fixed number of speakers or 'auto' for detection.") class-attribute instance-attribute
webhook = Field(default=None, description='Webhook URL for job status callbacks.') class-attribute instance-attribute
to_api_dict()

Return payload dict using API field names (camelCase) and excluding Nones.

DiarizationProcessor

Orchestrator over a DiarizationService.

This layer delegates to the service for generation and handles persistence.

audio_file_path = audio_file_path.resolve() instance-attribute
output_path = output_path.resolve() if output_path is not None else self.audio_file_path.parent / f'{self.audio_file_path.stem}{PYANNOTE_FILE_STR}.json' instance-attribute
params = params instance-attribute
service = service or PyannoteService(default_client) instance-attribute
writer = writer or FileResultWriter() instance-attribute
__init__(audio_file_path, output_path=None, *, service=None, params=None, api_key=None, writer=None)
export(response=None)

Write the provided or last response to self.output_path.

generate(*, wait_until_complete=True)

One-shot convenience: delegate to the service and cache the response.

get_response(job=None, *, wait_until_complete=False)

Fetch current/final response for a job, caching the last response.

start()

Start a job and cache its job_id.

PyannoteClient

Client for interacting with the pyannote.ai speaker diarization API.

api_key = api_key or os.getenv('PYANNOTEAI_API_TOKEN') instance-attribute
config = config or PyannoteConfig() instance-attribute
headers = {'Authorization': f'Bearer {self.api_key}'} instance-attribute
network_timeout = self.config.network_timeout instance-attribute
polling_config = self.config.polling_config instance-attribute
upload_max_retries = self.config.upload_max_retries instance-attribute
upload_timeout = self.config.upload_timeout instance-attribute
JobPoller

Generic job polling helper for long-running async jobs.

job_id = job_id instance-attribute
last_status = None instance-attribute
poll_count = 0 instance-attribute
polling_config = polling_config instance-attribute
start_time = time.time() instance-attribute
status_fn = status_fn instance-attribute
__init__(status_fn, job_id, polling_config)
run()
__init__(api_key=None, config=None)

Initialize with API key.

Parameters:

Name Type Description Default
api_key Optional[str]

Pyannote.ai API key (defaults to environment variable)

None
check_job_status(job_id)

Check the status of a diarization job.

Returns a typed transport model (JobStatusResponse) or None on failure.

poll_job_until_complete(job_id, estimated_duration=None, timeout=None, wait_until_complete=False)

Poll until the job reaches a terminal state or a client-side stop condition, and return a unified JobStatusResponse (JSR) that includes both the server payload and polling context via outcome, polls, and elapsed_s.

Parameters:

Name Type Description Default
job_id str

Remote job identifier to poll.

required
estimated_duration Optional[float]

Optional hint; currently unused (reserved for adaptive backoff).

None
timeout Optional[float]

Optional hard timeout in seconds for this poll call. If provided, it overrides the client's default polling timeout. Ignored if wait_until_complete is True.

None
wait_until_complete Optional[bool]

If True, ignore timeout and poll indefinitely (subject to process lifetime).

False

Returns:

Name Type Description
JobStatusResponse JobStatusResponse

unified transport + polling-context result.

start_diarization(media_id, params=None)

Start diarization job with pyannote.ai API.

Parameters:

Name Type Description Default
media_id str

The media ID from upload_audio

required
params Optional[DiarizationParams]

Optional parameters for diarization

None

Returns:

Type Description
Optional[str]

Optional[str]: The job ID if started successfully, None otherwise

upload_audio(file_path)

Upload audio file with retry logic for network robustness.

Retries on network errors with exponential backoff. Fails fast on permanent errors (auth, file not found, etc.).

PyannoteConfig

Bases: BaseSettings

Configuration constants for Pyannote API.

base_url = 'https://api.pyannote.ai/v1' class-attribute instance-attribute
diarize_endpoint property
job_status_endpoint property
media_content_type = 'audio/mpeg' class-attribute instance-attribute
media_input_endpoint property
media_prefix = 'media://diarization-' class-attribute instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='PYANNOTE_', extra='ignore') class-attribute instance-attribute
network_timeout = 3 class-attribute instance-attribute
polling_config = PollingConfig() class-attribute instance-attribute
upload_max_retries = 3 class-attribute instance-attribute
upload_timeout = 300 class-attribute instance-attribute
diarize(audio_file_path, output_path=None, *, params=None, service=None, api_key=None, wait_until_complete=True)

One-shot convenience to generate a result and (optionally) write it.

This returns the DiarizationResponse. Writing is left to callers or diarize_to_file below.

diarize_to_file(audio_file_path, output_path=None, *, params=None, service=None, api_key=None, wait_until_complete=True)

Convenience helper: generate then export to JSON if successful; returns response

audio
__all__ = ['AudioHandler', 'AudioHandlerConfig'] module-attribute
AudioHandler

Isolates audio operations and external dependencies (pydub, ffmpeg).

base_audio instance-attribute
config = config instance-attribute
input_format = None instance-attribute
output_format = config.output_format instance-attribute
__init__(config=AudioHandlerConfig())
build_audio_chunk(chunk, audio_file)

builds and sets the internal chunk.audio to be the new AudioChunk

export_audio_bytes(audio_segment, format_str=None)

Export AudioSegment to BytesIO for services/modules that require file-like objects.

AudioHandlerConfig

Bases: BaseSettings

Configuration settings for the AudioHandler. All audio time units are milliseconds (int)

SUPPORTED_FORMATS = frozenset({'mp3', 'wav', 'flac', 'ogg', 'm4a', 'mp4'}) class-attribute instance-attribute
max_segment_length = Field(default=None, description='Maximum allowed segment length (in milliseconds).') class-attribute instance-attribute
output_format = Field(default=None, description="Audio output format used when exporting segments (e.g., 'wav', 'mp3').") class-attribute instance-attribute
silence_all_intervals = Field(default=False, description='If True, replace every non-zero interval between consecutive diarization segments with silence of length spacing_time.') class-attribute instance-attribute
temp_storage_dir = Field(default=None, description='Optional directory path for storing temporary audio files (currently unused).') class-attribute instance-attribute
Config
env_prefix = 'AUDIO_HANDLER_' class-attribute instance-attribute
config
AudioHandlerConfig

Bases: BaseSettings

Configuration settings for the AudioHandler. All audio time units are milliseconds (int)

SUPPORTED_FORMATS = frozenset({'mp3', 'wav', 'flac', 'ogg', 'm4a', 'mp4'}) class-attribute instance-attribute
max_segment_length = Field(default=None, description='Maximum allowed segment length (in milliseconds).') class-attribute instance-attribute
output_format = Field(default=None, description="Audio output format used when exporting segments (e.g., 'wav', 'mp3').") class-attribute instance-attribute
silence_all_intervals = Field(default=False, description='If True, replace every non-zero interval between consecutive diarization segments with silence of length spacing_time.') class-attribute instance-attribute
temp_storage_dir = Field(default=None, description='Optional directory path for storing temporary audio files (currently unused).') class-attribute instance-attribute
Config
env_prefix = 'AUDIO_HANDLER_' class-attribute instance-attribute
handler

Audio handler utilities for slicing and assembling audio around diarization chunks. Designed for pipeline-friendly, single-responsibility methods so that higher-level services can remain agnostic of the underlying audio library.

This implementation purposely keeps logic minimal for testing.

logger = get_child_logger(__name__) module-attribute
AudioHandler

Isolates audio operations and external dependencies (pydub, ffmpeg).

base_audio instance-attribute
config = config instance-attribute
input_format = None instance-attribute
output_format = config.output_format instance-attribute
__init__(config=AudioHandlerConfig())
build_audio_chunk(chunk, audio_file)

builds and sets the internal chunk.audio to be the new AudioChunk

export_audio_bytes(audio_segment, format_str=None)

Export AudioSegment to BytesIO for services/modules that require file-like objects.

chunker
logger = get_child_logger(__name__) module-attribute
DiarizationChunker

Class for chunking diarization results into processing units based on configurable duration targets.

config = ChunkConfig() instance-attribute
__init__(**config_options)

Initialize chunker with additional config_options.

extract_contiguous_chunks(segments)

Split diarization segments into contiguous chunks of approximately target_duration, without splitting on speaker changes.

Parameters:

Name Type Description Default
segments List[DiarizedSegment]

List of speaker segments from diarization

required

Returns:

Type Description
List[DiarizationChunk]

List[Chunk]: Flat list of contiguous chunks

config
ChunkConfig

Bases: BaseSettings

Configuration for chunking

gap_spacing_time = 1000 class-attribute instance-attribute
gap_threshold = 4000 class-attribute instance-attribute
min_duration = 30000 class-attribute instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='CHUNK_', extra='ignore') class-attribute instance-attribute
target_duration = 300000 class-attribute instance-attribute
DiarizationConfig

Bases: BaseSettings

chunk = ChunkConfig() class-attribute instance-attribute
language = LanguageConfig() class-attribute instance-attribute
mapping = MappingPolicy() class-attribute instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='DIARIZATION_', extra='ignore') class-attribute instance-attribute
speaker = SpeakerConfig() class-attribute instance-attribute
LanguageConfig

Bases: BaseSettings

default_language = 'en' class-attribute instance-attribute
export_format = 'wav' class-attribute instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='LANGUAGE_', extra='ignore') class-attribute instance-attribute
probe_time = 10000 class-attribute instance-attribute
MappingPolicy

Bases: BaseSettings

Mapping policy for transport→domain shaping.

TODO (future parameters to consider): - min_segment_ms: int # drop micro-segments below threshold - merge_gap_ms: int # merge adjacent same-speaker if gap ≤ this - round_ms_to: int # quantize boundaries (e.g., 10ms) - confidence_floor: float | None # filter out low-confidence segments - suppress_unlabeled: bool # drop segments missing speaker id - attach_raw_payload: bool # persist raw API payload in metadata - version: int # policy versioning for reproducibility

default_speaker_label = 'SPEAKER_00' class-attribute instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='MAPPING_', extra='ignore') class-attribute instance-attribute
single_speaker = False class-attribute instance-attribute
PollingConfig

Bases: BaseSettings

Configuration constants for a generic polling class used to for Pyannote API polling.

exp_base = 2 class-attribute instance-attribute
initial_poll_time = 7 class-attribute instance-attribute
max_interval = 30 class-attribute instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='PYANNOTE_POLL_', extra='ignore') class-attribute instance-attribute
polling_interval = 15 class-attribute instance-attribute
polling_timeout = 300.0 class-attribute instance-attribute
PyannoteConfig

Bases: BaseSettings

Configuration constants for Pyannote API.

base_url = 'https://api.pyannote.ai/v1' class-attribute instance-attribute
diarize_endpoint property
job_status_endpoint property
media_content_type = 'audio/mpeg' class-attribute instance-attribute
media_input_endpoint property
media_prefix = 'media://diarization-' class-attribute instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='PYANNOTE_', extra='ignore') class-attribute instance-attribute
network_timeout = 3 class-attribute instance-attribute
polling_config = PollingConfig() class-attribute instance-attribute
upload_max_retries = 3 class-attribute instance-attribute
upload_timeout = 300 class-attribute instance-attribute
SpeakerConfig

Bases: BaseSettings

Configuration settings for speaker block generation.

default_speaker_label = 'SPEAKER_00' class-attribute instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='SPEAKER_', extra='ignore') class-attribute instance-attribute
same_speaker_gap_threshold = TimeMs.from_seconds(2) class-attribute instance-attribute
single_speaker = False class-attribute instance-attribute
models
logger = get_child_logger(__name__) module-attribute
AudioChunk

Bases: BaseModel

channels = None class-attribute instance-attribute
data instance-attribute
end_ms instance-attribute
format = None class-attribute instance-attribute
sample_rate = None class-attribute instance-attribute
start_ms instance-attribute
Config
arbitrary_types_allowed = True class-attribute instance-attribute
AugDiarizedSegment

Bases: DiarizedSegment

DiarizedSegment with additional chunking/processing metadata.

This class extends DiarizationSegment and adds fields that are only set during chunk accumulation or downstream processing.

Attributes:

Name Type Description
gap_before bool

Indicates if there is a gap greater than the configured threshold before this segment. Set only during chunk accumulation.

spacing_time TimeMs

The spacing (in ms) between this and the previous segment, possibly adjusted if there is a gap before. Set only during chunk accumulation.

audio TNHAudioSegment

The audio data for this segment, sliced from the original audio.

Notes
  • The audio field is a slice of the original audio corresponding to this segment.
  • All time values (start, end, duration) are relative to the original audio.
  • When slicing or probing the audio field, use times relative to 0 (i.e., 0 to duration).
  • For language probing or any operation on audio, always use 0 as the start and duration as the end.
audio instance-attribute
gap_before_new instance-attribute
relative_end property

End time relative to the segment audio (duration of segment).

relative_start property

Start time relative to the segment audio (always 0).

spacing_time_new instance-attribute
Config
arbitrary_types_allowed = True class-attribute instance-attribute
from_segment(segment, gap_before=None, spacing_time_new=None, audio=None, **kwargs) classmethod

Create an AugDiarizedSegment from a DiarizedSegment, with optional new fields. Args: segment (DiarizedSegment): The base segment to copy fields from. gap_before_new (bool, optional): Value for gap_before_new. Defaults to False. spacing_time_new (TimeMs, optional): Value for spacing_time_new. Defaults to None. audio (AudioSegment, optional): Audio data for this segment. Defaults to None. **kwargs: Any additional fields to override. Returns: AugDiarizedSegment: The new augmented segment.

DiarizationChunk

Bases: BaseModel

Represents a chunk of segments to be processed together.

accumulated_time = 0 class-attribute instance-attribute
audio = None class-attribute instance-attribute
end_time instance-attribute
segments instance-attribute
start_time instance-attribute
total_duration property

Get chunk duration in milliseconds.

total_duration_sec property
total_duration_time property
Config
arbitrary_types_allowed = True class-attribute instance-attribute
DiarizedSegment

Bases: BaseModel

Represents a diarized audio segment for a single speaker.

Attributes:

Name Type Description
speaker str

The speaker label for this segment.

start TimeMs

Start time in milliseconds.

end TimeMs

End time in milliseconds.

audio_map_start Optional[int]

Location in the audio output file, if mapped.

gap_before Optional[bool]

Indicates if there is a gap greater than the configured threshold before this segment. This attribute is set exclusively by ChunkAccumulator.add_segment() and should be None until that point.

spacing_time Optional[int]

The spacing (in ms) between this and the previous segment, possibly adjusted if there is a gap before. This attribute is also set exclusively by ChunkAccumulator.add_segment() and should be None until that point.

Notes
  • gap_before and spacing_time are not set during initial diarization, but are assigned only when the segment is accumulated into a chunk for downstream audio handling.
  • These fields should be considered write-once and must not be mutated elsewhere.
audio_map_start instance-attribute
duration property

Get segment duration in milliseconds.

duration_sec property
end instance-attribute
end_time property
gap_before instance-attribute
mapped_end property
mapped_start property

Downstream registry field set by the audio handler

spacing_time instance-attribute
speaker instance-attribute
start instance-attribute
start_time property
normalize()

Normalize the duration of the segment to be nonzero and validate start/end values.

SpeakerBlock

Bases: BaseModel

A block of contiguous or near-contiguous segments spoken by the same speaker.

Used as a higher-level abstraction over diarization segments to simplify chunking strategies (e.g., language-aware sampling, re-segmentation).

duration property
duration_sec property
end property
segment_count property
segments instance-attribute
speaker instance-attribute
start property
Config
arbitrary_types_allowed = True class-attribute instance-attribute
from_dict(data) classmethod

Create a SpeakerBlock from a dictionary (output of to_dict). Args: data (dict): Dictionary with keys matching SpeakerBlock fields. Returns: SpeakerBlock: Deserialized SpeakerBlock instance. Raises: ValueError, TypeError: If validation fails.

to_dict()

custom serializer for SpeakerBlock with validation.

protocols

Interfaces shared by diarization strategy classes.

AudioFetcher

Bases: Protocol

Abstract audio provider for probing a segment.

extract_audio(start_ms, end_ms)
ChunkingStrategy

Bases: Protocol

Protocol every chunking strategy must satisfy.

extract(segments)
DiarizationService

Bases: Protocol

Protocol for any diarization service.

generate(audio_path, params=None, *, wait_until_complete=True)

One-shot convenience: start + (optionally) wait + fetch + map.

Implementations may optimize this path; default behavior can be start() followed by get_response().

get_response(job_id, *, wait_until_complete=False)

Return the current state or final result as a DiarizationResponse.

When wait_until_complete is True, the service blocks until a terminal state (succeeded/failed/timeout) and returns that envelope.

start(audio_path, params=None)

Start a diarization job and return an opaque job_id.

LanguageDetector

Bases: Protocol

Abstract language detector (e.g., fastText, Whisper-lang).

detect(audio, format_str)
ResultWriter

Bases: Protocol

Port for persisting diarization results.

write(path, response)
SegmentAdapter

Bases: Protocol

to_segments(data)
pyannote_adapter
logger = get_child_logger(__name__) module-attribute
PyannoteAdapter

Bases: SegmentAdapter

config = config instance-attribute
__init__(config=DiarizationConfig())
failed_start()
to_response(jsr)

Convert a JobStatusResponse to a DiarizationResponse (domain layer).

to_segments(data)

Convert a pyannoteai diarization result dict to list of DiarizationSegment objects.

pyannote_client

pyannote_client.py

Client interface for interacting with the pyannote.ai speaker diarization API.

This module provides a robust, object-oriented client for uploading audio files, starting diarization jobs, polling for job completion, and retrieving results from the pyannote.ai API. It includes retry logic, configurable timeouts, and support for advanced diarization parameters.

Typical usage

client = PyannoteClient(api_key="your_api_key") media_id = client.upload_audio(Path("audio.mp3")) job_id = client.start_diarization(media_id) result = client.poll_job_until_complete(job_id)

JOB_ID_FIELD = 'jobId' module-attribute
logger = get_child_logger(__name__) module-attribute
APIKeyError

Bases: Exception

Raised when API key is missing or invalid.

PyannoteClient

Client for interacting with the pyannote.ai speaker diarization API.

api_key = api_key or os.getenv('PYANNOTEAI_API_TOKEN') instance-attribute
config = config or PyannoteConfig() instance-attribute
headers = {'Authorization': f'Bearer {self.api_key}'} instance-attribute
network_timeout = self.config.network_timeout instance-attribute
polling_config = self.config.polling_config instance-attribute
upload_max_retries = self.config.upload_max_retries instance-attribute
upload_timeout = self.config.upload_timeout instance-attribute
JobPoller

Generic job polling helper for long-running async jobs.

job_id = job_id instance-attribute
last_status = None instance-attribute
poll_count = 0 instance-attribute
polling_config = polling_config instance-attribute
start_time = time.time() instance-attribute
status_fn = status_fn instance-attribute
__init__(status_fn, job_id, polling_config)
run()
__init__(api_key=None, config=None)

Initialize with API key.

Parameters:

Name Type Description Default
api_key Optional[str]

Pyannote.ai API key (defaults to environment variable)

None
check_job_status(job_id)

Check the status of a diarization job.

Returns a typed transport model (JobStatusResponse) or None on failure.

poll_job_until_complete(job_id, estimated_duration=None, timeout=None, wait_until_complete=False)

Poll until the job reaches a terminal state or a client-side stop condition, and return a unified JobStatusResponse (JSR) that includes both the server payload and polling context via outcome, polls, and elapsed_s.

Parameters:

Name Type Description Default
job_id str

Remote job identifier to poll.

required
estimated_duration Optional[float]

Optional hint; currently unused (reserved for adaptive backoff).

None
timeout Optional[float]

Optional hard timeout in seconds for this poll call. If provided, it overrides the client's default polling timeout. Ignored if wait_until_complete is True.

None
wait_until_complete Optional[bool]

If True, ignore timeout and poll indefinitely (subject to process lifetime).

False

Returns:

Name Type Description
JobStatusResponse JobStatusResponse

unified transport + polling-context result.

start_diarization(media_id, params=None)

Start diarization job with pyannote.ai API.

Parameters:

Name Type Description Default
media_id str

The media ID from upload_audio

required
params Optional[DiarizationParams]

Optional parameters for diarization

None

Returns:

Type Description
Optional[str]

Optional[str]: The job ID if started successfully, None otherwise

upload_audio(file_path)

Upload audio file with retry logic for network robustness.

Retries on network errors with exponential backoff. Fails fast on permanent errors (auth, file not found, etc.).

pyannote_diarize
PYANNOTE_FILE_STR = '_pyannote_diarization' module-attribute
logger = get_child_logger(__name__) module-attribute
DiarizationProcessor

Orchestrator over a DiarizationService.

This layer delegates to the service for generation and handles persistence.

audio_file_path = audio_file_path.resolve() instance-attribute
output_path = output_path.resolve() if output_path is not None else self.audio_file_path.parent / f'{self.audio_file_path.stem}{PYANNOTE_FILE_STR}.json' instance-attribute
params = params instance-attribute
service = service or PyannoteService(default_client) instance-attribute
writer = writer or FileResultWriter() instance-attribute
__init__(audio_file_path, output_path=None, *, service=None, params=None, api_key=None, writer=None)
export(response=None)

Write the provided or last response to self.output_path.

generate(*, wait_until_complete=True)

One-shot convenience: delegate to the service and cache the response.

get_response(job=None, *, wait_until_complete=False)

Fetch current/final response for a job, caching the last response.

start()

Start a job and cache its job_id.

FileResultWriter

Default file-system writer to JSON.

write(path, response)
PyannoteService

Bases: DiarizationService

Concrete implementation of DiarizationService for pyannote.ai.

Bridges transport (PyannoteClient) and mapping (PyannoteAdapter) while exposing a clean domain-facing API.

adapter = adapter or PyannoteAdapter() instance-attribute
client = client or PyannoteClient() instance-attribute
__init__(client=None, adapter=None)
generate(audio_path, params=None, *, wait_until_complete=True)
get_response(job_id, *, wait_until_complete=False)
start(audio_path, params=None)
diarize(audio_file_path, output_path=None, *, params=None, service=None, api_key=None, wait_until_complete=True)

One-shot convenience to generate a result and (optionally) write it.

This returns the DiarizationResponse. Writing is left to callers or diarize_to_file below.

diarize_to_file(audio_file_path, output_path=None, *, params=None, service=None, api_key=None, wait_until_complete=True)

Convenience helper: generate then export to JSON if successful; returns response

schemas
DiarizationResponse = Annotated[Union[DiarizationSucceeded, DiarizationFailed, DiarizationPending, DiarizationRunning], Field(discriminator='status')] module-attribute
__all__ = ['PollOutcome', 'DiarizationParams', 'StartDiarizationResponse', 'JobStatus', 'JobStatusResponse', 'ErrorCode', 'ErrorInfo', 'DiarizationResult', 'DiarizationSucceeded', 'DiarizationFailed', 'DiarizationPending', 'DiarizationRunning', 'DiarizationResponse'] module-attribute
DiarizationFailed

Bases: _BaseResponse

error instance-attribute
status instance-attribute
DiarizationParams

Bases: BaseModel

Per-request diarization options; maps to pyannote API payload. Use .to_api_dict() to emit API field names.

confidence = Field(default=None, ge=0.0, le=1.0, description='Confidence threshold for segments.') class-attribute instance-attribute
model_config = ConfigDict(frozen=True, populate_by_name=True, extra='forbid') class-attribute instance-attribute
num_speakers = Field(default=None, alias='numSpeakers', description="Fixed number of speakers or 'auto' for detection.") class-attribute instance-attribute
webhook = Field(default=None, description='Webhook URL for job status callbacks.') class-attribute instance-attribute
to_api_dict()

Return payload dict using API field names (camelCase) and excluding Nones.

DiarizationPending

Bases: _BaseResponse

status instance-attribute
DiarizationResult

Bases: BaseModel

Domain-level diarization payload used by the rest of the system. NOTE: segments is intentionally typed as list[Any] so that it can hold your project’s DiarizedSegment instances from models.py without creating an import cycle. You can tighten this typing later to list[DiarizedSegment] and import under TYPE_CHECKING if desired.

metadata = None class-attribute instance-attribute
model_config = ConfigDict(frozen=True, extra='ignore') class-attribute instance-attribute
num_speakers = None class-attribute instance-attribute
segments instance-attribute
DiarizationRunning

Bases: _BaseResponse

status instance-attribute
DiarizationSucceeded

Bases: _BaseResponse

result instance-attribute
status instance-attribute
ErrorCode

Bases: str, Enum

Client- and adapter-level error taxonomy (not server statuses).

API_ERROR = 'api_error' class-attribute instance-attribute
BAD_REQUEST = 'bad_request' class-attribute instance-attribute
CANCELLED = 'cancelled' class-attribute instance-attribute
PARSE_ERROR = 'parse_error' class-attribute instance-attribute
TIMEOUT = 'timeout' class-attribute instance-attribute
TRANSIENT = 'transient' class-attribute instance-attribute
UNKNOWN = 'unknown' class-attribute instance-attribute
ErrorInfo

Bases: BaseModel

code instance-attribute
details = None class-attribute instance-attribute
message instance-attribute
model_config = ConfigDict(frozen=True, extra='allow') class-attribute instance-attribute
JobHandle dataclass
backend = 'pyannote' class-attribute instance-attribute
job_id instance-attribute
__init__(job_id, backend='pyannote')
JobStatus

Bases: str, Enum

CREATED = 'created' class-attribute instance-attribute
FAILED = 'failed' class-attribute instance-attribute
PENDING = 'pending' class-attribute instance-attribute
RUNNING = 'running' class-attribute instance-attribute
SUCCEEDED = 'succeeded' class-attribute instance-attribute
JobStatusResponse

Bases: BaseModel

Job Status Result (JSR): unified transport payload + client polling context. Combines transport-level fields with client-side polling metadata.

Semantics: - outcome describes how polling concluded (terminal success/failure, timeout, network error, etc.). - status is the last known server job status (SUCCEEDED, FAILED, RUNNING, PENDING) - server_error_msg and payload mirror the remote payload when present. - polls and elapsed_s report client polling metrics.

elapsed_s = 0.0 class-attribute instance-attribute
job_id = Field(alias='jobId') class-attribute instance-attribute
model_config = ConfigDict(frozen=True, extra='ignore', populate_by_name=True) class-attribute instance-attribute
outcome = PollOutcome.ERROR class-attribute instance-attribute
payload = Field(default=None, alias='output') class-attribute instance-attribute
polls = 0 class-attribute instance-attribute
server_error_msg = Field(default=None, alias='error') class-attribute instance-attribute
status = None class-attribute instance-attribute
normalize_created_status(value) classmethod

Normalize pyannote pre-running status to the existing domain contract.

PollOutcome

Bases: str, Enum

ERROR = 'error' class-attribute instance-attribute
FAILED = 'failed' class-attribute instance-attribute
INTERRUPTED = 'interrupted' class-attribute instance-attribute
NETWORK_ERROR = 'network_error' class-attribute instance-attribute
SUCCEEDED = 'succeeded' class-attribute instance-attribute
TIMEOUT = 'timeout' class-attribute instance-attribute
StartDiarizationResponse

Bases: BaseModel

Minimal typed view of the start-diarization response.

job_id = Field(alias='jobId') class-attribute instance-attribute
model_config = ConfigDict(frozen=True, extra='ignore') class-attribute instance-attribute
strategies
__all__ = ['LanguageDetector', 'LanguageProbe', 'WhisperLanguageDetector', 'group_speaker_blocks', 'TimeGapChunker'] module-attribute
LanguageDetector

Bases: Protocol

Abstract language detector (e.g., fastText, Whisper-lang).

detect(audio, format_str)
LanguageProbe
detector = detector instance-attribute
export_format = config.language.export_format instance-attribute
probe_time = config.language.probe_time instance-attribute
__init__(config, detector)
segment_language(aug_segment)

Get segment ISO-639 language code from an Augmented Diarize Segment which contains audio.

The probe window is always relative to the segment audio (0=start, duration=end).

TimeGapChunker

Bases: ChunkingStrategy

Chunker that ignores speaker/language and uses only time-gap logic.

cfg = config instance-attribute
__init__(config=DiarizationConfig())
extract(segments)

Extract time-based chunks from diarization segments.

WhisperLanguageDetector

Language detector using Whisper service.

audio_handler = audio_handler or AudioHandler() instance-attribute
model = model instance-attribute
__init__(model='whisper-1', audio_handler=None)
detect(audio, format_str)
group_speaker_blocks(segments, config=DiarizationConfig())

Group contiguous or near-contiguous segments by speaker identity.

Segments are grouped into SpeakerBlocks when the speaker remains the same and the gap between consecutive segments is less than the configured threshold.

Parameters:

Name Type Description Default
segments List[DiarizedSegment]

A list of diarization segments (must be sorted by start time).

required
config DiarizationConfig

Configuration containing the allowed gap between segments.

DiarizationConfig()

Returns:

Type Description
List[SpeakerBlock]

A list of SpeakerBlock objects representing grouped speaker runs.

language_based

LanguageChunker – chunking informed by speaker blocks + language probing.

logger = get_child_logger(__name__) module-attribute
LanguageChunker

Bases: ChunkingStrategy

Strategy:

  1. Group contiguous segments into SpeakerBlock objects.
  2. For each block longer than language_probe_threshold probe language at configurable offsets; if mismatch, split on language change.
  3. Build chunks respecting target_time similar to TimeGapChunker.
cfg = cfg instance-attribute
detector = detector instance-attribute
fetcher = fetcher instance-attribute
lang_thresh = language_probe_threshold instance-attribute
__init__(cfg=ChunkConfig(), fetcher=None, detector=None, language_probe_threshold=TimeMs(90000))
extract(segments)
language_probe

Lightweight language-detection helpers pluggable into chunkers.

logger = get_child_logger(__name__) module-attribute
LanguageProbe
detector = detector instance-attribute
export_format = config.language.export_format instance-attribute
probe_time = config.language.probe_time instance-attribute
__init__(config, detector)
segment_language(aug_segment)

Get segment ISO-639 language code from an Augmented Diarize Segment which contains audio.

The probe window is always relative to the segment audio (0=start, duration=end).

WhisperLanguageDetector

Language detector using Whisper service.

audio_handler = audio_handler or AudioHandler() instance-attribute
model = model instance-attribute
__init__(model='whisper-1', audio_handler=None)
detect(audio, format_str)
speaker_blocker
group_speaker_blocks(segments, config=DiarizationConfig())

Group contiguous or near-contiguous segments by speaker identity.

Segments are grouped into SpeakerBlocks when the speaker remains the same and the gap between consecutive segments is less than the configured threshold.

Parameters:

Name Type Description Default
segments List[DiarizedSegment]

A list of diarization segments (must be sorted by start time).

required
config DiarizationConfig

Configuration containing the allowed gap between segments.

DiarizationConfig()

Returns:

Type Description
List[SpeakerBlock]

A list of SpeakerBlock objects representing grouped speaker runs.

time_gap

TimeGapChunker – baseline strategy: split purely on accumulated time.

logger = get_child_logger(__name__) module-attribute
TimeGapChunker

Bases: ChunkingStrategy

Chunker that ignores speaker/language and uses only time-gap logic.

cfg = config instance-attribute
__init__(config=DiarizationConfig())
extract(segments)

Extract time-based chunks from diarization segments.

timeline_mapper

Timeline mapping utilities for transforming timestamps from chunk-relative coordinates to original audio coordinates.

This module enables mapping transcript segments back to their original positions in the source audio after processing chunked audio.

logger = get_child_logger(__name__) module-attribute
TimelineMapper

Maps timestamps from chunk-relative coordinates to original audio coordinates.

config = config or TimelineMapperConfig() instance-attribute
__init__(config=None)

Initialize with optional configuration.

remap(timed_text, chunk)

Remap all timestamps in a TimedText object from chunk-relative to original audio coordinates.

Parameters:

Name Type Description Default
timed_text TimedText

TimedText with chunk-relative timestamps

required
chunk DiarizationChunk

DiarizationChunk containing mapping information

required

Returns:

Type Description
TimedText

New TimedText object with remapped timestamps

TimelineMapperConfig

Bases: BaseModel

Configuration options for timeline mapping.

debug_logging = Field(default=False, description='Enable detailed logging of mapping decisions') class-attribute instance-attribute
map_speakers = Field(default=True, description='Assign speaker to mapped timings using diarization segment speaker.') class-attribute instance-attribute
types
PyannoteEntry

Bases: TypedDict

end instance-attribute
speaker instance-attribute
start instance-attribute
viewer
close_segment_viewer(pid)

Terminate the Streamlit viewer process by PID.

launch_segment_viewer(segments, master_audio_file)

Export segment data to a temporary JSON file and launch Streamlit viewer. Args: segments: List of dicts with diarization info (start, end, speaker). master_audio_file: Path to the master audio file.

load_segments_from_file(path)
main()

language_utils

normalize_language_code(language)

Normalize common language labels to compact language codes.

multilingual_models

ArtifactRetention

Bases: str, Enum

Artifact retention policy for generated subtitles.

DEBUG = 'debug' class-attribute instance-attribute
MINIMAL = 'minimal' class-attribute instance-attribute
LanguageDetectionResult

Bases: BaseModel

Language detection metadata for a routed segment or block.

confidence = Field(default=None, ge=0.0, le=1.0) class-attribute instance-attribute
detector_source instance-attribute
is_reliable = True class-attribute instance-attribute
language_code = Field(default=None) class-attribute instance-attribute
MergedSubtitleArtifact

Bases: BaseModel

Final user-facing subtitle artifact for the current MVP path.

artifact_retention = ArtifactRetention.MINIMAL class-attribute instance-attribute
final_english_srt instance-attribute
provider instance-attribute
source_language = None class-attribute instance-attribute
source_srt instance-attribute
target_language = 'en' class-attribute instance-attribute
MultilingualTranscriptionRequest

Bases: BaseModel

Top-level request for the multilingual transcription service.

artifact_retention = ArtifactRetention.MINIMAL class-attribute instance-attribute
audio_file instance-attribute
chars_per_caption = Field(default=42, ge=1) class-attribute instance-attribute
diarization_segments = None class-attribute instance-attribute
metadata_file = None class-attribute instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True) class-attribute instance-attribute
provider = TranscriptionProvider.WHISPER class-attribute instance-attribute
skip_translation = False class-attribute instance-attribute
source_language = None class-attribute instance-attribute
target_language = 'en' class-attribute instance-attribute
transcription_model = None class-attribute instance-attribute
translation_model = None class-attribute instance-attribute
translation_pattern = None class-attribute instance-attribute
use_speaker_blocks = False class-attribute instance-attribute
SegmentTranscriptionRequest

Bases: BaseModel

Segment-level transcription request for provider-neutral orchestration.

audio_file instance-attribute
audio_file_extension = None class-attribute instance-attribute
chars_per_caption = Field(default=42, ge=1) class-attribute instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True) class-attribute instance-attribute
provider instance-attribute
source_language = None class-attribute instance-attribute
target_language = 'en' class-attribute instance-attribute
transcription_model = None class-attribute instance-attribute
SegmentTranscriptionResult

Bases: BaseModel

Segment-level subtitle generation result.

error_message = None class-attribute instance-attribute
provider instance-attribute
segment_start_ms = Field(default=0, ge=0) class-attribute instance-attribute
source_language = None class-attribute instance-attribute
source_srt instance-attribute
target_language = 'en' class-attribute instance-attribute
translated_srt = None class-attribute instance-attribute
translation_skipped = False class-attribute instance-attribute
SpeakerLanguageBlock

Bases: BaseModel

A speaker-contiguous block with language metadata.

detection instance-attribute
end_ms = Field(ge=0) class-attribute instance-attribute
is_uncertain = False class-attribute instance-attribute
speaker_label instance-attribute
start_ms = Field(ge=0) class-attribute instance-attribute
TranscriptionProvider

Bases: str, Enum

Supported transcription providers for the multilingual workflow.

ASSEMBLYAI = 'assemblyai' class-attribute instance-attribute
WHISPER = 'whisper' class-attribute instance-attribute

multilingual_protocols

LanguageSegmentationServiceProtocol

Bases: Protocol

Build language-tagged speaker blocks for downstream routing.

build_blocks(request)

Return speaker blocks for multilingual processing.

SegmentTranscriptionServiceProtocol

Bases: Protocol

Provider-neutral segment transcription contract.

transcribe_segment(request)

Generate source-language subtitles for a segment.

SegmentTranslationServiceProtocol

Bases: Protocol

Selective translation contract for non-English segments.

translate_segment(result)

Return a transcription result with translated subtitles when needed.

SubtitleMergeServiceProtocol

Bases: Protocol

Merge segment results into a final user-facing subtitle artifact.

merge(results)

Merge segment subtitle results into one artifact.

multilingual_service

logger = get_logger(__name__) module-attribute
MultilingualTranscriptionService

Provider-neutral entry point for the current multilingual MVP path.

__init__(transcription_service=None, segmentation_service=None, translation_service_factory=None, merge_service_factory=None)
generate_subtitles(request)
PassThroughSubtitleMergeService

Bases: SubtitleMergeServiceProtocol

Merge segment SRTs into a single artifact.

__init__(artifact_retention)
merge(results)
ProviderBackedSegmentTranscriptionService

Bases: SegmentTranscriptionServiceProtocol

Bridge to the existing provider transcription services.

transcribe_segment(request)
SpeakerBlockLanguageSegmentationService

Bases: LanguageSegmentationServiceProtocol

Build language-tagged speaker blocks from diarized segments.

__init__(diarization_config=None, detector=None)
build_blocks(request)
SrtSegmentTranslationService

Bases: SegmentTranslationServiceProtocol

Translate generated SRT content using the existing SRT translator.

__init__(target_language, skip_translation=False, model=None, pattern_name=None, metadata_file=None)
translate_segment(result)

timed_object

__all__ = ['Granularity', 'TimedText', 'TimedTextUnit'] module-attribute
Granularity

Bases: str, Enum

SEGMENT = 'segment' class-attribute instance-attribute
WORD = 'word' class-attribute instance-attribute
TimedText

Bases: BaseModel

Represents a collection of timed text units of a single granularity.

Only one of segments or words is populated, determined by granularity. All units must match the declared granularity.

Notes
  • Start times must be non-decreasing (overlaps allowed for multiple speakers).
  • Negative start_ms or end_ms values are not allowed.
  • Durations must be strictly positive (>0 ms).
  • Mixed granularity is strictly prohibited.
duration property

Get the total duration in milliseconds.

end_ms property

Get the end time of the latest unit.

granularity = Field(..., description='Granularity type for all units.') class-attribute instance-attribute
segments = Field(default_factory=list, description='Phrase-level timed units') class-attribute instance-attribute
start_ms property

Get the start time of the earliest unit.

units property

Return the list of units matching the granularity.

words = Field(default_factory=list, description='Word-level timed units') class-attribute instance-attribute
__init__(*, granularity=None, segments=None, words=None, units=None, **kwargs)

Custom initializer for TimedText. If units is provided, granularity is inferred from the first unit unless explicitly set. If only segments or words is provided, granularity is set accordingly. If all are empty, granularity must be provided.

__len__()

Return the number of units.

append(unit)

Add a unit to the end.

clear()

Remove all units.

export_text(separator='\n', skip_empty=True, show_speaker=True)

Export the text content of all units as a single string.

Parameters:

Name Type Description Default
separator str

String used to separate units (default: newline).

'\n'
skip_empty bool

If True, skip units with empty or whitespace-only text.

True
show_speaker bool

If True, add speaker info.

True

Returns:

Type Description
str

Concatenated text of all units, separated by separator.

extend(units)

Add multiple units to the end.

filter_by_min_duration(min_duration_ms)

Return a new TimedText object containing only units with a minimum duration.

is_segment_granularity()

Return True if granularity is SEGMENT.

is_word_granularity()

Return True if granularity is WORD.

iter()

Unified iterator over the units of the correct granularity.

iter_segments()

Iterate over segment-level units.

Raises:

Type Description
ValueError

If granularity is not SEGMENT.

iter_words()

Iterate over word-level units.

Raises:

Type Description
ValueError

If granularity is not WORD.

merge(items) classmethod

Merge a list of TimedText objects of the same granularity into a single TimedText object.

model_post_init(__context)

After initialization, sort units by start time and normalize durations.

set_all_speakers(speaker)

Set the same speaker for all units.

set_speaker(index, speaker)

Set speaker for a specific unit by index.

shift(offset_ms)

Shift all units by a given offset in milliseconds.

slice(start_ms, end_ms)

Return a new TimedText object containing only units within [start_ms, end_ms]. Units must overlap with the interval to be included.

sort_by_start()

Sort units by start time.

TimedTextUnit

Bases: BaseModel

Represents a timed unit with timestamps.

A fundamental building block for subtitle and transcript processing that associates text content with start/end times and optional metadata. Can represent either a segment (phrase/sentence) or a word.

confidence = Field(None, description='Optional confidence score') class-attribute instance-attribute
duration_ms property

Get duration in milliseconds.

duration_sec property

Get duration in seconds.

end_ms = Field(..., description='End time in milliseconds') class-attribute instance-attribute
end_sec property

Get end time in seconds.

granularity instance-attribute
index = Field(None, description='Entry index or sequence number') class-attribute instance-attribute
speaker = Field(None, description='Speaker identifier if available') class-attribute instance-attribute
start_ms = Field(..., description='Start time in milliseconds') class-attribute instance-attribute
start_sec property

Get start time in seconds.

text = Field(..., description='The text content') class-attribute instance-attribute
normalize()

Normalize the duration of the segment to be nonzero

overlaps_with(other)

Check if this unit overlaps with another.

set_speaker(speaker)

Set the speaker label.

shift_time(offset_ms)

Create a new TimedUnit with timestamps shifted by offset.

timed_text

Module for handling timed text objects. For example, can be used subtitles like VTT and SRT.

This module provides classes and utilities for parsing, manipulating, and generating timed text objects useful in subtitle and transcript processing. It uses Pydantic for robust data validation and type safety.

Granularity

Bases: str, Enum

SEGMENT = 'segment' class-attribute instance-attribute
WORD = 'word' class-attribute instance-attribute
TimedText

Bases: BaseModel

Represents a collection of timed text units of a single granularity.

Only one of segments or words is populated, determined by granularity. All units must match the declared granularity.

Notes
  • Start times must be non-decreasing (overlaps allowed for multiple speakers).
  • Negative start_ms or end_ms values are not allowed.
  • Durations must be strictly positive (>0 ms).
  • Mixed granularity is strictly prohibited.
duration property

Get the total duration in milliseconds.

end_ms property

Get the end time of the latest unit.

granularity = Field(..., description='Granularity type for all units.') class-attribute instance-attribute
segments = Field(default_factory=list, description='Phrase-level timed units') class-attribute instance-attribute
start_ms property

Get the start time of the earliest unit.

units property

Return the list of units matching the granularity.

words = Field(default_factory=list, description='Word-level timed units') class-attribute instance-attribute
__init__(*, granularity=None, segments=None, words=None, units=None, **kwargs)

Custom initializer for TimedText. If units is provided, granularity is inferred from the first unit unless explicitly set. If only segments or words is provided, granularity is set accordingly. If all are empty, granularity must be provided.

__len__()

Return the number of units.

append(unit)

Add a unit to the end.

clear()

Remove all units.

export_text(separator='\n', skip_empty=True, show_speaker=True)

Export the text content of all units as a single string.

Parameters:

Name Type Description Default
separator str

String used to separate units (default: newline).

'\n'
skip_empty bool

If True, skip units with empty or whitespace-only text.

True
show_speaker bool

If True, add speaker info.

True

Returns:

Type Description
str

Concatenated text of all units, separated by separator.

extend(units)

Add multiple units to the end.

filter_by_min_duration(min_duration_ms)

Return a new TimedText object containing only units with a minimum duration.

is_segment_granularity()

Return True if granularity is SEGMENT.

is_word_granularity()

Return True if granularity is WORD.

iter()

Unified iterator over the units of the correct granularity.

iter_segments()

Iterate over segment-level units.

Raises:

Type Description
ValueError

If granularity is not SEGMENT.

iter_words()

Iterate over word-level units.

Raises:

Type Description
ValueError

If granularity is not WORD.

merge(items) classmethod

Merge a list of TimedText objects of the same granularity into a single TimedText object.

model_post_init(__context)

After initialization, sort units by start time and normalize durations.

set_all_speakers(speaker)

Set the same speaker for all units.

set_speaker(index, speaker)

Set speaker for a specific unit by index.

shift(offset_ms)

Shift all units by a given offset in milliseconds.

slice(start_ms, end_ms)

Return a new TimedText object containing only units within [start_ms, end_ms]. Units must overlap with the interval to be included.

sort_by_start()

Sort units by start time.

TimedTextUnit

Bases: BaseModel

Represents a timed unit with timestamps.

A fundamental building block for subtitle and transcript processing that associates text content with start/end times and optional metadata. Can represent either a segment (phrase/sentence) or a word.

confidence = Field(None, description='Optional confidence score') class-attribute instance-attribute
duration_ms property

Get duration in milliseconds.

duration_sec property

Get duration in seconds.

end_ms = Field(..., description='End time in milliseconds') class-attribute instance-attribute
end_sec property

Get end time in seconds.

granularity instance-attribute
index = Field(None, description='Entry index or sequence number') class-attribute instance-attribute
speaker = Field(None, description='Speaker identifier if available') class-attribute instance-attribute
start_ms = Field(..., description='Start time in milliseconds') class-attribute instance-attribute
start_sec property

Get start time in seconds.

text = Field(..., description='The text content') class-attribute instance-attribute
normalize()

Normalize the duration of the segment to be nonzero

overlaps_with(other)

Check if this unit overlaps with another.

set_speaker(speaker)

Set the speaker label.

shift_time(offset_ms)

Create a new TimedUnit with timestamps shifted by offset.

transcription

__all__ = ['patch_whisper_options', 'DiarizationChunker', 'TimedText', 'TextSegmentBuilder', 'TimedTextUnit', 'Granularity', 'TranscriptionService', 'TranscriptionServiceFactory'] module-attribute
DiarizationChunker

Class for chunking diarization results into processing units based on configurable duration targets.

config = ChunkConfig() instance-attribute
__init__(**config_options)

Initialize chunker with additional config_options.

extract_contiguous_chunks(segments)

Split diarization segments into contiguous chunks of approximately target_duration, without splitting on speaker changes.

Parameters:

Name Type Description Default
segments List[DiarizedSegment]

List of speaker segments from diarization

required

Returns:

Type Description
List[DiarizationChunk]

List[Chunk]: Flat list of contiguous chunks

Granularity

Bases: str, Enum

SEGMENT = 'segment' class-attribute instance-attribute
WORD = 'word' class-attribute instance-attribute
TextSegmentBuilder
avoid_orphans = avoid_orphans instance-attribute
current_characters = 0 instance-attribute
current_words = [] instance-attribute
ignore_speaker = ignore_speaker instance-attribute
max_duration = max_duration_ms instance-attribute
max_gap_duration = max_gap_duration_ms instance-attribute
segments = [] instance-attribute
target_characters = target_characters instance-attribute
__init__(*, max_duration_ms=None, target_characters=None, avoid_orphans=True, max_gap_duration_ms=None, ignore_speaker=True)
build_segments(*, target_duration=None, target_characters=None, avoid_orphans=True, max_gap_duration=None, ignore_speaker=False)

Build or rebuild segments from the contents of words.

Parameters:

Name Type Description Default
target_duration Optional[int]

Maximum desired segment duration in milliseconds.

None
target_characters Optional[int]

Maximum desired character length of a segment.

None
avoid_orphans Optional[bool]

If True, prevent extremely short trailing segments.

True
Note

This is a stub. Concrete algorithms will be implemented later.

Raises:

Type Description
NotImplementedError

Always, until implemented.

create_segments(timed_text)
TimedText

Bases: BaseModel

Represents a collection of timed text units of a single granularity.

Only one of segments or words is populated, determined by granularity. All units must match the declared granularity.

Notes
  • Start times must be non-decreasing (overlaps allowed for multiple speakers).
  • Negative start_ms or end_ms values are not allowed.
  • Durations must be strictly positive (>0 ms).
  • Mixed granularity is strictly prohibited.
duration property

Get the total duration in milliseconds.

end_ms property

Get the end time of the latest unit.

granularity = Field(..., description='Granularity type for all units.') class-attribute instance-attribute
segments = Field(default_factory=list, description='Phrase-level timed units') class-attribute instance-attribute
start_ms property

Get the start time of the earliest unit.

units property

Return the list of units matching the granularity.

words = Field(default_factory=list, description='Word-level timed units') class-attribute instance-attribute
__init__(*, granularity=None, segments=None, words=None, units=None, **kwargs)

Custom initializer for TimedText. If units is provided, granularity is inferred from the first unit unless explicitly set. If only segments or words is provided, granularity is set accordingly. If all are empty, granularity must be provided.

__len__()

Return the number of units.

append(unit)

Add a unit to the end.

clear()

Remove all units.

export_text(separator='\n', skip_empty=True, show_speaker=True)

Export the text content of all units as a single string.

Parameters:

Name Type Description Default
separator str

String used to separate units (default: newline).

'\n'
skip_empty bool

If True, skip units with empty or whitespace-only text.

True
show_speaker bool

If True, add speaker info.

True

Returns:

Type Description
str

Concatenated text of all units, separated by separator.

extend(units)

Add multiple units to the end.

filter_by_min_duration(min_duration_ms)

Return a new TimedText object containing only units with a minimum duration.

is_segment_granularity()

Return True if granularity is SEGMENT.

is_word_granularity()

Return True if granularity is WORD.

iter()

Unified iterator over the units of the correct granularity.

iter_segments()

Iterate over segment-level units.

Raises:

Type Description
ValueError

If granularity is not SEGMENT.

iter_words()

Iterate over word-level units.

Raises:

Type Description
ValueError

If granularity is not WORD.

merge(items) classmethod

Merge a list of TimedText objects of the same granularity into a single TimedText object.

model_post_init(__context)

After initialization, sort units by start time and normalize durations.

set_all_speakers(speaker)

Set the same speaker for all units.

set_speaker(index, speaker)

Set speaker for a specific unit by index.

shift(offset_ms)

Shift all units by a given offset in milliseconds.

slice(start_ms, end_ms)

Return a new TimedText object containing only units within [start_ms, end_ms]. Units must overlap with the interval to be included.

sort_by_start()

Sort units by start time.

TimedTextUnit

Bases: BaseModel

Represents a timed unit with timestamps.

A fundamental building block for subtitle and transcript processing that associates text content with start/end times and optional metadata. Can represent either a segment (phrase/sentence) or a word.

confidence = Field(None, description='Optional confidence score') class-attribute instance-attribute
duration_ms property

Get duration in milliseconds.

duration_sec property

Get duration in seconds.

end_ms = Field(..., description='End time in milliseconds') class-attribute instance-attribute
end_sec property

Get end time in seconds.

granularity instance-attribute
index = Field(None, description='Entry index or sequence number') class-attribute instance-attribute
speaker = Field(None, description='Speaker identifier if available') class-attribute instance-attribute
start_ms = Field(..., description='Start time in milliseconds') class-attribute instance-attribute
start_sec property

Get start time in seconds.

text = Field(..., description='The text content') class-attribute instance-attribute
normalize()

Normalize the duration of the segment to be nonzero

overlaps_with(other)

Check if this unit overlaps with another.

set_speaker(speaker)

Set the speaker label.

shift_time(offset_ms)

Create a new TimedUnit with timestamps shifted by offset.

TranscriptionService

Bases: ABC

Abstract base class defining the interface for transcription services.

This interface provides a standard way to interact with different transcription service providers (e.g., OpenAI Whisper, AssemblyAI).

get_result(job_id) abstractmethod

Get results for an existing transcription job.

Parameters:

Name Type Description Default
job_id str

ID of the transcription job

required

Returns:

Type Description
TranscriptionResult

Dictionary containing transcription results in the same

TranscriptionResult

standardized format as transcribe()

transcribe(audio_file, options=None) abstractmethod

Transcribe audio file to text.

Parameters:

Name Type Description Default
audio_file Union[Path, BytesIO]

Path to audio file or file-like object

required
options Optional[Dict[str, Any]]

Provider-specific options for transcription

None

Returns:

Type Description
TranscriptionResult

TranscriptionResult

transcribe_to_format(audio_file, format_type='srt', transcription_options=None, format_options=None) abstractmethod

Transcribe audio and return result in specified format.

Parameters:

Name Type Description Default
audio_file Union[Path, BytesIO]

Path, file-like object, or URL of audio file

required
format_type str

Format type (e.g., "srt", "vtt", "text")

'srt'
transcription_options Optional[Dict[str, Any]]

Options for transcription

None
format_options Optional[Dict[str, Any]]

Format-specific options

None

Returns:

Type Description
str

String representation in the requested format

TranscriptionServiceFactory

Factory for creating transcription service instances.

This factory provides a standard way to create transcription service instances based on the provider name and configuration.

create_service(provider='assemblyai', api_key=None, **kwargs) classmethod

Create a transcription service instance.

Parameters:

Name Type Description Default
provider str

Service provider name (e.g., "whisper", "assemblyai")

'assemblyai'
api_key Optional[str]

API key for the service

None
**kwargs Any

Additional provider-specific configuration

{}

Returns:

Type Description
TranscriptionService

TranscriptionService instance

Raises:

Type Description
ValueError

If the provider is not supported

ImportError

If the provider module cannot be imported

register_provider(name, provider_class) classmethod

Register a provider implementation with the factory.

Parameters:

Name Type Description Default
name str

Provider name (lowercase)

required
provider_class Callable[..., TranscriptionService]

Provider implementation class or factory function

required
Example

from my_module import MyTranscriptionService TranscriptionServiceFactory.register_provider("my_provider", MyTranscriptionService)

patch_whisper_options(options, file_extension)

Patch routine to ensure 'file_extension' is present in transcription options dict. This is a workaround for OpenAI Whisper API, which requires file-like objects to have a filename/extension. Only allows known audio extensions.

Parameters:

Name Type Description Default
options Optional[Dict[str, Any]]

Transcription options dictionary (will not be mutated)

required
file_extension str

File extension string (with or without leading dot)

required

Returns:

Type Description
Dict[str, Any]

New options dictionary with 'file_extension' set appropriately

Raises:

Type Description
ValueError

If file_extension is not in the allowed list

assemblyai_service

AssemblyAI implementation of the TranscriptionService interface.

This module provides a complete implementation of the TranscriptionService interface using the AssemblyAI Python SDK, with support for all major features including:

  • Transcription with configurable options
  • Speaker diarization
  • Automatic language detection
  • Audio intelligence features
  • Subtitle generation
  • Regional endpoint support
  • Webhook callbacks

The implementation follows a modular design with single-action methods and supports both synchronous and asynchronous usage patterns.

logger = get_child_logger(__name__) module-attribute
AAIConfig dataclass

Comprehensive configuration for AssemblyAI transcription service.

This class contains all configurable options for the AssemblyAI API, organized by feature category.

api_key = None class-attribute instance-attribute
auto_chapters = False class-attribute instance-attribute
auto_highlights = False class-attribute instance-attribute
chars_per_caption = 60 class-attribute instance-attribute
content_safety = False class-attribute instance-attribute
custom_spelling = field(default_factory=dict) class-attribute instance-attribute
disfluencies = False class-attribute instance-attribute
dual_channel = False class-attribute instance-attribute
entity_detection = False class-attribute instance-attribute
filter_profanity = False class-attribute instance-attribute
format_text = True class-attribute instance-attribute
iab_categories = False class-attribute instance-attribute
language_code = None class-attribute instance-attribute
language_detection = True class-attribute instance-attribute
polling_interval = 4 class-attribute instance-attribute
punctuate = True class-attribute instance-attribute
sentiment_analysis = False class-attribute instance-attribute
speaker_labels = True class-attribute instance-attribute
speakers_expected = None class-attribute instance-attribute
speech_model = SpeechModel.BEST class-attribute instance-attribute
summarization = False class-attribute instance-attribute
use_eu_endpoint = False class-attribute instance-attribute
webhook_auth_header_name = None class-attribute instance-attribute
webhook_auth_header_value = None class-attribute instance-attribute
webhook_url = None class-attribute instance-attribute
word_boost = field(default_factory=list) class-attribute instance-attribute
__init__(api_key=None, use_eu_endpoint=False, polling_interval=4, speech_model=SpeechModel.BEST, language_code=None, language_detection=True, dual_channel=False, format_text=True, punctuate=True, disfluencies=False, filter_profanity=False, chars_per_caption=60, speaker_labels=True, speakers_expected=None, custom_spelling=dict(), word_boost=list(), auto_chapters=False, auto_highlights=False, entity_detection=False, iab_categories=False, sentiment_analysis=False, summarization=False, content_safety=False, webhook_url=None, webhook_auth_header_name=None, webhook_auth_header_value=None)
AAITranscriptionService

Bases: TranscriptionService

AssemblyAI implementation of the TranscriptionService interface.

Provides comprehensive access to AssemblyAI's transcription services with support for all major features through the official Python SDK.

config = AAIConfig() instance-attribute
format_converter = FormatConverter() instance-attribute
transcriber = aai.Transcriber(config=(self._create_transcription_config(options))) instance-attribute
__init__(api_key=None, options=None)

Initialize the AssemblyAI transcription service.

Parameters:

Name Type Description Default
api_key Optional[str]

AssemblyAI API key (defaults to ASSEMBLYAI_API_KEY env var)

None
options Optional[Dict[str, Any]]

Additional transcription configuration overrides

None
get_result(job_id)

Get results for an existing transcription job.

This method blocks until the transcript is retrieved.

Parameters:

Name Type Description Default
job_id str

ID of the transcription job

required

Returns:

Type Description
TranscriptionResult

Dictionary containing transcription results

get_subtitles(transcript_id, format_type='srt')

Get subtitles directly from AssemblyAI.

Parameters:

Name Type Description Default
transcript_id str

ID of the transcription job

required
format_type str

Format type ("srt" or "vtt")

'srt'

Returns:

Type Description
str

String representation in the requested format

Raises:

Type Description
ValueError

If the format type is not supported

standardize_result(transcript)

Standardize AssemblyAI transcript to match common format.

Parameters:

Name Type Description Default
transcript Transcript

AssemblyAI transcript object

required

Returns:

Type Description
TranscriptionResult

Standardized result dictionary

transcribe(audio_file, options=None)

Transcribe audio file to text using AssemblyAI's synchronous SDK approach.

This method handles: - File paths - File-like objects - URLs

Parameters:

Name Type Description Default
audio_file Union[Path, BinaryIO, str]

Path, file-like object, or URL of audio file

required
options Optional[Dict[str, Any]]

Provider-specific options for transcription

None

Returns:

Type Description
TranscriptionResult

Dictionary containing standardized transcription results

transcribe_async(audio_file, options=None)

Submit an asynchronous transcription job using AssemblyAI's SDK.

This method submits a transcription job and returns immediately with a transcript ID that can be used to retrieve results later.

Parameters:

Name Type Description Default
audio_file Union[Path, BinaryIO, str]

Path, file-like object, or URL of audio file

required
options Optional[Dict[str, Any]]

Provider-specific options for transcription

None

Returns:

Type Description
Future[Any]

String containing the transcript ID for later retrieval

Notes

The SDK's submit method returns a Future object, but this method extracts just the transcript ID for simpler handling.

transcribe_to_format(audio_file, format_type='srt', transcription_options=None, format_options=None)

Transcribe audio and return result in specified format.

Takes advantage of the direct subtitle generation functionality when requesting SRT or VTT formats.

Parameters:

Name Type Description Default
audio_file Union[Path, BinaryIO, str]

Path, file-like object, or URL of audio file

required
format_type str

Format type (e.g., "srt", "vtt", "text")

'srt'
transcription_options Optional[Dict[str, Any]]

Options for transcription

None
format_options Optional[Dict[str, Any]]

Format-specific options

None

Returns:

Type Description
str

String representation in the requested format

SpeechModel

Bases: str, Enum

Supported AssemblyAI speech models.

BEST = 'best' class-attribute instance-attribute
NANO = 'nano' class-attribute instance-attribute
format_converter
tnh_scholar.audio_processing.transcription.format_converter

Thin facade that turns raw transcription-service output dictionaries into the formats requested by callers (plain-text, SRT - VTT coming later).

Core heavy lifting now lives in:

  • TimedText / TimedTextUnit - canonical internal representation
  • SegmentBuilder - word-level -> sentence/segment chunking
  • SRTProcessor - rendering to .srt

Only one public method remains: 🇵🇾meth:FormatConverter.convert.

logger = get_child_logger(__name__) module-attribute
FormatConverter

Convert a raw transcription result to text, SRT, or (placeholder) VTT.

The raw result must follow the loose schema - {"utterances": [...]} -> already speaker-segmented - {"words": [...]} -> word-level; we chunk via :class:SegmentBuilder - {"text": "...", "audio_duration_ms": 12345} -> single blob fallback

config = config or FormatConverterConfig() instance-attribute
__init__(config=None)
convert(result, format_type='srt', format_options=None)

Convert result to the given format_type.

Parameters

result : dict Raw transcription output. format_type : {"srt", "text", "vtt"} format_options : dict | None Currently only {"include_speaker": bool} recognized for srt.

FormatConverterConfig

Bases: BaseModel

User-tunable knobs for :class:FormatConverter.

Only a handful remain now that the heavy logic moved to SegmentBuilder.

characters_per_entry = 42 class-attribute instance-attribute
include_segment_index = True class-attribute instance-attribute
include_speaker = True class-attribute instance-attribute
max_entry_duration_ms = 6000 class-attribute instance-attribute
max_gap_duration_ms = 2000 class-attribute instance-attribute
patches
patch_file_with_name(file_obj, extension)

Ensures the file-like object has a .name attribute with the correct extension.

patch_whisper_options(options, file_extension)

Patch routine to ensure 'file_extension' is present in transcription options dict. This is a workaround for OpenAI Whisper API, which requires file-like objects to have a filename/extension. Only allows known audio extensions.

Parameters:

Name Type Description Default
options Optional[Dict[str, Any]]

Transcription options dictionary (will not be mutated)

required
file_extension str

File extension string (with or without leading dot)

required

Returns:

Type Description
Dict[str, Any]

New options dictionary with 'file_extension' set appropriately

Raises:

Type Description
ValueError

If file_extension is not in the allowed list

srt_processor
SRTConfig

Configuration options for SRT processing.

include_speaker = include_speaker instance-attribute
max_chars_per_line = max_chars_per_line instance-attribute
reindex_entries = reindex_entries instance-attribute
speaker_format = speaker_format instance-attribute
timestamp_format = timestamp_format instance-attribute
use_pysrt = use_pysrt instance-attribute
__init__(include_speaker=False, speaker_format='[{speaker}] {text}', reindex_entries=True, timestamp_format='{:02d}:{:02d}:{:02d},{:03d}', max_chars_per_line=42, use_pysrt=False)

Initialize with default settings.

Parameters:

Name Type Description Default
include_speaker bool

Whether to include speaker labels in output

False
speaker_format str

Format string for speaker attribution

'[{speaker}] {text}'
reindex_entries bool

Whether to reindex entries sequentially

True
timestamp_format str

Format string for timestamp formatting

'{:02d}:{:02d}:{:02d},{:03d}'
max_chars_per_line int

Maximum characters per line before splitting

42
SRTProcessor

Handles parsing and generating SRT format.

Provides functionality to convert between SRT text format and TimedText objects, with various formatting options. Supports both native parsing/generation and pysrt backend.

config = config or SRTConfig() instance-attribute
__init__(config=None)

Initialize with optional configuration overrides.

Parameters:

Name Type Description Default
config Optional[SRTConfig]

Configuration options for SRT processing

None
add_speaker_labels(srt_content, *, speaker=None, speaker_labels=None)

Unified entry point for adding speaker labels. (Not implemented yet.)

assign_single_speaker(srt_content, speaker)

Assign the same speaker to all segments in the SRT content.

assign_speaker_by_mapping(srt_content, speaker_labels)

Assign speakers to segments based on a mapping of speaker to segment indices. (Not implemented yet.)

combine(timed_texts)

Combine multiple lists of TimedText into one, with proper indexing.

Parameters:

Name Type Description Default
timed_texts List[TimedText]

List of TimedText to combine

required

Returns:

Type Description
TimedText

Combined TimedText object

generate(timed_text, include_speaker=None)

Generate SRT content from a TimedText object. Uses internal generator or pysrt depending on configuration.

merge_srts(srt_list)

Merge multiple SRT files into a single SRT string.

parse(srt_content)

Parse SRT content into a new TimedText object. Uses internal parser or pysrt depending on configuration.

shift_timestamps(timed_text, offset_ms)

Shift all timestamps by the given offset.

Parameters:

Name Type Description Default
timed_text TimedText

TimedText to shift

required
offset_ms int

Offset in milliseconds to apply

required

Returns:

Type Description
TimedText

New TimedText object with adjusted timestamps

SubtitleFormat

Bases: str, Enum

Supported subtitle formats.

SRT = 'srt' class-attribute instance-attribute
TEXT = 'text' class-attribute instance-attribute
VTT = 'vtt' class-attribute instance-attribute
text_segment_builder

SegmentBuilder for creating phrase-level segments from word-level TimedText.

This module builds higher-level segments from a TimedText object containing word-level units, based on configurable criteria like duration, character count, punctuation, pauses, and speaker changes.

COMMON_ABBREVIATIONS = frozenset({'adj.', 'adm.', 'adv.', 'al.', 'anon.', 'apr.', 'arc.', 'aug.', 'ave.', 'brig.', 'bros.', 'capt.', 'cmdr.', 'col.', 'comdr.', 'con.', 'corp.', 'cpl.', 'dr.', 'drs.', 'ed.', 'enc.', 'etc.', 'ex.', 'feb.', 'gen.', 'gov.', 'hon.', 'hosp.', 'hr.', 'inc.', 'jan.', 'jr.', 'maj.', 'mar.', 'messrs.', 'mlle.', 'mm.', 'mme.', 'mr.', 'mrs.', 'ms.', 'msgr.', 'nov.', 'oct.', 'op.', 'ord.', 'ph.d.', 'prof.', 'pvt.', 'rep.', 'reps.', 'res.', 'rev.', 'rt.', 'sen.', 'sens.', 'sep.', 'sfc.', 'sgt.', 'sr.', 'st.', 'supt.', 'surg.', 'u.s.', 'v.p.', 'vs.'}) module-attribute
TextSegmentBuilder
avoid_orphans = avoid_orphans instance-attribute
current_characters = 0 instance-attribute
current_words = [] instance-attribute
ignore_speaker = ignore_speaker instance-attribute
max_duration = max_duration_ms instance-attribute
max_gap_duration = max_gap_duration_ms instance-attribute
segments = [] instance-attribute
target_characters = target_characters instance-attribute
__init__(*, max_duration_ms=None, target_characters=None, avoid_orphans=True, max_gap_duration_ms=None, ignore_speaker=True)
build_segments(*, target_duration=None, target_characters=None, avoid_orphans=True, max_gap_duration=None, ignore_speaker=False)

Build or rebuild segments from the contents of words.

Parameters:

Name Type Description Default
target_duration Optional[int]

Maximum desired segment duration in milliseconds.

None
target_characters Optional[int]

Maximum desired character length of a segment.

None
avoid_orphans Optional[bool]

If True, prevent extremely short trailing segments.

True
Note

This is a stub. Concrete algorithms will be implemented later.

Raises:

Type Description
NotImplementedError

Always, until implemented.

create_segments(timed_text)
transcription_service
TranscriptionResult

Bases: BaseModel

audio_duration_ms = None class-attribute instance-attribute
confidence = None class-attribute instance-attribute
language instance-attribute
raw_result = None class-attribute instance-attribute
status = None class-attribute instance-attribute
text instance-attribute
transcript_id = None class-attribute instance-attribute
utterance_timing = None class-attribute instance-attribute
word_timing = None class-attribute instance-attribute
TranscriptionService

Bases: ABC

Abstract base class defining the interface for transcription services.

This interface provides a standard way to interact with different transcription service providers (e.g., OpenAI Whisper, AssemblyAI).

get_result(job_id) abstractmethod

Get results for an existing transcription job.

Parameters:

Name Type Description Default
job_id str

ID of the transcription job

required

Returns:

Type Description
TranscriptionResult

Dictionary containing transcription results in the same

TranscriptionResult

standardized format as transcribe()

transcribe(audio_file, options=None) abstractmethod

Transcribe audio file to text.

Parameters:

Name Type Description Default
audio_file Union[Path, BytesIO]

Path to audio file or file-like object

required
options Optional[Dict[str, Any]]

Provider-specific options for transcription

None

Returns:

Type Description
TranscriptionResult

TranscriptionResult

transcribe_to_format(audio_file, format_type='srt', transcription_options=None, format_options=None) abstractmethod

Transcribe audio and return result in specified format.

Parameters:

Name Type Description Default
audio_file Union[Path, BytesIO]

Path, file-like object, or URL of audio file

required
format_type str

Format type (e.g., "srt", "vtt", "text")

'srt'
transcription_options Optional[Dict[str, Any]]

Options for transcription

None
format_options Optional[Dict[str, Any]]

Format-specific options

None

Returns:

Type Description
str

String representation in the requested format

TranscriptionServiceFactory

Factory for creating transcription service instances.

This factory provides a standard way to create transcription service instances based on the provider name and configuration.

create_service(provider='assemblyai', api_key=None, **kwargs) classmethod

Create a transcription service instance.

Parameters:

Name Type Description Default
provider str

Service provider name (e.g., "whisper", "assemblyai")

'assemblyai'
api_key Optional[str]

API key for the service

None
**kwargs Any

Additional provider-specific configuration

{}

Returns:

Type Description
TranscriptionService

TranscriptionService instance

Raises:

Type Description
ValueError

If the provider is not supported

ImportError

If the provider module cannot be imported

register_provider(name, provider_class) classmethod

Register a provider implementation with the factory.

Parameters:

Name Type Description Default
name str

Provider name (lowercase)

required
provider_class Callable[..., TranscriptionService]

Provider implementation class or factory function

required
Example

from my_module import MyTranscriptionService TranscriptionServiceFactory.register_provider("my_provider", MyTranscriptionService)

Utterance

Bases: BaseModel

confidence instance-attribute
end_ms instance-attribute
speaker instance-attribute
start_ms instance-attribute
text instance-attribute
WordTiming

Bases: BaseModel

confidence instance-attribute
end_ms instance-attribute
start_ms instance-attribute
word instance-attribute
vtt_processor
VTTConfig

Configuration options for WebVTT processing.

include_speaker = include_speaker instance-attribute
max_chars_per_line = max_chars_per_line instance-attribute
reindex_entries = reindex_entries instance-attribute
speaker_format = speaker_format instance-attribute
timestamp_format = timestamp_format instance-attribute
__init__(include_speaker=False, speaker_format='<v {speaker}>{text}', reindex_entries=False, timestamp_format='{:02d}:{:02d}:{:02d}.{:03d}', max_chars_per_line=42)

Initialize with default settings.

Parameters:

Name Type Description Default
include_speaker bool

Whether to include speaker labels in output

False
speaker_format str

Format string for speaker attribution

'<v {speaker}>{text}'
reindex_entries bool

Whether to reindex entries sequentially

False
timestamp_format str

Format string for timestamp formatting

'{:02d}:{:02d}:{:02d}.{:03d}'
max_chars_per_line int

Maximum characters per line before splitting

42
VTTProcessor

Handles parsing and generating WebVTT format.

config = config or VTTConfig() instance-attribute
__init__(config=None)

Initialize with optional configuration.

Parameters:

Name Type Description Default
config Optional[VTTConfig]

Configuration options for VTT processing

None
generate(timed_texts)

Generate VTT content from a list of TimedUnit objects.

Parameters:

Name Type Description Default
timed_texts List[TimedTextUnit]

List of TimedUnit objects

required

Returns:

Type Description
str

String containing VTT formatted content

parse(vtt_content)

Parse VTT content into a list of TimedUnit objects.

Parameters:

Name Type Description Default
vtt_content str

String containing VTT formatted content

required

Returns:

Type Description
List[TimedTextUnit]

List of TimedUnit objects

whisper_service
TODO: MAJOR REFACTOR PLANNED

This module currently mixes persistent service configuration (WhisperConfig) with per-call runtime options, leading to complex validation and logic. Plan is to:

  • Refactor so each WhisperTranscriptionService instance is configured once at construction, with all relevant settings (including file-like/path-like mode, file extension, etc).
  • Use Pydantic BaseSettings for configuration to normalize configuration and validation according to TNH Scholar style.
  • Remove ad-hoc runtime options from the transcribe() entrypoint; all config should be set at init.
  • If a different configuration is needed, instantiate a new service object.
  • This will simplify validation, error handling, and code logic, and make the contract clear and robust.
  • NOTE: This will change the TranscriptionService contract and will require similar changes in other transcription system implementations.
  • Update all dependent code and tests accordingly.

logger = get_child_logger(__name__) module-attribute
WhisperBase

Bases: TypedDict

duration instance-attribute
language instance-attribute
text instance-attribute
WhisperConfig dataclass

Configuration for the Whisper transcription service.

BASE_PARAMS = ['model', 'language', 'temperature', 'prompt', 'response_format'] class-attribute instance-attribute
FORMAT_PARAMS = {'verbose_json': ['timestamp_granularities'], 'json': [], 'text': [], 'srt': [], 'vtt': []} class-attribute instance-attribute
SUPPORTED_FORMATS = ['json', 'text', 'srt', 'vtt', 'verbose_json'] class-attribute instance-attribute
chunking_strategy = 'auto' class-attribute instance-attribute
language = None class-attribute instance-attribute
model = 'whisper-1' class-attribute instance-attribute
prompt = None class-attribute instance-attribute
response_format = 'verbose_json' class-attribute instance-attribute
temperature = None class-attribute instance-attribute
timestamp_granularities = field(default_factory=(lambda: ['word'])) class-attribute instance-attribute
__init__(model='whisper-1', response_format='verbose_json', timestamp_granularities=(lambda: ['word'])(), chunking_strategy='auto', language=None, temperature=None, prompt=None)
to_dict()

Convert configuration to dictionary for API call.

validate()

Validate configuration values.

WhisperResponse

Bases: WhisperBase

segments instance-attribute
words instance-attribute
WhisperSegment

Bases: TypedDict

avg_logprob instance-attribute
compression_ratio instance-attribute
end instance-attribute
id instance-attribute
no_speech_prob instance-attribute
start instance-attribute
temperature instance-attribute
text instance-attribute
WhisperTranscriptionService

Bases: TranscriptionService

OpenAI Whisper implementation of the TranscriptionService interface.

Provides transcription services using the OpenAI Whisper API.

config = WhisperConfig() instance-attribute
format_converter = FormatConverter() instance-attribute
__init__(api_key=None, **config_options)

Initialize the Whisper transcription service.

Parameters:

Name Type Description Default
api_key Optional[str]

OpenAI API key (defaults to OPENAI_API_KEY env var)

None
**config_options Any

Additional configuration options

{}
get_result(job_id)

Get results for an existing transcription job.

Whisper API operates synchronously and doesn't use job IDs, so this method is not implemented.

Parameters:

Name Type Description Default
job_id str

ID of the transcription job

required

Returns:

Type Description
TranscriptionResult

Dictionary containing transcription results

Raises:

Type Description
NotImplementedError

This method is not supported for Whisper

set_api_key(api_key=None)

Set or update the API key.

This method allows refreshing the API key without re-instantiating the class.

Parameters:

Name Type Description Default
api_key Optional[str]

OpenAI API key (defaults to OPENAI_API_KEY env var)

None

Raises:

Type Description
ValueError

If no API key is provided or found in environment

transcribe(audio_file, options=None)

Transcribe audio file to text using OpenAI Whisper API.

PATCH: If audio_file is a file-like object, options['file_extension'] must be provided (OpenAI API quirk).

Parameters:

Name Type Description Default
audio_file Union[Path, BytesIO]

Path to audio file or file-like object

required
options Optional[Dict[str, Any]]

Provider-specific options for transcription. If audio_file is file-like, must include 'file_extension'.

None

Returns:

Type Description
TranscriptionResult

Dictionary containing transcription results with standardized keys

Raises:

Type Description
ValueError

If file-like object is provided without 'file_extension' in options

transcribe_to_format(audio_file, format_type='srt', transcription_options=None, format_options=None)

Transcribe audio and return result in specified format.

PATCH: If audio_file is a file-like object, transcription_options['file_extension'] must be provided (OpenAI API quirk).

Takes advantage of the direct subtitle generation functionality when requesting SRT or VTT formats.

Parameters:

Name Type Description Default
audio_file Union[Path, BytesIO]

Path, file-like object, or URL of audio file

required
format_type str

Format type (e.g., "srt", "vtt", "text")

'srt'
transcription_options Optional[Dict[str, Any]]

Options for transcription. If audio_file is file-like, must include 'file_extension'.

None
format_options Optional[Dict[str, Any]]

Format-specific options

None

Returns:

Type Description
str

String representation in the requested format

Raises:

Type Description
ValueError

If file-like object is provided without 'file_extension' in transcription_options

WordEntry

Bases: TypedDict

end instance-attribute
start instance-attribute
word instance-attribute

utils

__all__ = ['AudioEnhancer', 'get_segment_audio', 'play_audio_segment', 'play_bytes', 'play_from_file', 'play_diarization_segment', 'get_audio_from_file'] module-attribute
AudioEnhancer
compression_settings = compression_settings instance-attribute
config = config instance-attribute
__init__(config=EnhancementConfig(), compression_settings=CompressionSettings())

Initialize with enhancement configuration and compression settings.

enhance(input_path, output_path=None)

Apply enhancement routines (compression, EQ, gating, etc.) in a modular fashion. Converts input to FLAC working format for Whisper compatibility.

extract_sample(input_path, start, duration, output_path=None, output_format='flac', codec=None, compression_level=8)

Extract a sample segment from the audio file.

Parameters

input_path : Path Path to the input audio file. start : float Start time in seconds. duration : float Duration in seconds. output_path : Path, optional Output file path. If None, auto-generated from input. output_format : str, default="flac" Output audio format/extension. codec : str, optional Audio codec to use (default: "flac" if output_format is "flac", else None). compression_level : int, default=8 Compression level for supported codecs.

Returns

Path Path to the extracted audio sample.

get_audio_info(file_path)

Get detailed audio information using ffprobe.

play_audio(file_path)

Play audio in notebook for quality assessment.

get_audio_from_file(audio_file)
get_segment_audio(segment, audio)
play_audio_segment(audio)
play_bytes(data, format='wav')
play_diarization_segment(segment, audio)
play_from_file(path)
audio_enhance

Module review and recommendations:

Big Picture Approach:

Modular, Configurable, and Extensible: Your use of Pydantic models for settings and configs is excellent. It makes the pipeline flexible and easy to tune for different ASR or enhancement needs. Tooling: Leveraging SoX and FFmpeg is a pragmatic choice for robust, high-quality audio processing. Pipeline Structure: The AudioEnhancer class is well-structured, with clear separation of concerns for each processing step (remix, rate, gain, EQ, compand, etc.). Notebook Integration: The play_audio method and use of IPython display is great for interactive, iterative work.

Details & Points You Might Be Missing:

Error Handling & Logging:

You print errors but could benefit from more structured logging (e.g., using Python’s logging module). Consider more granular exception handling, especially for subprocess calls. Testing & Validation:

No unit tests or validation of output audio quality/format are present. Consider adding automated tests (even if just for file existence, format, and basic properties). You could add a method to compare pre/post enhancement SNR, loudness, or other metrics. Documentation & Examples:

While docstrings are good, a usage example (in code or markdown) would help new users. Consider a README or notebook cell that demonstrates a full workflow. Performance:

For large-scale or batch processing, consider parallelization or async processing. Temporary files (e.g., intermediate FLACs) could be managed/cleaned up more robustly. Extensibility:

The pipeline is modular, but adding a “custom steps” hook (e.g., user-defined SoX/FFmpeg args) would make it even more flexible. You might want to support other codecs or output formats for downstream ASR models. Feature Gaps:

The extract_sample method is a TODO. Implementing this would be useful for quick QA or dataset creation. Consider adding Voice Activity Detection (VAD) or silence trimming as optional steps. You could add a “dry run” mode to print the SoX/FFmpeg commands without executing, for debugging. ASR-Specific Enhancements:

You might want to add preset configs for different ASR models (e.g., Whisper, Wav2Vec2, etc.), as they may have different optimal preprocessing. Consider integrating with open-source ASR evaluation tools to close the loop on enhancement effectiveness. General Strategic Recommendations:

Automate QA: Add methods to check output audio quality, duration, and format, and optionally compare to input. Batch Processing: Add a method to process a directory or list of files. Config Export/Import: Allow saving/loading configs as JSON/YAML for reproducibility. CLI/Script Interface: Consider a command-line interface for use outside notebooks. Unit Tests: Add basic tests for each method, especially for error cases. Summary Table:

| Modularity | Good | Add custom step hooks | | Configurability | Excellent | Presets for more ASR models | | Error Handling | Basic | Use logging, more granular exceptions | | Testing | Missing | Add unit tests, output validation | | Documentation | Good | Add usage examples, README | | Extensibility | Good | Support more codecs, batch processing | | ASR Optimization | Good start | Add VAD, silence trim, model-specific configs |

logger = get_child_logger(__name__) module-attribute
AudioEnhancer
compression_settings = compression_settings instance-attribute
config = config instance-attribute
__init__(config=EnhancementConfig(), compression_settings=CompressionSettings())

Initialize with enhancement configuration and compression settings.

enhance(input_path, output_path=None)

Apply enhancement routines (compression, EQ, gating, etc.) in a modular fashion. Converts input to FLAC working format for Whisper compatibility.

extract_sample(input_path, start, duration, output_path=None, output_format='flac', codec=None, compression_level=8)

Extract a sample segment from the audio file.

Parameters

input_path : Path Path to the input audio file. start : float Start time in seconds. duration : float Duration in seconds. output_path : Path, optional Output file path. If None, auto-generated from input. output_format : str, default="flac" Output audio format/extension. codec : str, optional Audio codec to use (default: "flac" if output_format is "flac", else None). compression_level : int, default=8 Compression level for supported codecs.

Returns

Path Path to the extracted audio sample.

get_audio_info(file_path)

Get detailed audio information using ffprobe.

play_audio(file_path)

Play audio in notebook for quality assessment.

CompressionSettings

Bases: BaseSettings

Compression settings for audio enhancement routines.

Attributes:

Name Type Description
minimal list[str]

List of compand arguments for minimal compression.

light list[str]

List of compand arguments for light compression.

moderate list[str]

List of compand arguments for moderate compression.

aggressive list[str]

List of compand arguments for aggressive compression.

whisper_optimized list[str]

List of compand arguments for Whisper-optimized compression.

whisper_aggressive list[str]

List of compand arguments for aggressive Whisper compression.

primary_speech_only list[str]

List of compand arguments for primary speech only.

aggressive = ['0.02,0.1', '8:-70,-55,-45,-35,-25,-15', '-5', '-90', '0.05'] class-attribute instance-attribute
light = ['0.05,0.2', '6:-60,-50,-40,-30,-20,-10', '-3', '-85', '0.1'] class-attribute instance-attribute
minimal = ['0.1,0.3', '3:-50,-40,-30,-20', '-3', '-80', '0.2'] class-attribute instance-attribute
moderate = ['0.03,0.15', '6:-65,-50,-40,-30,-20,-10', '-4', '-85', '0.1'] class-attribute instance-attribute
primary_speech_only = ['0.005,0.06', '12:-60,-45,-55,-30,-35,-18,-15,-8', '-8', '-60', '0.03'] class-attribute instance-attribute
whisper_aggressive = ['0.005,0.06', '12:-75,-45,-55,-30,-35,-18,-15,-8', '-8', '-95', '0.03'] class-attribute instance-attribute
whisper_optimized = ['0.005,0.06', '12:-75,-65,-55,-45,-35,-25,-15,-8', '-8', '-95', '0.03'] class-attribute instance-attribute
EQSettings

Bases: BaseSettings

bass = (-5, 200) class-attribute instance-attribute
contrast = 75 class-attribute instance-attribute
eq_bands = [(100, 0.9, -20), (1500, 1, 4), (4000, 0.6, 15), (10000, 1, -10)] class-attribute instance-attribute
highpass_freq = 175 class-attribute instance-attribute
lowpass_freq = 15000 class-attribute instance-attribute
treble = (3, 3000) class-attribute instance-attribute
EnhancementConfig

Bases: BaseModel

channels = 2 class-attribute instance-attribute
codec = 'flac' class-attribute instance-attribute
compression_level = 'aggressive' class-attribute instance-attribute
eq = EQSettings() class-attribute instance-attribute
force_mono = False class-attribute instance-attribute
gate = GateSettings() class-attribute instance-attribute
include_eq = True class-attribute instance-attribute
include_gate = True class-attribute instance-attribute
norm = NormalizationSettings() class-attribute instance-attribute
rate = RateSettings() class-attribute instance-attribute
remix = RemixSettings() class-attribute instance-attribute
sample_rate = 48000 class-attribute instance-attribute
target_rate = None class-attribute instance-attribute
GateSettings

Bases: BaseSettings

gate_params = ['0.1', '0.05', '-inf', '0.1', '-90', '0.1'] class-attribute instance-attribute
NormalizationSettings

Bases: BaseSettings

norm_level = -1 class-attribute instance-attribute
RateSettings

Bases: BaseSettings

rate_args = ['-v'] class-attribute instance-attribute
RemixSettings

Bases: BaseSettings

remix_channels = '1,2' class-attribute instance-attribute
compress_wav_to_mp4_vbr(input_wav, output_path=None, quality=8)

Compress WAV to M4A (AAC VBR) using ffmpeg.

Parameters:

input_wav : str or Path Path to the input .wav file output_path : str or Path, optional Output .mp4 file path. If None, auto-generated from input quality : int, default=8 VBR quality level: 1 = good (~96kbps), 2 = very good (~128kbps), 3+ = higher bitrate

Returns:

Path Path to the compressed .m4a file

get_sox_info(file_path)

Get audio info using SoX

playback
get_audio_from_file(audio_file)
get_segment_audio(segment, audio)
play_audio_segment(audio)
play_bytes(data, format='wav')
play_diarization_segment(segment, audio)
play_from_file(path)

cli_tools

TNH Scholar CLI Tools

Command-line interface tools for the TNH Scholar project:

audio-transcribe:
    Audio processing pipeline that handles downloading, segmentation,
    and transcription of Buddhist teachings.

tnh-gen:
    Unified GenAI CLI replacing legacy tooling, including tnh-fab.
    See https://aaronksolomon.github.io/tnh-scholar/architecture/tnh-gen/

See individual tool documentation for usage details and examples.

audio_transcribe

__all__ = ['audio_transcribe', 'main', 'YTDVersionChecker'] module-attribute
YTDVersionChecker

Simple version checker for yt-dlp with robust version comparison.

This is a prototype implementation may need expansion in these areas: - Caching to prevent frequent PyPI calls - More comprehensive error handling for: - Missing/uninstalled packages - Network timeouts - JSON parsing errors - Invalid version strings - Environment detection (virtualenv, conda, system Python) - Configuration options for version pinning - Proxy support for network requests

NETWORK_TIMEOUT = 5 class-attribute instance-attribute
PYPI_URL = 'https://pypi.org/pypi/yt-dlp/json' class-attribute instance-attribute
check_version()

Check if yt-dlp needs updating.

Returns:

Type Description
Tuple[bool, Version, Version]

Tuple of (needs_update, installed_version, latest_version)

Raises:

Type Description
ImportError

If yt-dlp is not installed

RequestException

For network-related errors

InvalidVersion

If version strings are invalid

main()
audio_transcribe

CLI tool for downloading audio (YouTube or local), and transcribing to text.

Usage

audio-transcribe [OPTIONS]

e.g. audio-transcribe --yt_url https://www.youtube.com/watch?v=EXAMPLE --output_dir ./processed --service whisper --model whisper-1

DEFAULT_CHUNK_DURATION = 120 module-attribute
DEFAULT_MIN_CHUNK = 10 module-attribute
DEFAULT_MODEL = 'whisper-1' module-attribute
DEFAULT_OUTPUT_PATH = './audio_transcriptions/transcript.txt' module-attribute
DEFAULT_RESPONSE_FORMAT = 'text' module-attribute
DEFAULT_SERVICE = 'whisper' module-attribute
DEFAULT_TEMP_DIR = tempfile.gettempdir() module-attribute
VIDEO_EXTENSIONS = {'.mp4', '.avi', '.mov', '.mkv', '.wmv'} module-attribute
logger = get_child_logger(__name__) module-attribute
AudioTranscribeApp

Main application class for audio transcription CLI. Organizes configuration, source resolution, and pipeline execution. All runtime options are supplied via a validated AudioTranscribeConfig.

audio_file = self._resolve_audio_source() instance-attribute
chunk_duration = TimeMs.from_seconds(config.chunk_duration) instance-attribute
config = config instance-attribute
diarization_config = self._build_diarization_config() instance-attribute
end_time = config.end_time instance-attribute
file_ = config.file_ instance-attribute
keep_artifacts = config.keep_artifacts instance-attribute
language = config.language instance-attribute
min_chunk = TimeMs.from_seconds(config.min_chunk) instance-attribute
model = config.model instance-attribute
output_path = Path(config.output) instance-attribute
prompt = config.prompt instance-attribute
response_format = config.response_format instance-attribute
service = config.service instance-attribute
start_time = config.start_time instance-attribute
temp_dir = self.output_path.parent instance-attribute
transcription_options = self._build_transcription_options() instance-attribute
yt_url = config.yt_url instance-attribute
yt_url_csv = config.yt_url_csv instance-attribute
__init__(config)

Parameters:

Name Type Description Default
config AudioTranscribeConfig

Validated AudioTranscribeConfig instance.

required
run()

Run the transcription pipeline and print results, or just download audio if no_transcribe is set.

audio_transcribe(**kwargs)

CLI entry point for audio transcription.

main()
config
DEFAULT_OUTPUT_PATH = './audio_transcriptions/transcript.txt' module-attribute
DEFAULT_SERVICE = 'whisper' module-attribute
DEFAULT_TEMP_DIR = './audio_transcriptions/tmp' module-attribute
AudioTranscribeConfig

Bases: BaseSettings

Validated runtime configuration for the audio-transcribe CLI.

chunk_duration = Field(default=120, description='Target chunk duration in seconds') class-attribute instance-attribute
end_time = Field(default=None, description='End time offset') class-attribute instance-attribute
file_ = Field(default=None, description='Path to local audio file') class-attribute instance-attribute
keep_artifacts = Field(default=False, description='Keep all intermediate artifacts in the output directory instead of using a system temp directory.') class-attribute instance-attribute
language = Field(default='en', description='Language code') class-attribute instance-attribute
min_chunk = Field(default=10, ge=10, description='Minimum chunk duration in seconds') class-attribute instance-attribute
model = Field(default='whisper-1', description='Transcription model name') class-attribute instance-attribute
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', extra='ignore') class-attribute instance-attribute
no_transcribe = Field(default=False, description='If True, only download YouTube audio to mp3, no transcription.') class-attribute instance-attribute
output = Field(default=DEFAULT_OUTPUT_PATH, description='Path to output transcript file') class-attribute instance-attribute
prompt = Field(default='', description='Prompt or keywords') class-attribute instance-attribute
response_format = Field(default='text', description='Response format') class-attribute instance-attribute
service = Field(default=DEFAULT_SERVICE, pattern='^(whisper|assemblyai)$', description='Transcription service') class-attribute instance-attribute
start_time = Field(default=None, description='Start time offset') class-attribute instance-attribute
temp_dir = Field(default=None, description='Directory for temporary processing files') class-attribute instance-attribute
yt_url = Field(default=None, description='YouTube URL') class-attribute instance-attribute
yt_url_csv = Field(default=None, description='CSV file with YouTube URLs') class-attribute instance-attribute
validate_sources()

Enforce coherent source selection for CLI execution.

MultipleAudioSourceError

Bases: ValueError

Raised when audio source selection has multiple sources).

NoAudioSourceError

Bases: ValueError

Raised when no audio source is provided.

convert_video
FFMPEG_VIDEO_CONV_DEFAULT_CONFIG = {'audio_codec': 'libmp3lame', 'audio_bitrate': '192k', 'audio_samplerate': '44100'} module-attribute
logger = get_child_logger(__name__) module-attribute
convert_video_to_audio(video_file, output_dir, conversion_params=None)

Convert a video file to an audio file using ffmpeg.

Parameters:

Name Type Description Default
video_file Path

Path to the video file

required
output_dir Path

Directory to save the converted audio file

required
conversion_params Optional[Dict[str, str]]

Optional dictionary to override default conversion parameters

None

Returns:

Type Description
Path

Path to the converted audio file

environment
env
logger = get_child_logger(__name__) module-attribute
check_env()

Check the environment for necessary conditions: 1. Check OpenAI key is available. 2. Check that all requirements from requirements.txt are importable.

check_requirements(requirements_file)

Check that all requirements listed in requirements.txt can be imported. If any cannot be imported, print a warning.

This is a heuristic check. Some packages may not share the same name as their importable module. Adjust the name mappings below as needed.

Example

check_requirements(Path("./requirements.txt"))

Prints warnings if imports fail, otherwise silent.
transcription_pipeline
TranscriptionPipeline
audio_file = audio_file instance-attribute
audio_file_extension = audio_file.suffix instance-attribute
diarization_config = diarization_config or DiarizationConfig() instance-attribute
diarization_dir = self.output_dir / f'{self.audio_file.stem}_diarization' instance-attribute
diarization_kwargs = diarization_kwargs or {} instance-attribute
diarization_results_path = self.diarization_dir / 'raw_diarization_results.json' instance-attribute
logger = logger or logging.getLogger(__name__) instance-attribute
output_dir = output_dir instance-attribute
save_diarization = save_diarization instance-attribute
transcriber = transcriber instance-attribute
transcription_options = patch_whisper_options(transcription_options, file_extension=(audio_file.suffix)) instance-attribute
__init__(audio_file, output_dir, diarization_config=None, transcriber='whisper', transcription_options=None, diarization_kwargs=None, save_diarization=True, logger=None)

Initialize the TranscriptionPipeline.

Parameters:

Name Type Description Default
audio_file Path

Path to the audio file to process.

required
output_dir Path

Directory to store output files.

required
diarization_config Optional[DiarizationConfig]

Diarization configuration.

None
transcriber str

Transcription service provider.

'whisper'
transcription_options Optional[Dict[str, Any]]

Options for transcription.

None
diarization_kwargs Optional[Dict[str, Any]]

Additional diarization arguments.

None
save_diarization bool

Whether to save raw diarization JSON results.

True
logger Optional[Logger]

Logger for pipeline events.

None
run()

Execute the full transcription pipeline with robust error handling.

Returns:

Type Description
Optional[List[Dict[str, Any]]]

List[Dict[str, Any]]: List of transcript dicts with chunk metadata, or None on failure

Raises:

Type Description
RuntimeError

If any pipeline step fails.

validate
validate_inputs(is_download, yt_url, yt_url_list, audio_file, split, transcribe, chunk_dir, no_chunks, silence_boundaries, whisper_boundaries)

Validate the CLI inputs for coherent download, split, and transcribe flows.

version_check
logger = get_child_logger(__name__) module-attribute
YTDVersionChecker

Simple version checker for yt-dlp with robust version comparison.

This is a prototype implementation may need expansion in these areas: - Caching to prevent frequent PyPI calls - More comprehensive error handling for: - Missing/uninstalled packages - Network timeouts - JSON parsing errors - Invalid version strings - Environment detection (virtualenv, conda, system Python) - Configuration options for version pinning - Proxy support for network requests

NETWORK_TIMEOUT = 5 class-attribute instance-attribute
PYPI_URL = 'https://pypi.org/pypi/yt-dlp/json' class-attribute instance-attribute
check_version()

Check if yt-dlp needs updating.

Returns:

Type Description
Tuple[bool, Version, Version]

Tuple of (needs_update, installed_version, latest_version)

Raises:

Type Description
ImportError

If yt-dlp is not installed

RequestException

For network-related errors

InvalidVersion

If version strings are invalid

check_ytd_version()

Check if yt-dlp is up to date and available.

This function checks the installed version of yt-dlp against the latest version on PyPI. Since YouTube changes frequently break older yt-dlp versions, this check is strict and requires the latest version.

Returns:

Name Type Description
bool bool

True if yt-dlp is installed and up to date, False otherwise.

Note

This is a strict check. Outdated versions return False to prevent wasting time on long-running jobs that will likely fail due to YouTube API changes.

claude_assistant

Claude assistant CLI package.

claude_assistant

Typer entrypoint for a minimal local Claude worker wrapper.

claude-assistant is a thin convenience CLI for launching claude --print from a predictable environment. It is intended as a pragmatic bridge for delegated local worker invocation while the broader orchestration surfaces are still evolving.

app = typer.Typer(name='claude-assistant', help='Minimal wrapper around `claude --print` for delegated local worker runs.', add_completion=False, no_args_is_help=True) module-attribute
ClaudeAssistantPaths dataclass

Resolved output paths for one invocation.

stderr_path instance-attribute
stdout_path instance-attribute
__init__(stdout_path, stderr_path)
ClaudeAssistantResult dataclass

Serializable summary of one wrapper invocation.

command instance-attribute
cwd instance-attribute
exit_code instance-attribute
final_message instance-attribute
stderr_path instance-attribute
stdout_path instance-attribute
__init__(command, cwd, exit_code, stdout_path, stderr_path, final_message)
to_json()

Render one JSON summary suitable for scripted callers.

main()

Dispatch to the Typer app.

run_command(prompt=typer.Option(..., '--prompt', help='Prompt text to pass to `claude --print`.'), cwd=typer.Option(Path.cwd(), '--cwd', file_okay=False, dir_okay=True, resolve_path=True, help='Working directory for the Claude run.'), claude_executable=typer.Option(None, '--claude-executable', file_okay=True, dir_okay=False, resolve_path=True, help='Optional explicit path to the Claude executable.'), stdout_path=typer.Option(None, '--stdout-path', resolve_path=True, help='Optional path for captured stdout.'), stderr_path=typer.Option(None, '--stderr-path', resolve_path=True, help='Optional path for captured stderr.'), permission_mode=typer.Option('dontAsk', '--permission-mode', help='Claude permission mode, for example `dontAsk` or `acceptEdits`.'), json_output=typer.Option(True, '--json/--no-json', help='Request Claude stream-json stdout for machine-readable capture.'), verbose=typer.Option(True, '--verbose/--no-verbose', help='Include Claude verbose event output.'), inherit_env=typer.Option(False, '--inherit-env/--sanitize-env', help='Inherit the current environment instead of the sanitized env.'))

Run one local Claude worker invocation and emit a JSON summary.

codex_assistant

Codex assistant CLI package.

codex_assistant

Typer entrypoint for a minimal local Codex worker wrapper.

codex-assistant is a thin convenience CLI for launching codex exec from a predictable, sanitized user-like environment. It is intended as a pragmatic bridge for delegated local worker invocation while the broader orchestration surfaces are still evolving.

app = typer.Typer(name='codex-assistant', help='Minimal sanitized wrapper around `codex exec` for delegated local worker runs.', add_completion=False, no_args_is_help=True) module-attribute
CodexAssistantPaths dataclass

Resolved output paths for one invocation.

stderr_path instance-attribute
stdout_path instance-attribute
__init__(stdout_path, stderr_path)
CodexAssistantResult dataclass

Serializable summary of one wrapper invocation.

command instance-attribute
cwd instance-attribute
exit_code instance-attribute
final_message instance-attribute
stderr_path instance-attribute
stdout_path instance-attribute
__init__(command, cwd, exit_code, stdout_path, stderr_path, final_message)
to_json()

Render one JSON summary suitable for scripted callers.

main()

Dispatch to the Typer app.

run_command(prompt=typer.Option(..., '--prompt', help='Prompt text to pass to `codex exec`.'), cwd=typer.Option(Path.cwd(), '--cwd', file_okay=False, dir_okay=True, resolve_path=True, help='Working directory for the Codex run.'), codex_executable=typer.Option(None, '--codex-executable', file_okay=True, dir_okay=False, resolve_path=True, help='Optional explicit path to the Codex executable.'), profile=typer.Option('collab', '--profile', help='Codex profile to use.'), model=typer.Option(None, '--model', help='Optional model override.'), stdout_path=typer.Option(None, '--stdout-path', resolve_path=True, help='Optional path for captured stdout.'), stderr_path=typer.Option(None, '--stderr-path', resolve_path=True, help='Optional path for captured stderr.'), output_last_message_path=typer.Option(None, '--output-last-message-path', resolve_path=True, help='Optional path for Codex `--output-last-message` capture.'), json_output=typer.Option(True, '--json/--no-json', help='Request Codex JSONL stdout for machine-readable capture.'), ephemeral=typer.Option(True, '--ephemeral/--no-ephemeral', help='Use Codex ephemeral mode.'), inherit_env=typer.Option(False, '--inherit-env/--sanitize-env', help='Inherit the current environment instead of the sanitized user-like env.'), enable_feature=typer.Option([], '--enable-feature', help='Repeatable Codex feature enable flag.'), disable_feature=typer.Option([], '--disable-feature', help='Repeatable Codex feature disable flag.'))

Run one local Codex worker invocation and emit a JSON summary.

json_to_srt

__all__ = ['main', 'json_to_srt'] module-attribute
main()

Entry point for the jsonl-to-srt CLI tool.

json_to_srt

Simple CLI tool for converting JSONL transcription files to SRT format.

This module provides a command line interface for transforming JSONL transcription files (from audio-transcribe) into SRT subtitle format. Handles chunked transcriptions with proper timestamp accumulation.

JsonDict = dict[str, Any] module-attribute
logger = get_child_logger(__name__) module-attribute
JsonlToSrtConverter

Converts JSONL transcription files from audio-transcribe to SRT format.

accumulated_time = 0.0 instance-attribute
entry_index = 1 instance-attribute
__init__()

Initialize converter state.

build_srt_entry(index, start, end, text)

Format a single SRT entry.

convert(input_file, output_file=None)

Convert a JSONL transcription file to SRT format.

Parameters:

Name Type Description Default
input_file TextIO

JSONL transcription file to parse

required
output_file Optional[Path]

Optional output file path

None

Returns:

Name Type Description
str str

SRT formatted content

extract_segment_data(segment)

Extract timestamp and text data from a segment.

format_timestamp(seconds)

Convert seconds to SRT timestamp format (HH:MM:SS,mmm).

get_segments_from_data(data)

Extract segments from a data object.

handle_output(srt_content, output_file)

Write SRT content to file or stdout.

parse_jsonl_line(line)

Parse a single JSONL line into a dictionary.

process_jsonl_content(lines)

Process all JSONL content into SRT format.

process_jsonl_line(line)

Process a single JSONL line into SRT entries.

process_segment(segment)

Process a single segment into SRT format.

process_segments_list(segments_list)

Process a list of segments into SRT entries.

read_input_lines(input_file)

Read and filter input lines from file.

json_to_srt(input_file, output=None)

Convert JSONL transcription files to SRT subtitle format.

Reads from stdin if no INPUT_FILE is specified. Writes to stdout if no output file is specified.

main()

Entry point for the jsonl-to-srt CLI tool.

json_to_srt1

Simple CLI tool for converting JSONL transcription files to SRT format.

This module provides a command line interface for transforming JSONL transcription files (from audio-transcribe) into SRT subtitle format.

JsonDict = dict[str, Any] module-attribute
logger = get_child_logger(__name__) module-attribute
convert_to_srt(input_file, output_file=None)

Convert a JSONL transcription file to SRT format.

Parameters:

Name Type Description Default
input_file TextIO

JSONL transcription file to parse

required
output_file Optional[Path]

Optional output file path

None

Returns:

Name Type Description
str str

SRT formatted content

extract_segment_data(segment)

Extract timestamp and text data from a segment.

format_srt_entry(index, start, end, text)

Format a single SRT entry.

format_timestamp(seconds)

Convert seconds to SRT timestamp format (HH:MM:SS,mmm).

get_segments_from_data(data)

Extract segments from a data object.

handle_output(srt_content, output_file)

Write SRT content to file or stdout.

json_to_srt(input_file, output=None)

Convert JSONL transcription files to SRT subtitle format.

Reads from stdin if no INPUT_FILE is specified. Writes to stdout if no output file is specified.

main()

Entry point for the jsonl-to-srt CLI tool.

parse_jsonl_line(line)

Parse a single JSONL line into a dictionary.

process_jsonl_content(lines)

Process all JSONL content into SRT format.

process_jsonl_line(line, entry_index, accumulated_time)

Process a single JSONL line into SRT entries.

process_segment(segment, entry_index)

Process a single segment into SRT format.

process_segments_list(segments_list, entry_index)

Process a list of segments into SRT entries.

read_input_lines(input_file)

Read and filter input lines from file.

nfmt

__all__ = ['main', 'nfmt'] module-attribute
main()

Entry point for the nfmt CLI tool.

nfmt
main()

Entry point for the nfmt CLI tool.

nfmt(input_file, output, spacing)

Normalize the number of newlines in a text file.

sent_split

__all__ = ['main', 'sent_split'] module-attribute
main()
sent_split

Simple CLI tool for sentence splitting.

This module provides a command line interface for splitting text into sentences. Uses NLTK for robust sentence tokenization. Reads from stdin and writes to stdout by default, with optional file input/output.

SplitConfig

Bases: BaseModel

nltk_tokenizer = 'punkt' class-attribute instance-attribute
separator = 'newline' class-attribute instance-attribute
SplitIOData

Bases: BaseModel

content = None class-attribute instance-attribute
input_path = None class-attribute instance-attribute
output_path = None class-attribute instance-attribute
from_io(input_file, output) classmethod
get_input_content()
write_output(result)
SplitResult
stats = {} class-attribute instance-attribute
text_object instance-attribute
ensure_nltk_data(config)
main()
sent_split(input_file, output, space)
split_text(text, config, io_data)
sent_split_bak

Simple CLI tool for sentence splitting.

This module provides a command line interface for splitting text into sentences. Uses NLTK for robust sentence tokenization. Reads from stdin and writes to stdout by default, with optional file input/output.

ensure_nltk_data()

Ensure NLTK punkt tokenizer is available.

main()
process_text(text, newline=True)

Split text into sentences using NLTK.

sent_split(input_file, output, space)

Split text into sentences using NLTK's sentence tokenizer.

Reads from stdin if no input file is specified. Writes to stdout if no output file is specified.

srt_translate

__all__ = ['main', 'srt_translate'] module-attribute
main()

Entry point for the srt-translate CLI tool.

srt_translate

CLI tool for translating SRT subtitle files using tnh-scholar line translation.

This module provides a command line interface for translating SRT subtitle files from one language to another while preserving timecodes and subtitle structure. Uses the same translation engine as the prompt-driven line translator.

logger = get_child_logger(__name__) module-attribute
SrtEntry

Represents a single subtitle entry from an SRT file.

end_time = end_time instance-attribute
index = index instance-attribute
line_key property

Generate a unique line key for this entry.

start_time = start_time instance-attribute
text = text.strip() instance-attribute
__init__(index, start_time, end_time, text)

Initialize subtitle entry with timing and text.

__str__()

Format entry as SRT text.

SrtTranslator

Translates SRT files while preserving timecodes.

metadata = metadata instance-attribute
model = model instance-attribute
pattern = pattern instance-attribute
source_language = source_language instance-attribute
target_language = target_language instance-attribute
__init__(source_language=None, target_language='en', pattern=None, model=None, metadata=None)

Initialize translator with language, model settings, and metadata.

create_text_object(text)

Create a TextObject from the extracted SRT text with metadata.

entries_to_numbered_text(entries)

Convert SRT entries to numbered text for TextObject.

extract_translated_lines(translated_object)

Extract translated lines from TextObject with line keys.

format_srt(entries)

Format entries back to SRT content.

parse_srt(content)

Parse SRT content into structured entries.

translate_and_save(input_file, output_path)

Handles file reading, translation, and saving.

translate_srt(content)

Process SRT content through complete translation pipeline.

translate_text_object(text_object)

Translate the TextObject using line translation.

update_entries_with_translations(entries, translations)

Apply translations to original entries.

load_metadata_from_file(metadata_file)

Load metadata from a file if provided.

main()

Entry point for the srt-translate CLI tool.

set_output_path(input_file, output, target_language)
set_pattern(pattern)
srt_translate(input_file, output=None, source_language=None, target_language='en', model=None, pattern=None, debug=False, metadata=None)

Translate SRT subtitle files from one language to another.

INPUT_FILE is the path to the SRT file to translate.

tnh_codex_harness

Suspended CLI package for the reference-only Codex harness spike.

tnh_codex_harness

Typer entrypoint for the Codex harness CLI.

app = typer.Typer(name='tnh-codex-harness', help='Standalone Codex API harness.', add_completion=False, no_args_is_help=True) module-attribute
main()

Dispatch to Typer app.

run_command(task=typer.Option(..., '--task', help='Task for Codex.'), system_prompt=typer.Option(None, '--system-prompt', help='Optional system prompt.'), apply_patch=typer.Option(True, '--apply-patch/--no-apply-patch', help='Apply patch output.'), run_tests_command=typer.Option(None, '--run-tests', help='Test command to run after applying patch.'), model=typer.Option(None, '--model', help='Override the Codex model.'), runs_root=typer.Option(None, '--runs-root', help='Override runs root directory.'), timeout_seconds=typer.Option(None, '--timeout-seconds', help='Timeout for tests.'), max_output_tokens=typer.Option(None, '--max-output-tokens', help='Max output tokens.'), temperature=typer.Option(None, '--temperature', help='Sampling temperature.'), max_tool_rounds=typer.Option(None, '--max-tool-rounds', help='Maximum tool-call rounds to allow.'), use_chat_completions=typer.Option(False, '--use-chat-completions', help='Use Chat Completions API instead of Responses API.'))

Run a single Codex harness execution.

tnh_conductor

CLI package for the maintained tnh-conductor entry point.

__all__ = ['app', 'main'] module-attribute
app = typer.Typer(name='tnh-conductor', help='Maintained local/headless workflow bootstrap runner.', add_completion=False, no_args_is_help=True) module-attribute
main()

Dispatch to the Typer app.

tnh_conductor

Typer entrypoint for the maintained tnh-conductor CLI.

STATUS_STORE = FilesystemRunArtifactStore() module-attribute
app = typer.Typer(name='tnh-conductor', help='Maintained local/headless workflow bootstrap runner.', add_completion=False, no_args_is_help=True) module-attribute
conductor_app()

Expose tnh-conductor as a command group.

main()

Dispatch to the Typer app.

run_command(workflow=typer.Option(..., '--workflow', exists=True, file_okay=True, dir_okay=False, readable=True, resolve_path=True, help='Workflow YAML file to execute.'), repo_root=typer.Option(Path.cwd(), '--repo-root', file_okay=False, dir_okay=True, resolve_path=True, help='Repository root for the managed worktree run.'), runs_root=typer.Option(None, '--runs-root', file_okay=False, dir_okay=True, resolve_path=True, help='Optional override for the canonical runs root.'), workspace_root=typer.Option(None, '--workspace-root', file_okay=False, dir_okay=True, resolve_path=True, help='Optional override for the managed worktree root.'), base_ref=typer.Option('HEAD', '--base-ref', help='Committed git base ref for the run.'), codex_executable=typer.Option(None, '--codex-executable', exists=True, file_okay=True, dir_okay=False, resolve_path=True, help='Optional explicit path to the Codex executable.'), claude_executable=typer.Option(None, '--claude-executable', exists=True, file_okay=True, dir_okay=False, resolve_path=True, help='Optional explicit path to the Claude executable.'))

Execute one maintained local/headless bootstrap run.

status_command(run_id=typer.Argument(..., help='Run id to inspect.'), repo_root=typer.Option(Path.cwd(), '--repo-root', file_okay=False, dir_okay=True, resolve_path=True, help='Repository root used to resolve default storage roots.'), runs_root=typer.Option(None, '--runs-root', file_okay=False, dir_okay=True, resolve_path=True, help='Optional override for the canonical runs root.'), watch=typer.Option(False, '--watch', help='Poll and print status snapshots until the run reaches a terminal state.'), poll_interval_seconds=typer.Option(1.0, '--poll-interval-seconds', help='Polling interval in seconds when --watch is enabled.'))

Read the maintained live status artifact for one run.

tnh_conductor_spike

CLI entrypoint package for tnh-conductor-spike.

tnh_conductor_spike

Typer entrypoint for the tnh-conductor-spike CLI.

app = typer.Typer(name='tnh-conductor-spike', help='Phase 0 protocol layer spike runner.', add_completion=False, no_args_is_help=True) module-attribute
main()

Dispatch to the Typer app.

run_command(agent=typer.Option(..., '--agent', help='Agent identifier (claude-code, codex).'), task=typer.Option(None, '--task', help='Task text for the agent.'), prompt_id=typer.Option(None, '--prompt-id', help='Prompt id for the task.'), timeout_seconds=typer.Option(SpikeDefaults().default_timeout_seconds, '--timeout-seconds', help='Wall-clock timeout.'), idle_timeout_seconds=typer.Option(SpikeDefaults().default_idle_timeout_seconds, '--idle-timeout-seconds', help='Idle timeout.'), heartbeat_interval_seconds=typer.Option(SpikeDefaults().default_heartbeat_interval_seconds, '--heartbeat-interval-seconds', help='Heartbeat interval for progress events.'), work_branch=typer.Option(None, '--work-branch', help='Explicit work branch name.'))

Run a single Phase 0 spike execution.

tnh_gen

tnh-gen CLI package.

__all__ = ['app', 'main'] module-attribute
app = typer.Typer(name='tnh-gen', help='TNH-Gen: Unified CLI for TNH Scholar GenAI operations.', add_completion=False, no_args_is_help=True) module-attribute
main()

Dispatch execution to the Typer application.

commands

tnh-gen command modules.

config
ConfigValue = str | Path | float | int | None module-attribute
app = typer.Typer(help='Inspect and edit tnh-gen configuration.') module-attribute
get_config_value(key)

Retrieve a single config value by key.

Parameters:

Name Type Description Default
key str

Configuration key to fetch.

required
list_config_keys()

List available configuration keys supported by the CLI.

set_config_value(key=typer.Argument(..., help=f'Config key. Supported: {', '.join(available_keys())}'), value=typer.Argument(..., help='New value for the config key.'), workspace=typer.Option(False, '--workspace', help='Persist to workspace config (.vscode/tnh-scholar.json or .tnh-gen.json).'))

Persist a config value to user or workspace scope.

Parameters:

Name Type Description Default
key str

Configuration key to update.

Argument(..., help=f'Config key. Supported: {join(available_keys())}')
value str

New value to store.

Argument(..., help='New value for the config key.')
workspace bool

Whether to persist to workspace scope.

Option(False, '--workspace', help='Persist to workspace config (.vscode/tnh-scholar.json or .tnh-gen.json).')
show_config(catalog_health=typer.Option(False, '--catalog-health', help='Include aggregated prompt catalog health in the response.'), format=typer.Option(None, '--format', help='Output format: json (requires --api), yaml, or text (human-only).', case_sensitive=False))

Show the effective configuration and its source precedence.

Parameters:

Name Type Description Default
format OutputFormat | None

Optional output format override (json or yaml).

Option(None, '--format', help='Output format: json (requires --api), yaml, or text (human-only).', case_sensitive=False)
list
app = typer.Typer(help='List available prompts with metadata.', invoke_without_command=True) module-attribute
list_prompts(tag=typer.Option([], '--tag', help='Filter by tag (repeatable).'), search=typer.Option(None, '--search', help='Search prompt name/description.'), keys_only=typer.Option(False, '--keys-only', help='Output only prompt keys.'), format=typer.Option(None, '--format', help='Output format: json (requires --api), yaml, text/table (human-only).', case_sensitive=False))

List prompts with optional filters and output formats.

Parameters:

Name Type Description Default
tag list[str]

Filter prompts by tag (repeatable).

Option([], '--tag', help='Filter by tag (repeatable).')
search str | None

Case-insensitive search across name/description.

Option(None, '--search', help='Search prompt name/description.')
keys_only bool

Whether to output only prompt keys.

Option(False, '--keys-only', help='Output only prompt keys.')
format ListOutputFormat | None

Desired output format (defaults to global setting).

Option(None, '--format', help='Output format: json (requires --api), yaml, text/table (human-only).', case_sensitive=False)
run
app = typer.Typer(help='Execute a prompt with variable substitution.', invoke_without_command=True) module-attribute
logger = logging.getLogger(__name__) module-attribute
RunContext dataclass

Encapsulates all context needed for prompt execution.

config instance-attribute
config_meta instance-attribute
include_provenance instance-attribute
input_metadata instance-attribute
intent instance-attribute
metadata instance-attribute
model_override instance-attribute
output_file instance-attribute
output_format instance-attribute
prompt_key instance-attribute
quiet instance-attribute
service instance-attribute
trace_id instance-attribute
variables instance-attribute
__init__(prompt_key, config, config_meta, service, metadata, input_metadata, variables, trace_id, model_override, intent, quiet, output_format, output_file, include_provenance)
TnhGenCLIOptions

Encapsulates all CLI option definitions for the run command.

API = typer.Option(False, '--api', help='Machine-readable API contract output (JSON by default).') class-attribute instance-attribute
CONFIG = typer.Option(None, '--config', help='Path to config file that overrides user/workspace config.') class-attribute instance-attribute
FORMAT = typer.Option(None, '--format', help='Output format: json or yaml (API mode only).', case_sensitive=False) class-attribute instance-attribute
INPUT_FILE = typer.Option(..., '--input-file', help='Input file containing user content.') class-attribute instance-attribute
INTENT = typer.Option(None, '--intent', help='Intent hint for routing.') class-attribute instance-attribute
MAX_TOKENS = typer.Option(None, '--max-tokens', help='Maximum output tokens.') class-attribute instance-attribute
MODEL = typer.Option(None, '--model', help='Model override.') class-attribute instance-attribute
NO_PROVENANCE = typer.Option(False, '--no-provenance', help='Omit provenance block in files.') class-attribute instance-attribute
OUTPUT_FILE = typer.Option(None, '--output-file', help='Write result text to file.') class-attribute instance-attribute
PROMPT = typer.Option(..., '--prompt', help='Prompt key to execute.') class-attribute instance-attribute
PROMPT_DIR = typer.Option(None, '--prompt-dir', help='Override the prompt catalog directory for this invocation.') class-attribute instance-attribute
STREAMING = typer.Option(False, '--streaming', help='Enable streaming output (not implemented).') class-attribute instance-attribute
TEMPERATURE = typer.Option(None, '--temperature', help='Model temperature.') class-attribute instance-attribute
TOP_P = typer.Option(None, '--top-p', help='Top-p sampling (not yet supported).') class-attribute instance-attribute
VAR = typer.Option([], '--var', help='Inline variable assignment (repeatable).') class-attribute instance-attribute
VARS_FILE = typer.Option(None, '--vars', help='JSON file with variable definitions.') class-attribute instance-attribute
run_prompt(config=TnhGenCLIOptions.CONFIG, api=TnhGenCLIOptions.API, prompt_dir=TnhGenCLIOptions.PROMPT_DIR, prompt=TnhGenCLIOptions.PROMPT, input_file=TnhGenCLIOptions.INPUT_FILE, vars_file=TnhGenCLIOptions.VARS_FILE, var=TnhGenCLIOptions.VAR, model=TnhGenCLIOptions.MODEL, intent=TnhGenCLIOptions.INTENT, max_tokens=TnhGenCLIOptions.MAX_TOKENS, temperature=TnhGenCLIOptions.TEMPERATURE, top_p=TnhGenCLIOptions.TOP_P, output_file=TnhGenCLIOptions.OUTPUT_FILE, format=TnhGenCLIOptions.FORMAT, no_provenance=TnhGenCLIOptions.NO_PROVENANCE, streaming=TnhGenCLIOptions.STREAMING)

Execute a prompt with variable substitution and AI processing.

Parameters:

Name Type Description Default
config Path | None

Optional path to an explicit config file.

CONFIG
api bool

Whether to emit machine-readable API contract output.

API
prompt_dir Path | None

Optional prompt catalog directory override.

PROMPT_DIR
prompt str

Key of the prompt to execute.

PROMPT
input_file Path

File containing the main user input text.

INPUT_FILE
vars_file Path | None

Optional JSON file with additional variables.

VARS_FILE
var list[str]

Inline variable assignments (--var key=value).

VAR
model str | None

Optional model override for this run.

MODEL
intent str | None

Optional routing intent to pass to the service.

INTENT
max_tokens int | None

Max output tokens override.

MAX_TOKENS
temperature float | None

Temperature override.

TEMPERATURE
top_p float | None

Top-p sampling override (accepted but not applied).

TOP_P
output_file Path | None

Optional file to write the rendered text to.

OUTPUT_FILE
format OutputFormat | None

Output format for stdout.

FORMAT
no_provenance bool

Whether to omit provenance header in written files.

NO_PROVENANCE
streaming bool

Whether to request streaming (not yet implemented).

STREAMING
version
version(format=typer.Option(None, '--format', help='Output format: json (requires --api), yaml, or text (human-only).', case_sensitive=False))

Display version information for tnh-gen and dependencies.

Parameters:

Name Type Description Default
format OutputFormat | None

Optional output format override (json or yaml).

Option(None, '--format', help='Output format: json (requires --api), yaml, or text (human-only).', case_sensitive=False)
config_loader
CLIConfig

Bases: BaseModel

CLI configuration modeled with Pydantic for consistency with OS blueprint.

api_key = None class-attribute instance-attribute
cli_path = None class-attribute instance-attribute
default_model = None class-attribute instance-attribute
default_temperature = None class-attribute instance-attribute
max_dollars = None class-attribute instance-attribute
max_input_chars = None class-attribute instance-attribute
prompt_catalog_dir = Field(default=None) class-attribute instance-attribute
with_overrides(overrides)

Return a new config with non-null override values applied.

Parameters:

Name Type Description Default
overrides ConfigData

Mapping of override keys to values.

required

Returns:

Type Description
'CLIConfig'

New CLIConfig instance with overrides applied.

available_keys()

Return the list of supported config keys.

Returns:

Type Description
list[str]

List of available configuration keys.

load_config(config_path=None, *, cwd=None, overrides=None, prompt_dir=None)

Load CLI configuration with clear precedence and metadata.

The effective config is built in this order: defaults/env → user config → workspace config → explicit config_path → CLI overrides → explicit prompt_dir override. Overrides that are None are ignored to avoid clobbering previous values.

Parameters:

Name Type Description Default
config_path Path | None

Optional explicit config file to load.

None
cwd Path | None

Working directory for resolving workspace config paths.

None
overrides ConfigData | None

In-memory override values (e.g., CLI flags).

None
prompt_dir Path | None

Optional prompt catalog directory override.

None

Returns:

Type Description
tuple[CLIConfig, ConfigMeta]

Tuple of validated CLIConfig and metadata containing the source list.

Raises:

Type Description
ValueError

If any referenced config file contains invalid JSON.

load_config_overrides(config_path=None, *, cwd=None)

Load only user/workspace/explicit config overrides (no defaults).

persist_config_value(key, value, *, workspace=False, cwd=None)

Persist a single config value to the user or workspace config file.

Parameters:

Name Type Description Default
key ConfigKey

Configuration key to update.

required
value Any

Value to persist.

required
workspace bool

Whether to target workspace scope instead of user scope.

False
cwd Path | None

Working directory for resolving workspace path.

None

Returns:

Type Description
Path

Path to the file that was written.

Raises:

Type Description
KeyError

If the key is not supported.

errors
ExitCode

Bases: IntEnum

CLI exit codes mapped to error classes.

FORMAT_ERROR = 4 class-attribute instance-attribute
INPUT_ERROR = 5 class-attribute instance-attribute
POLICY_ERROR = 1 class-attribute instance-attribute
PROVIDER_ERROR = 3 class-attribute instance-attribute
SUCCESS = 0 class-attribute instance-attribute
TRANSPORT_ERROR = 2 class-attribute instance-attribute
emit_trace_id(trace_id, error_code)

Emit a trace identifier to stderr for diagnostics.

error_response(exc, *, error_code=None, suggestion=None, trace_id)

Construct a serialized error response and matching exit code.

Parameters:

Name Type Description Default
exc Exception

The caught exception.

required
error_code str | None

Optional explicit error code to surface in diagnostics.

None
suggestion str | None

Optional user-facing recovery suggestion.

None
trace_id str

Unique trace identifier for tracking this CLI request.

required

Returns:

Type Description
Tuple[ErrorPayload, ExitCode]

A tuple containing the response payload and associated exit code.

exit_with_error(exc, *, trace_id, format_override=None)

Render error output, emit trace, and exit with mapped status.

map_exception(exc)

Map a raised exception to a stable CLI exit code.

Parameters:

Name Type Description Default
exc Exception

Exception raised during CLI execution.

required

Returns:

Type Description
ExitCode

ExitCode representing the failure category.

render_error(exc, *, trace_id, format_override=None, suggestion=None)

Render error output based on API vs human mode.

factory
DefaultServiceFactory

Default factory bridging CLI config to GenAIService.

create_genai_service(cli_config, overrides)

Create a fully configured GenAI service instance.

Parameters:

Name Type Description Default
cli_config CLIConfig

Effective CLI configuration.

required
overrides ServiceOverrides

Execution-time overrides for model and token behavior.

required

Returns:

Type Description
GenAIServiceProtocol

GenAIServiceProtocol implementation bound to current settings.

ServiceFactory

Bases: Protocol

Factory protocol for constructing GenAI services.

create_genai_service(cli_config, overrides)

Create a GenAI service given CLI config and overrides.

ServiceOverrides dataclass

Typed overrides passed from CLI flags into Settings.

max_tokens = None class-attribute instance-attribute
model = None class-attribute instance-attribute
temperature = None class-attribute instance-attribute
__init__(model=None, max_tokens=None, temperature=None)
cli_config_to_settings_kwargs(cli_config, overrides)

Translate CLI configuration into kwargs for GenAI service settings.

output

Output helpers for tnh-gen, including formatting policy utilities.

formatter
format_table(headers, rows)

Render a simple fixed-width table for CLI display.

Parameters:

Name Type Description Default
headers list[str]

Column headers.

required
rows Iterable[list[str]]

Row data to render.

required

Returns:

Type Description
str

Rendered table string.

render_output(payload, fmt)

Serialize payload to the requested output format.

Parameters:

Name Type Description Default
payload Any

Data to serialize.

required
fmt OutputFormat | ListOutputFormat

Output format enum selection.

required

Returns:

Type Description
str

Serialized string representation for CLI display.

Raises:

Type Description
ValueError

If the requested format is unsupported.

human_formatter
LABELS = HumanOutputLabels() module-attribute
HumanOutputLabels dataclass

Display labels for human-friendly CLI output.

error_prefix = 'Error: ' class-attribute instance-attribute
header_template = 'Available Prompts ({count})' class-attribute instance-attribute
metadata_separator = ' | ' class-attribute instance-attribute
no_default_model = '(no default)' class-attribute instance-attribute
no_tags = '(no tags)' class-attribute instance-attribute
no_variables = '(none)' class-attribute instance-attribute
suggestion_prefix = 'Suggestion: ' class-attribute instance-attribute
variable_prefix = ' Variables: ' class-attribute instance-attribute
__init__(no_variables='(none)', no_default_model='(no default)', no_tags='(no tags)', header_template='Available Prompts ({count})', variable_prefix=' Variables: ', metadata_separator=' | ', error_prefix='Error: ', suggestion_prefix='Suggestion: ')
OutputColor

Bases: str, Enum

ANSI color codes for human-friendly CLI output.

ERROR = 'red' class-attribute instance-attribute
MODEL = 'green' class-attribute instance-attribute
SUGGESTION = 'yellow' class-attribute instance-attribute
TAGS = 'yellow' class-attribute instance-attribute
TITLE = 'bright_blue' class-attribute instance-attribute
VARIABLES = 'cyan' class-attribute instance-attribute
format_human_friendly_error(error, suggestion=None)

Format errors for human-readable CLI output.

format_human_friendly_list(prompts)

Format prompt metadata for human-readable CLI output.

policy
resolve_list_format(*, api, format_override, ctx_format)

Resolve list output format with API-aware defaults.

resolve_output_format(*, api, format_override, default_format)

Resolve output format with API-aware defaults.

validate_global_format(api, format_override)

Validate global format flags shared across commands.

validate_list_format(api, format_override)

Validate list format combinations.

validate_run_format(api, format_override)

Validate run format combinations.

provenance
provenance_block(envelope, *, source_metadata=None, trace_id, prompt_version)

Build a YAML frontmatter block capturing provenance for saved files.

provenance_metadata(envelope, *, source_metadata=None, trace_id, prompt_version)

Build merged provenance metadata for persisted sidecars or headers.

sidecar_path(path)

Return the provenance sidecar path for a structured output artifact.

write_output_file(path, *, result_text, envelope, source_metadata=None, trace_id, prompt_version, include_provenance, structured_output=False)

Write result text to disk, optionally prefixing provenance metadata.

state
ctx = CLIContext() module-attribute
CLIContext dataclass

Holds shared CLI state populated by the Typer callback.

api = False class-attribute instance-attribute
config_path = None class-attribute instance-attribute
no_color = False class-attribute instance-attribute
output_format = None class-attribute instance-attribute
quiet = False class-attribute instance-attribute
service_factory = None class-attribute instance-attribute
__init__(config_path=None, output_format=None, api=False, quiet=False, no_color=False, service_factory=None)
ListOutputFormat

Bases: str, Enum

Output formats available for prompt listing.

json = 'json' class-attribute instance-attribute
table = 'table' class-attribute instance-attribute
text = 'text' class-attribute instance-attribute
yaml = 'yaml' class-attribute instance-attribute
OutputFormat

Bases: str, Enum

Supported output formats for primary CLI commands.

json = 'json' class-attribute instance-attribute
text = 'text' class-attribute instance-attribute
yaml = 'yaml' class-attribute instance-attribute
tnh_gen

Typer entrypoint for the tnh-gen CLI.

app = typer.Typer(name='tnh-gen', help='TNH-Gen: Unified CLI for TNH Scholar GenAI operations.', add_completion=False, no_args_is_help=True) module-attribute
cli_callback(click_ctx, config=typer.Option(None, '--config', help='Path to config file that overrides user/workspace config.'), prompt_dir=typer.Option(None, '--prompt-dir', help='Override the prompt catalog directory for this invocation.'), format=typer.Option(None, '--format', help='Output format for commands (json/yaml for API output; text/yaml for human output).', case_sensitive=False), api=typer.Option(False, '--api', help='Machine-readable API contract output (JSON by default).'), quiet=typer.Option(False, '--quiet', '-q', help='Suppress non-error output.'), no_color=typer.Option(False, '--no-color', help='Disable colored output.'))

Apply global options and initialize shared context.

Default behavior: human-friendly output optimized for interactive CLI use. Use --api for machine-readable JSON contract output.

Examples:

tnh-gen list tnh-gen --api list tnh-gen --prompt-dir ./my-prompts list tnh-gen run --prompt daily --input-file notes.md tnh-gen --api run --prompt daily --input-file notes.md

Parameters:

Name Type Description Default
config Optional[Path]

Optional path to an explicit config file.

Option(None, '--config', help='Path to config file that overrides user/workspace config.')
prompt_dir Path | None

Optional prompt catalog directory override.

Option(None, '--prompt-dir', help='Override the prompt catalog directory for this invocation.')
format OutputFormat | None

Output format override for commands.

Option(None, '--format', help='Output format for commands (json/yaml for API output; text/yaml for human output).', case_sensitive=False)
api bool

Whether to emit machine-readable API contract output.

Option(False, '--api', help='Machine-readable API contract output (JSON by default).')
quiet bool

Whether to suppress non-error output.

Option(False, '--quiet', '-q', help='Suppress non-error output.')
no_color bool

Whether to disable colored terminal output.

Option(False, '--no-color', help='Disable colored output.')
main()

Dispatch execution to the Typer application.

types
ConfigKey = Literal['prompt_catalog_dir', 'default_model', 'max_dollars', 'max_input_chars', 'default_temperature', 'api_key', 'cli_path'] module-attribute
DefaultVariables = Mapping[str, Any] module-attribute
PolicyApplied = Mapping[str, Any] module-attribute
RunOutcomePayload = RunSuccessPayload | RunIncompletePayload | RunFailedPayload module-attribute
VariableMap = MutableMapping[str, Any] module-attribute
ConfigData

Bases: TypedDict

api_key instance-attribute
cli_path instance-attribute
default_model instance-attribute
default_temperature instance-attribute
max_dollars instance-attribute
max_input_chars instance-attribute
prompt_catalog_dir instance-attribute
ConfigKeysHumanPayload

Bases: TypedDict

keys instance-attribute
ConfigKeysPayload

Bases: TypedDict

keys instance-attribute
trace_id instance-attribute
ConfigMeta

Bases: TypedDict

config_files instance-attribute
sources instance-attribute
ConfigShowPayload

Bases: TypedDict

catalog_errors instance-attribute
catalog_health instance-attribute
config instance-attribute
config_files instance-attribute
sources instance-attribute
trace_id instance-attribute
ConfigUpdateApiPayload

Bases: TypedDict

status instance-attribute
target instance-attribute
trace_id instance-attribute
updated instance-attribute
ConfigUpdatePayload

Bases: TypedDict

target instance-attribute
updated instance-attribute
ConfigValuePayload

Bases: TypedDict

api_key instance-attribute
cli_path instance-attribute
default_model instance-attribute
default_temperature instance-attribute
max_dollars instance-attribute
max_input_chars instance-attribute
prompt_catalog_dir instance-attribute
trace_id instance-attribute
ErrorDiagnostics

Bases: TypedDict

error_code instance-attribute
error_type instance-attribute
suggestion instance-attribute
ErrorPayload

Bases: TypedDict

diagnostics instance-attribute
error instance-attribute
status instance-attribute
trace_id instance-attribute
HumanEntry

Bases: TypedDict

default_model instance-attribute
description instance-attribute
key instance-attribute
name instance-attribute
tags instance-attribute
variables instance-attribute
HumanVariables

Bases: TypedDict

optional instance-attribute
required instance-attribute
ListApiEntry

Bases: TypedDict

default_model instance-attribute
default_variables instance-attribute
description instance-attribute
key instance-attribute
name instance-attribute
optional_variables instance-attribute
output_mode instance-attribute
required_variables instance-attribute
tags instance-attribute
version instance-attribute
warnings instance-attribute
ListApiPayload

Bases: TypedDict

catalog_errors instance-attribute
count instance-attribute
prompts instance-attribute
sources instance-attribute
ListHumanPayload

Bases: TypedDict

count instance-attribute
prompts instance-attribute
RunAdapterDiagnosticsPayload

Bases: TypedDict

content_part_count instance-attribute
content_source instance-attribute
extraction_notes instance-attribute
raw_finish_reason instance-attribute
RunBasePayload

Bases: TypedDict

policy_applied instance-attribute
prompt_warnings instance-attribute
provenance instance-attribute
sources instance-attribute
trace_id instance-attribute
warnings instance-attribute
RunFailedPayload

Bases: RunBasePayload

failure instance-attribute
status instance-attribute
RunFailurePayload

Bases: TypedDict

adapter_diagnostics instance-attribute
message instance-attribute
reason instance-attribute
retryable instance-attribute
RunIncompletePayload

Bases: RunBasePayload

result instance-attribute
status instance-attribute
RunProvenancePayload

Bases: TypedDict

backend instance-attribute
completed_at instance-attribute
model instance-attribute
prompt_fingerprint instance-attribute
prompt_key instance-attribute
prompt_version instance-attribute
schema_version instance-attribute
started_at instance-attribute
RunResultPayload

Bases: TypedDict

finish_reason instance-attribute
json instance-attribute
model instance-attribute
provider instance-attribute
schema_ref instance-attribute
text instance-attribute
usage instance-attribute
RunSuccessPayload

Bases: RunBasePayload

result instance-attribute
status instance-attribute
RunUsagePayload

Bases: TypedDict

completion_tokens instance-attribute
prompt_tokens instance-attribute
total_tokens instance-attribute
SettingsKwargs

Bases: TypedDict

default_max_output_tokens instance-attribute
default_model instance-attribute
default_temperature instance-attribute
max_dollars instance-attribute
max_input_chars instance-attribute
openai_api_key instance-attribute
prompt_dir instance-attribute
VersionHumanPayload

Bases: TypedDict

genai_service_version instance-attribute
platform instance-attribute
prompt_system_version instance-attribute
python instance-attribute
tnh_gen instance-attribute
tnh_scholar instance-attribute
VersionPayload

Bases: TypedDict

genai_service_version instance-attribute
platform instance-attribute
prompt_system_version instance-attribute
python instance-attribute
tnh_gen instance-attribute
tnh_scholar instance-attribute
trace_id instance-attribute

tnh_lines

Typer CLI for line numbering helpers.

tnh_lines
app = typer.Typer(name='tnh-lines', help='Prepare numbered text for sectioning workflows and convert it back to plain text.', add_completion=False, no_args_is_help=True) module-attribute
main()

Dispatch execution to the Typer application.

number_command(input_file=typer.Argument(..., help='Plain text source file.'), output_file=typer.Argument(..., help='Numbered output path.'), start=typer.Option(1, '--start', help='Starting line number.'), separator=typer.Option(':', '--separator', help='Line-number separator.'), no_clobber=typer.Option(False, '--no-clobber', help='Fail if the output file already exists.'))

Write numbered text in N:LINE format.

unnumber_command(input_file=typer.Argument(..., help='Numbered source file.'), output_file=typer.Argument(..., help='Plain-text output path.'), no_clobber=typer.Option(False, '--no-clobber', help='Fail if the output file already exists.'))

Strip numbering and write plain text.

tnh_setup

__all__ = ['main', 'tnh_setup'] module-attribute
main()
prompt_display

Shared user-facing prompt setup descriptions.

PromptSetupDisplay dataclass

Human-readable prompt setup descriptions for setup commands.

__init__()
bundled_prompts() classmethod
repo_workspace() classmethod
tnh_setup

Legacy Click-based tnh-setup entrypoint.

Prefer tnh_setup_typer.py for the maintained CLI implementation. This compatibility path is retained temporarily and should not diverge.

OPENAI_ENV_HELP_MSG = "\n>>>>>>>>>> OpenAI API key not found in environment. <<<<<<<<<\n\nFor AI processing with TNH-scholar:\n\n1. Get an API key from https://platform.openai.com/api-keys\n2. Set the OPENAI_API_KEY environment variable:\n\n export OPENAI_API_KEY='your-api-key-here' # Linux/Mac\n set OPENAI_API_KEY=your-api-key-here # Windows\n\nFor OpenAI API access help: https://platform.openai.com/\n\n>>>>>>>>>>>>>>>>>>>>>>>>>>> -- <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<\n" module-attribute
SetupPaths dataclass

Resolved filesystem paths used by setup.

config_dir instance-attribute
log_dir instance-attribute
prompt_dir instance-attribute
__init__(config_dir, log_dir, prompt_dir)
create_config_dirs(paths)

Create required configuration directories.

main()

Entry point for setup CLI tool.

maybe_check_environment(*, skip_env)

Load env and report missing OpenAI configuration.

maybe_setup_ytdlp_runtime(*, skip_ytdlp_runtime)

Prompt for and run yt-dlp runtime setup.

report_prompt_setup(paths, *, skip_prompts)

Report prompt directory setup without external downloads.

tnh_setup(skip_env, skip_prompts, skip_ytdlp_runtime)

Set up TNH Scholar configuration.

tnh_setup_typer
app = typer.Typer(add_completion=False, no_args_is_help=False) module-attribute
PromptDecision dataclass
assume_yes instance-attribute
no_input instance-attribute
skip_env instance-attribute
skip_prompts instance-attribute
skip_ytdlp_runtime instance-attribute
verify_only instance-attribute
__init__(skip_env, skip_prompts, skip_ytdlp_runtime, verify_only, assume_yes, no_input)
SetupConfig dataclass
config_dir instance-attribute
log_dir instance-attribute
prompt_dir instance-attribute
__init__(config_dir, log_dir, prompt_dir)
main()
tnh_setup(skip_env=typer.Option(False, help='Skip OpenAI API key check.'), skip_prompts=typer.Option(False, help='Skip prompt directory setup guidance.'), skip_ytdlp_runtime=typer.Option(False, help='Skip yt-dlp runtime setup.'), verify_only=typer.Option(False, help='Only run environment verification.'), assume_yes=typer.Option(False, '--yes', '-y', help='Assume yes for all prompts.'), no_input=typer.Option(False, help='Fail if a prompt would be required.'))

Set up TNH Scholar configuration.

ui
SetupSummaryItem dataclass
component instance-attribute
status instance-attribute
style instance-attribute
__init__(component, status, style)
SetupUI dataclass
console instance-attribute
use_rich instance-attribute
__init__(console, use_rich)
banner()
create() classmethod
section(step, title, total=3)
spinner(label, action)
status(label, status, style='info')
summary(items)

tnh_tree

Developer tool for the tnh-scholar project.

This legacy utility generates repository tree snapshots for manual developer reference. It is no longer part of routine CI or release validation.

main()

CLI entry point registered as tnh-tree.

token_count

__all__ = ['main', 'token_count_cli'] module-attribute
main()

Entry point for the token-count CLI tool.

token_count_cli(input_file)

Return the Open AI API token count of a text file. Based on gpt-4o.

token_count
main()

Entry point for the token-count CLI tool.

token_count_cli(input_file)

Return the Open AI API token count of a text file. Based on gpt-4o.

utils

T = TypeVar('T') module-attribute
logger = get_child_logger(__name__) module-attribute
handle_cli_exception(message, exc)

Convert unexpected errors to Click-friendly messages.

run_or_fail(message, operation)

Execute an operation and re-raise failures as Click exceptions to avoid stack traces.

ytt_fetch

__all__ = ['main', 'ytt_fetch'] module-attribute
main()
ytt_fetch

Simple CLI tool for retrieving video transcripts.

This module provides a command line interface for downloading video transcripts in specified languages. It uses yt-dlp for video info extraction.

logger = get_child_logger(__name__) module-attribute
cleanup_files(keep, filepath)
export_data(output_path, data)
export_ttml_data(metadata, ttml_path, no_embed, output_path, keep)
generate_metadata(service, url, keep, output_path)
generate_transcript(service, url, lang, keep, no_embed, output_path)
get_ttml_download(dl, url, lang, output_path)
main()
ytt_fetch(url, lang, keep, info, no_embed, output)

YouTube Transcript Fetch: Retrieve and save transcripts for a Youtube video using yt-dlp.

configuration

Configuration utilities for TNH Scholar.

context

Runtime context discovery and path resolution.

BuiltinRootLocator

Resolves the built-in runtime_assets root.

resolve()
ContextIdFactory

Generates correlation and session identifiers.

build(correlation_id, session_id)
PromptDirectoryNames dataclass

Canonical prompt-directory names used by runtime discovery.

__init__()
builtin() classmethod
legacy_workspace() classmethod
user() classmethod
workspace() classmethod
PromptPathBuilder

Builds prompt search paths for a context.

__init__(context)
build()
primary()
RegistryCategory

Bases: StrEnum

Registry category for path resolution.

OVERRIDES = 'overrides' class-attribute instance-attribute
PROVIDERS = 'providers' class-attribute instance-attribute
RegistryPathBuilder

Builds registry search paths for a context.

__init__(context)
build(category)
TNHContext dataclass

Resolved runtime context for TNH Scholar.

builtin_root instance-attribute
correlation_id instance-attribute
session_id instance-attribute
user_root instance-attribute
workspace_root instance-attribute
__init__(builtin_root, workspace_root, user_root, correlation_id, session_id)
discover(*, workspace_root=None, user_root=None, correlation_id=None, session_id=None, start_path=None) classmethod
get_primary_prompt_dir()

Return the highest-precedence prompt directory that exists.

get_prompt_search_paths()

Return valid prompt directories in precedence order.

get_registry_search_paths(registry_type)
UserRootLocator

Resolves the user configuration root.

resolve()
WorkspaceDiscoveryPolicy dataclass

Policy for workspace discovery.

markers instance-attribute
stop_dir instance-attribute
__init__(markers, stop_dir)
default() classmethod
WorkspaceLocator

Locates a workspace root by walking upward.

__init__(policy)
find(start_path)

exceptions

__all__ = ['TnhScholarError', 'ConfigurationError', 'ValidationError', 'ExternalServiceError', 'RateLimitError', 'NotRetryable', 'MetadataConflictError', 'SectionBoundaryError'] module-attribute

ConfigurationError

Bases: TnhScholarError

Configuration-related errors (missing env vars, invalid settings, etc.).

ExternalServiceError

Bases: TnhScholarError

Upstream/provider errors (HTTP 5xx, transport, transient provider issues).

MetadataConflictError

Bases: ValidationError

Raised when metadata merge encounters key conflicts in FAIL_ON_CONFLICT mode.

NotRetryable

Bases: TnhScholarError

Marker for errors where retry is known to be pointless (e.g., bad auth).

RateLimitError

Bases: ExternalServiceError

Upstream rate limits; typically retryable after a backoff.

SectionBoundaryError

Bases: ValidationError

Raised when section boundaries have gaps, overlaps, or out-of-bounds errors.

Note: Implementation is in text_object.py to avoid circular imports. This entry exists for documentation and to reserve the error name.

TnhScholarError

Bases: Exception

Base exception for all tnh_scholar errors.

Attributes:

Name Type Description
message

Human-readable summary.

context

Optional structured context to aid logging/diagnostics. Keep this JSON-serializable.

cause

Optional underlying exception.

__cause__ = cause instance-attribute
context = dict(context) if context else {} instance-attribute
message = message instance-attribute
__init__(message='', *, context=None, cause=None)
__str__()

ValidationError

Bases: TnhScholarError

Input/data validation errors (precondition failures before calling providers).

journal_processing

__all__ = ['batch_section', 'batch_translate', 'generate_clean_batch', 'save_cleaned_data', 'save_sectioning_data', 'save_translation_data', 'setup_logger'] module-attribute

batch_section(input_xml_path, batch_jsonl, system_message, journal_name)

Split journal content into sections using GPT, with retries for starting and completing the batch.

batch_translate(input_xml_path, batch_json_path, metadata_path, system_message, journal_name)

Translates the journal sections using the GPT model. Saves the translated content back to XML.

generate_clean_batch(input_xml_file, output_file, system_message, user_wrap_function)

Generate a batch file for the OpenAI (OA) API using a single input XML file.

save_cleaned_data(cleaned_xml_path, cleaned_wrapped_pages, journal_name)

save_sectioning_data(output_json_path, raw_output_path, serial_json, journal_name)

save_translation_data(xml_output_path, translation_data, journal_name)

setup_logger(log_file_path)

Configures the logger to write to a log file and the console. Adds a custom "PRIORITY_INFO" logging level for important messages.

journal_process

BATCH_RETRY_DELAY = 5 module-attribute
DEFAULT_JOURNAL_MODEL = 'gpt-4o' module-attribute
DEFAULT_MODEL_SETTINGS = {'gpt-4o': {'max_tokens': 16000, 'temperature': 1.0}, 'gpt-3.5-turbo': {'max_tokens': 4096, 'temperature': 1.0}, 'gpt-4o-mini': {'max_tokens': 16000, 'temperature': 1.0}} module-attribute
MAX_BATCH_RETRIES = 40 module-attribute
MAX_TOKEN_LIMIT = 60000 module-attribute
journal_schema = {'type': 'object', 'properties': {'journal_summary': {'type': 'string'}, 'sections': {'type': 'array', 'items': {'type': 'object', 'properties': {'title_vi': {'type': 'string'}, 'title_en': {'type': 'string'}, 'author': {'type': ['string', 'null']}, 'summary': {'type': 'string'}, 'keywords': {'type': 'array', 'items': {'type': 'string'}}, 'start_page': {'type': 'integer', 'minimum': 1}, 'end_page': {'type': 'integer', 'minimum': 1}}, 'required': ['title_vi', 'title_en', 'summary', 'keywords', 'start_page', 'end_page']}}}, 'required': ['journal_summary', 'sections']} module-attribute
logger = logging.getLogger('journal_process') module-attribute
ModelSettings

Bases: TypedDict

max_tokens instance-attribute
temperature instance-attribute
batch_section(input_xml_path, batch_jsonl, system_message, journal_name)

Split journal content into sections using GPT, with retries for starting and completing the batch.

batch_translate(input_xml_path, batch_json_path, metadata_path, system_message, journal_name)

Translates the journal sections using the GPT model. Saves the translated content back to XML.

create_jsonl_file_for_batch(messages, output_file_path=None, max_token_list=None, model=DEFAULT_JOURNAL_MODEL, tools=None, json_mode=False)

Write a JSONL batch file mirroring the legacy OpenAI format.

deserialize_json(serialized_data)

Converts a serialized JSON string into a Python dictionary.

Parameters:

Name Type Description Default
serialized_data str

The JSON string to deserialize.

required

Returns:

Name Type Description
dict dict

The deserialized Python dictionary.

extract_page_groups_from_metadata(metadata)

Extracts page groups from the section metadata for use with split_xml_pages.

Parameters:

Name Type Description Default
metadata dict

The section metadata containing sections with start and end pages.

required

Returns:

Type Description
list

List[Tuple[int, int]]: A list of tuples, each representing a page range (start_page, end_page).

generate_all_batches(processed_document_dir, system_message, user_wrap_function, file_regex='.*\\.xml')

Generate cleaning batches for all journals in the specified directory.

Parameters:

Name Type Description Default
processed_document_dir str

Path to the directory containing processed journal data.

required
system_message str

System message template for batch processing.

required
user_wrap_function callable

Function to wrap user input for processing pages.

required
file_regex str

Regex pattern to identify target files (default: ".*.xml").

'.*\\.xml'
generate_clean_batch(input_xml_file, output_file, system_message, user_wrap_function)

Generate a batch file for the OpenAI (OA) API using a single input XML file.

generate_messages(system_message, user_message_wrapper, data_list_to_process, log_system_message=True)

Build OpenAI-style chat message payloads.

generate_single_oa_batch_from_pages(input_xml_file, output_file, system_message, user_wrap_function)

*** Deprecated *** Generate a batch file for the OpenAI (OA) API using a single input XML file.

run_immediate_chat_process(messages, max_tokens=0, response_format=None, model=DEFAULT_JOURNAL_MODEL)

Legacy-compatible immediate completion powered by GenAI simple_completion.

save_cleaned_data(cleaned_xml_path, cleaned_wrapped_pages, journal_name)
save_sectioning_data(output_json_path, raw_output_path, serial_json, journal_name)
save_translation_data(xml_output_path, translation_data, journal_name)
send_data_for_tx_batch(batch_jsonl_path, section_data_to_send, system_message, max_token_list, journal_name, immediate=False)

Sends data for translation batch or immediate processing.

Parameters:

Name Type Description Default
batch_jsonl_path Path

Path for the JSONL file to save batch data.

required
section_data_to_send List

List of section data to translate.

required
system_message str

System message for the translation process.

required
max_token_list List

List of max tokens for each section.

required
journal_name str

Name of the journal being processed.

required
immediate bool

If True, run immediate chat processing instead of batch.

False

Returns:

Name Type Description
List list

Translated data from the batch or immediate process.

setup_logger(log_file_path)

Configures the logger to write to a log file and the console. Adds a custom "PRIORITY_INFO" logging level for important messages.

start_batch_with_retries(jsonl_file, description='', max_retries=MAX_BATCH_RETRIES, retry_delay=BATCH_RETRY_DELAY, poll_interval=10, timeout=3600)

Simulate the legacy batch runner using sequential simple_completion calls.

The parameters mirror the old interface so callers remain unchanged, but the implementation now iterates through the JSONL requests locally.

translate_sections(batch_jsonl_path, system_message, section_contents, section_metadata, journal_name, immediate=False)

build up sections in batches to translate

unwrap_all_lines(pages)
unwrap_lines(text)
Removes angle brackets (< >) from encapsulated lines and merges them into
a newline-separated string.

Parameters:
    text (str): The input string with encapsulated lines.

Returns:
    str: A newline-separated string with the encapsulation removed.

Example:
    >>> merge_encapsulated_lines("<Line 1> <Line 2> <Line 3>")
    'Line 1

Line 2 Line 3' >>> merge_encapsulated_lines(" ") 'Line 1 Line 2 Line 3'

validate_and_clean_data(data, schema)

Recursively validate and clean AI-generated data to fit the given schema. Any missing fields are filled with defaults, and extra fields are ignored.

Parameters:

Name Type Description Default
data dict

The AI-generated data to validate and clean.

required
schema dict

The schema defining the required structure.

required

Returns:

Name Type Description
dict dict

The cleaned data adhering to the schema.

validate_and_save_metadata(output_file_path, json_metadata_serial, schema)

Validates and cleans journal data against the schema, then writes it to a JSON file.

Returns:

Name Type Description
bool bool

True if successfully written to the file, False otherwise.

wrap_all_lines(pages)
wrap_lines(text)
Encloses each line of the input text with angle brackets.

Args:
    text (str): The input string containing lines separated by '

'.

Returns:
    str: A string where each line is enclosed in angle brackets.

Example:
    >>> enclose_lines("This is a string with

two lines.") ' < two lines.>'

logging_config

TNH-Scholar Logging Utilities

A production-ready, environment-driven logging system for the TNH-Scholar project. It provides JSON logs in production, color/plain text in development, optional non-blocking queue logging, file rotation, noise suppression for chatty deps, and optional routing of Python warnings into the logging pipeline.

This module is designed for application layer configuration and library layer usage:

  • Applications (CLI, Streamlit, FastAPI, notebooks) call :func:setup_logging.
  • Libraries / services (e.g., gen_ai_service, IssueHandler) only acquire a logger via :func:get_logger (or legacy :func:get_child_logger) and never configure global logging.

Quick start

Application entry point (recommended):

>>> from tnh_scholar.logging_config import setup_logging, get_logger
>>> setup_logging()  # reads env; see variables below
>>> log = get_logger(__name__)
>>> log.info("app started", extra={"service": "gen-ai"})

Jupyter / dev (force color in non-TTY):

>>> import os
>>> os.environ["APP_ENV"] = "dev"
>>> os.environ["LOG_JSON"] = "false"
>>> os.environ["LOG_COLOR"] = "true"]  # Jupyter isn't a TTY; force color
>>> from tnh_scholar.logging_config import setup_logging, get_logger
>>> setup_logging()
>>> get_logger(__name__).info("hello, color")

Library / service modules (do NOT configure logging):

>>> from tnh_scholar.logging_config import get_logger
>>> log = get_logger(__name__)
>>> log.info("library message")

Behavior by environment
  • dev (default):
    • Plain or color text to stdout by default.
    • Queue logging disabled by default (synchronous).
    • Color auto-detects TTY and Jupyter/IPython (can be forced).
  • prod:
    • JSON logs to stderr by default (suitable for log shippers).
    • Queue logging enabled by default (can be disabled).

Environment variables

Most behavior is controlled by environment variables (read when setup_logging() instantiates :class:LogSettings). Truthy values accept true/1/yes/on (case-insensitive).

  • APP_ENV: dev | prod | test (default: dev)
  • LOG_LEVEL: Logging level for the base project logger (default: INFO)
  • LOG_STDOUT: Emit logs to stdout (default: true)
  • LOG_FILE_ENABLE: Emit logs to a file (default: false)
  • LOG_FILE_PATH: File path for logs (default: ./logs/main.log)
  • LOG_ROTATE_BYTES: Rotate at N bytes (e.g., 10485760) (default: unset)
  • LOG_ROTATE_WHEN: Timed rotation (e.g., midnight) (default: unset)
  • LOG_BACKUPS: Number of rotated file backups (default: 5)
  • LOG_JSON: Use JSON formatter (recommended in prod) (default: true)
  • LOG_COLOR: true | false | auto (default: auto)
  • LOG_STREAM: stdout | stderr (default: stderr; dev defaults to stdout)
  • LOG_USE_QUEUE: Use QueueHandler/QueueListener (default: true; dev defaults to false)
  • LOG_CAPTURE_WARNINGS: Route Python warnings via logging (default: false)
  • LOG_SUPPRESS: Comma-separated list of noisy module names to set to WARNING (default includes urllib3, httpx, openai, uvicorn.*, etc.)

Backward compatibility
  • get_child_logger(name, console=False, separate_file=False) remains available and can attach ad-hoc console/file handlers without reconfiguring the project base logger. When custom handlers are attached, the child’s propagation is turned off to avoid duplicate messages.
  • setup_logging_legacy(...) forwards to :func:setup_logging and emits a DeprecationWarning to help locate legacy call sites.
  • Custom level PRIORITY_INFO (25) and :meth:logger.priority_info still exist but are deprecated. Prefer:

    log.info("message", extra={"priority": "high"})

This keeps level semantics standard and plays better with structured logging.


Queue logging notes
  • When LOG_USE_QUEUE=true, the base logger uses a :class:QueueHandler. A :class:QueueListener is started with sinks mirroring your configured stdout/file handlers. This decouples log emission from I/O to minimize latency.
  • In notebooks or during debugging, you may prefer synchronous logs:

    os.environ["LOG_USE_QUEUE"] = "false"


Python warnings routing
  • When LOG_CAPTURE_WARNINGS=true, Python warnings are captured and logged through py.warnings. This module attaches the base logger’s handlers to that logger and disables propagation to avoid duplicate output.

Mixing print() and logging
  • print() writes to stdout; the logger can write to stdout or stderr depending on LOG_STREAM and environment. Ordering is not guaranteed, especially with queue logging enabled. Prefer logging for consistent output.

Minimal examples

CLI / entrypoint:

>>> import os
>>> os.environ.setdefault("APP_ENV", "prod")
>>> os.environ.setdefault("LOG_JSON", "true")
>>> from tnh_scholar.logging_config import setup_logging, get_logger
>>> setup_logging()
>>> get_logger(__name__).info("ready")

File logging with rotation:

>>> import os
>>> os.environ.update({
...     "LOG_FILE_ENABLE": "true",
...     "LOG_FILE_PATH": "./logs/app.log",
...     "LOG_ROTATE_BYTES": "10485760",  # 10MB
...     "LOG_BACKUPS": "7",
... })
>>> setup_logging()
>>> get_logger("smoke").info("to file")

Jupyter with color:

>>> import os
>>> os.environ.update({"APP_ENV": "dev", "LOG_JSON": "false", "LOG_COLOR": "true"})
>>> setup_logging()
>>> get_logger(__name__).info("color in notebook")

Notes
  • JSON formatting requires python-json-logger; without it, we fall back to plain/color format automatically.
  • This module never configures the root logger; it configures the project base logger (tnh) so your app can coexist with other libraries cleanly.

BASE_LOG_DIR = Path('./logs') module-attribute

BASE_LOG_NAME = 'tnh' module-attribute

DEFAULT_CONSOLE_FORMAT_STRING = LOG_FMT_COLOR module-attribute

DEFAULT_FILE_FORMAT_STRING = '%(asctime)s - %(name)s - %(levelname)s - %(message)s' module-attribute

DEFAULT_LOG_FILEPATH = Path('main.log') module-attribute

JsonFormatter = getattr(_pythonjsonlogger_json, 'JsonFormatter', None) module-attribute

LOG_COLORS = {'DEBUG': 'bold_green', 'INFO': 'cyan', 'PRIORITY_INFO': 'bold_cyan', 'WARNING': 'bold_yellow', 'ERROR': 'bold_red', 'CRITICAL': 'bold_red'} module-attribute

LOG_FMT_COLOR = '%(asctime)s | %(log_color)s%(levelname)-8s%(reset)s | %(name)s | %(message)s' module-attribute

LOG_FMT_JSON = '%(asctime)s %(levelname)s %(name)s %(message)s %(process)d %(thread)d %(module)s %(filename)s %(lineno)d' module-attribute

LOG_FMT_PLAIN = '%(asctime)s | %(levelname)-8s | %(name)s | %(message)s' module-attribute

MAX_FILE_SIZE = 10 * 1024 * 1024 module-attribute

PRIORITY_INFO_LEVEL = 25 module-attribute

__all__ = ['BASE_LOG_NAME', 'BASE_LOG_DIR', 'DEFAULT_LOG_FILEPATH', 'MAX_FILE_SIZE', 'OMPFilter', 'setup_logging', 'setup_logging_legacy', 'get_logger', 'get_child_logger'] module-attribute

LogSettings dataclass

Environment-driven logging settings with sensible defaults.

backups = field(default_factory=(lambda: _env_int('LOG_BACKUPS', 5))) class-attribute instance-attribute
base_name = field(default_factory=(lambda: _env_str('LOG_BASE', BASE_LOG_NAME))) class-attribute instance-attribute
capture_warnings = field(default_factory=(lambda: _env_bool('LOG_CAPTURE_WARNINGS', 'false'))) class-attribute instance-attribute
colorize = field(default_factory=(lambda: _env_str('LOG_COLOR', 'auto'))) class-attribute instance-attribute
environment = field(default_factory=(lambda: _env_str('APP_ENV', 'dev'))) class-attribute instance-attribute
file_path = field(default_factory=(lambda: Path(_env_str('LOG_FILE_PATH', str(BASE_LOG_DIR / DEFAULT_LOG_FILEPATH))))) class-attribute instance-attribute
json_format = field(default_factory=(lambda: _env_bool('LOG_JSON', 'true'))) class-attribute instance-attribute
level = field(default_factory=(lambda: _env_str('LOG_LEVEL', 'INFO'))) class-attribute instance-attribute
log_stream = field(default_factory=(lambda: _env_str('LOG_STREAM', 'stderr'))) class-attribute instance-attribute
rotate_bytes = field(default_factory=(lambda: _env_int('LOG_ROTATE_BYTES', 0) or None)) class-attribute instance-attribute
rotate_when = field(default_factory=(lambda: _env_str('LOG_ROTATE_WHEN', '') or None)) class-attribute instance-attribute
suppress_modules = field(default_factory=(lambda: _env_str('LOG_SUPPRESS', 'urllib3,httpx,openai,botocore,boto3,asyncio,uvicorn,uvicorn.error,uvicorn.access'))) class-attribute instance-attribute
to_file = field(default_factory=(lambda: _env_bool('LOG_FILE_ENABLE', 'false'))) class-attribute instance-attribute
to_stdout = field(default_factory=(lambda: _env_bool('LOG_STDOUT', 'true'))) class-attribute instance-attribute
use_queue = field(default_factory=(lambda: _env_bool('LOG_USE_QUEUE', 'true'))) class-attribute instance-attribute
__init__(environment=(lambda: _env_str('APP_ENV', 'dev'))(), base_name=(lambda: _env_str('LOG_BASE', BASE_LOG_NAME))(), level=(lambda: _env_str('LOG_LEVEL', 'INFO'))(), to_stdout=(lambda: _env_bool('LOG_STDOUT', 'true'))(), to_file=(lambda: _env_bool('LOG_FILE_ENABLE', 'false'))(), file_path=(lambda: Path(_env_str('LOG_FILE_PATH', str(BASE_LOG_DIR / DEFAULT_LOG_FILEPATH))))(), rotate_when=(lambda: _env_str('LOG_ROTATE_WHEN', '') or None)(), rotate_bytes=(lambda: _env_int('LOG_ROTATE_BYTES', 0) or None)(), backups=(lambda: _env_int('LOG_BACKUPS', 5))(), json_format=(lambda: _env_bool('LOG_JSON', 'true'))(), colorize=(lambda: _env_str('LOG_COLOR', 'auto'))(), capture_warnings=(lambda: _env_bool('LOG_CAPTURE_WARNINGS', 'false'))(), log_stream=(lambda: _env_str('LOG_STREAM', 'stderr'))(), use_queue=(lambda: _env_bool('LOG_USE_QUEUE', 'true'))(), suppress_modules=(lambda: _env_str('LOG_SUPPRESS', 'urllib3,httpx,openai,botocore,boto3,asyncio,uvicorn,uvicorn.error,uvicorn.access'))())
__post_init__()
is_dev()
selected_stream()

Return the Python stream object to emit logs to (stdout or stderr).

should_color()

LoggingConfigurator

settings = settings or LogSettings() instance-attribute
__init__(settings=None)
apply_config(config)
apply_legacy_args(*, log_level, log_filepath, max_log_file_size, backup_count, console)
build_config(*, filters, formatters, handlers)
build_filters()
build_formatters()
build_handlers(formatters)
configure(*, legacy_args, suppressed_modules)
select_base_handlers(handlers)
start_queue_listener(handlers)
suppress_noise(modules_override, force=False)

OMPFilter

Bases: Filter

filter(record)

UtcFormatter

Bases: Formatter

UTC ISO-8601 timestamps for plain text logging.

converter = time.gmtime class-attribute instance-attribute
formatTime(record, datefmt=None)

get_child_logger(name, console=False, separate_file=False)

Get a child logger that writes logs to a console or a specified file.

Parameters:

Name Type Description Default
name str

The name of the child logger (e.g., module name).

required
console bool

If True, log to the console. If False, do not log to the console. If None, inherit console behavior from the parent logger.

False

Returns:

Type Description
Logger

logging.Logger: Configured child logger.

get_logger(name)

Preferred helper: returns a namespaced logger under the base project name.

Backwards-compatible with existing call sites that used get_child_logger(name).

priority_info(self, message, *args, **kwargs)

Deprecated: use logger.info(msg, extra={"priority": "high"}) instead.

This custom level (25) was introduced for highlighting important informational events, but it complicates interoperability with external log shippers and structured log processing. The recommended migration path is to log at the standard INFO level with an added extra field indicating priority.

Example

logger.info("Important event", extra={"priority": "high"})

setup_logging(log_level=logging.INFO, log_filepath=DEFAULT_LOG_FILEPATH, max_log_file_size=MAX_FILE_SIZE, backup_count=5, console=True, suppressed_modules=None, *, settings=None)

Initialize project-wide logging using dictConfig, with JSON in prod and colorized/plain text in dev.

Backward compatible with previous signature. Prefer using env vars or pass a LogSettings via the keyword-only settings parameter.

setup_logging_legacy(*args, **kwargs)

Deprecated: use setup_logging().

This wrapper preserves old call sites during migration. It emits a DeprecationWarning (once per process) and forwards all arguments to the current setup_logging().

metadata

__all__ = ['Frontmatter', 'Metadata', 'ProcessMetadata'] module-attribute

Frontmatter

Handles YAML frontmatter embedding and extraction.

Note: extract is pure (no I/O). extract_from_file performs I/O and should be treated as adapter-level convenience, not domain-level parsing.

embed(metadata, content) classmethod

Embed metadata as YAML frontmatter.

Parameters:

Name Type Description Default
metadata Metadata

Dictionary of metadata

required
content str

Content text

required

Returns:

Type Description
str

Text with embedded frontmatter

extract(content) staticmethod

Extract frontmatter and content from text.

Parameters:

Name Type Description Default
content str

Text with optional YAML frontmatter

required

Returns:

Type Description
tuple[Metadata, str]

Tuple of (metadata object, remaining content)

extract_from_file(file) classmethod

Adapter-level convenience wrapper that reads from disk then parses.

generate(metadata) staticmethod

Metadata

Bases: MutableMapping

Flexible metadata container that behaves like a dict while ensuring JSON serializability. Designed for AI processing pipelines where schema flexibility is prioritized over structure.

process_history property

Access process history with proper typing.

__delitem__(key)
__get_pydantic_core_schema__(source_type, handler) classmethod

Defines the Pydantic core schema for the Metadata class.

This method allows Pydantic to validate Metadata objects as dictionaries. It handles both direct Metadata instances and dictionaries during validation, providing flexibility for data input.

Parameters:

Name Type Description Default
source_type Any

The source type being validated.

required
handler Callable[[Any], CoreSchema]

A callable to handle schema generation for other types.

required

Returns:

Type Description
CoreSchema

A Pydantic core schema that validates either a Metadata instance

CoreSchema

(by converting it to a dictionary) or a standard dictionary.

__getitem__(key)
__init__(data=None)
__ior__(other)
__iter__()
__len__()
__or__(other)
__repr__()
__ror__(other)
__setitem__(key, value)

Process and set value, ensuring JSON serializability.

__str__()
add_process_info(process_metadata)

Add process metadata to history.

copy()

Create a deep copy of the metadata object.

from_dict(data) classmethod

Create from a plain dict.

from_fields(data, fields) classmethod

Create a Metadata object by extracting specified fields from a dictionary.

Parameters:

Name Type Description Default
data dict

Source dictionary

required
fields list[str]

List of field names to extract

required

Returns:

Type Description
Metadata

New Metadata instance with only specified fields

from_yaml(yaml_str) classmethod

Create Metadata instance from YAML string.

Parameters:

Name Type Description Default
yaml_str str

YAML formatted string

required

Returns:

Type Description
Metadata

New Metadata instance

Raises:

Type Description
YAMLError

If YAML parsing fails

text_embed(content)
to_dict()

Convert to plain dict for JSON serialization.

to_yaml()

Return metadata as YAML formatted string

ProcessMetadata

Bases: Metadata

Records information about a specific processing operation.

__init__(step, processor, tool=None, **additional_params)

metadata

JsonValue = Union[str, int, float, bool, list, dict, None] module-attribute
logger = get_child_logger(__name__) module-attribute
Frontmatter

Handles YAML frontmatter embedding and extraction.

Note: extract is pure (no I/O). extract_from_file performs I/O and should be treated as adapter-level convenience, not domain-level parsing.

embed(metadata, content) classmethod

Embed metadata as YAML frontmatter.

Parameters:

Name Type Description Default
metadata Metadata

Dictionary of metadata

required
content str

Content text

required

Returns:

Type Description
str

Text with embedded frontmatter

extract(content) staticmethod

Extract frontmatter and content from text.

Parameters:

Name Type Description Default
content str

Text with optional YAML frontmatter

required

Returns:

Type Description
tuple[Metadata, str]

Tuple of (metadata object, remaining content)

extract_from_file(file) classmethod

Adapter-level convenience wrapper that reads from disk then parses.

generate(metadata) staticmethod
Metadata

Bases: MutableMapping

Flexible metadata container that behaves like a dict while ensuring JSON serializability. Designed for AI processing pipelines where schema flexibility is prioritized over structure.

process_history property

Access process history with proper typing.

__delitem__(key)
__get_pydantic_core_schema__(source_type, handler) classmethod

Defines the Pydantic core schema for the Metadata class.

This method allows Pydantic to validate Metadata objects as dictionaries. It handles both direct Metadata instances and dictionaries during validation, providing flexibility for data input.

Parameters:

Name Type Description Default
source_type Any

The source type being validated.

required
handler Callable[[Any], CoreSchema]

A callable to handle schema generation for other types.

required

Returns:

Type Description
CoreSchema

A Pydantic core schema that validates either a Metadata instance

CoreSchema

(by converting it to a dictionary) or a standard dictionary.

__getitem__(key)
__init__(data=None)
__ior__(other)
__iter__()
__len__()
__or__(other)
__repr__()
__ror__(other)
__setitem__(key, value)

Process and set value, ensuring JSON serializability.

__str__()
add_process_info(process_metadata)

Add process metadata to history.

copy()

Create a deep copy of the metadata object.

from_dict(data) classmethod

Create from a plain dict.

from_fields(data, fields) classmethod

Create a Metadata object by extracting specified fields from a dictionary.

Parameters:

Name Type Description Default
data dict

Source dictionary

required
fields list[str]

List of field names to extract

required

Returns:

Type Description
Metadata

New Metadata instance with only specified fields

from_yaml(yaml_str) classmethod

Create Metadata instance from YAML string.

Parameters:

Name Type Description Default
yaml_str str

YAML formatted string

required

Returns:

Type Description
Metadata

New Metadata instance

Raises:

Type Description
YAMLError

If YAML parsing fails

text_embed(content)
to_dict()

Convert to plain dict for JSON serialization.

to_yaml()

Return metadata as YAML formatted string

ProcessMetadata

Bases: Metadata

Records information about a specific processing operation.

__init__(step, processor, tool=None, **additional_params)
safe_yaml_load(yaml_str, *, context='unknown')

ocr_processing

__all__ = ['PDFParseWarning', 'annotate_image_with_text', 'build_processed_pdf', 'deserialize_entity_annotations_from_json', 'extract_image_from_page', 'get_page_dimensions', 'load_pdf_pages', 'load_processed_PDF_data', 'make_image_preprocess_mask', 'pil_to_bytes', 'process_page', 'process_single_image', 'save_processed_pdf_data', 'serialize_entity_annotations_to_json', 'start_image_annotator_client'] module-attribute

PDFParseWarning

Bases: Warning

Custom warning class for PDF parsing issues. Encapsulates minimal logic for displaying warnings with a custom format.

warn(message) staticmethod

Display a warning message with custom formatting.

Parameters:

Name Type Description Default
message str

The warning message to display.

required

annotate_image_with_text(image, text_annotations, annotation_font_path, font_size=12)

Annotates a PIL image with bounding boxes and text descriptions from OCR results.

Parameters:

Name Type Description Default
image Image

The input PIL image to annotate.

required
text_annotations List[EntityAnnotation]

OCR results containing bounding boxes and text.

required
annotation_font_path str

Path to the font file for text annotations.

required
font_size int

Font size for text annotations.

12

Returns:

Type Description
Image

Image.Image: The annotated PIL image.

Raises:

Type Description
ValueError

If the input image is None.

IOError

If the font file cannot be loaded.

Exception

For any other unexpected errors.

build_processed_pdf(pdf_path, client, preprocessor=None, annotation_font_path=DEFAULT_ANNOTATION_FONT_PATH)

Processes a PDF document, extracting text, word locations, annotated images, and unannotated images.

Parameters:

Name Type Description Default
pdf_path Path

Path to the PDF file.

required
client ImageAnnotatorClient

Google Vision API client for text detection.

required
annotation_font_path Path

Path to the font file for annotations.

DEFAULT_ANNOTATION_FONT_PATH

Returns:

Type Description
Tuple[List[str], List[List[EntityAnnotation]], List[Image], List[Image]]

Tuple[List[str], List[List[vision.EntityAnnotation]], List[Image.Image], List[Image.Image]]: - List of extracted full-page texts (one entry per page). - List of word locations (list of vision.EntityAnnotation objects for each page). - List of annotated images (with bounding boxes and text annotations). - List of unannotated images (raw page images).

Raises:

Type Description
FileNotFoundError

If the specified PDF file does not exist.

ValueError

If the PDF file is invalid or contains no pages.

Exception

For any unexpected errors during processing.

Example

from pathlib import Path from google.cloud import vision pdf_path = Path("/path/to/example.pdf") font_path = Path("/path/to/fonts/Arial.ttf") client = vision.ImageAnnotatorClient() try: text_pages, word_locations_list, annotated_images, unannotated_images = build_processed_pdf( pdf_path, client, font_path ) print(f"Processed {len(text_pages)} pages successfully!") except Exception as e: print(f"Error processing PDF: {e}")

deserialize_entity_annotations_from_json(data)

Deserializes JSON data into a nested list of EntityAnnotation objects.

Parameters:

Name Type Description Default
data str

The JSON string containing serialized annotations.

required

Returns:

Type Description
List[List[EntityAnnotation]]

List[List[EntityAnnotation]]: The reconstructed nested list of EntityAnnotation objects.

extract_image_from_page(page)

Extracts the first image from the given PDF page and returns it as a PIL Image.

Parameters:

Name Type Description Default
page Page

The PDF page object.

required

Returns:

Type Description
Image

Image.Image: The first image on the page as a Pillow Image object.

Raises:

Type Description
ValueError

If no images are found on the page or the image data is incomplete.

Exception

For unexpected errors during image extraction.

Example

import fitz from PIL import Image doc = fitz.open("/path/to/document.pdf") page = doc.load_page(0) # Load the first page try: image = extract_image_from_page(page) image.show() # Display the image except Exception as e: print(f"Error extracting image: {e}")

get_page_dimensions(page)

Extracts the width and height of a single PDF page in both inches and pixels.

Parameters:

Name Type Description Default
page Page

A single PDF page object from PyMuPDF.

required

Returns:

Name Type Description
dict dict

A dictionary containing the width and height of the page in inches and pixels.

load_pdf_pages(pdf_path)

Opens the PDF document and returns the fitz Document object.

Parameters:

Name Type Description Default
pdf_path Path

The path to the PDF file.

required

Returns:

Type Description
Document

fitz.Document: The loaded PDF document.

Raises:

Type Description
FileNotFoundError

If the specified file does not exist.

ValueError

If the file is not a valid PDF document.

Exception

For any unexpected error.

Example

from pathlib import Path pdf_path = Path("/path/to/example.pdf") try: pdf_doc = load_pdf_pages(pdf_path) print(f"PDF contains {pdf_doc.page_count} pages.") except Exception as e: print(f"Error loading PDF: {e}")

load_processed_PDF_data(base_path)

Loads processed PDF data from files using metadata for file references.

Parameters:

Name Type Description Default
base_path Path

Base path where processed assets are stored.

required

Returns:

Type Description
Tuple[List[str], List[List[EntityAnnotation]], List[Image], List[Image]]

Tuple[List[str], List[List[EntityAnnotation]], List[Image.Image], List[Image.Image]]: - Loaded text pages. - Word locations (list of EntityAnnotation objects for each page). - Annotated images. - Unannotated images.

Raises:

Type Description
FileNotFoundError

If any required files are missing.

ValueError

If the metadata file is incomplete or invalid.

make_image_preprocess_mask(mask_height)

Creates a preprocessing function that masks a specified height at the bottom of the image.

Parameters:

Name Type Description Default
mask_height float

The proportion of the image height to mask at the bottom (0.0 to 1.0).

required

Returns:

Type Description
Callable[[Image, int], Image]

Callable[[Image.Image, int], Image.Image]: A preprocessing function that takes an image

Callable[[Image, int], Image]

and page number as input and returns the processed image.

pil_to_bytes(image, format='PNG')

Converts a Pillow image to raw bytes.

Parameters:

Name Type Description Default
image Image

The Pillow image object to convert.

required
format str

The format to save the image as (e.g., "PNG", "JPEG"). Default is "PNG".

'PNG'

Returns:

Name Type Description
bytes bytes

The raw bytes of the image.

process_page(page, client, annotation_font_path, preprocessor=None)

Processes a single PDF page, extracting text, word locations, and annotated images.

Parameters:

Name Type Description Default
page Page

The PDF page object.

required
client ImageAnnotatorClient

Google Vision API client for text detection.

required
preprocessor Callable[[Image, int], Image]

Preprocessing function for the image.

None
annotation_font_path str

Path to the font file for annotations.

required

Returns:

Type Description
Tuple[str, List[EntityAnnotation], Image, Image, dict]

Tuple[str, List[vision.EntityAnnotation], Image.Image, Image.Image, dict]: - Full page text (str) - Word locations (List of vision.EntityAnnotation) - Annotated image (Pillow Image object) - Original unprocessed image (Pillow Image object) - Page dimensions (dict)

process_single_image(image, client, feature_type=DEFAULT_ANNOTATION_METHOD, language_hints=DEFAULT_ANNOTATION_LANGUAGE_HINTS)

Processes a single image with the Google Vision API and returns text annotations.

Parameters:

Name Type Description Default
image Image

The preprocessed Pillow image object.

required
client ImageAnnotatorClient

Google Vision API client for text detection.

required
feature_type str

Type of text detection to use ('TEXT_DETECTION' or 'DOCUMENT_TEXT_DETECTION').

DEFAULT_ANNOTATION_METHOD
language_hints List

Language hints for OCR.

DEFAULT_ANNOTATION_LANGUAGE_HINTS

Returns:

Type Description
Any

List[vision.EntityAnnotation]: Text annotations from the Vision API response.

Raises:

Type Description
ValueError

If no text is detected.

save_processed_pdf_data(output_dir, journal_name, text_pages, word_locations, annotated_images, unannotated_images)

Saves processed PDF data to files for later reloading.

Parameters:

Name Type Description Default
output_dir Path

Directory to save the data (as a Path object).

required
journal_name str

Name for the output directory (usually the PDF name without extension).

required
text_pages List[str]

Extracted full-page text.

required
word_locations List[List[EntityAnnotation]]

Word locations and annotations from Vision API.

required
annotated_images List[Image]

Annotated images with bounding boxes.

required
unannotated_images List[Image]

Raw unannotated images.

required

Returns:

Type Description
None

None

serialize_entity_annotations_to_json(annotations)

Serializes a nested list of EntityAnnotation objects into a JSON-compatible format using Base64 encoding.

Parameters:

Name Type Description Default
annotations List[List[EntityAnnotation]]

The nested list of EntityAnnotation objects.

required

Returns:

Name Type Description
str str

The serialized data in JSON format as a string.

start_image_annotator_client(credentials_file=None, api_endpoint='vision.googleapis.com', timeout=(10, 30), enable_logging=False)

Starts and returns a Google Vision API ImageAnnotatorClient with optional configuration.

Parameters:

Name Type Description Default
credentials_file str

Path to the credentials JSON file. If None, uses the default environment variable.

None
api_endpoint str

Custom API endpoint for the Vision API. Default is the global endpoint.

'vision.googleapis.com'
timeout Tuple[int, int]

Connection and read timeouts in seconds. Default is (10, 30).

(10, 30)
enable_logging bool

Enable detailed logging for debugging. Default is False.

False

Returns:

Type Description
ImageAnnotatorClient

vision.ImageAnnotatorClient: Configured Vision API client.

Raises:

Type Description
FileNotFoundError

If the specified credentials file is not found.

Exception

For unexpected errors during client setup.

Example

client = start_image_annotator_client( credentials_file="/path/to/credentials.json", api_endpoint="vision.googleapis.com", timeout=(10, 30), enable_logging=True ) print("Google Vision API client initialized.")

ocr_editor

current_image = st.session_state.current_image module-attribute
current_page_index = st.session_state.current_page_index module-attribute
current_text = pages[current_page_index] module-attribute
edited_text = st.text_area('Edit OCR Text', value=(st.session_state.current_text), key=f'text_area_{st.session_state.current_page_index}', height=400) module-attribute
image_directory = st.sidebar.text_input('Image Directory', value='./images') module-attribute
ocr_text_directory = st.sidebar.text_input('OCR Text Directory', value='./ocr_text') module-attribute
pages = st.session_state.pages module-attribute
save_path = os.path.join(ocr_text_directory, 'updated_ocr.xml') module-attribute
tree = st.session_state.tree module-attribute
uploaded_image_file = st.sidebar.file_uploader('Upload an Image', type=['jpg', 'jpeg', 'png', 'pdf']) module-attribute
uploaded_text_file = st.sidebar.file_uploader('Upload OCR Text File', type=['xml']) module-attribute
extract_pages(tree)

Extract page data from the XML tree.

Parameters:

Name Type Description Default
tree ElementTree

Parsed XML tree.

required

Returns:

Name Type Description
list list

A list of dictionaries containing 'number' and 'text' for each page.

load_xml(file_obj)

Load an XML file from a file-like object.

save_xml(tree, file_path)

Save the modified XML tree to a file.

ocr_processing

DEFAULT_ANNOTATION_FONT_PATH = Path('/System/Library/Fonts/Supplemental/Arial.ttf') module-attribute
DEFAULT_ANNOTATION_FONT_SIZE = 12 module-attribute
DEFAULT_ANNOTATION_LANGUAGE_HINTS = ['vi'] module-attribute
DEFAULT_ANNOTATION_METHOD = 'DOCUMENT_TEXT_DETECTION' module-attribute
DEFAULT_ANNOTATION_OFFSET = 2 module-attribute
logger = logging.getLogger('ocr_processing') module-attribute
PDFParseWarning

Bases: Warning

Custom warning class for PDF parsing issues. Encapsulates minimal logic for displaying warnings with a custom format.

warn(message) staticmethod

Display a warning message with custom formatting.

Parameters:

Name Type Description Default
message str

The warning message to display.

required
annotate_image_with_text(image, text_annotations, annotation_font_path, font_size=12)

Annotates a PIL image with bounding boxes and text descriptions from OCR results.

Parameters:

Name Type Description Default
image Image

The input PIL image to annotate.

required
text_annotations List[EntityAnnotation]

OCR results containing bounding boxes and text.

required
annotation_font_path str

Path to the font file for text annotations.

required
font_size int

Font size for text annotations.

12

Returns:

Type Description
Image

Image.Image: The annotated PIL image.

Raises:

Type Description
ValueError

If the input image is None.

IOError

If the font file cannot be loaded.

Exception

For any other unexpected errors.

build_processed_pdf(pdf_path, client, preprocessor=None, annotation_font_path=DEFAULT_ANNOTATION_FONT_PATH)

Processes a PDF document, extracting text, word locations, annotated images, and unannotated images.

Parameters:

Name Type Description Default
pdf_path Path

Path to the PDF file.

required
client ImageAnnotatorClient

Google Vision API client for text detection.

required
annotation_font_path Path

Path to the font file for annotations.

DEFAULT_ANNOTATION_FONT_PATH

Returns:

Type Description
Tuple[List[str], List[List[EntityAnnotation]], List[Image], List[Image]]

Tuple[List[str], List[List[vision.EntityAnnotation]], List[Image.Image], List[Image.Image]]: - List of extracted full-page texts (one entry per page). - List of word locations (list of vision.EntityAnnotation objects for each page). - List of annotated images (with bounding boxes and text annotations). - List of unannotated images (raw page images).

Raises:

Type Description
FileNotFoundError

If the specified PDF file does not exist.

ValueError

If the PDF file is invalid or contains no pages.

Exception

For any unexpected errors during processing.

Example

from pathlib import Path from google.cloud import vision pdf_path = Path("/path/to/example.pdf") font_path = Path("/path/to/fonts/Arial.ttf") client = vision.ImageAnnotatorClient() try: text_pages, word_locations_list, annotated_images, unannotated_images = build_processed_pdf( pdf_path, client, font_path ) print(f"Processed {len(text_pages)} pages successfully!") except Exception as e: print(f"Error processing PDF: {e}")

deserialize_entity_annotations_from_json(data)

Deserializes JSON data into a nested list of EntityAnnotation objects.

Parameters:

Name Type Description Default
data str

The JSON string containing serialized annotations.

required

Returns:

Type Description
List[List[EntityAnnotation]]

List[List[EntityAnnotation]]: The reconstructed nested list of EntityAnnotation objects.

extract_image_from_page(page)

Extracts the first image from the given PDF page and returns it as a PIL Image.

Parameters:

Name Type Description Default
page Page

The PDF page object.

required

Returns:

Type Description
Image

Image.Image: The first image on the page as a Pillow Image object.

Raises:

Type Description
ValueError

If no images are found on the page or the image data is incomplete.

Exception

For unexpected errors during image extraction.

Example

import fitz from PIL import Image doc = fitz.open("/path/to/document.pdf") page = doc.load_page(0) # Load the first page try: image = extract_image_from_page(page) image.show() # Display the image except Exception as e: print(f"Error extracting image: {e}")

get_page_dimensions(page)

Extracts the width and height of a single PDF page in both inches and pixels.

Parameters:

Name Type Description Default
page Page

A single PDF page object from PyMuPDF.

required

Returns:

Name Type Description
dict dict

A dictionary containing the width and height of the page in inches and pixels.

load_pdf_pages(pdf_path)

Opens the PDF document and returns the fitz Document object.

Parameters:

Name Type Description Default
pdf_path Path

The path to the PDF file.

required

Returns:

Type Description
Document

fitz.Document: The loaded PDF document.

Raises:

Type Description
FileNotFoundError

If the specified file does not exist.

ValueError

If the file is not a valid PDF document.

Exception

For any unexpected error.

Example

from pathlib import Path pdf_path = Path("/path/to/example.pdf") try: pdf_doc = load_pdf_pages(pdf_path) print(f"PDF contains {pdf_doc.page_count} pages.") except Exception as e: print(f"Error loading PDF: {e}")

load_processed_PDF_data(base_path)

Loads processed PDF data from files using metadata for file references.

Parameters:

Name Type Description Default
base_path Path

Base path where processed assets are stored.

required

Returns:

Type Description
Tuple[List[str], List[List[EntityAnnotation]], List[Image], List[Image]]

Tuple[List[str], List[List[EntityAnnotation]], List[Image.Image], List[Image.Image]]: - Loaded text pages. - Word locations (list of EntityAnnotation objects for each page). - Annotated images. - Unannotated images.

Raises:

Type Description
FileNotFoundError

If any required files are missing.

ValueError

If the metadata file is incomplete or invalid.

make_image_preprocess_mask(mask_height)

Creates a preprocessing function that masks a specified height at the bottom of the image.

Parameters:

Name Type Description Default
mask_height float

The proportion of the image height to mask at the bottom (0.0 to 1.0).

required

Returns:

Type Description
Callable[[Image, int], Image]

Callable[[Image.Image, int], Image.Image]: A preprocessing function that takes an image

Callable[[Image, int], Image]

and page number as input and returns the processed image.

pil_to_bytes(image, format='PNG')

Converts a Pillow image to raw bytes.

Parameters:

Name Type Description Default
image Image

The Pillow image object to convert.

required
format str

The format to save the image as (e.g., "PNG", "JPEG"). Default is "PNG".

'PNG'

Returns:

Name Type Description
bytes bytes

The raw bytes of the image.

process_page(page, client, annotation_font_path, preprocessor=None)

Processes a single PDF page, extracting text, word locations, and annotated images.

Parameters:

Name Type Description Default
page Page

The PDF page object.

required
client ImageAnnotatorClient

Google Vision API client for text detection.

required
preprocessor Callable[[Image, int], Image]

Preprocessing function for the image.

None
annotation_font_path str

Path to the font file for annotations.

required

Returns:

Type Description
Tuple[str, List[EntityAnnotation], Image, Image, dict]

Tuple[str, List[vision.EntityAnnotation], Image.Image, Image.Image, dict]: - Full page text (str) - Word locations (List of vision.EntityAnnotation) - Annotated image (Pillow Image object) - Original unprocessed image (Pillow Image object) - Page dimensions (dict)

process_single_image(image, client, feature_type=DEFAULT_ANNOTATION_METHOD, language_hints=DEFAULT_ANNOTATION_LANGUAGE_HINTS)

Processes a single image with the Google Vision API and returns text annotations.

Parameters:

Name Type Description Default
image Image

The preprocessed Pillow image object.

required
client ImageAnnotatorClient

Google Vision API client for text detection.

required
feature_type str

Type of text detection to use ('TEXT_DETECTION' or 'DOCUMENT_TEXT_DETECTION').

DEFAULT_ANNOTATION_METHOD
language_hints List

Language hints for OCR.

DEFAULT_ANNOTATION_LANGUAGE_HINTS

Returns:

Type Description
Any

List[vision.EntityAnnotation]: Text annotations from the Vision API response.

Raises:

Type Description
ValueError

If no text is detected.

save_processed_pdf_data(output_dir, journal_name, text_pages, word_locations, annotated_images, unannotated_images)

Saves processed PDF data to files for later reloading.

Parameters:

Name Type Description Default
output_dir Path

Directory to save the data (as a Path object).

required
journal_name str

Name for the output directory (usually the PDF name without extension).

required
text_pages List[str]

Extracted full-page text.

required
word_locations List[List[EntityAnnotation]]

Word locations and annotations from Vision API.

required
annotated_images List[Image]

Annotated images with bounding boxes.

required
unannotated_images List[Image]

Raw unannotated images.

required

Returns:

Type Description
None

None

serialize_entity_annotations_to_json(annotations)

Serializes a nested list of EntityAnnotation objects into a JSON-compatible format using Base64 encoding.

Parameters:

Name Type Description Default
annotations List[List[EntityAnnotation]]

The nested list of EntityAnnotation objects.

required

Returns:

Name Type Description
str str

The serialized data in JSON format as a string.

start_image_annotator_client(credentials_file=None, api_endpoint='vision.googleapis.com', timeout=(10, 30), enable_logging=False)

Starts and returns a Google Vision API ImageAnnotatorClient with optional configuration.

Parameters:

Name Type Description Default
credentials_file str

Path to the credentials JSON file. If None, uses the default environment variable.

None
api_endpoint str

Custom API endpoint for the Vision API. Default is the global endpoint.

'vision.googleapis.com'
timeout Tuple[int, int]

Connection and read timeouts in seconds. Default is (10, 30).

(10, 30)
enable_logging bool

Enable detailed logging for debugging. Default is False.

False

Returns:

Type Description
ImageAnnotatorClient

vision.ImageAnnotatorClient: Configured Vision API client.

Raises:

Type Description
FileNotFoundError

If the specified credentials file is not found.

Exception

For unexpected errors during client setup.

Example

client = start_image_annotator_client( credentials_file="/path/to/credentials.json", api_endpoint="vision.googleapis.com", timeout=(10, 30), enable_logging=True ) print("Google Vision API client initialized.")

prompt_system

Prompt system package scaffolding per ADR-PT04.

Modules will provide object-service compliant prompt catalog, rendering, and validation.

adapters

Prompt catalog adapters.

filesystem_catalog_adapter

Filesystem-backed prompt catalog adapter.

FilesystemPromptCatalog

Bases: PromptCatalogPort

Filesystem-backed catalog for offline/packaged distributions.

__init__(config, mapper, loader, cache=None, transport=None)
catalog_health()

Return the accumulated catalog health report.

get(key)
list()
frontmatter_fallback

Shared helpers for resilient prompt body extraction.

extract_best_effort_body(content)

Return prompt body even when frontmatter is missing or malformed.

git_catalog_adapter

Git-backed prompt catalog adapter.

GitPromptCatalog

Bases: PromptCatalogPort

Git-backed prompt catalog adapter (implements PromptCatalogPort).

__init__(config, transport, loader, mapper=None, cache=None)
catalog_health()

Return the accumulated catalog health report.

get(key)
list()
refresh()

config

Configuration models and policies for the prompt system.

policy

Behavior policies controlling prompt rendering and validation.

PromptRenderPolicy

Bases: BaseModel

Policy for prompt rendering precedence and behavior.

allow_undefined_vars = False class-attribute instance-attribute
merge_strategy = 'override' class-attribute instance-attribute
policy_version = '1.0' class-attribute instance-attribute
precedence_order = ['caller_context', 'frontmatter_defaults', 'settings_defaults'] class-attribute instance-attribute
ValidationPolicy

Bases: BaseModel

Policy controlling validation strictness.

allow_extra_variables = False class-attribute instance-attribute
fail_on_missing_required = True class-attribute instance-attribute
mode = 'strict' class-attribute instance-attribute
policy_version = '1.0' class-attribute instance-attribute
prompt_catalog_config

Construction-time configuration models for prompt catalog and transports.

GitTransportConfig

Bases: BaseModel

Git transport layer configuration.

auto_pull = False class-attribute instance-attribute
default_branch = 'main' class-attribute instance-attribute
pull_timeout_s = 30.0 class-attribute instance-attribute
repository_path instance-attribute
PromptCatalogConfig

Bases: BaseModel

Configuration for building a prompt catalog.

cache_ttl_s = 300 class-attribute instance-attribute
enable_git_refresh = True class-attribute instance-attribute
repository_path instance-attribute
validation_on_load = True class-attribute instance-attribute

domain

Domain models and protocols for prompt handling.

models

Domain models for the prompt system.

CatalogHealth

Bases: BaseModel

Aggregated prompt catalog health report.

error_count property

Return the number of fatal prompt issues.

errors = Field(default_factory=list) class-attribute instance-attribute
warning_count property

Return the number of non-fatal prompt issues.

warnings = Field(default_factory=list) class-attribute instance-attribute
CatalogIssue

Bases: BaseModel

Single prompt catalog health issue.

issue_type instance-attribute
message instance-attribute
prompt_key instance-attribute
CatalogIssueType

Bases: str, Enum

Catalog health issue classifications.

FRONTMATTER_PARSE_ERROR = 'frontmatter_parse_error' class-attribute instance-attribute
METADATA_WARNING = 'metadata_warning' class-attribute instance-attribute
VALIDATION_ERROR = 'validation_error' class-attribute instance-attribute
InputStrictness

Bases: str, Enum

Input declaration strictness.

loose = 'loose' class-attribute instance-attribute
strict = 'strict' class-attribute instance-attribute
Message

Bases: BaseModel

Single message in a conversation.

content instance-attribute
role instance-attribute
Prompt

Bases: BaseModel

Prompt domain model.

metadata instance-attribute
name instance-attribute
template instance-attribute
version instance-attribute
PromptArtifactSpec

Bases: BaseModel

Artifact declaration for artifact-producing prompts.

path instance-attribute
required = False class-attribute instance-attribute
PromptInputSpec

Bases: BaseModel

Prompt input declaration.

description = None class-attribute instance-attribute
name instance-attribute
required = False class-attribute instance-attribute
source = None class-attribute instance-attribute
strictness = InputStrictness.loose class-attribute instance-attribute
type = None class-attribute instance-attribute
PromptMetadata

Bases: BaseModel

Prompt front matter metadata.

content_flags = Field(default_factory=list) class-attribute instance-attribute
created_at = None class-attribute instance-attribute
default_model = None class-attribute instance-attribute
default_variables = Field(default_factory=dict) class-attribute instance-attribute
description instance-attribute
input_contract_ref = None class-attribute instance-attribute
inputs = Field(default_factory=list) class-attribute instance-attribute
key = '' class-attribute instance-attribute
model_config = ConfigDict(extra='forbid') class-attribute instance-attribute
name instance-attribute
optional_variables = Field(default_factory=list) class-attribute instance-attribute
output_contract = None class-attribute instance-attribute
output_contract_ref = None class-attribute instance-attribute
output_mode = None class-attribute instance-attribute
pii_handling = None class-attribute instance-attribute
prompt_id = None class-attribute instance-attribute
required_variables = Field(default_factory=list) class-attribute instance-attribute
role = None class-attribute instance-attribute
safety_level = None class-attribute instance-attribute
schema_version = '1.0' class-attribute instance-attribute
tags = Field(default_factory=list) class-attribute instance-attribute
updated_at = None class-attribute instance-attribute
version instance-attribute
warnings = Field(default_factory=list) class-attribute instance-attribute
canonical_key()

Return canonical key without version suffix.

immutable_ref()

Return immutable prompt reference key.v.

resolved_output_mode()

Return normalized platform output mode.

PromptOutputContract

Bases: BaseModel

Prompt output contract declaration.

artifacts = Field(default_factory=list) class-attribute instance-attribute
mode instance-attribute
schema_ref = None class-attribute instance-attribute
PromptOutputMode

Bases: str, Enum

Prompt output modes supported by the platform.

artifacts = 'artifacts' class-attribute instance-attribute
json = 'json' class-attribute instance-attribute
text = 'text' class-attribute instance-attribute
PromptValidationResult

Bases: BaseModel

Result of prompt validation.

valid is maintained for API ergonomics and derived from errors to keep a single source of truth.

errors = Field(default_factory=list) class-attribute instance-attribute
fingerprint_data = Field(default_factory=dict) class-attribute instance-attribute
valid = True class-attribute instance-attribute
warnings = Field(default_factory=list) class-attribute instance-attribute
succeeded()
RenderParams

Bases: BaseModel

Per-call rendering parameters.

preserve_whitespace = False class-attribute instance-attribute
strict_undefined = True class-attribute instance-attribute
user_input = None class-attribute instance-attribute
variables = Field(default_factory=dict) class-attribute instance-attribute
RenderedPrompt

Bases: BaseModel

Rendered prompt ready for the provider.

messages = Field(default_factory=list) class-attribute instance-attribute
system = None class-attribute instance-attribute
ValidationIssue

Bases: BaseModel

Single validation issue.

code instance-attribute
field = None class-attribute instance-attribute
level instance-attribute
line = None class-attribute instance-attribute
message instance-attribute
protocols

Protocols defining prompt system behavior contracts.

PromptCatalogPort

Bases: Protocol

Repository interface for prompt storage and retrieval.

catalog_health()

Return aggregated catalog health information.

get(key)

Retrieve prompt by key.

list()

List available prompts.

PromptRendererPort

Bases: Protocol

Renders prompts with variable substitution.

render(prompt, params)

Render prompt with templating.

PromptValidatorPort

Bases: Protocol

Validates prompt schema and render parameters.

validate(prompt)

Validate prompt metadata and template.

validate_render(prompt, params)

Validate render inputs against prompt requirements.

mappers

Mappers for translating transport data to domain models and back.

prompt_mapper

Mapper for translating prompt files to domain models.

PromptMapper

Maps transport-layer prompt data into domain objects.

to_domain_prompt(file_content, source_key=None)

Map raw file content (including front matter) to a Prompt.

to_file_request(key, base_path)

Map prompt key to a filesystem path for transport.

to_key_from_path(path, base_path)

Map a prompt file path to canonical key.

Absolute paths are relativized only when they live under base_path. Relative paths also strip a matching relative base_path prefix, which keeps tnh-prompts/foo.md and foo.md equivalent for callers such as --prompt-dir ./tnh-prompts. Paths outside the base are preserved.

service

Prompt system services (rendering, validation, loading).

contract_schema

Prompt contract schema resolution and validation.

SCHEMA_DIRECTORY_PARTS = ('schemas', 'prompt-contracts') module-attribute
SCHEMA_SUFFIX = '.schema.json' module-attribute
PromptContractSchemaResolver

Resolve and validate prompt-contract JSON Schema artifacts.

__init__(context)
for_prompt_directory(prompts_base) classmethod

Build a resolver using runtime-context discovery for a prompt directory.

resolve(schema_ref)

Resolve a schema_ref to the highest-precedence schema file.

resolve_validated(schema_ref)

Resolve a schema_ref and confirm the artifact is valid JSON Schema.

search_roots()

Return schema search roots in workspace/user/built-in precedence.

validate_instance(resolved, payload)

Validate a JSON payload against a resolved schema.

ResolvedPromptContractSchema

Bases: BaseModel

Resolved prompt-contract schema artifact.

document instance-attribute
path instance-attribute
schema_ref instance-attribute
format_contract_validation_error(*, schema_ref, error)

Build a user-facing contract validation failure message.

loader

Prompt loader orchestration service.

PromptLoader

Responsible for preparing prompts (parse + validate).

__init__(validator)
parse_error_issue(prompt_key, message)

Build a fatal issue for unreadable or invalid frontmatter.

validate(prompt)

Validate prompt using configured validator.

validation_issues(prompt_key, validation)

Convert validation errors into fatal catalog issues.

warning_issues(prompt_key, warnings)

Convert non-fatal prompt warnings into catalog issues.

renderer

Prompt rendering service.

PromptRenderer

Bases: PromptRendererPort

Renders prompts using configured policy.

__init__(policy, settings_defaults=None)
render(prompt, params)

Render prompt with templating and precedence rules.

validator

Prompt validation service.

PromptValidator

Bases: PromptValidatorPort

Validates prompt metadata and render parameters.

__init__(policy, schema_resolver=None)
validate(prompt)

Validate prompt metadata and template syntax.

validate_render(prompt, params)

Validate render inputs against prompt requirements.

transport

Transport layer for prompt system (git/filesystem/cache).

cache

Cache transport abstractions.

T = TypeVar('T') module-attribute
CacheTransport

Bases: Protocol, Generic[T]

Abstract cache transport.

clear()
get(key)
invalidate(key)
set(key, value, ttl_s=None)
InMemoryCacheTransport

Bases: Generic[T]

In-memory cache implementation with TTL.

__init__(default_ttl_s=300)
clear()
get(key)
invalidate(key)
set(key, value, ttl_s=None)
filesystem

Filesystem transport for prompt files.

FilesystemTransport

Reads prompt files from the filesystem.

__init__(mapper)
list_files(base_path, pattern='**/*.md')

List prompt files under base path.

read_file(request)

Read a prompt file from disk.

git_client

Git transport client for prompt files.

GitTransportClient

Minimal git transport operations.

__init__(config, mapper)
get_current_commit()
list_files(pattern='**/*.md')
pull_latest()
read_file_at_commit(request)
models

Transport models for prompt system I/O.

GitRefreshRequest

Bases: BaseModel

Request to refresh git repository.

repository_path instance-attribute
target_ref = None class-attribute instance-attribute
GitRefreshResponse

Bases: BaseModel

Git refresh operation result.

branch instance-attribute
changed_files instance-attribute
current_commit instance-attribute
refreshed_at instance-attribute
PromptFileRequest

Bases: BaseModel

Transport-level request to load a prompt file.

commit_sha = None class-attribute instance-attribute
path instance-attribute
PromptFileResponse

Bases: BaseModel

Transport-level prompt file data.

content instance-attribute
file_hash instance-attribute
loaded_at instance-attribute
metadata_raw instance-attribute

text_processing

__all__ = ['bracket_lines', 'unbracket_lines', 'lines_from_bracketed_text', 'NumberedText', 'normalize_newlines', 'clean_text'] module-attribute

NumberedText

Immutable container for text documents with numbered lines.

Provides utilities for working with line-numbered text including reading, writing, accessing lines by number, and iterating over numbered lines.

Immutability Note

NumberedText is designed to be used immutably after construction. While not enforced at runtime (for performance reasons as a low-level container), instances should not be modified after creation. All operations return new data rather than mutating the instance.

Whitespace and Blank Line Handling (Monaco Editor as standard for compatibility): NumberedText follows Monaco Editor's verbatim line and whitespace handling. Monaco Editor: https://microsoft.github.io/monaco-editor/typedoc/interfaces/IRange.html

- Blank lines: Preserved as empty strings in the lines list
- Whitespace: Leading/trailing whitespace preserved (never stripped)
- Line count: Blank lines count as lines (e.g., "a\n\nb" has 3 lines)
- Indexing: 1-based line numbers with inclusive end semantics (Monaco IRange)

Numbered Input Detection:
When input contains line numbers (e.g., "1: foo\n2:\n3: bar"):
- Pattern validation: Only non-blank lines validated for sequential numbering
- Number extraction: Removes number prefix (e.g., "2: ") from all lines
- Blank line handling: After number removal, blank lines become empty strings
- Example: "1: foo\n2:\n3: bar" → lines=[' foo', '', ' bar']

Attributes:

Name Type Description
lines List[str]

List of text lines (do not modify after construction)

start int

Starting line number (do not modify after construction)

separator str

Separator between line number and content (do not modify after construction)

Examples:

>>> text = "First line\nSecond line\n\nFourth line"
>>> doc = NumberedText(text)
>>> print(doc)
1: First line
2: Second line
3:
4: Fourth line
>>> print(doc.get_line(2))
Second line
>>> for num, line in doc:
...     print(f"Line {num}: {len(line)} chars")
content property

Get original text without line numbers.

end property
lines = [] instance-attribute
numbered_content property

Get text with line numbers as a string. Equivalent to str(self)

numbered_lines property

Get list of lines with line numbers included.

Returns:

Type Description
List[str]

List[str]: Lines with numbers and separator prefixed

Examples:

>>> doc = NumberedText("First line\nSecond line")
>>> doc.numbered_lines
['1: First line', '2: Second line']
Note
  • Unlike str(self), this returns a list rather than joined string
  • Maintains consistent formatting with separator
  • Useful for processing or displaying individual numbered lines
separator = separator instance-attribute
size property

Get the number of lines.

start = start instance-attribute
LineSegment dataclass

Represents a segment of lines with start and end indices in 1-based indexing.

The segment follows Python range conventions where start is inclusive and end is exclusive. However, indexing is 1-based to match NumberedText.

Attributes:

Name Type Description
start int

Starting line number (inclusive, 1-based)

end int

Ending line number (exclusive, 1-based)

end instance-attribute
start instance-attribute
__init__(start, end)
__iter__()

Allow unpacking into start, end pairs.

SegmentIterator

Iterator for generating line segments of specified size.

Produces segments of lines with start/end indices following 1-based indexing. The final segment may be smaller than the specified segment size.

Attributes:

Name Type Description
total_lines

Total number of lines in text

segment_size

Number of lines per segment

start_line

Starting line number (1-based)

min_segment_size

Minimum size for the final segment

min_segment_size = min_segment_size instance-attribute
num_segments = (remaining_lines + segment_size - 1) // segment_size instance-attribute
segment_size = segment_size instance-attribute
start_line = start_line instance-attribute
total_lines = total_lines instance-attribute
__init__(total_lines, segment_size, start_line=1, min_segment_size=None)

Initialize the segment iterator.

Parameters:

Name Type Description Default
total_lines int

Total number of lines to iterate over

required
segment_size int

Desired size of each segment

required
start_line int

First line number (default: 1)

1
min_segment_size Optional[int]

Minimum size for final segment (default: None) If specified, the last segment will be merged with the previous one if it would be smaller than this size.

None

Raises:

Type Description
ValueError

If segment_size < 1 or total_lines < 1

ValueError

If start_line < 1 (must use 1-based indexing)

ValueError

If min_segment_size >= segment_size

__iter__()

Iterate over line segments.

Yields:

Type Description
LineSegment

LineSegment containing start (inclusive) and end (exclusive) indices

__getitem__(index)

Get line content by line number (1-based indexing).

__init__(content=None, start=1, separator=':')

Initialize a numbered text document, detecting and preserving existing numbering.

Valid numbered text must have: - Sequential line numbers - Consistent separator character(s) - Every non-empty line must follow the numbering pattern

Parameters:

Name Type Description Default
content Optional[str]

Initial text content, if any

None
start int

Starting line number (used only if content isn't already numbered)

1
separator str

Separator between line numbers and content (only if content isn't numbered)

':'

Examples:

>>> # Custom separators
>>> doc = NumberedText("1→First line\n2→Second line")
>>> doc.separator == "→"
True
>>> # Preserves starting number
>>> doc = NumberedText("5#First\n6#Second")
>>> doc.start == 5
True
>>> # Regular numbered list isn't treated as line numbers
>>> doc = NumberedText("1. First item\n2. Second item")
>>> doc.numbered_lines
['1: 1. First item', '2: 2. Second item']
__iter__()

Iterate over (line_number, line_content) pairs.

__len__()

Return the number of lines.

__str__()

Return the numbered text representation.

from_file(path, **kwargs) classmethod

Create a NumberedText instance from a file.

get_coverage_report(section_start_lines)

Return coverage statistics for sections defined by start lines.

get_line(line_num)

Get content of specified line number.

get_lines(start, end)

Deprecated: use get_lines_exclusive; end index remains exclusive.

get_lines_exclusive(start, end)

Get content of line range [start, end) using 1-based line numbers.

Parameters:

Name Type Description Default
start int

Inclusive start line (1-based external indexing).

required
end int

Exclusive end line (1-based; not included), matching Python slicing semantics.

required
get_numbered_line(line_num)

Get specified line with line number.

get_numbered_lines(start, end)

Get numbered lines for [start, end) using 1-based numbering.

get_numbered_segment(start, end)
get_segment(start, end)

Return the segment from start line (inclusive) up to end line (inclusive).

This aligns with Monaco's inclusive range semantics. Internally we convert to Python's exclusive upper bound when slicing.

iter_segments(segment_size, min_segment_size=None)

Iterate over segments of the text with specified size.

Parameters:

Name Type Description Default
segment_size int

Number of lines per segment

required
min_segment_size Optional[int]

Optional minimum size for final segment. If specified, last segment will be merged with previous one if it would be smaller than this size.

None

Yields:

Type Description
LineSegment

LineSegment objects containing start and end line numbers

Example

text = NumberedText("line1\nline2\nline3\nline4\nline5") for segment in text.iter_segments(2): ... print(f"Lines {segment.start}-{segment.end}") Lines 1-3 Lines 3-5 Lines 5-6

save(path, numbered=True)

Save document to file.

Parameters:

Name Type Description Default
path Path

Output file path

required
numbered bool

Whether to save with line numbers (default: True)

True
validate_section_boundaries(section_start_lines)

Validate section boundaries for gaps, overlaps, and out-of-bounds errors.

Sections are defined by their start lines; the end of each section is implicit: it ends at the line before the next section starts, with the final section ending at the last line of the text. Validation enforces: - First section starts at self.start - No overlaps (next start must be > previous start) - No gaps (next start must be exactly previous start + 1) - All start lines within [self.start, self.end]

bracket_lines(text, number=False)

Encloses each line of the input text with angle brackets.
If number is True, adds a line number followed by a colon `:` and then the line.

Args:
    text (str): The input string containing lines separated by '

'. number (bool): Whether to prepend line numbers to each line.

Returns:
    str: A string where each line is enclosed in angle brackets.

Examples:
    >>> bracket_lines("This is a string with

two lines.") ' < two lines.>'

    >>> bracket_lines("This is a string with

two lines.", number=True) '<1:This is a string with> <2: two lines.>'

clean_text(text, newline=False)

Cleans a given text by replacing specific unwanted characters such as tab, and non-breaking spaces with regular spaces.

This function takes a string as input and applies replacements based on a predefined mapping of characters to replace.

Parameters:

Name Type Description Default
text str

The text to be cleaned.

required

Returns:

Name Type Description
str str

The cleaned text with unwanted characters replaced by spaces.

Example

text = "This is\n an example\ttext with\xa0extra spaces." clean_text(text) 'This is an example text with extra spaces.'

lines_from_bracketed_text(text, start, end, keep_brackets=False)

Extracts lines from bracketed text between the start and end indices, inclusive.
Handles both numbered and non-numbered cases.

Args:
    text (str): The input bracketed text containing lines like <...>.
    start (int): The starting line number (1-based).
    end (int): The ending line number (1-based).

Returns:
    list[str]: The lines from start to end inclusive, with angle brackets removed.

Raises:
    FormattingError: If the text contains improperly formatted lines (missing angle brackets).
    ValueError: If start or end indices are invalid or out of bounds.

Examples:
    >>> text = "<1:Line 1>

<2:Line 2> <3:Line 3>" >>> lines_from_bracketed_text(text, 1, 2) ['Line 1', 'Line 2']

    >>> text = "<Line 1>

" >>> lines_from_bracketed_text(text, 2, 3) ['Line 2', 'Line 3']

normalize_newlines(text, spacing=2)

Normalize newline blocks in the input text by reducing consecutive newlines
to the specified number of newlines for consistent readability and formatting.

Parameters:
----------
text : str
    The input text containing inconsistent newline spacing.
spacing : int, optional
    The number of newlines to insert between lines. Defaults to 2.

Returns:
-------
str
    The text with consecutive newlines reduced to the specified number of newlines.

Example:
--------
>>> raw_text = "Heading

Paragraph text 1 Paragraph text 2

" >>> normalize_newlines(raw_text, spacing=2) 'Heading

Paragraph text 1

Paragraph text 2

'

unbracket_lines(text, number=False)

Removes angle brackets (< >) from encapsulated lines and optionally removes line numbers.

Args:
    text (str): The input string with encapsulated lines.
    number (bool): If True, removes line numbers in the format 'digit:'.
        Raises a ValueError if `number=True` and a line does not start
        with a digit followed by a colon.

Returns:
    str: A newline-separated string with the encapsulation removed,
        and line numbers stripped if specified.

Examples:
    >>> unbracket_lines("<1:Line 1>

<2:Line 2>", number=True) 'Line 1 Line 2'

    >>> unbracket_lines("<Line 1>

") 'Line 1 Line 2'

    >>> unbracket_lines("<1Line 1>", number=True)
    ValueError: Line does not start with a valid number: '1Line 1'

bracket

FormattingError

Bases: Exception

Custom exception raised for formatting-related errors.

__init__(message='An error occurred due to invalid formatting.')
bracket_all_lines(pages)
bracket_lines(text, number=False)
Encloses each line of the input text with angle brackets.
If number is True, adds a line number followed by a colon `:` and then the line.

Args:
    text (str): The input string containing lines separated by '

'. number (bool): Whether to prepend line numbers to each line.

Returns:
    str: A string where each line is enclosed in angle brackets.

Examples:
    >>> bracket_lines("This is a string with

two lines.") ' < two lines.>'

    >>> bracket_lines("This is a string with

two lines.", number=True) '<1:This is a string with> <2: two lines.>'

lines_from_bracketed_text(text, start, end, keep_brackets=False)
Extracts lines from bracketed text between the start and end indices, inclusive.
Handles both numbered and non-numbered cases.

Args:
    text (str): The input bracketed text containing lines like <...>.
    start (int): The starting line number (1-based).
    end (int): The ending line number (1-based).

Returns:
    list[str]: The lines from start to end inclusive, with angle brackets removed.

Raises:
    FormattingError: If the text contains improperly formatted lines (missing angle brackets).
    ValueError: If start or end indices are invalid or out of bounds.

Examples:
    >>> text = "<1:Line 1>

<2:Line 2> <3:Line 3>" >>> lines_from_bracketed_text(text, 1, 2) ['Line 1', 'Line 2']

    >>> text = "<Line 1>

" >>> lines_from_bracketed_text(text, 2, 3) ['Line 2', 'Line 3']

number_lines(text, start=1, separator=': ')

Numbers each line of text with a readable format, including empty lines.

Parameters:

Name Type Description Default
text str

Input text to be numbered. Can be multi-line.

required
start int

Starting line number. Defaults to 1.

1
separator str

Separator between line number and content. Defaults to ": ".

': '

Returns:

Name Type Description
str str

Numbered text where each line starts with "{number}: ".

Examples:

>>> text = "First line\nSecond line\n\nFourth line"
>>> print(number_lines(text))
1: First line
2: Second line
3:
4: Fourth line
>>> print(number_lines(text, start=5, separator=" | "))
5 | First line
6 | Second line
7 |
8 | Fourth line
Notes
  • All lines are numbered, including empty lines, to maintain text structure
  • Line numbers are aligned through natural string formatting
  • Customizable separator allows for different formatting needs
  • Can start from any line number for flexibility in text processing
unbracket_all_lines(pages)
unbracket_lines(text, number=False)
Removes angle brackets (< >) from encapsulated lines and optionally removes line numbers.

Args:
    text (str): The input string with encapsulated lines.
    number (bool): If True, removes line numbers in the format 'digit:'.
        Raises a ValueError if `number=True` and a line does not start
        with a digit followed by a colon.

Returns:
    str: A newline-separated string with the encapsulation removed,
        and line numbers stripped if specified.

Examples:
    >>> unbracket_lines("<1:Line 1>

<2:Line 2>", number=True) 'Line 1 Line 2'

    >>> unbracket_lines("<Line 1>

") 'Line 1 Line 2'

    >>> unbracket_lines("<1Line 1>", number=True)
    ValueError: Line does not start with a valid number: '1Line 1'

match_section

MatchObject

Bases: BaseModel

Basic Match Object definition.

case_sensitive = False class-attribute instance-attribute
decorator = None class-attribute instance-attribute
level = None class-attribute instance-attribute
pattern = None class-attribute instance-attribute
type instance-attribute
words = None class-attribute instance-attribute
SectionConfig

Bases: BaseModel

Configuration for section detection.

description = None class-attribute instance-attribute
name instance-attribute
patterns instance-attribute
find_keyword(line, words, case_sensitive, decorator)

Check if line matches keyword pattern.

find_markdown_header(line, level)

Check if line matches markdown header pattern.

find_regex(line, pattern)

Check if line matches regex pattern.

find_section_boundaries(text, config)

Find all section boundary line numbers.

numbered_text

NumberedFormat

Bases: NamedTuple

is_numbered instance-attribute
separator = None class-attribute instance-attribute
start_num = None class-attribute instance-attribute
NumberedText

Immutable container for text documents with numbered lines.

Provides utilities for working with line-numbered text including reading, writing, accessing lines by number, and iterating over numbered lines.

Immutability Note

NumberedText is designed to be used immutably after construction. While not enforced at runtime (for performance reasons as a low-level container), instances should not be modified after creation. All operations return new data rather than mutating the instance.

Whitespace and Blank Line Handling (Monaco Editor as standard for compatibility): NumberedText follows Monaco Editor's verbatim line and whitespace handling. Monaco Editor: https://microsoft.github.io/monaco-editor/typedoc/interfaces/IRange.html

- Blank lines: Preserved as empty strings in the lines list
- Whitespace: Leading/trailing whitespace preserved (never stripped)
- Line count: Blank lines count as lines (e.g., "a\n\nb" has 3 lines)
- Indexing: 1-based line numbers with inclusive end semantics (Monaco IRange)

Numbered Input Detection:
When input contains line numbers (e.g., "1: foo\n2:\n3: bar"):
- Pattern validation: Only non-blank lines validated for sequential numbering
- Number extraction: Removes number prefix (e.g., "2: ") from all lines
- Blank line handling: After number removal, blank lines become empty strings
- Example: "1: foo\n2:\n3: bar" → lines=[' foo', '', ' bar']

Attributes:

Name Type Description
lines List[str]

List of text lines (do not modify after construction)

start int

Starting line number (do not modify after construction)

separator str

Separator between line number and content (do not modify after construction)

Examples:

>>> text = "First line\nSecond line\n\nFourth line"
>>> doc = NumberedText(text)
>>> print(doc)
1: First line
2: Second line
3:
4: Fourth line
>>> print(doc.get_line(2))
Second line
>>> for num, line in doc:
...     print(f"Line {num}: {len(line)} chars")
content property

Get original text without line numbers.

end property
lines = [] instance-attribute
numbered_content property

Get text with line numbers as a string. Equivalent to str(self)

numbered_lines property

Get list of lines with line numbers included.

Returns:

Type Description
List[str]

List[str]: Lines with numbers and separator prefixed

Examples:

>>> doc = NumberedText("First line\nSecond line")
>>> doc.numbered_lines
['1: First line', '2: Second line']
Note
  • Unlike str(self), this returns a list rather than joined string
  • Maintains consistent formatting with separator
  • Useful for processing or displaying individual numbered lines
separator = separator instance-attribute
size property

Get the number of lines.

start = start instance-attribute
LineSegment dataclass

Represents a segment of lines with start and end indices in 1-based indexing.

The segment follows Python range conventions where start is inclusive and end is exclusive. However, indexing is 1-based to match NumberedText.

Attributes:

Name Type Description
start int

Starting line number (inclusive, 1-based)

end int

Ending line number (exclusive, 1-based)

end instance-attribute
start instance-attribute
__init__(start, end)
__iter__()

Allow unpacking into start, end pairs.

SegmentIterator

Iterator for generating line segments of specified size.

Produces segments of lines with start/end indices following 1-based indexing. The final segment may be smaller than the specified segment size.

Attributes:

Name Type Description
total_lines

Total number of lines in text

segment_size

Number of lines per segment

start_line

Starting line number (1-based)

min_segment_size

Minimum size for the final segment

min_segment_size = min_segment_size instance-attribute
num_segments = (remaining_lines + segment_size - 1) // segment_size instance-attribute
segment_size = segment_size instance-attribute
start_line = start_line instance-attribute
total_lines = total_lines instance-attribute
__init__(total_lines, segment_size, start_line=1, min_segment_size=None)

Initialize the segment iterator.

Parameters:

Name Type Description Default
total_lines int

Total number of lines to iterate over

required
segment_size int

Desired size of each segment

required
start_line int

First line number (default: 1)

1
min_segment_size Optional[int]

Minimum size for final segment (default: None) If specified, the last segment will be merged with the previous one if it would be smaller than this size.

None

Raises:

Type Description
ValueError

If segment_size < 1 or total_lines < 1

ValueError

If start_line < 1 (must use 1-based indexing)

ValueError

If min_segment_size >= segment_size

__iter__()

Iterate over line segments.

Yields:

Type Description
LineSegment

LineSegment containing start (inclusive) and end (exclusive) indices

__getitem__(index)

Get line content by line number (1-based indexing).

__init__(content=None, start=1, separator=':')

Initialize a numbered text document, detecting and preserving existing numbering.

Valid numbered text must have: - Sequential line numbers - Consistent separator character(s) - Every non-empty line must follow the numbering pattern

Parameters:

Name Type Description Default
content Optional[str]

Initial text content, if any

None
start int

Starting line number (used only if content isn't already numbered)

1
separator str

Separator between line numbers and content (only if content isn't numbered)

':'

Examples:

>>> # Custom separators
>>> doc = NumberedText("1→First line\n2→Second line")
>>> doc.separator == "→"
True
>>> # Preserves starting number
>>> doc = NumberedText("5#First\n6#Second")
>>> doc.start == 5
True
>>> # Regular numbered list isn't treated as line numbers
>>> doc = NumberedText("1. First item\n2. Second item")
>>> doc.numbered_lines
['1: 1. First item', '2: 2. Second item']
__iter__()

Iterate over (line_number, line_content) pairs.

__len__()

Return the number of lines.

__str__()

Return the numbered text representation.

from_file(path, **kwargs) classmethod

Create a NumberedText instance from a file.

get_coverage_report(section_start_lines)

Return coverage statistics for sections defined by start lines.

get_line(line_num)

Get content of specified line number.

get_lines(start, end)

Deprecated: use get_lines_exclusive; end index remains exclusive.

get_lines_exclusive(start, end)

Get content of line range [start, end) using 1-based line numbers.

Parameters:

Name Type Description Default
start int

Inclusive start line (1-based external indexing).

required
end int

Exclusive end line (1-based; not included), matching Python slicing semantics.

required
get_numbered_line(line_num)

Get specified line with line number.

get_numbered_lines(start, end)

Get numbered lines for [start, end) using 1-based numbering.

get_numbered_segment(start, end)
get_segment(start, end)

Return the segment from start line (inclusive) up to end line (inclusive).

This aligns with Monaco's inclusive range semantics. Internally we convert to Python's exclusive upper bound when slicing.

iter_segments(segment_size, min_segment_size=None)

Iterate over segments of the text with specified size.

Parameters:

Name Type Description Default
segment_size int

Number of lines per segment

required
min_segment_size Optional[int]

Optional minimum size for final segment. If specified, last segment will be merged with previous one if it would be smaller than this size.

None

Yields:

Type Description
LineSegment

LineSegment objects containing start and end line numbers

Example

text = NumberedText("line1\nline2\nline3\nline4\nline5") for segment in text.iter_segments(2): ... print(f"Lines {segment.start}-{segment.end}") Lines 1-3 Lines 3-5 Lines 5-6

save(path, numbered=True)

Save document to file.

Parameters:

Name Type Description Default
path Path

Output file path

required
numbered bool

Whether to save with line numbers (default: True)

True
validate_section_boundaries(section_start_lines)

Validate section boundaries for gaps, overlaps, and out-of-bounds errors.

Sections are defined by their start lines; the end of each section is implicit: it ends at the line before the next section starts, with the final section ending at the last line of the text. Validation enforces: - First section starts at self.start - No overlaps (next start must be > previous start) - No gaps (next start must be exactly previous start + 1) - All start lines within [self.start, self.end]

SectionValidationError

Bases: BaseModel

Error found in section boundaries.

Error metadata class following tnh-scholar standards: - Pydantic v2 BaseModel for validation and serialization - Frozen for immutability - Used as data structure returned from validation methods

See: src/tnh_scholar/exceptions.py for exception classes

actual_start instance-attribute
error_type instance-attribute
expected_start instance-attribute
message instance-attribute
model_config = ConfigDict(frozen=True, extra='forbid') class-attribute instance-attribute
section_index instance-attribute
section_input_index instance-attribute
get_numbered_format(text)

Analyze text to determine if it follows a consistent line numbering format.

Valid formats have: - Sequential numbers starting from some value - Consistent separator character(s) - Every line must follow the format

Parameters:

Name Type Description Default
text str

Text to analyze

required

Returns:

Type Description
NumberedFormat

Tuple of (is_numbered, separator, start_number)

Examples:

>>> _analyze_numbered_format("1→First\n2→Second")
(True, "→", 1)
>>> _analyze_numbered_format("1. First")  # Numbered list format
(False, None, None)
>>> _analyze_numbered_format("5#Line\n6#Other")
(True, "#", 5)

text_processing

clean_text(text, newline=False)

Cleans a given text by replacing specific unwanted characters such as tab, and non-breaking spaces with regular spaces.

This function takes a string as input and applies replacements based on a predefined mapping of characters to replace.

Parameters:

Name Type Description Default
text str

The text to be cleaned.

required

Returns:

Name Type Description
str str

The cleaned text with unwanted characters replaced by spaces.

Example

text = "This is\n an example\ttext with\xa0extra spaces." clean_text(text) 'This is an example text with extra spaces.'

normalize_newlines(text, spacing=2)
Normalize newline blocks in the input text by reducing consecutive newlines
to the specified number of newlines for consistent readability and formatting.

Parameters:
----------
text : str
    The input text containing inconsistent newline spacing.
spacing : int, optional
    The number of newlines to insert between lines. Defaults to 2.

Returns:
-------
str
    The text with consecutive newlines reduced to the specified number of newlines.

Example:
--------
>>> raw_text = "Heading

Paragraph text 1 Paragraph text 2

" >>> normalize_newlines(raw_text, spacing=2) 'Heading

Paragraph text 1

Paragraph text 2

'

tools

Internal helper utilities for dev workflows.

notebook_prep

Utilities for maintaining paired *_local.ipynb notebooks.

EXCLUDED_PARTS = {'.ipynb_checkpoints'} module-attribute
prep_notebooks(directory, dry_run=True)

Create *_local notebooks and strip outputs from originals.

Parameters

directory: Directory whose notebooks will be processed. dry_run: When True only report pending work without copying files or invoking nbconvert.

tree_builder

Helpers for generating directory-tree text files.

build_tree(root_dir, src_dir=None)

Generate directory trees for the project and optionally its source directory.

utils

__all__ = ['copy_files_with_regex', 'ensure_directory_exists', 'ensure_directory_writable', 'iterate_subdir', 'path_as_str', 'read_str_from_file', 'sanitize_filename', 'to_slug', 'write_str_to_file', 'load_json_into_model', 'load_jsonl_to_dict', 'save_model_to_json', 'get_language_code_from_text', 'get_language_from_code', 'get_language_name_from_text', 'fraction_to_percent', 'ExpectedTimeTQDM', 'TimeProgress', 'TimeMs', 'TNHAudioSegment', 'convert_ms_to_sec', 'convert_sec_to_ms', 'get_user_confirmation', 'check_ocr_env', 'check_openai_env'] module-attribute

ExpectedTimeTQDM

A context manager for a time-based tqdm progress bar with optional delay.

  • 'expected_time': number of seconds we anticipate the task might take.
  • 'display_interval': how often (seconds) to refresh the bar.
  • 'desc': a short description for the bar.
  • 'delay_start': how many seconds to wait (sleep) before we even create/start the bar.

If the task finishes before 'delay_start' has elapsed, the bar may never appear.

delay_start = delay_start instance-attribute
desc = desc instance-attribute
display_interval = display_interval instance-attribute
expected_time = round(expected_time) instance-attribute
__enter__()
__exit__(exc_type, exc_value, traceback)
__init__(expected_time, display_interval=0.5, desc='Time-based Progress', delay_start=1.0)

TNHAudioSegment

raw property

Access the underlying pydub.AudioSegment if needed.

__add__(other)
__getitem__(key)
__iadd__(other)
__init__(segment)
__len__()
empty() staticmethod
export(out_f, format, **kwargs)

Wrapper: Export the audio segment to a file-like object or file path.

Parameters:

Name Type Description Default
out_f str | BinaryIO

File path or file-like object to write the audio data to.

required
format str

Audio format (e.g., 'mp3', 'wav').

required
**kwargs Any

Additional keyword arguments passed to pydub.AudioSegment.export.

{}
from_file(file, format=None, **kwargs) staticmethod

Wrapper: Load an audio file into a TNHAudioSegment.

Parameters:

Name Type Description Default
file str | Path | BytesIO

Path to the audio file.

required
format str | None

Optional audio format (e.g., 'mp3', 'wav'). If None, pydub will attempt to infer it.

None
**kwargs Any

Additional keyword arguments passed to pydub.AudioSegment.from_file.

{}

Returns:

Type Description
TNHAudioSegment

TNHAudioSegment instance containing the loaded audio.

silent(duration) staticmethod

TimeMs

Bases: int

Lightweight representation of a time interval or timestamp in milliseconds. Allows negative values.

__add__(other)
__get_pydantic_core_schema__(source_type, handler) classmethod
__new__(ms)
__radd__(other)
__repr__()
__rsub__(other)
__sub__(other)
from_seconds(seconds) classmethod
to_ms()
to_seconds()

TimeProgress

A context manager for a time-based progress display using dots.

The display updates once per second, printing a dot and showing: - Expected time (if provided) - Elapsed time (always displayed)

Example:

import time with ExpectedTimeProgress(expected_time=60, desc="Transcribing..."): ... time.sleep(5) # Simulate work [Expected Time: 1:00, Elapsed Time: 0:05] .....

Parameters:

Name Type Description Default
expected_time Optional[float]

Expected time in seconds. Optional.

None
display_interval float

How often to print a dot (seconds).

1.0
desc str

Description to display alongside the progress.

''
desc = desc instance-attribute
display_interval = display_interval instance-attribute
expected_time = expected_time instance-attribute
__enter__()
__exit__(exc_type, exc_value, traceback)
__init__(expected_time=None, display_interval=1.0, desc='')

check_ocr_env(output=True)

Check OCR processing requirements.

check_openai_env(output=True)

Check OpenAI API requirements.

convert_ms_to_sec(ms)

Convert time from milliseconds (int) to seconds (float).

convert_sec_to_ms(val)

Convert seconds to milliseconds, rounding to the nearest integer.

copy_files_with_regex(source_dir, destination_dir, regex_patterns, preserve_structure=True)

Copies files from subdirectories one level down in the source directory to the destination directory if they match any regex pattern. Optionally preserves the directory structure.

Parameters:

Name Type Description Default
source_dir Path

Path to the source directory to search files in.

required
destination_dir Path

Path to the destination directory where files will be copied.

required
regex_patterns list[str]

List of regex patterns to match file names.

required
preserve_structure bool

Whether to preserve the directory structure. Defaults to True.

True

Raises:

Type Description
ValueError

If the source directory does not exist or is not a directory.

Example

copy_files_with_regex( ... source_dir=Path("/path/to/source"), ... destination_dir=Path("/path/to/destination"), ... regex_patterns=[r'.*.txt\(', r'.*\.log\)'], ... preserve_structure=True ... )

ensure_directory_exists(dir_path)

Create directory if it doesn't exist.

Parameters:

Name Type Description Default
dir_path Path

Directory path to ensure exists.

required

Returns:

Name Type Description
bool bool

True if the directory exists or was created successfully, False otherwise.

ensure_directory_writable(dir_path)

Ensure the directory exists and is writable. Creates the directory if it does not exist.

Parameters:

Name Type Description Default
dir_path Path

Directory to verify or create.

required

Raises:

Type Description
ValueError

If the directory cannot be created or is not writable.

TypeError

If the provided path is not a Path instance.

fraction_to_percent(numerator, denominator)

Convert a fraction to a percentage (0.0 if denominator is zero).

get_language_code_from_text(text)

Detect the language of the provided text using langdetect.

Parameters:

Name Type Description Default
text str

Text to analyze

      code or 'name' for full English language name
required

Returns:

Name Type Description
str str

return result 'code' ISO 639-1 for detected language.

Raises:

Type Description
ValueError

If text is empty or invalid

get_language_from_code(code)

get_language_name_from_text(text)

get_user_confirmation(prompt, default=True)

Prompt the user for a yes/no confirmation with single-character input. Cross-platform implementation. Returns True if 'y' is entered, and False if 'n' Allows for default value if return is entered.

Example usage if get_user_confirmation("Do you want to continue"): print("Continuing...") else: print("Exiting...")

iterate_subdir(directory, recursive=False)

Iterates through subdirectories in the given directory.

Parameters:

Name Type Description Default
directory Path

The root directory to start the iteration.

required
recursive bool

If True, iterates recursively through all subdirectories. If False, iterates only over the immediate subdirectories.

False

Yields:

Name Type Description
Path Path

Paths to each subdirectory.

Example

for subdir in iterate_subdir(Path('/root'), recursive=False): ... print(subdir)

load_json_into_model(file, model)

Loads a JSON file and validates it against a Pydantic model.

Parameters:

Name Type Description Default
file Path

Path to the JSON file.

required
model type[BaseModel]

The Pydantic model to validate against.

required

Returns:

Name Type Description
BaseModel BaseModel

An instance of the validated Pydantic model.

Raises:

Type Description
ValueError

If the file content is invalid JSON or does not match the model.

Example: class ExampleModel(BaseModel): name: str age: int city: str

if __name__ == "__main__":
    json_file = Path("example.json")
    try:
        data = load_json_into_model(json_file, ExampleModel)
        print(data)
    except ValueError as e:
        print(e)

load_jsonl_to_dict(file_path)

Load a JSONL file into a list of dictionaries.

Parameters:

Name Type Description Default
file_path Path

Path to the JSONL file.

required

Returns:

Type Description
List[Dict]

List[Dict]: A list of dictionaries, each representing a line in the JSONL file.

Example

from pathlib import Path file_path = Path("data.jsonl") data = load_jsonl_to_dict(file_path) print(data) [{'key1': 'value1'}, {'key2': 'value2'}]

path_as_str(path)

read_str_from_file(file_path)

Reads the entire content of a text file.

Parameters:

Name Type Description Default
file_path Path

The path to the text file.

required

Returns:

Type Description
str

The content of the text file as a single string.

sanitize_filename(filename, max_length=DEFAULT_MAX_FILENAME_LENGTH)

Sanitize filename for use unix use.

save_model_to_json(file, model, indent=4, ensure_ascii=False)

Saves a Pydantic model to a JSON file, formatted with indentation for readability.

Parameters:

Name Type Description Default
file Path

Path to the JSON file where the model will be saved.

required
model BaseModel

The Pydantic model instance to save.

required
indent int

Number of spaces for JSON indentation. Defaults to 4.

4
ensure_ascii bool

Whether to escape non-ASCII characters. Defaults to False.

False

Raises:

Type Description
ValueError

If the model cannot be serialized to JSON.

IOError

If there is an issue writing to the file.

Example

class ExampleModel(BaseModel): name: str age: int

if name == "main": model_instance = ExampleModel(name="John", age=30) json_file = Path("example.json") try: save_model_to_json(json_file, model_instance) print(f"Model saved to {json_file}") except (ValueError, IOError) as e: print(e)

to_slug(string)

Slugify a Unicode string.

Converts a string to a strict URL-friendly slug format, allowing only lowercase letters, digits, and hyphens.

Example

slugify("Héllø_Wörld!") 'hello-world'

write_str_to_file(file_path, text, overwrite=False)

Writes text to a file with file locking.

Parameters:

Name Type Description Default
file_path PathLike

The path to the file to write.

required
text str

The text to write to the file.

required
overwrite bool

Whether to overwrite the file if it exists.

False

Raises:

Type Description
FileExistsError

If the file exists and overwrite is False.

OSError

If there's an issue with file locking or writing.

file_utils

DEFAULT_MAX_FILENAME_LENGTH = 25 module-attribute
PathLike = Union[str, Path] module-attribute
__all__ = ['DEFAULT_MAX_FILENAME_LENGTH', 'FileExistsWarning', 'ensure_directory_exists', 'ensure_directory_writable', 'iterate_subdir', 'path_source_str', 'copy_files_with_regex', 'read_str_from_file', 'write_str_to_file', 'sanitize_filename', 'to_slug', 'path_as_str'] module-attribute
FileExistsWarning

Bases: UserWarning

copy_files_with_regex(source_dir, destination_dir, regex_patterns, preserve_structure=True)

Copies files from subdirectories one level down in the source directory to the destination directory if they match any regex pattern. Optionally preserves the directory structure.

Parameters:

Name Type Description Default
source_dir Path

Path to the source directory to search files in.

required
destination_dir Path

Path to the destination directory where files will be copied.

required
regex_patterns list[str]

List of regex patterns to match file names.

required
preserve_structure bool

Whether to preserve the directory structure. Defaults to True.

True

Raises:

Type Description
ValueError

If the source directory does not exist or is not a directory.

Example

copy_files_with_regex( ... source_dir=Path("/path/to/source"), ... destination_dir=Path("/path/to/destination"), ... regex_patterns=[r'.*.txt\(', r'.*\.log\)'], ... preserve_structure=True ... )

ensure_directory_exists(dir_path)

Create directory if it doesn't exist.

Parameters:

Name Type Description Default
dir_path Path

Directory path to ensure exists.

required

Returns:

Name Type Description
bool bool

True if the directory exists or was created successfully, False otherwise.

ensure_directory_writable(dir_path)

Ensure the directory exists and is writable. Creates the directory if it does not exist.

Parameters:

Name Type Description Default
dir_path Path

Directory to verify or create.

required

Raises:

Type Description
ValueError

If the directory cannot be created or is not writable.

TypeError

If the provided path is not a Path instance.

iterate_subdir(directory, recursive=False)

Iterates through subdirectories in the given directory.

Parameters:

Name Type Description Default
directory Path

The root directory to start the iteration.

required
recursive bool

If True, iterates recursively through all subdirectories. If False, iterates only over the immediate subdirectories.

False

Yields:

Name Type Description
Path Path

Paths to each subdirectory.

Example

for subdir in iterate_subdir(Path('/root'), recursive=False): ... print(subdir)

path_as_str(path)
path_source_str(path)
read_str_from_file(file_path)

Reads the entire content of a text file.

Parameters:

Name Type Description Default
file_path Path

The path to the text file.

required

Returns:

Type Description
str

The content of the text file as a single string.

sanitize_filename(filename, max_length=DEFAULT_MAX_FILENAME_LENGTH)

Sanitize filename for use unix use.

to_slug(string)

Slugify a Unicode string.

Converts a string to a strict URL-friendly slug format, allowing only lowercase letters, digits, and hyphens.

Example

slugify("Héllø_Wörld!") 'hello-world'

write_str_to_file(file_path, text, overwrite=False)

Writes text to a file with file locking.

Parameters:

Name Type Description Default
file_path PathLike

The path to the file to write.

required
text str

The text to write to the file.

required
overwrite bool

Whether to overwrite the file if it exists.

False

Raises:

Type Description
FileExistsError

If the file exists and overwrite is False.

OSError

If there's an issue with file locking or writing.

json_utils

format_json(file)

Formats a JSON file with line breaks and indentation for readability.

Parameters:

Name Type Description Default
file Path

Path to the JSON file to be formatted.

required
Example

format_json(Path("data.json"))

load_json_into_model(file, model)

Loads a JSON file and validates it against a Pydantic model.

Parameters:

Name Type Description Default
file Path

Path to the JSON file.

required
model type[BaseModel]

The Pydantic model to validate against.

required

Returns:

Name Type Description
BaseModel BaseModel

An instance of the validated Pydantic model.

Raises:

Type Description
ValueError

If the file content is invalid JSON or does not match the model.

Example: class ExampleModel(BaseModel): name: str age: int city: str

if __name__ == "__main__":
    json_file = Path("example.json")
    try:
        data = load_json_into_model(json_file, ExampleModel)
        print(data)
    except ValueError as e:
        print(e)
load_jsonl_to_dict(file_path)

Load a JSONL file into a list of dictionaries.

Parameters:

Name Type Description Default
file_path Path

Path to the JSONL file.

required

Returns:

Type Description
List[Dict]

List[Dict]: A list of dictionaries, each representing a line in the JSONL file.

Example

from pathlib import Path file_path = Path("data.jsonl") data = load_jsonl_to_dict(file_path) print(data) [{'key1': 'value1'}, {'key2': 'value2'}]

save_model_to_json(file, model, indent=4, ensure_ascii=False)

Saves a Pydantic model to a JSON file, formatted with indentation for readability.

Parameters:

Name Type Description Default
file Path

Path to the JSON file where the model will be saved.

required
model BaseModel

The Pydantic model instance to save.

required
indent int

Number of spaces for JSON indentation. Defaults to 4.

4
ensure_ascii bool

Whether to escape non-ASCII characters. Defaults to False.

False

Raises:

Type Description
ValueError

If the model cannot be serialized to JSON.

IOError

If there is an issue writing to the file.

Example

class ExampleModel(BaseModel): name: str age: int

if name == "main": model_instance = ExampleModel(name="John", age=30) json_file = Path("example.json") try: save_model_to_json(json_file, model_instance) print(f"Model saved to {json_file}") except (ValueError, IOError) as e: print(e)

write_data_to_json_file(file, data, indent=4, ensure_ascii=False)

Writes a dictionary or list as a JSON string to a file, ensuring the parent directory exists, and supports formatting with indentation and ASCII control.

Parameters:

Name Type Description Default
file Path

Path to the JSON file where the data will be written.

required
data Union[dict, list]

The data to write to the file. Typically a dict or list.

required
indent int

Number of spaces for JSON indentation. Defaults to 4.

4
ensure_ascii bool

Whether to escape non-ASCII characters. Defaults to False.

False

Raises:

Type Description
ValueError

If the data cannot be serialized to JSON.

IOError

If there is an issue writing to the file.

Example

from pathlib import Path data = {"key": "value"} write_json_str_to_file(Path("output.json"), data, indent=2, ensure_ascii=True)

lang

logger = get_child_logger(__name__) module-attribute
get_language_code_from_text(text)

Detect the language of the provided text using langdetect.

Parameters:

Name Type Description Default
text str

Text to analyze

      code or 'name' for full English language name
required

Returns:

Name Type Description
str str

return result 'code' ISO 639-1 for detected language.

Raises:

Type Description
ValueError

If text is empty or invalid

get_language_from_code(code)
get_language_name_from_text(text)

math_utils

fraction_to_percent(numerator, denominator)

Convert a fraction to a percentage (0.0 if denominator is zero).

progress_utils

BAR_FORMAT = '{desc}: {percentage:3.0f}%|{bar}| Total: {total_fmt} sec. [elapsed: {elapsed}]' module-attribute
ExpectedTimeTQDM

A context manager for a time-based tqdm progress bar with optional delay.

  • 'expected_time': number of seconds we anticipate the task might take.
  • 'display_interval': how often (seconds) to refresh the bar.
  • 'desc': a short description for the bar.
  • 'delay_start': how many seconds to wait (sleep) before we even create/start the bar.

If the task finishes before 'delay_start' has elapsed, the bar may never appear.

delay_start = delay_start instance-attribute
desc = desc instance-attribute
display_interval = display_interval instance-attribute
expected_time = round(expected_time) instance-attribute
__enter__()
__exit__(exc_type, exc_value, traceback)
__init__(expected_time, display_interval=0.5, desc='Time-based Progress', delay_start=1.0)
TimeProgress

A context manager for a time-based progress display using dots.

The display updates once per second, printing a dot and showing: - Expected time (if provided) - Elapsed time (always displayed)

Example:

import time with ExpectedTimeProgress(expected_time=60, desc="Transcribing..."): ... time.sleep(5) # Simulate work [Expected Time: 1:00, Elapsed Time: 0:05] .....

Parameters:

Name Type Description Default
expected_time Optional[float]

Expected time in seconds. Optional.

None
display_interval float

How often to print a dot (seconds).

1.0
desc str

Description to display alongside the progress.

''
desc = desc instance-attribute
display_interval = display_interval instance-attribute
expected_time = expected_time instance-attribute
__enter__()
__exit__(exc_type, exc_value, traceback)
__init__(expected_time=None, display_interval=1.0, desc='')

timing_utils

TimeMs

Bases: int

Lightweight representation of a time interval or timestamp in milliseconds. Allows negative values.

__add__(other)
__get_pydantic_core_schema__(source_type, handler) classmethod
__new__(ms)
__radd__(other)
__repr__()
__rsub__(other)
__sub__(other)
from_seconds(seconds) classmethod
to_ms()
to_seconds()
convert_ms_to_sec(ms)

Convert time from milliseconds (int) to seconds (float).

convert_sec_to_ms(val)

Convert seconds to milliseconds, rounding to the nearest integer.

tnh_audio_segment

TNHAudioSegment: A typed, minimal wrapper for pydub.AudioSegment.

This class provides a type-safe interface for working with audio segments using pydub, enabling easier composition, slicing, and manipulation of audio data. It exposes common operations such as concatenation, slicing, and length retrieval, while hiding the underlying pydub implementation.

Key features
  • Type-annotated methods for static analysis and IDE support
  • Static constructors for silent and empty segments
  • Operator overloads for concatenation and slicing
  • Access to the underlying pydub.AudioSegment via the raw property

Extend this class with additional methods as needed for your audio processing workflows.

TNHAudioSegment
raw property

Access the underlying pydub.AudioSegment if needed.

__add__(other)
__getitem__(key)
__iadd__(other)
__init__(segment)
__len__()
empty() staticmethod
export(out_f, format, **kwargs)

Wrapper: Export the audio segment to a file-like object or file path.

Parameters:

Name Type Description Default
out_f str | BinaryIO

File path or file-like object to write the audio data to.

required
format str

Audio format (e.g., 'mp3', 'wav').

required
**kwargs Any

Additional keyword arguments passed to pydub.AudioSegment.export.

{}
from_file(file, format=None, **kwargs) staticmethod

Wrapper: Load an audio file into a TNHAudioSegment.

Parameters:

Name Type Description Default
file str | Path | BytesIO

Path to the audio file.

required
format str | None

Optional audio format (e.g., 'mp3', 'wav'). If None, pydub will attempt to infer it.

None
**kwargs Any

Additional keyword arguments passed to pydub.AudioSegment.from_file.

{}

Returns:

Type Description
TNHAudioSegment

TNHAudioSegment instance containing the loaded audio.

silent(duration) staticmethod

user_io_utils

get_single_char(prompt=None)

Get a single character from input, adapting to the execution environment.

Parameters:

Name Type Description Default
prompt Optional[str]

Optional prompt to display before getting input

None

Returns:

Type Description
str

A single character string from user input

Note
  • In terminal environments, uses raw input mode without requiring Enter
  • In Jupyter/IPython, falls back to regular input with message about Enter
get_user_confirmation(prompt, default=True)

Prompt the user for a yes/no confirmation with single-character input. Cross-platform implementation. Returns True if 'y' is entered, and False if 'n' Allows for default value if return is entered.

Example usage if get_user_confirmation("Do you want to continue"): print("Continuing...") else: print("Exiting...")

validate

OCR_ENV_VARS = {'GOOGLE_APPLICATION_CREDENTIALS'} module-attribute
OPENAI_ENV_VARS = {'OPENAI_API_KEY'} module-attribute
logger = get_child_logger(__name__) module-attribute
check_env(required_vars, feature='this feature', output=True)

Check environment variables and provide user-friendly error messages.

Parameters:

Name Type Description Default
required_vars Set[str]

Set of environment variable names to check

required
feature str

Description of feature requiring these variables

'this feature'

Returns:

Name Type Description
bool bool

True if all required variables are set

check_ocr_env(output=True)

Check OCR processing requirements.

check_openai_env(output=True)

Check OpenAI API requirements.

get_env_message(missing_vars, feature='this feature')

Generate user-friendly environment setup message.

Parameters:

Name Type Description Default
missing_vars List[str]

List of missing environment variable names

required
feature str

Name of feature requiring the variables

'this feature'

Returns:

Type Description
str

Formatted error message with setup instructions

version_check

Version checker package for monitoring package version compatibility.

__all__ = ['PackageVersionChecker', 'VersionCheckerConfig', 'VersionStrategy', 'Result', 'PackageInfo'] module-attribute
PackageInfo dataclass

Information about a package and its versions.

installed_version = None class-attribute instance-attribute
latest_version = None class-attribute instance-attribute
name instance-attribute
required_version = None class-attribute instance-attribute
__init__(name, installed_version=None, latest_version=None, required_version=None)
PackageVersionChecker

Main class for checking package versions against requirements.

cache = cache or VersionCache() instance-attribute
provider = provider or StandardVersionProvider() instance-attribute
__init__(provider=None, cache=None)
check_version(package_name, config=None)

Check if package meets version requirements based on config.

Result dataclass

Result of a version check operation.

diff_details = None class-attribute instance-attribute
error = None class-attribute instance-attribute
is_compatible instance-attribute
needs_update instance-attribute
package_info instance-attribute
warning_level = None class-attribute instance-attribute
__init__(is_compatible, needs_update, package_info, error=None, warning_level=None, diff_details=None)
get_upgrade_command()

Return pip command to upgrade package.

VersionCheckerConfig

Configuration for version checking behavior.

cache_duration = cache_duration instance-attribute
fail_on_error = fail_on_error instance-attribute
network_timeout = network_timeout instance-attribute
requirement = requirement instance-attribute
strategy = strategy instance-attribute
vdiff_fail_matrix = vdiff_fail_matrix instance-attribute
vdiff_warn_matrix = vdiff_warn_matrix instance-attribute
__init__(strategy=VersionStrategy.MINIMUM, requirement='', fail_on_error=False, cache_duration=3600, network_timeout=5, vdiff_warn_matrix=None, vdiff_fail_matrix=None)

Initialize version checker configuration.

get_required_version()

Get required version as a Version object.

VersionStrategy

Bases: Enum

Enumeration of version checking strategies.

EXACT = 'exact' class-attribute instance-attribute
LATEST = 'latest' class-attribute instance-attribute
MINIMUM = 'minimum' class-attribute instance-attribute
RANGE = 'range' class-attribute instance-attribute
VERSION_DIFF = 'vdiff' class-attribute instance-attribute
cache

Simple caching mechanism for version information.

VersionCache

Simple time-based cache for version information.

cache = {} instance-attribute
cache_duration = cache_duration instance-attribute
timestamps = {} instance-attribute
__init__(cache_duration=3600)

Initialize cache with specified expiration time in seconds.

get(key)

Get cached version if still valid.

is_valid(key)

Check if cached value is still valid.

set(key, value)

Cache version with current timestamp.

checker

Main version checker implementation.

PackageVersionChecker

Main class for checking package versions against requirements.

cache = cache or VersionCache() instance-attribute
provider = provider or StandardVersionProvider() instance-attribute
__init__(provider=None, cache=None)
check_version(package_name, config=None)

Check if package meets version requirements based on config.

cli

Command-line interface for version checking (stub for future implementation).

main()

Command-line interface for version checking.

config

Configuration classes for version checking.

VersionCheckerConfig

Configuration for version checking behavior.

cache_duration = cache_duration instance-attribute
fail_on_error = fail_on_error instance-attribute
network_timeout = network_timeout instance-attribute
requirement = requirement instance-attribute
strategy = strategy instance-attribute
vdiff_fail_matrix = vdiff_fail_matrix instance-attribute
vdiff_warn_matrix = vdiff_warn_matrix instance-attribute
__init__(strategy=VersionStrategy.MINIMUM, requirement='', fail_on_error=False, cache_duration=3600, network_timeout=5, vdiff_warn_matrix=None, vdiff_fail_matrix=None)

Initialize version checker configuration.

get_required_version()

Get required version as a Version object.

VersionStrategy

Bases: Enum

Enumeration of version checking strategies.

EXACT = 'exact' class-attribute instance-attribute
LATEST = 'latest' class-attribute instance-attribute
MINIMUM = 'minimum' class-attribute instance-attribute
RANGE = 'range' class-attribute instance-attribute
VERSION_DIFF = 'vdiff' class-attribute instance-attribute
models

Data models for version checking results.

PackageInfo dataclass

Information about a package and its versions.

installed_version = None class-attribute instance-attribute
latest_version = None class-attribute instance-attribute
name instance-attribute
required_version = None class-attribute instance-attribute
__init__(name, installed_version=None, latest_version=None, required_version=None)
Result dataclass

Result of a version check operation.

diff_details = None class-attribute instance-attribute
error = None class-attribute instance-attribute
is_compatible instance-attribute
needs_update instance-attribute
package_info instance-attribute
warning_level = None class-attribute instance-attribute
__init__(is_compatible, needs_update, package_info, error=None, warning_level=None, diff_details=None)
get_upgrade_command()

Return pip command to upgrade package.

providers

Version provider implementations for retrieving package versions.

StandardVersionProvider

Bases: VersionProvider

Standard implementation of version provider using importlib and PyPI.

cache = cache or VersionCache() instance-attribute
pypi_url_template = 'https://pypi.org/pypi/{package}/json' instance-attribute
timeout = timeout instance-attribute
__init__(cache=None, timeout=5)
get_installed_version(package_name)

Get installed package version.

get_latest_version(package_name)

Get latest available package version from PyPI.

VersionProvider

Bases: ABC

Interface for retrieving package version information.

get_installed_version(package_name) abstractmethod

Get installed package version.

get_latest_version(package_name) abstractmethod

Get latest available package version.

strategies

Version comparison strategies for package version checking.

check_exact_version(installed, required)

Check if installed version exactly matches requirement.

check_minimum_version(installed, required)

Check if installed version meets minimum requirement.

check_version_diff(installed, reference, vdiff_matrix)

Check if version difference is within specified limits.

parse_vdiff_matrix(matrix_str)

Parse a version difference matrix string.

webhook_server

WebhookServer

A generic webhook server that can receive callbacks from external services.

app = self._create_flask_app() instance-attribute
flask_running = Event() instance-attribute
flask_server_thread = None instance-attribute
port = port instance-attribute
tunnel_process = None instance-attribute
webhook_data = None instance-attribute
webhook_received = Condition() instance-attribute
__init__(port=5050)

Initialize webhook server with configuration.

Parameters:

Name Type Description Default
port int

The port to run the Flask server on

5050
cleanup()

Clean up all resources.

close_tunnel()

Close the tunnel if it's running.

create_tunnel()

Create a public webhook URL using py-localtunnel.

Returns:

Type Description
Optional[str]

Optional[str]: The public webhook URL or None if tunnel creation failed

shutdown_server()

Gracefully shut down the Flask server.

start_server()

Start Flask server in a separate thread.

wait_for_webhook(timeout=120)

Wait for webhook data to be received.

Parameters:

Name Type Description Default
timeout int

Maximum time to wait in seconds

120

Returns:

Type Description
Optional[Dict]

Optional[Dict]: The webhook data or None if timed out

video_processing

__all__ = ['DLPDownloader', 'DownloadError', 'TranscriptError', 'VideoAudio', 'VideoProcessingError', 'VideoResource', 'VideoTranscript', 'YTDownloadService', 'extract_text_from_ttml', 'get_youtube_urls_from_csv'] module-attribute

DLPDownloader

Bases: YTDownloader

yt-dlp based implementation of YouTube content retrieval.

Assures temporary file export is in the form . where ID is the YouTube video id, and ext is the appropriate extension.

Renames the export file to be based on title and ID by default, or moves the export file to the specified output file with appropriate extension.

config = config or BASE_YDL_OPTIONS instance-attribute
__init__(config=None)
get_audio(url, start=None, end=None, output_path=None)

Download audio and get metadata for a YouTube video.

get_default_export_name(url)

Get default export filename for a URL.

get_default_filename_stem(metadata)

Generate the object download filename.

get_metadata(url)

Get metadata for a YouTube video.

get_transcript(url, lang='en', output_path=None)

Downloads video transcript in TTML format.

Parameters:

Name Type Description Default
url str

YouTube video URL

required
lang str

Language code for transcript (default: "en")

'en'
output_path Optional[Path]

Optional output directory (uses current dir if None)

None

Returns:

Type Description
VideoTranscript

TranscriptResource containing TTML file path and metadata

Raises:

Type Description
TranscriptError

If no transcript found for specified language

get_video(url, quality=None, output_path=None)

Download the full video with associated metadata.

Parameters:

Name Type Description Default
url str

YouTube video URL

required
quality Optional[str]

yt-dlp format string (default: highest available)

None
output_path Optional[Path]

Optional output directory

None

Returns:

Type Description
VideoFile

VideoFile containing video file path and metadata

Raises:

Type Description
VideoDownloadError

If download fails

DownloadError

Bases: VideoProcessingError

Raised for download-related errors.

TranscriptError

Bases: VideoProcessingError

Raised for transcript-related errors.

VideoAudio dataclass

Bases: VideoResource

VideoProcessingError

Bases: Exception

Base exception for video processing errors.

VideoResource dataclass

Base class for all video resources.

filepath = None class-attribute instance-attribute
metadata instance-attribute
__init__(metadata, filepath=None)

VideoTranscript dataclass

Bases: VideoResource

YTDownloadService dataclass

Service wrapper for YouTube download operations.

Notes

Keeps Object-Service protocol alignment; behavior is delegated for now.

downloader instance-attribute
__init__(downloader)
fetch_audio(url, start=None, end=None, output_path=None)

Fetch audio via the configured downloader.

fetch_metadata(url)

Fetch metadata via the configured downloader.

fetch_transcript(url, lang='en', output_path=None)

Fetch a transcript via the configured downloader.

fetch_video(url, quality=None, output_path=None)

Fetch video via the configured downloader.

extract_text_from_ttml(ttml_path)

Extract plain text content from TTML file.

Parameters:

Name Type Description Default
ttml_path Path

Path to TTML transcript file

required

Returns:

Type Description
str

Plain text content with one sentence per line

Raises:

Type Description
ValueError

If file doesn't exist or has invalid content

get_youtube_urls_from_csv(file_path)

Reads a CSV file containing YouTube URLs and titles, logs the titles, and returns a list of URLs.

Parameters:

Name Type Description Default
file_path Path

Path to the CSV file containing YouTube URLs and titles.

required

Returns:

Type Description
List[str]

List[str]: List of YouTube URLs.

Raises:

Type Description
FileNotFoundError

If the file does not exist.

ValueError

If the CSV file is improperly formatted.

ops_check

OpsCheckConfig dataclass
output_dir instance-attribute
url_limit instance-attribute
urls_path instance-attribute
__init__(urls_path, url_limit, output_dir)
OpsCheckFailure dataclass
reason instance-attribute
url instance-attribute
__init__(url, reason)
OpsCheckProgressReporter

Bases: Protocol

Observer for live yt-dlp ops-check progress events.

on_run_finished(report)

Called when the full ops check completes.

on_run_started(total_urls)

Called when the ops check begins.

on_url_failed(index, total_urls, url, reason)

Called when one URL fails.

on_url_started(index, total_urls, url)

Called before validating one URL.

on_url_succeeded(index, total_urls, url)

Called when one URL succeeds.

OpsCheckReport dataclass
failures instance-attribute
successes instance-attribute
__init__(successes, failures)
ok()
OpsCheckRunner
__init__(downloader, config, reporter=None)
run()

video_processing

video_processing.py

BASE_YDL_OPTIONS = {'quiet': False, 'no_warnings': True, 'extract_flat': True, 'socket_timeout': 30, 'retries': 3, 'ignoreerrors': True, 'logger': logger} module-attribute
DEFAULT_AUDIO_OPTIONS = BASE_YDL_OPTIONS | {'format': 'bestaudio/best', 'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3', 'preferredquality': '192'}], 'noplaylist': True} module-attribute
DEFAULT_METADATA_FIELDS = ['id', 'title', 'description', 'duration', 'upload_date', 'uploader', 'channel_url', 'webpage_url', 'original_url', 'channel', 'language', 'categories', 'tags'] module-attribute
DEFAULT_METADATA_OPTIONS = BASE_YDL_OPTIONS | {'skip_download': True} module-attribute
DEFAULT_TRANSCRIPT_OPTIONS = BASE_YDL_OPTIONS | {'skip_download': True, 'writesubtitles': True, 'writeautomaticsub': True, 'subtitlesformat': 'ttml'} module-attribute
DEFAULT_VIDEO_OPTIONS = BASE_YDL_OPTIONS | {'format': 'bestvideo+bestaudio/best', 'merge_output_format': 'mp4', 'noplaylist': True} module-attribute
TEMP_FILENAME_FORMAT = 'temp_%(id)s' module-attribute
TEMP_FILENAME_STR = 'temp_{id}' module-attribute
logger = get_child_logger(__name__) module-attribute
DLPDownloader

Bases: YTDownloader

yt-dlp based implementation of YouTube content retrieval.

Assures temporary file export is in the form . where ID is the YouTube video id, and ext is the appropriate extension.

Renames the export file to be based on title and ID by default, or moves the export file to the specified output file with appropriate extension.

config = config or BASE_YDL_OPTIONS instance-attribute
__init__(config=None)
get_audio(url, start=None, end=None, output_path=None)

Download audio and get metadata for a YouTube video.

get_default_export_name(url)

Get default export filename for a URL.

get_default_filename_stem(metadata)

Generate the object download filename.

get_metadata(url)

Get metadata for a YouTube video.

get_transcript(url, lang='en', output_path=None)

Downloads video transcript in TTML format.

Parameters:

Name Type Description Default
url str

YouTube video URL

required
lang str

Language code for transcript (default: "en")

'en'
output_path Optional[Path]

Optional output directory (uses current dir if None)

None

Returns:

Type Description
VideoTranscript

TranscriptResource containing TTML file path and metadata

Raises:

Type Description
TranscriptError

If no transcript found for specified language

get_video(url, quality=None, output_path=None)

Download the full video with associated metadata.

Parameters:

Name Type Description Default
url str

YouTube video URL

required
quality Optional[str]

yt-dlp format string (default: highest available)

None
output_path Optional[Path]

Optional output directory

None

Returns:

Type Description
VideoFile

VideoFile containing video file path and metadata

Raises:

Type Description
VideoDownloadError

If download fails

DownloadError

Bases: VideoProcessingError

Raised for download-related errors.

TranscriptError

Bases: VideoProcessingError

Raised for transcript-related errors.

VideoAudio dataclass

Bases: VideoResource

VideoDownloadError

Bases: VideoProcessingError

Raised for video download-related errors.

VideoFile dataclass

Bases: VideoResource

Represents a downloaded video file and its metadata.

VideoProcessingError

Bases: Exception

Base exception for video processing errors.

VideoResource dataclass

Base class for all video resources.

filepath = None class-attribute instance-attribute
metadata instance-attribute
__init__(metadata, filepath=None)
VideoTranscript dataclass

Bases: VideoResource

YTDownloader

Abstract base class for YouTube content retrieval.

get_audio(url, start, end, output_path)

Extract audio with associated metadata.

get_metadata(url)

Retrieve video metadata only.

get_transcript(url, lang='en', output_path=None)

Retrieve video transcript with associated metadata.

get_video(url, quality=None, output_path=None)

Download the full video with associated metadata.

Parameters:

Name Type Description Default
url str

YouTube video URL

required
quality Optional[str]

yt-dlp format string (default: highest available)

None
output_path Optional[Path]

Optional output directory

None

Returns:

Type Description
VideoFile

VideoFile containing video file path and metadata

Raises:

Type Description
VideoDownloadError

If download fails

extract_text_from_ttml(ttml_path)

Extract plain text content from TTML file.

Parameters:

Name Type Description Default
ttml_path Path

Path to TTML transcript file

required

Returns:

Type Description
str

Plain text content with one sentence per line

Raises:

Type Description
ValueError

If file doesn't exist or has invalid content

get_youtube_urls_from_csv(file_path)

Reads a CSV file containing YouTube URLs and titles, logs the titles, and returns a list of URLs.

Parameters:

Name Type Description Default
file_path Path

Path to the CSV file containing YouTube URLs and titles.

required

Returns:

Type Description
List[str]

List[str]: List of YouTube URLs.

Raises:

Type Description
FileNotFoundError

If the file does not exist.

ValueError

If the CSV file is improperly formatted.

video_processing_old1

DEFAULT_TRANSCRIPT_DIR = Path.home() / '.yt_dlp_transcripts' module-attribute
DEFAULT_TRANSCRIPT_OPTIONS = {'skip_download': True, 'quiet': True, 'no_warnings': True, 'extract_flat': True, 'socket_timeout': 30, 'retries': 3, 'ignoreerrors': True, 'logger': logger} module-attribute
logger = get_child_logger(__name__) module-attribute
SubtitleTrack

Bases: TypedDict

Type definition for a subtitle track entry.

ext instance-attribute
name instance-attribute
url instance-attribute
TranscriptNotFoundError

Bases: Exception

Raised when no transcript is available for the requested language.

language = language instance-attribute
video_url = video_url instance-attribute
__init__(video_url, language)

Initialize TranscriptNotFoundError.

Parameters:

Name Type Description Default
video_url str

URL of the video where transcript was not found

required
language str

Language code that was requested

required
VideoInfo

Bases: TypedDict

Type definition for relevant video info fields.

automatic_captions instance-attribute
subtitles instance-attribute
download_audio_yt(url, output_dir, start_time=None, prompt_overwrite=True)

Downloads audio from a YouTube video using yt_dlp.YoutubeDL, with an optional start time.

Parameters:

Name Type Description Default
url str

URL of the YouTube video.

required
output_dir Path

Directory to save the downloaded audio file.

required
start_time str

Optional start time (e.g., '00:01:30' for 1 minute 30 seconds).

None

Returns:

Name Type Description
Path Path

Path to the downloaded audio file.

get_transcript(url, lang='en', download_dir=DEFAULT_TRANSCRIPT_DIR, keep_transcript_file=False)

Downloads and extracts the transcript for a given YouTube video URL.

Retrieves the transcript file, extracts the text content, and returns the raw text.

Parameters:

Name Type Description Default
url str

The URL of the YouTube video.

required
lang str

The language code for the transcript (default: 'en').

'en'
download_dir Path

The directory to download the transcript to.

DEFAULT_TRANSCRIPT_DIR
keep_transcript_file bool

Whether to keep the downloaded transcript file (default: False).

False

Returns:

Type Description
str

The extracted transcript text.

Raises:

Type Description
TranscriptNotFoundError

If no transcript is available in the specified language.

DownloadError

If video info extraction or download fails.

ValueError

If the downloaded transcript file is invalid or empty.

ParseError

If XML parsing of the transcript fails.

get_transcript_info(video_url, lang='en')

Retrieves the transcript URL for a video in the specified language.

Parameters:

Name Type Description Default
video_url str

The URL of the video

required
lang str

The desired language code

'en'

Returns:

Type Description
str

URL of the transcript

Raises:

Type Description
TranscriptNotFoundError

If no transcript is available in the specified language

DownloadError

If video info extraction fails

get_video_download_path_yt(output_dir, url)

Extracts the video title using yt-dlp.

Parameters:

Name Type Description Default
url str

The YouTube URL.

required

Returns:

Name Type Description
str Path

The title of the video.

get_youtube_urls_from_csv(file_path)

Reads a CSV file containing YouTube URLs and titles, logs the titles, and returns a list of URLs.

Parameters:

Name Type Description Default
file_path Path

Path to the CSV file containing YouTube URLs and titles.

required

Returns:

Type Description
List[str]

List[str]: List of YouTube URLs.

Raises:

Type Description
FileNotFoundError

If the file does not exist.

ValueError

If the CSV file is improperly formatted.

video_processing_old2

AUDIO_DOWNLOAD_OPTIONS = BASE_YDL_OPTIONS | {'format': 'bestaudio/best', 'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3', 'preferredquality': '192'}], 'noplaylist': True} module-attribute
BASE_YDL_OPTIONS = {'quiet': True, 'no_warnings': True, 'extract_flat': True, 'socket_timeout': 30, 'retries': 3, 'ignoreerrors': True, 'logger': logger} module-attribute
DEFAULT_METADATA_FIELDS = ['id', 'title', 'description', 'duration', 'upload_date', 'uploader', 'channel_url', 'webpage_url', 'original_url', 'channel', 'language', 'categories', 'tags'] module-attribute
DEFAULT_TRANSCRIPT_DIR = Path.home() / '.yt_dlp_transcripts' module-attribute
TRANSCRIPT_OPTIONS = BASE_YDL_OPTIONS | {'writesubtitles': True, 'writeautomaticsub': True, 'subtitlesformat': 'ttml'} module-attribute
logger = get_child_logger(__name__) module-attribute
SubtitleTrack

Bases: TypedDict

Type definition for a subtitle track entry.

ext instance-attribute
name instance-attribute
url instance-attribute
TranscriptNotFoundError

Bases: Exception

Raised when no transcript is available for the requested language.

language = language instance-attribute
video_url = video_url instance-attribute
__init__(video_url, language)
VideoDownload dataclass

Bases: VideoMetadata

Result of download operations.

filepath instance-attribute
__init__(metadata, filepath)
VideoInfo

Bases: TypedDict

Type definition for relevant video info fields.

automatic_captions instance-attribute
subtitles instance-attribute
VideoMetadata dataclass

Base class for video operations containing common metadata.

metadata instance-attribute
__init__(metadata)
VideoTranscript dataclass

Bases: VideoMetadata

Result of transcript operations.

content instance-attribute
__init__(metadata, content)
download_audio_yt(url, output_dir, start_time=None)

Downloads audio from YouTube URL with optional start time.

get_transcript(url, lang='en', download_dir=DEFAULT_TRANSCRIPT_DIR, keep_transcript_file=False)

Downloads and extracts transcript with metadata.

get_video_download_path_yt(output_dir, url)

Get video metadata and expected download path.

get_video_metadata(url)

Get metadata for a YouTube video without downloading content.

Parameters:

Name Type Description Default
url str

YouTube video URL

required

Returns:

Type Description
VideoMetadata

VideoMetadata with only metadata field populated

Raises:

Type Description
DownloadError

If video info extraction fails

get_youtube_urls_from_csv(file_path)

Reads YouTube URLs from a CSV file containing URLs and titles.

yt_download_service

YTDownloadService dataclass

Service wrapper for YouTube download operations.

Notes

Keeps Object-Service protocol alignment; behavior is delegated for now.

downloader instance-attribute
__init__(downloader)
fetch_audio(url, start=None, end=None, output_path=None)

Fetch audio via the configured downloader.

fetch_metadata(url)

Fetch metadata via the configured downloader.

fetch_transcript(url, lang='en', output_path=None)

Fetch a transcript via the configured downloader.

fetch_video(url, quality=None, output_path=None)

Fetch video via the configured downloader.

yt_environment

JsRuntime dataclass
name instance-attribute
path instance-attribute
__init__(name, path)
YTDLPEnvironmentInspector

Preflight checks for yt-dlp runtime dependencies.

has_remote_components()
inspect()

Inspect the environment for common yt-dlp runtime gaps.

inspect_report()

Return a structured preflight report for user-facing output.

resolve_js_runtime()
YTDLPEnvironmentReport dataclass

Report describing preflight environment concerns for yt-dlp.

warnings instance-attribute
__init__(warnings)
has_warnings()

Return True when warnings are present.

yt_preflight_report

YTPreflightItem dataclass

Structured preflight warning item for yt-dlp runtime checks.

code instance-attribute
message instance-attribute
severity = YTPreflightSeverity.WARNING class-attribute instance-attribute
__init__(code, message, severity=YTPreflightSeverity.WARNING)
YTPreflightReport dataclass

Aggregated preflight report.

items instance-attribute
__init__(items)
has_items()

Return True when preflight items exist.

YTPreflightSeverity

Bases: str, Enum

WARNING = 'warning' class-attribute instance-attribute

xml_processing

__all__ = ['FormattingError', 'PagebreakXMLParser', 'join_xml_data_to_doc', 'remove_page_tags', 'save_pages_to_xml', 'split_xml_on_pagebreaks', 'split_xml_pages'] module-attribute

FormattingError

Bases: Exception

Custom exception raised for formatting-related errors.

__init__(message='An error occurred due to invalid formatting.')

PagebreakXMLParser

Parses XML documents split by tags, with optional grouping and tag retention.

cleaned_text = '' instance-attribute
original_text = text instance-attribute
pagebreak_tags = [] instance-attribute
pages = [] instance-attribute
__init__(text)
parse(page_groups=None, keep_pagebreaks=True)

Parses the XML and returns a list of page contents, optionally grouped and with pagebreaks retained.

join_xml_data_to_doc(file_path, data, overwrite=False)

Joins a list of XML-tagged data with newlines, wraps it with tags, and writes it to the specified file. Raises an exception if the file exists and overwrite is not set.

Parameters:

Name Type Description Default
file_path Path

Path to the output file.

required
data List[str]

List of XML-tagged data strings.

required
overwrite bool

Whether to overwrite the file if it exists.

False

Raises:

Type Description
FileExistsError

If the file exists and overwrite is False.

ValueError

If the data list is empty.

Example

join_xml_data_to_doc(Path("output.xml"), ["Data"], overwrite=True)

remove_page_tags(text)

Removes and tags from a text string.

Parameters: - text (str): The input text containing tags.

Returns: - str: The text with tags removed.

save_pages_to_xml(output_xml_path, text_pages, overwrite=False)

Generates and saves an XML file containing text pages, with a tag indicating the page ends.

Parameters:

Name Type Description Default
output_xml_path Path

The Path object for the file where the XML file will be saved.

required
text_pages List[str]

A list of strings, each representing the text content of a page.

required
overwrite bool

If True, overwrites the file if it exists. Default is False.

False

Returns:

Type Description
None

None

Raises:

Type Description
ValueError

If the input list of text_pages is empty or contains invalid types.

FileExistsError

If the file already exists and overwrite is False.

PermissionError

If the file cannot be created due to insufficient permissions.

OSError

For other file I/O-related errors.

split_xml_on_pagebreaks(text, page_groups=None, keep_pagebreaks=True)

Splits an XML document into individual pages based on tags. Optionally groups pages together based on page_groups and retains tags if keep_pagebreaks is True.

split_xml_pages(text)

Backwards-compatible helper that returns the page contents without pagebreak tags.

Parameters:

Name Type Description Default
text str

XML document string.

required

Returns:

Type Description
List[str]

List of page strings.

extract_tags

extract_unique_tags(xml_file)

Extract all unique tags from an XML file using lxml.

Parameters:

Name Type Description Default
xml_file str

Path to the XML file.

required

Returns:

Type Description
Set[str]

Set[str]: A set of unique tags in the XML document.

main()

xml_processing

FormattingError

Bases: Exception

Custom exception raised for formatting-related errors.

__init__(message='An error occurred due to invalid formatting.')
PagebreakXMLParser

Parses XML documents split by tags, with optional grouping and tag retention.

cleaned_text = '' instance-attribute
original_text = text instance-attribute
pagebreak_tags = [] instance-attribute
pages = [] instance-attribute
__init__(text)
parse(page_groups=None, keep_pagebreaks=True)

Parses the XML and returns a list of page contents, optionally grouped and with pagebreaks retained.

join_xml_data_to_doc(file_path, data, overwrite=False)

Joins a list of XML-tagged data with newlines, wraps it with tags, and writes it to the specified file. Raises an exception if the file exists and overwrite is not set.

Parameters:

Name Type Description Default
file_path Path

Path to the output file.

required
data List[str]

List of XML-tagged data strings.

required
overwrite bool

Whether to overwrite the file if it exists.

False

Raises:

Type Description
FileExistsError

If the file exists and overwrite is False.

ValueError

If the data list is empty.

Example

join_xml_data_to_doc(Path("output.xml"), ["Data"], overwrite=True)

remove_page_tags(text)

Removes and tags from a text string.

Parameters: - text (str): The input text containing tags.

Returns: - str: The text with tags removed.

save_pages_to_xml(output_xml_path, text_pages, overwrite=False)

Generates and saves an XML file containing text pages, with a tag indicating the page ends.

Parameters:

Name Type Description Default
output_xml_path Path

The Path object for the file where the XML file will be saved.

required
text_pages List[str]

A list of strings, each representing the text content of a page.

required
overwrite bool

If True, overwrites the file if it exists. Default is False.

False

Returns:

Type Description
None

None

Raises:

Type Description
ValueError

If the input list of text_pages is empty or contains invalid types.

FileExistsError

If the file already exists and overwrite is False.

PermissionError

If the file cannot be created due to insufficient permissions.

OSError

For other file I/O-related errors.

split_xml_on_pagebreaks(text, page_groups=None, keep_pagebreaks=True)

Splits an XML document into individual pages based on tags. Optionally groups pages together based on page_groups and retains tags if keep_pagebreaks is True.

split_xml_pages(text)

Backwards-compatible helper that returns the page contents without pagebreak tags.

Parameters:

Name Type Description Default
text str

XML document string.

required

Returns:

Type Description
List[str]

List of page strings.