API Reference¶
tnh_scholar
¶
TNH Scholar: Text Processing and Analysis Tools
TNH Scholar is an AI-driven project designed to explore, query, process and translate the teachings of Thich Nhat Hanh and other Plum Village Dharma Teachers. The project aims to create a resource for practitioners and scholars to deeply engage with mindfulness and spiritual wisdom through natural language processing and machine learning models.
Core Features
- Audio transcription and processing
- Multi-lingual text processing and translation
- Prompt-based text analysis
- OCR processing for historical documents
- CLI tools for batch processing
Package Structure
- tnh_scholar/
- CLI_tools/ - Command line interface tools
- audio_processing/ - Audio file handling and transcription
- journal_processing/ - Journal and publication processing
- ocr_processing/ - Optical character recognition tools
- text_processing/ - Core text processing utilities
- video_processing/ - Video file handling and transcription
- utils/ - Shared utility functions
- xml_processing/ - XML parsing and generation
Environment Configuration
- The package uses environment variables for configuration, including:
- TNH_PROMPT_DIR - Directory for text processing prompts
- OPENAI_API_KEY - OpenAI API authentication
- GOOGLE_VISION_KEY - Google Cloud Vision API key for OCR
CLI Tools
- audio-transcribe - Audio file transcription utility
- tnh-gen - GenAI CLI for text processing and analysis
For more information, see: - Documentation: https://aaronksolomon.github.io/tnh-scholar/ - Source: https://github.com/aaronksolomon/tnh-scholar - Issues: https://github.com/aaronksolomon/tnh-scholar/issues
Dependencies
- Core: click, pydantic, openai, yt-dlp
- Optional: streamlit (GUI), spacy (NLP), google-cloud-vision (OCR)
TNH_CLI_TOOLS_DIR = TNH_ROOT_SRC_DIR / 'cli_tools'
module-attribute
¶
TNH_METADATA_PROCESS_FIELD = 'tnh_processing'
module-attribute
¶
TNH_PROJECT_ROOT_DIR = TNH_ROOT_SRC_DIR.resolve().parent.parent
module-attribute
¶
TNH_ROOT_SRC_DIR = Path(__file__).resolve().parent
module-attribute
¶
__version__ = '0.4.2'
module-attribute
¶
agent_orchestration
¶
Agent orchestration package.
app
¶
Maintained application-layer bootstrap surface for agent orchestration.
__all__ = ['BootstrapRuntimeProfile', 'HeadlessBootstrapConfig', 'HeadlessBootstrapParams', 'HeadlessBootstrapResult', 'HeadlessBootstrapService', 'HeadlessPolicyConfig', 'HeadlessRunnerConfig', 'HeadlessStorageConfig', 'HeadlessValidationConfig', 'build_bootstrap_runtime_profile']
module-attribute
¶
BootstrapRuntimeProfile
dataclass
¶
HeadlessBootstrapConfig
¶
Bases: BaseModel
Construction-time config for the maintained headless bootstrap path.
base_ref = 'HEAD'
class-attribute
instance-attribute
¶
branch_prefix = 'tnh/run-'
class-attribute
instance-attribute
¶
policy
instance-attribute
¶
repo_root
instance-attribute
¶
runner = Field(default_factory=HeadlessRunnerConfig)
class-attribute
instance-attribute
¶
storage
instance-attribute
¶
validation
instance-attribute
¶
HeadlessBootstrapParams
¶
Bases: BaseModel
Per-run parameters for the maintained headless bootstrap path.
workflow_path
instance-attribute
¶
HeadlessBootstrapResult
¶
Bases: BaseModel
Stable summary returned by the maintained headless bootstrap path.
HeadlessBootstrapService
dataclass
¶
Load one workflow and run the maintained kernel end to end.
config
instance-attribute
¶
kernel_factory = None
class-attribute
instance-attribute
¶
workflow_loader = field(default_factory=YamlWorkflowLoader)
class-attribute
instance-attribute
¶
__init__(config, workflow_loader=YamlWorkflowLoader(), kernel_factory=None)
¶
run(params)
¶
Execute one maintained headless bootstrap run.
HeadlessPolicyConfig
¶
Bases: BaseModel
Construction-time execution policy config.
execution_policy_settings
instance-attribute
¶
HeadlessRunnerConfig
¶
HeadlessStorageConfig
¶
HeadlessValidationConfig
¶
Bases: BaseModel
Construction-time builtin validator mapping config.
builtin_commands = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
build_bootstrap_runtime_profile()
¶
Return the explicit bootstrap profile used by the maintained CLI.
factory
¶
Composition helpers for the maintained headless bootstrap app layer.
models
¶
Typed models for the maintained headless bootstrap app layer.
HeadlessBootstrapConfig
¶
Bases: BaseModel
Construction-time config for the maintained headless bootstrap path.
base_ref = 'HEAD'
class-attribute
instance-attribute
¶branch_prefix = 'tnh/run-'
class-attribute
instance-attribute
¶policy
instance-attribute
¶repo_root
instance-attribute
¶runner = Field(default_factory=HeadlessRunnerConfig)
class-attribute
instance-attribute
¶storage
instance-attribute
¶validation
instance-attribute
¶
HeadlessBootstrapParams
¶
Bases: BaseModel
Per-run parameters for the maintained headless bootstrap path.
workflow_path
instance-attribute
¶
HeadlessBootstrapResult
¶
Bases: BaseModel
Stable summary returned by the maintained headless bootstrap path.
HeadlessPolicyConfig
¶
Bases: BaseModel
Construction-time execution policy config.
execution_policy_settings
instance-attribute
¶
HeadlessRunnerConfig
¶
HeadlessStorageConfig
¶
profile
¶
Explicit bootstrap profile assembly for the maintained headless app layer.
service
¶
Maintained application-layer bootstrap service for headless orchestration.
HeadlessBootstrapService
dataclass
¶
Load one workflow and run the maintained kernel end to end.
config
instance-attribute
¶kernel_factory = None
class-attribute
instance-attribute
¶workflow_loader = field(default_factory=YamlWorkflowLoader)
class-attribute
instance-attribute
¶__init__(config, workflow_loader=YamlWorkflowLoader(), kernel_factory=None)
¶run(params)
¶Execute one maintained headless bootstrap run.
codex_harness
¶
Suspended Codex harness spike preserved as reference-only code.
adapters
¶
models
¶
Domain models for the Codex harness.
CodexDefaults
dataclass
¶
Default values for harness settings and parameters.
default_system_prompt = 'Use the provided tools to inspect the repo. Use repo-relative paths. Return ONLY JSON with keys: patch (string or null), rationale (string), risk_flags (list of strings), open_questions (list of strings), status (complete|partial|blocked). No extra keys.'
class-attribute
instance-attribute
¶max_output_tokens = 2000
class-attribute
instance-attribute
¶max_tool_rounds = 12
class-attribute
instance-attribute
¶model = 'gpt-5.2-codex'
class-attribute
instance-attribute
¶runs_root = Path('.tnh-codex/runs')
class-attribute
instance-attribute
¶temperature = None
class-attribute
instance-attribute
¶timeout_seconds = 900
class-attribute
instance-attribute
¶__init__(runs_root=Path('.tnh-codex/runs'), model='gpt-5.2-codex', timeout_seconds=900, max_output_tokens=2000, temperature=None, max_tool_rounds=12, default_system_prompt='Use the provided tools to inspect the repo. Use repo-relative paths. Return ONLY JSON with keys: patch (string or null), rationale (string), risk_flags (list of strings), open_questions (list of strings), status (complete|partial|blocked). No extra keys.')
¶
CodexMessage
¶
CodexOutputStatus
¶
CodexRequest
¶
CodexResponseText
¶
CodexRunArtifacts
¶
Bases: BaseModel
Paths to files generated by a run.
CodexRunConfig
¶
CodexRunMetadata
¶
Bases: BaseModel
Metadata for a Codex harness run.
artifacts
instance-attribute
¶ended_at
instance-attribute
¶error_message = None
class-attribute
instance-attribute
¶model
instance-attribute
¶output_status = None
class-attribute
instance-attribute
¶patch_applied = False
class-attribute
instance-attribute
¶run_id
instance-attribute
¶started_at
instance-attribute
¶status
instance-attribute
¶test_exit_code = None
class-attribute
instance-attribute
¶
CodexRunParams
¶
Bases: BaseModel
Per-run parameters for the harness.
apply_patch = True
class-attribute
instance-attribute
¶max_output_tokens = Field(default_factory=(lambda: CodexDefaults().max_output_tokens))
class-attribute
instance-attribute
¶max_tool_rounds = Field(default_factory=(lambda: CodexDefaults().max_tool_rounds))
class-attribute
instance-attribute
¶run_tests_command = None
class-attribute
instance-attribute
¶system_prompt = None
class-attribute
instance-attribute
¶task
instance-attribute
¶temperature = Field(default_factory=(lambda: CodexDefaults().temperature))
class-attribute
instance-attribute
¶timeout_seconds = Field(default_factory=(lambda: CodexDefaults().timeout_seconds))
class-attribute
instance-attribute
¶
CodexRunStatus
¶
CodexSettings
¶
Bases: BaseSettings
Environment-driven settings for the Codex harness.
model = Field(default_factory=(lambda: CodexDefaults().model))
class-attribute
instance-attribute
¶model_config = SettingsConfigDict(extra='ignore')
class-attribute
instance-attribute
¶openai_api_key = None
class-attribute
instance-attribute
¶runs_root = Field(default_factory=(lambda: CodexDefaults().runs_root))
class-attribute
instance-attribute
¶from_env()
classmethod
¶Create settings from environment.
CodexStructuredOutput
¶
Bases: BaseModel
Structured output expected from Codex.
PatchApplyResult
¶
protocols
¶
Protocol definitions for the Codex harness.
ArtifactWriterProtocol
¶
ClockProtocol
¶
PatchApplierProtocol
¶
ResponsesClientProtocol
¶
Bases: Protocol
Call the OpenAI Responses API.
run(request, tool_registry)
¶Execute the request and return response text.
RunIdGeneratorProtocol
¶
SearcherProtocol
¶
Bases: Protocol
Search for text in the repository.
search(query, root)
¶Return matching lines for the query.
TestRunnerProtocol
¶
Bases: Protocol
Run test commands.
run(command, timeout_seconds)
¶Execute a test command and capture results.
ToolExecutorProtocol
¶
Bases: Protocol
Execute tool calls for the Codex harness.
execute(call)
¶Execute the tool call and return output.
ToolRegistryProtocol
¶
providers
¶
Providers for codex harness infrastructure.
chat_completions_client
¶
OpenAI Chat Completions API client for Codex harness.
ChatCompletionsClient
dataclass
¶
Bases: ResponsesClientProtocol
Chat Completions API client for Codex harness.
openai_responses_client
¶
OpenAI Responses API client for Codex harness.
OpenAIResponsesClient
dataclass
¶
Bases: ResponsesClientProtocol
Responses API client for Codex harness.
patch_applier
¶
Patch application provider for Codex harness.
GitPatchApplier
dataclass
¶
Bases: PatchApplierProtocol
Apply unified diff patches using git.
run_id
¶
Run id generator for Codex harness.
TimestampRunIdGenerator
dataclass
¶
Bases: RunIdGeneratorProtocol
Timestamp-based run id generator.
searcher
¶
test_runner
¶
Test runner provider for Codex harness.
ShellTestRunner
dataclass
¶
Bases: TestRunnerProtocol
Run test commands via the shell.
tool_executor
¶
Tool execution provider for Codex harness.
CodexToolExecutor
dataclass
¶
Bases: ToolExecutorProtocol
Execute Codex tool calls against the repo.
tool_registry
¶
service
¶
Service orchestrator for the Codex harness.
CodexHarnessService
dataclass
¶
Coordinate Codex harness execution and artifacts.
artifact_writer
instance-attribute
¶clock
instance-attribute
¶output_parser
instance-attribute
¶patch_applier
instance-attribute
¶responses_client
instance-attribute
¶run_id_generator
instance-attribute
¶test_runner
instance-attribute
¶tool_registry
instance-attribute
¶__init__(clock, run_id_generator, responses_client, artifact_writer, output_parser, patch_applier, test_runner, tool_registry)
¶run(params, config)
¶Execute a Codex harness run.
tools
¶
Tooling definitions for Codex harness.
ApplyPatchResult
¶
ListFilesResult
¶
ReadFileResult
¶
RunTestsResult
¶
SearchRepoResult
¶
ToolCall
¶
ToolDefinition
¶
ToolName
¶
Bases: str, Enum
apply_patch = 'apply_patch'
class-attribute
instance-attribute
¶list_files = 'list_files'
class-attribute
instance-attribute
¶read_file = 'read_file'
class-attribute
instance-attribute
¶run_tests = 'run_tests'
class-attribute
instance-attribute
¶search_repo = 'search_repo'
class-attribute
instance-attribute
¶
ToolResult
¶
common
¶
Shared primitives for agent orchestration subsystems.
__all__ = ['local_now', 'strftime_run_id', 'utc_now']
module-attribute
¶
local_now()
¶
Return current local timestamp with timezone information.
strftime_run_id(now, format_string)
¶
Return run id generated from timestamp and format.
utc_now()
¶
Return current UTC timestamp.
conductor_mvp
¶
MVP conductor kernel for workflow execution.
__all__ = ['ArtifactPaths', 'BuiltinValidatorSpec', 'ConductorKernelService', 'EvaluateStep', 'GateStep', 'HarnessValidatorSpec', 'KernelRunResult', 'MechanicalOutcome', 'PlannerDecision', 'PlannerStatus', 'RollbackStep', 'RunAgentStep', 'RunValidationStep', 'StopStep', 'ValidatorExecutionSpec', 'WorkflowDefinition', 'WorkflowValidationError']
module-attribute
¶
ArtifactPaths
¶
BuiltinValidatorSpec
¶
ConductorKernelService
dataclass
¶
Execute a validated workflow deterministically.
agent_runner
instance-attribute
¶
artifact_store
instance-attribute
¶
clock
instance-attribute
¶
gate_approver
instance-attribute
¶
planner_evaluator
instance-attribute
¶
run_id_generator
instance-attribute
¶
validation_runner
instance-attribute
¶
workflow_validator
instance-attribute
¶
workspace
instance-attribute
¶
__init__(clock, run_id_generator, artifact_store, workspace, agent_runner, validation_runner, planner_evaluator, gate_approver, workflow_validator)
¶
run(workflow, run_root)
¶
Execute workflow and return run summary.
EvaluateStep
¶
GateStep
¶
HarnessValidatorSpec
¶
Bases: BaseModel
Generated harness validator reference resolved by a provider.
KernelRunResult
¶
MechanicalOutcome
¶
Bases: str, Enum
Mechanical execution outcomes.
completed = 'completed'
class-attribute
instance-attribute
¶
error = 'error'
class-attribute
instance-attribute
¶
killed_idle = 'killed_idle'
class-attribute
instance-attribute
¶
killed_policy = 'killed_policy'
class-attribute
instance-attribute
¶
killed_timeout = 'killed_timeout'
class-attribute
instance-attribute
¶
PlannerDecision
¶
Bases: BaseModel
Structured planner output consumed by EVALUATE.
PlannerStatus
¶
Bases: str, Enum
Semantic planner statuses.
RollbackStep
¶
RunAgentStep
¶
RunValidationStep
¶
StopStep
¶
ValidatorExecutionSpec
¶
WorkflowDefinition
¶
WorkflowValidationError
¶
Bases: Exception
Raised when a workflow fails static or runtime kernel validation.
adapters
¶
models
¶
Typed domain models for conductor MVP workflow execution.
StepDefinition = Annotated[RunAgentStep | RunValidationStep | EvaluateStep | GateStep | RollbackStep | StopStep, Field(discriminator='opcode')]
module-attribute
¶
ValidatorSpec = Annotated[BuiltinValidatorSpec | HarnessValidatorSpec, Field(discriminator='kind')]
module-attribute
¶
AgentRunResult
¶
ArtifactPaths
¶
BaseStep
¶
BuiltinValidatorName
¶
BuiltinValidatorSpec
¶
EvaluateStep
¶
GateOutcome
¶
GateStep
¶
HarnessReport
¶
Bases: BaseModel
Minimal harness report fields used by kernel runtime checks.
proposed_goldens = Field(default_factory=list)
class-attribute
instance-attribute
¶
HarnessValidatorName
¶
Bases: str, Enum
Trusted generated harness validator identifiers.
generated_harness = 'generated_harness'
class-attribute
instance-attribute
¶
HarnessValidatorSpec
¶
Bases: BaseModel
Generated harness validator reference resolved by a provider.
KernelRunResult
¶
MechanicalOutcome
¶
Bases: str, Enum
Mechanical execution outcomes.
completed = 'completed'
class-attribute
instance-attribute
¶error = 'error'
class-attribute
instance-attribute
¶killed_idle = 'killed_idle'
class-attribute
instance-attribute
¶killed_policy = 'killed_policy'
class-attribute
instance-attribute
¶killed_timeout = 'killed_timeout'
class-attribute
instance-attribute
¶
Opcode
¶
Bases: str, Enum
Kernel opcode names.
evaluate = 'EVALUATE'
class-attribute
instance-attribute
¶gate = 'GATE'
class-attribute
instance-attribute
¶rollback = 'ROLLBACK'
class-attribute
instance-attribute
¶run_agent = 'RUN_AGENT'
class-attribute
instance-attribute
¶run_validation = 'RUN_VALIDATION'
class-attribute
instance-attribute
¶stop = 'STOP'
class-attribute
instance-attribute
¶
PlannerDecision
¶
Bases: BaseModel
Structured planner output consumed by EVALUATE.
PlannerStatus
¶
Bases: str, Enum
Semantic planner statuses.
RollbackStep
¶
RouteRule
¶
RunAgentStep
¶
RunValidationStep
¶
StopStep
¶
ValidationRunResult
¶
ValidatorExecutionSpec
¶
WorkflowDefaults
¶
protocols
¶
Protocols for conductor MVP collaborators.
AgentRunnerProtocol
¶
ArtifactStoreProtocol
¶
GateApproverProtocol
¶
PlannerEvaluatorProtocol
¶
Bases: Protocol
Execute EVALUATE steps.
evaluate(step, run_dir)
¶Return structured planner decision.
RunIdGeneratorProtocol
¶
Bases: Protocol
Abstraction for generating run identifiers.
next_id(now)
¶Generate next run identifier.
ValidationRunnerProtocol
¶
providers
¶
Providers for conductor MVP.
__all__ = ['FileArtifactStore', 'LocalValidationRunner', 'StaticValidatorResolver', 'SystemClock', 'TimestampRunIdGenerator']
module-attribute
¶
FileArtifactStore
dataclass
¶
Bases: ArtifactStoreProtocol
Write run artifacts to local filesystem.
LocalValidationRunner
dataclass
¶
Bases: ValidationRunnerProtocol
Run validators via subprocess in the local worktree.
StaticValidatorResolver
dataclass
¶
Bases: ValidatorResolverProtocol
Resolve trusted validator refs from code-owned mappings.
entries
instance-attribute
¶harness_report_name = 'harness_report.json'
class-attribute
instance-attribute
¶harness_script_name = 'generated_harness.py'
class-attribute
instance-attribute
¶__init__(entries, harness_script_name='generated_harness.py', harness_report_name='harness_report.json')
¶resolve(validator, run_dir)
¶Resolve validator into a trusted execution spec.
SystemClock
dataclass
¶
TimestampRunIdGenerator
dataclass
¶
Bases: RunIdGeneratorProtocol
Generate run IDs from UTC timestamps.
artifact_store
¶
Artifact store provider for conductor MVP.
FileArtifactStore
dataclass
¶
Bases: ArtifactStoreProtocol
Write run artifacts to local filesystem.
run_id
¶
Run id generator provider for conductor MVP.
TimestampRunIdGenerator
dataclass
¶
Bases: RunIdGeneratorProtocol
Generate run IDs from UTC timestamps.
validation_runner
¶
Local RUN_VALIDATION executor with artifact capture.
TODO(agent-orch, high-priority): This provider still renders trusted validator
execution specs into argv tuples for subprocess.run. That is an acceptable
short-term infrastructure boundary for PR #35, but it is not the intended
end-state architecture for conductor MVP.
Required follow-up:
- replace ValidatorExecutionSpec.command: tuple[str, ...] with typed command
objects per validator family
- move argv rendering into a final executor-only translation layer
- eliminate naked command vectors from provider contracts entirely
Do not treat the current implementation as the final security/architecture fix.
BuiltinCommandEntry
¶LocalValidationRunner
dataclass
¶
Bases: ValidationRunnerProtocol
Run validators via subprocess in the local worktree.
StaticValidatorResolver
dataclass
¶
Bases: ValidatorResolverProtocol
Resolve trusted validator refs from code-owned mappings.
entries
instance-attribute
¶harness_report_name = 'harness_report.json'
class-attribute
instance-attribute
¶harness_script_name = 'generated_harness.py'
class-attribute
instance-attribute
¶__init__(entries, harness_script_name='generated_harness.py', harness_report_name='harness_report.json')
¶resolve(validator, run_dir)
¶Resolve validator into a trusted execution spec.
service
¶
Deterministic conductor kernel service for MVP workflows.
ConductorKernelService
dataclass
¶
Execute a validated workflow deterministically.
agent_runner
instance-attribute
¶artifact_store
instance-attribute
¶clock
instance-attribute
¶gate_approver
instance-attribute
¶planner_evaluator
instance-attribute
¶run_id_generator
instance-attribute
¶validation_runner
instance-attribute
¶workflow_validator
instance-attribute
¶workspace
instance-attribute
¶__init__(clock, run_id_generator, artifact_store, workspace, agent_runner, validation_runner, planner_evaluator, gate_approver, workflow_validator)
¶run(workflow, run_root)
¶Execute workflow and return run summary.
KernelState
dataclass
¶
Mutable execution state represented immutably.
current_step_id
instance-attribute
¶pending_golden_gate = False
class-attribute
instance-attribute
¶trace = field(default_factory=list)
class-attribute
instance-attribute
¶__init__(current_step_id, pending_golden_gate=False, trace=list())
¶advance(step_id, next_step_id, pending_gate=None)
¶Advance state with trace update.
log_text()
¶Render trace log text.
WorkflowCatalog
dataclass
¶
Indexed workflow helper for step lookups.
workflow
instance-attribute
¶__init__(workflow)
¶find_step(step_id)
¶Find a step by id or raise.
has_step_id(step_id)
¶Return True if workflow contains provided step id.
has_step_type(opcode)
¶Return True if workflow contains at least one opcode.
route_target(step, outcome_key, *, context)
¶Return route target for an outcome or raise with context.
transition_targets(step)
¶Return all declared transition targets for a step.
WorkflowValidationError
¶
Bases: Exception
Raised when a workflow fails static or runtime kernel validation.
execution
¶
Maintained execution subsystem for agent orchestration.
__all__ = ['CliExecutableInvocation', 'ExecutionOutputCapturePolicy', 'ExecutionRequest', 'ExecutionResult', 'ExecutionServiceProtocol', 'ExecutionTermination', 'ExplicitEnvironmentPolicy', 'InheritParentEnvironmentPolicy', 'IsolatedEnvironmentPolicy', 'PythonScriptInvocation', 'SubprocessExecutionService', 'TimeoutPolicy']
module-attribute
¶
CliExecutableInvocation
¶
ExecutionOutputCapturePolicy
¶
ExecutionRequest
¶
Bases: BaseModel
Trusted execution request.
environment_policy
instance-attribute
¶
invocation
instance-attribute
¶
output_capture_policy = Field(default_factory=ExecutionOutputCapturePolicy)
class-attribute
instance-attribute
¶
timeout_policy = Field(default_factory=TimeoutPolicy)
class-attribute
instance-attribute
¶
working_directory
instance-attribute
¶
ExecutionResult
¶
Bases: BaseModel
Low-level subprocess result.
exit_code = None
class-attribute
instance-attribute
¶
failure_message = None
class-attribute
instance-attribute
¶
stderr_text = ''
class-attribute
instance-attribute
¶
stdout_text = ''
class-attribute
instance-attribute
¶
termination
instance-attribute
¶
timed_out = False
class-attribute
instance-attribute
¶
ExecutionServiceProtocol
¶
Bases: Protocol
Execute one trusted execution request.
run(request)
¶
Execute one trusted request and return the normalized result.
ExecutionTermination
¶
Bases: str, Enum
Mechanical subprocess outcomes.
completed = 'completed'
class-attribute
instance-attribute
¶
idle_timeout = 'idle_timeout'
class-attribute
instance-attribute
¶
non_zero_exit = 'non_zero_exit'
class-attribute
instance-attribute
¶
policy_kill = 'policy_kill'
class-attribute
instance-attribute
¶
startup_failure = 'startup_failure'
class-attribute
instance-attribute
¶
wall_clock_timeout = 'wall_clock_timeout'
class-attribute
instance-attribute
¶
ExplicitEnvironmentPolicy
¶
InheritParentEnvironmentPolicy
¶
IsolatedEnvironmentPolicy
¶
PythonScriptInvocation
¶
SubprocessExecutionService
dataclass
¶
TimeoutPolicy
¶
Bases: BaseModel
Execution timeout settings.
idle_seconds is reserved for future streaming/idleness enforcement.
The current subprocess service only enforces wall-clock timeouts.
idle_seconds = Field(default=None, description='Maximum allowed idle time in seconds before termination. Reserved for future enforcement in the subprocess service.')
class-attribute
instance-attribute
¶
wall_clock_seconds = Field(default=None, description='Maximum wall-clock runtime in seconds before termination.')
class-attribute
instance-attribute
¶
models
¶
Typed models for the execution subsystem.
EnvironmentPolicy = Annotated[InheritParentEnvironmentPolicy | ExplicitEnvironmentPolicy | IsolatedEnvironmentPolicy, Field(discriminator='kind')]
module-attribute
¶
Invocation = Annotated[CliExecutableInvocation | PythonScriptInvocation, Field(discriminator='family')]
module-attribute
¶
CliExecutableInvocation
¶
ExecutionOutputCapturePolicy
¶
ExecutionRequest
¶
Bases: BaseModel
Trusted execution request.
environment_policy
instance-attribute
¶invocation
instance-attribute
¶output_capture_policy = Field(default_factory=ExecutionOutputCapturePolicy)
class-attribute
instance-attribute
¶timeout_policy = Field(default_factory=TimeoutPolicy)
class-attribute
instance-attribute
¶working_directory
instance-attribute
¶
ExecutionResult
¶
Bases: BaseModel
Low-level subprocess result.
exit_code = None
class-attribute
instance-attribute
¶failure_message = None
class-attribute
instance-attribute
¶stderr_text = ''
class-attribute
instance-attribute
¶stdout_text = ''
class-attribute
instance-attribute
¶termination
instance-attribute
¶timed_out = False
class-attribute
instance-attribute
¶
ExecutionTermination
¶
Bases: str, Enum
Mechanical subprocess outcomes.
completed = 'completed'
class-attribute
instance-attribute
¶idle_timeout = 'idle_timeout'
class-attribute
instance-attribute
¶non_zero_exit = 'non_zero_exit'
class-attribute
instance-attribute
¶policy_kill = 'policy_kill'
class-attribute
instance-attribute
¶startup_failure = 'startup_failure'
class-attribute
instance-attribute
¶wall_clock_timeout = 'wall_clock_timeout'
class-attribute
instance-attribute
¶
ExplicitEnvironmentPolicy
¶
InheritParentEnvironmentPolicy
¶
IsolatedEnvironmentPolicy
¶
OutputEncoding
¶
PythonScriptInvocation
¶
TimeoutPolicy
¶
Bases: BaseModel
Execution timeout settings.
idle_seconds is reserved for future streaming/idleness enforcement.
The current subprocess service only enforces wall-clock timeouts.
idle_seconds = Field(default=None, description='Maximum allowed idle time in seconds before termination. Reserved for future enforcement in the subprocess service.')
class-attribute
instance-attribute
¶wall_clock_seconds = Field(default=None, description='Maximum wall-clock runtime in seconds before termination.')
class-attribute
instance-attribute
¶
protocols
¶
execution_policy
¶
Maintained execution-policy subsystem for agent orchestration.
__all__ = ['ApprovalPosture', 'EffectiveExecutionPolicy', 'ExecutionPolicyAssembler', 'ExecutionPolicyAssemblerProtocol', 'ExecutionPolicyAssemblyError', 'ExecutionPolicySettings', 'ExecutionPosture', 'NetworkPosture', 'PolicySummary', 'PolicyViolation', 'PolicyViolationClass', 'RequestedExecutionPolicy']
module-attribute
¶
ApprovalPosture
¶
EffectiveExecutionPolicy
¶
Bases: BaseModel
Concrete enforced policy after derivation.
allowed_paths = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
approval_posture = ApprovalPosture.fail_on_prompt
class-attribute
instance-attribute
¶
execution_posture = ExecutionPosture.read_only
class-attribute
instance-attribute
¶
forbidden_operations = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
forbidden_paths = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
network_posture = NetworkPosture.deny
class-attribute
instance-attribute
¶
policy_reference = None
class-attribute
instance-attribute
¶
ExecutionPolicyAssembler
dataclass
¶
ExecutionPolicyAssemblerProtocol
¶
Bases: Protocol
Assemble requested and effective execution policy records.
assemble(*, settings, workflow_policy_ref=None, step_policy_ref=None, runtime_overrides=None)
¶
Assemble one canonical policy summary.
ExecutionPolicyAssemblyError
¶
Bases: ValueError
Raised when execution policy references cannot be assembled.
ExecutionPolicySettings
¶
Bases: BaseModel
System-level execution policy defaults and named references.
ExecutionPosture
¶
NetworkPosture
¶
PolicySummary
¶
Bases: BaseModel
Canonical persisted policy record for one executed step.
capability_notes = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
effective_policy
instance-attribute
¶
enforcement_notes = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
requested_policy
instance-attribute
¶
runtime_overrides = None
class-attribute
instance-attribute
¶
violations = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
PolicyViolation
¶
PolicyViolationClass
¶
Bases: str, Enum
Stable policy violation classes.
forbidden_operation = 'forbidden_operation'
class-attribute
instance-attribute
¶
forbidden_path = 'forbidden_path'
class-attribute
instance-attribute
¶
interactive_prompt_violation = 'interactive_prompt_violation'
class-attribute
instance-attribute
¶
native_policy_block = 'native_policy_block'
class-attribute
instance-attribute
¶
network_violation = 'network_violation'
class-attribute
instance-attribute
¶
protected_branch_violation = 'protected_branch_violation'
class-attribute
instance-attribute
¶
RequestedExecutionPolicy
¶
Bases: BaseModel
Policy intent requested by the control plane.
allowed_paths = None
class-attribute
instance-attribute
¶
approval_posture = None
class-attribute
instance-attribute
¶
execution_posture = None
class-attribute
instance-attribute
¶
forbidden_operations = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
forbidden_paths = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
network_posture = None
class-attribute
instance-attribute
¶
policy_reference = None
class-attribute
instance-attribute
¶
assembly
¶
models
¶
Typed models for maintained execution policy contracts.
ApprovalPosture
¶
EffectiveExecutionPolicy
¶
Bases: BaseModel
Concrete enforced policy after derivation.
allowed_paths = Field(default_factory=tuple)
class-attribute
instance-attribute
¶approval_posture = ApprovalPosture.fail_on_prompt
class-attribute
instance-attribute
¶execution_posture = ExecutionPosture.read_only
class-attribute
instance-attribute
¶forbidden_operations = Field(default_factory=tuple)
class-attribute
instance-attribute
¶forbidden_paths = Field(default_factory=tuple)
class-attribute
instance-attribute
¶network_posture = NetworkPosture.deny
class-attribute
instance-attribute
¶policy_reference = None
class-attribute
instance-attribute
¶
ExecutionPolicySettings
¶
Bases: BaseModel
System-level execution policy defaults and named references.
ExecutionPosture
¶
NetworkPosture
¶
PolicySummary
¶
Bases: BaseModel
Canonical persisted policy record for one executed step.
capability_notes = Field(default_factory=tuple)
class-attribute
instance-attribute
¶effective_policy
instance-attribute
¶enforcement_notes = Field(default_factory=tuple)
class-attribute
instance-attribute
¶requested_policy
instance-attribute
¶runtime_overrides = None
class-attribute
instance-attribute
¶violations = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
PolicyViolation
¶
PolicyViolationClass
¶
Bases: str, Enum
Stable policy violation classes.
forbidden_operation = 'forbidden_operation'
class-attribute
instance-attribute
¶forbidden_path = 'forbidden_path'
class-attribute
instance-attribute
¶interactive_prompt_violation = 'interactive_prompt_violation'
class-attribute
instance-attribute
¶native_policy_block = 'native_policy_block'
class-attribute
instance-attribute
¶network_violation = 'network_violation'
class-attribute
instance-attribute
¶protected_branch_violation = 'protected_branch_violation'
class-attribute
instance-attribute
¶
RequestedExecutionPolicy
¶
Bases: BaseModel
Policy intent requested by the control plane.
allowed_paths = None
class-attribute
instance-attribute
¶approval_posture = None
class-attribute
instance-attribute
¶execution_posture = None
class-attribute
instance-attribute
¶forbidden_operations = Field(default_factory=tuple)
class-attribute
instance-attribute
¶forbidden_paths = Field(default_factory=tuple)
class-attribute
instance-attribute
¶network_posture = None
class-attribute
instance-attribute
¶policy_reference = None
class-attribute
instance-attribute
¶
protocols
¶
kernel
¶
Maintained kernel subsystem for agent orchestration.
__all__ = ['EvaluateStep', 'GateOutcome', 'GateStep', 'KernelRunResult', 'KernelRunService', 'MechanicalOutcome', 'Opcode', 'PlannerDecision', 'PlannerStatus', 'RollbackStep', 'RouteRule', 'RunAgentStep', 'RunValidationStep', 'StopStep', 'WorkflowDefinition', 'WorkflowValidationError', 'WorkflowValidator']
module-attribute
¶
EvaluateStep
¶
GateOutcome
¶
GateStep
¶
KernelRunResult
¶
Bases: BaseModel
Kernel run summary.
ended_at
instance-attribute
¶
final_state_path
instance-attribute
¶
last_step_id
instance-attribute
¶
metadata_path
instance-attribute
¶
run_directory
instance-attribute
¶
run_id
instance-attribute
¶
started_at
instance-attribute
¶
status
instance-attribute
¶
status_path
instance-attribute
¶
workflow_id
instance-attribute
¶
KernelRunService
dataclass
¶
Execute a workflow deterministically.
artifact_store
instance-attribute
¶
clock
instance-attribute
¶
execution_policy_assembler = field(default_factory=ExecutionPolicyAssembler)
class-attribute
instance-attribute
¶
execution_policy_settings = field(default_factory=ExecutionPolicySettings)
class-attribute
instance-attribute
¶
gate_approver
instance-attribute
¶
heartbeat_executor = field(default_factory=(lambda: ThreadPoolExecutor(max_workers=1)), repr=False, compare=False)
class-attribute
instance-attribute
¶
heartbeat_interval_seconds = 30.0
class-attribute
instance-attribute
¶
planner_evaluator
instance-attribute
¶
run_id_generator
instance-attribute
¶
runner_service
instance-attribute
¶
validation_service
instance-attribute
¶
workflow_validator
instance-attribute
¶
workspace
instance-attribute
¶
__init__(clock, run_id_generator, artifact_store, workspace, runner_service, validation_service, planner_evaluator, gate_approver, workflow_validator, execution_policy_settings=ExecutionPolicySettings(), execution_policy_assembler=ExecutionPolicyAssembler(), heartbeat_interval_seconds=30.0, heartbeat_executor=(lambda: ThreadPoolExecutor(max_workers=1))())
¶
run(workflow, run_root)
¶
Execute a workflow and return summary.
MechanicalOutcome
¶
Bases: str, Enum
Mechanical outcomes used for kernel routing.
completed = 'completed'
class-attribute
instance-attribute
¶
error = 'error'
class-attribute
instance-attribute
¶
killed_idle = 'killed_idle'
class-attribute
instance-attribute
¶
killed_policy = 'killed_policy'
class-attribute
instance-attribute
¶
killed_timeout = 'killed_timeout'
class-attribute
instance-attribute
¶
Opcode
¶
Bases: str, Enum
Kernel opcode names.
Values intentionally mirror the accepted OA04 workflow schema tokens.
evaluate = 'EVALUATE'
class-attribute
instance-attribute
¶
gate = 'GATE'
class-attribute
instance-attribute
¶
rollback = 'ROLLBACK'
class-attribute
instance-attribute
¶
run_agent = 'RUN_AGENT'
class-attribute
instance-attribute
¶
run_validation = 'RUN_VALIDATION'
class-attribute
instance-attribute
¶
stop = 'STOP'
class-attribute
instance-attribute
¶
PlannerDecision
¶
Bases: BaseModel
Structured planner output.
PlannerStatus
¶
Bases: str, Enum
Semantic planner statuses.
RollbackStep
¶
RouteRule
¶
RunAgentStep
¶
RunValidationStep
¶
StopStep
¶
WorkflowDefinition
¶
WorkflowValidationError
¶
Bases: Exception
Raised when workflow invariants are violated.
WorkflowValidator
dataclass
¶
adapters
¶
catalog
¶
Workflow graph helpers.
WorkflowCatalog
dataclass
¶
Indexed workflow helper.
step_index = None
class-attribute
instance-attribute
¶workflow
instance-attribute
¶__init__(workflow, step_index=None)
¶__post_init__()
¶find_step(step_id)
¶Find a step or raise.
has_step_id(step_id)
¶Return whether workflow contains a step id.
has_step_type(opcode)
¶Return whether workflow contains an opcode.
path_contains_gate(start_id)
¶Return whether any reachable path contains a gate step.
reachable_step_ids(start_id)
¶Return the set of reachable step ids from one step.
route_target(step, outcome_key, *, context)
¶Return one transition target.
transition_targets(step)
¶Return declared transition targets.
enums
¶
Shared kernel enums used across orchestration contracts.
__all__ = ['AgentFamily', 'GateOutcome', 'MechanicalOutcome', 'Opcode', 'PlannerStatus', 'RunnerTermination']
module-attribute
¶
AgentFamily
¶
GateOutcome
¶
MechanicalOutcome
¶
Bases: str, Enum
Mechanical outcomes used for kernel routing.
completed = 'completed'
class-attribute
instance-attribute
¶error = 'error'
class-attribute
instance-attribute
¶killed_idle = 'killed_idle'
class-attribute
instance-attribute
¶killed_policy = 'killed_policy'
class-attribute
instance-attribute
¶killed_timeout = 'killed_timeout'
class-attribute
instance-attribute
¶
Opcode
¶
Bases: str, Enum
Kernel opcode names.
Values intentionally mirror the accepted OA04 workflow schema tokens.
evaluate = 'EVALUATE'
class-attribute
instance-attribute
¶gate = 'GATE'
class-attribute
instance-attribute
¶rollback = 'ROLLBACK'
class-attribute
instance-attribute
¶run_agent = 'RUN_AGENT'
class-attribute
instance-attribute
¶run_validation = 'RUN_VALIDATION'
class-attribute
instance-attribute
¶stop = 'STOP'
class-attribute
instance-attribute
¶
PlannerStatus
¶
Bases: str, Enum
Semantic planner statuses.
RunnerTermination
¶
Bases: str, Enum
Mechanical outcomes exposed to the kernel by maintained runners.
completed = 'completed'
class-attribute
instance-attribute
¶error = 'error'
class-attribute
instance-attribute
¶killed_idle = 'killed_idle'
class-attribute
instance-attribute
¶killed_policy = 'killed_policy'
class-attribute
instance-attribute
¶killed_timeout = 'killed_timeout'
class-attribute
instance-attribute
¶
errors
¶
Kernel-specific exceptions.
WorkflowValidationError
¶
Bases: Exception
Raised when workflow invariants are violated.
models
¶
Typed models for the maintained kernel subsystem.
StepDefinition = Annotated[RunAgentStep | RunValidationStep | EvaluateStep | GateStep | RollbackStep | StopStep, Field(discriminator='opcode')]
module-attribute
¶
__all__ = ['BaseStep', 'EvaluateStep', 'GateOutcome', 'GateStep', 'KernelRunResult', 'MechanicalOutcome', 'Opcode', 'PlannerDecision', 'PlannerStatus', 'RollbackStep', 'RouteRule', 'RunAgentStep', 'RunValidationStep', 'StepDefinition', 'StopStep', 'WorkflowDefaults', 'WorkflowDefinition']
module-attribute
¶
BaseStep
¶
EvaluateStep
¶
GateOutcome
¶
GateStep
¶
KernelRunResult
¶
Bases: BaseModel
Kernel run summary.
ended_at
instance-attribute
¶final_state_path
instance-attribute
¶last_step_id
instance-attribute
¶metadata_path
instance-attribute
¶run_directory
instance-attribute
¶run_id
instance-attribute
¶started_at
instance-attribute
¶status
instance-attribute
¶status_path
instance-attribute
¶workflow_id
instance-attribute
¶
MechanicalOutcome
¶
Bases: str, Enum
Mechanical outcomes used for kernel routing.
completed = 'completed'
class-attribute
instance-attribute
¶error = 'error'
class-attribute
instance-attribute
¶killed_idle = 'killed_idle'
class-attribute
instance-attribute
¶killed_policy = 'killed_policy'
class-attribute
instance-attribute
¶killed_timeout = 'killed_timeout'
class-attribute
instance-attribute
¶
Opcode
¶
Bases: str, Enum
Kernel opcode names.
Values intentionally mirror the accepted OA04 workflow schema tokens.
evaluate = 'EVALUATE'
class-attribute
instance-attribute
¶gate = 'GATE'
class-attribute
instance-attribute
¶rollback = 'ROLLBACK'
class-attribute
instance-attribute
¶run_agent = 'RUN_AGENT'
class-attribute
instance-attribute
¶run_validation = 'RUN_VALIDATION'
class-attribute
instance-attribute
¶stop = 'STOP'
class-attribute
instance-attribute
¶
PlannerDecision
¶
Bases: BaseModel
Structured planner output.
PlannerStatus
¶
Bases: str, Enum
Semantic planner statuses.
RollbackStep
¶
RouteRule
¶
RunAgentStep
¶
RunValidationStep
¶
StopStep
¶
WorkflowDefaults
¶
protocols
¶
Protocols required by the maintained kernel.
__all__ = ['ClockProtocol', 'GateApproverProtocol', 'PlannerEvaluatorProtocol', 'RunArtifactPaths', 'RunArtifactStoreProtocol', 'RunIdGeneratorProtocol', 'RunnerResult', 'RunnerServiceProtocol', 'RunnerTaskRequest', 'RollbackStep', 'ValidationResult', 'ValidationServiceProtocol', 'ValidationStepRequest', 'WorkspaceServiceProtocol']
module-attribute
¶
GateApproverProtocol
¶
PlannerEvaluatorProtocol
¶
RollbackStep
¶
RunArtifactPaths
¶
Bases: BaseModel
Canonical run-scoped filesystem paths.
RunArtifactStoreProtocol
¶
Bases: Protocol
Persist run-scoped artifacts.
append_event(event, paths)
¶Append one event record.
artifact_step_dir(step_id, paths)
¶Return the canonical artifact directory for one step.
copy_file_artifact(*, paths, step_id, role, filename, source_path, media_type, required, important=False)
¶Copy one existing file into canonical artifact storage.
create_run(run_id, root_directory)
¶Create and return canonical run paths.
read_status(run_id, root_directory)
¶Read and validate live run status for one run id.
status_path_for_run(run_id, root_directory)
¶Return the canonical status path for one run id.
write_final_state(final_state, paths)
¶Persist the terminal workflow state summary.
write_json_artifact(*, paths, step_id, role, filename, payload, required, important=False)
¶Write and register one JSON artifact.
write_metadata(metadata, paths)
¶Persist run metadata.
write_status(status, paths)
¶Persist live run status.
write_step_manifest(manifest, paths)
¶Persist one step manifest and return its path.
write_text_artifact(*, paths, step_id, role, filename, content, media_type, required, important=False)
¶Write and register one text artifact.
RunnerResult
¶
RunnerServiceProtocol
¶
RunnerTaskRequest
¶
Bases: BaseModel
Kernel-facing runner task request.
ValidationResult
¶
Bases: BaseModel
Validation result exposed to the kernel.
ValidationServiceProtocol
¶
Bases: Protocol
Execute a validation step.
run(request)
¶Execute validators and return normalized result.
ValidationStepRequest
¶
WorkspaceServiceProtocol
¶
Bases: Protocol
Workspace safety and rollback operations.
current_context()
¶Return the current managed workspace context, if any.
diff_summary()
¶Return normalized diff summary.
planned_worktree_path(run_id)
¶Return the planned worktree path for one run, if managed.
prepare_pre_run(run_id)
¶Create and persist the pre-run workspace context.
rollback_pre_run()
¶Rollback to the pre-run state and return updated context.
snapshot()
¶Return current semantic snapshot.
provenance
¶
Kernel-side provenance recording helpers.
KernelProvenanceRecorder
dataclass
¶
Persist kernel-side provenance artifacts, manifests, and events.
artifact_store
instance-attribute
¶clock
instance-attribute
¶workspace
instance-attribute
¶__init__(artifact_store, workspace, clock)
¶record_failed_step(*, run_id, step_id, opcode, started_at, ended_at, paths, extra_artifacts=(), notes)
¶Persist the canonical failure manifest and event for one step.
record_gate_requested(*, run_id, step_id, paths)
¶Append the canonical gate-requested event.
record_gate_resolved(*, run_id, step_id, paths)
¶Append the canonical gate-resolved event.
record_rollback_completed(*, run_id, step_id, paths)
¶Append the canonical rollback-completed event.
record_route_selected(*, run_id, step_id, next_step_id, opcode, paths)
¶Append the canonical route-selected event.
record_runner_completed(*, run_id, step_id, runner_family, paths)
¶Append the canonical runner-completed event.
record_runner_started(*, run_id, step_id, runner_family, paths)
¶Append the canonical runner-started event.
record_status_updated(*, run_id, step_id, lifecycle_state, paths)
¶Append the canonical status-updated event.
record_step_blocked(*, run_id, step_id, paths)
¶Append the canonical step-blocked event.
record_step_manifest(*, run_id, step_id, opcode, termination, started_at, ended_at, paths, extra_artifacts=(), notes=(), next_step_id=None)
¶Persist the canonical manifest and completion events for one step.
record_step_started(*, run_id, step_id, paths)
¶Append the canonical step-started event.
record_step_waiting(*, run_id, step_id, paths)
¶Append the canonical step-waiting event.
service
¶
Top-level maintained kernel service.
T = TypeVar('T')
module-attribute
¶
KernelRunService
dataclass
¶
Execute a workflow deterministically.
artifact_store
instance-attribute
¶clock
instance-attribute
¶execution_policy_assembler = field(default_factory=ExecutionPolicyAssembler)
class-attribute
instance-attribute
¶execution_policy_settings = field(default_factory=ExecutionPolicySettings)
class-attribute
instance-attribute
¶gate_approver
instance-attribute
¶heartbeat_executor = field(default_factory=(lambda: ThreadPoolExecutor(max_workers=1)), repr=False, compare=False)
class-attribute
instance-attribute
¶heartbeat_interval_seconds = 30.0
class-attribute
instance-attribute
¶planner_evaluator
instance-attribute
¶run_id_generator
instance-attribute
¶runner_service
instance-attribute
¶validation_service
instance-attribute
¶workflow_validator
instance-attribute
¶workspace
instance-attribute
¶__init__(clock, run_id_generator, artifact_store, workspace, runner_service, validation_service, planner_evaluator, gate_approver, workflow_validator, execution_policy_settings=ExecutionPolicySettings(), execution_policy_assembler=ExecutionPolicyAssembler(), heartbeat_interval_seconds=30.0, heartbeat_executor=(lambda: ThreadPoolExecutor(max_workers=1))())
¶run(workflow, run_root)
¶Execute a workflow and return summary.
StepContext
dataclass
¶
Per-run step execution context.
paths
instance-attribute
¶provenance
instance-attribute
¶run_directory
instance-attribute
¶run_id
instance-attribute
¶started_at
instance-attribute
¶workflow_id
instance-attribute
¶workflow_policy_ref = None
class-attribute
instance-attribute
¶workspace_context
instance-attribute
¶__init__(run_id, workflow_id, paths, run_directory, started_at, workspace_context, provenance, workflow_policy_ref=None)
¶
state
¶
Kernel state models.
KernelState
dataclass
¶
Immutable runtime state.
current_step_id
instance-attribute
¶pending_golden_gate = False
class-attribute
instance-attribute
¶trace = field(default_factory=list)
class-attribute
instance-attribute
¶__init__(current_step_id, pending_golden_gate=False, trace=list())
¶advance(step_id, next_step_id, pending_gate=None)
¶Advance state immutably.
log_text()
¶Render trace text.
pending_gate_after_outcome(outcome)
¶Return pending gate state after one gate decision.
with_pending_gate()
¶Return a state with pending golden-gate approval.
run_artifacts
¶
Run artifact persistence subsystem.
__all__ = ['ArtifactRole', 'EvidenceReference', 'EvidenceSummary', 'FilesystemRunArtifactStore', 'RunArtifactPaths', 'RunArtifactStoreProtocol', 'RunEventRecord', 'RunEventType', 'RunLifecycleState', 'RunMetadata', 'RunStatus', 'SchemaVersionRecord', 'StepArtifactEntry', 'StepManifest']
module-attribute
¶
ArtifactRole
¶
Bases: str, Enum
Canonical artifact roles reserved for maintained consumers.
gate_outcome = 'gate_outcome'
class-attribute
instance-attribute
¶
gate_request = 'gate_request'
class-attribute
instance-attribute
¶
harness_fixture = 'harness_fixture'
class-attribute
instance-attribute
¶
planner_decision = 'planner_decision'
class-attribute
instance-attribute
¶
policy_summary = 'policy_summary'
class-attribute
instance-attribute
¶
runner_final_response = 'runner_final_response'
class-attribute
instance-attribute
¶
runner_metadata = 'runner_metadata'
class-attribute
instance-attribute
¶
runner_transcript = 'runner_transcript'
class-attribute
instance-attribute
¶
validation_report = 'validation_report'
class-attribute
instance-attribute
¶
validation_stderr = 'validation_stderr'
class-attribute
instance-attribute
¶
validation_stdout = 'validation_stdout'
class-attribute
instance-attribute
¶
workspace_diff = 'workspace_diff'
class-attribute
instance-attribute
¶
workspace_status = 'workspace_status'
class-attribute
instance-attribute
¶
EvidenceReference
¶
EvidenceSummary
¶
FilesystemRunArtifactStore
dataclass
¶
Bases: RunArtifactStoreProtocol
Persist run artifacts to the local filesystem.
__init__()
¶
append_event(event, paths)
¶
artifact_step_dir(step_id, paths)
¶
copy_file_artifact(*, paths, step_id, role, filename, source_path, media_type, required, important=False)
¶
create_run(run_id, root_directory)
¶
read_status(run_id, root_directory)
¶
status_path_for_run(run_id, root_directory)
¶
write_final_state(final_state, paths)
¶
write_json_artifact(*, paths, step_id, role, filename, payload, required, important=False)
¶
write_metadata(metadata, paths)
¶
write_status(status, paths)
¶
write_step_manifest(manifest, paths)
¶
write_text_artifact(*, paths, step_id, role, filename, content, media_type, required, important=False)
¶
RunArtifactPaths
¶
Bases: BaseModel
Canonical run-scoped filesystem paths.
RunArtifactStoreProtocol
¶
Bases: Protocol
Persist run-scoped artifacts.
append_event(event, paths)
¶
Append one event record.
artifact_step_dir(step_id, paths)
¶
Return the canonical artifact directory for one step.
copy_file_artifact(*, paths, step_id, role, filename, source_path, media_type, required, important=False)
¶
Copy one existing file into canonical artifact storage.
create_run(run_id, root_directory)
¶
Create and return canonical run paths.
read_status(run_id, root_directory)
¶
Read and validate live run status for one run id.
status_path_for_run(run_id, root_directory)
¶
Return the canonical status path for one run id.
write_final_state(final_state, paths)
¶
Persist the terminal workflow state summary.
write_json_artifact(*, paths, step_id, role, filename, payload, required, important=False)
¶
Write and register one JSON artifact.
write_metadata(metadata, paths)
¶
Persist run metadata.
write_status(status, paths)
¶
Persist live run status.
write_step_manifest(manifest, paths)
¶
Persist one step manifest and return its path.
write_text_artifact(*, paths, step_id, role, filename, content, media_type, required, important=False)
¶
Write and register one text artifact.
RunEventRecord
¶
Bases: BaseModel
Appendable runtime event record.
artifact_path = None
class-attribute
instance-attribute
¶
artifact_role = None
class-attribute
instance-attribute
¶
event_type
instance-attribute
¶
lifecycle_state = None
class-attribute
instance-attribute
¶
next_step_id = None
class-attribute
instance-attribute
¶
opcode = None
class-attribute
instance-attribute
¶
run_id
instance-attribute
¶
runner_family = None
class-attribute
instance-attribute
¶
step_id
instance-attribute
¶
timestamp
instance-attribute
¶
RunEventType
¶
Bases: str, Enum
Canonical event types for one workflow run.
artifact_recorded = 'artifact_recorded'
class-attribute
instance-attribute
¶
gate_requested = 'gate_requested'
class-attribute
instance-attribute
¶
gate_resolved = 'gate_resolved'
class-attribute
instance-attribute
¶
rollback_completed = 'rollback_completed'
class-attribute
instance-attribute
¶
route_selected = 'route_selected'
class-attribute
instance-attribute
¶
runner_completed = 'runner_completed'
class-attribute
instance-attribute
¶
runner_started = 'runner_started'
class-attribute
instance-attribute
¶
status_updated = 'status_updated'
class-attribute
instance-attribute
¶
step_blocked = 'step_blocked'
class-attribute
instance-attribute
¶
step_completed = 'step_completed'
class-attribute
instance-attribute
¶
step_failed = 'step_failed'
class-attribute
instance-attribute
¶
step_started = 'step_started'
class-attribute
instance-attribute
¶
step_waiting = 'step_waiting'
class-attribute
instance-attribute
¶
RunLifecycleState
¶
Bases: str, Enum
Bounded operator-facing lifecycle state for one run.
RunMetadata
¶
Bases: BaseModel
Run-level metadata persisted for one workflow execution.
artifacts_root
instance-attribute
¶
ended_at = None
class-attribute
instance-attribute
¶
entry_step
instance-attribute
¶
last_step_id = None
class-attribute
instance-attribute
¶
run_id
instance-attribute
¶
schema_versions = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
started_at
instance-attribute
¶
termination = None
class-attribute
instance-attribute
¶
workflow_id
instance-attribute
¶
workflow_version
instance-attribute
¶
workspace_context = None
class-attribute
instance-attribute
¶
RunStatus
¶
Bases: BaseModel
Live operator-facing status for one workflow run.
active_attempt = None
class-attribute
instance-attribute
¶
active_opcode = None
class-attribute
instance-attribute
¶
active_runner_family = None
class-attribute
instance-attribute
¶
blocking_reason = None
class-attribute
instance-attribute
¶
current_step_id = None
class-attribute
instance-attribute
¶
elapsed_seconds = None
class-attribute
instance-attribute
¶
last_artifact_write = None
class-attribute
instance-attribute
¶
last_completed_step_id = None
class-attribute
instance-attribute
¶
last_route_target = None
class-attribute
instance-attribute
¶
lifecycle_state
instance-attribute
¶
operator_note = None
class-attribute
instance-attribute
¶
run_id
instance-attribute
¶
started_at
instance-attribute
¶
termination = None
class-attribute
instance-attribute
¶
updated_at
instance-attribute
¶
workflow_id
instance-attribute
¶
worktree_path = None
class-attribute
instance-attribute
¶
SchemaVersionRecord
¶
StepArtifactEntry
¶
StepManifest
¶
Bases: BaseModel
Canonical manifest for one executed step.
artifacts = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
ended_at
instance-attribute
¶
evidence_summary
instance-attribute
¶
opcode
instance-attribute
¶
started_at
instance-attribute
¶
step_id
instance-attribute
¶
termination
instance-attribute
¶
artifact_for_role(role)
¶
Return the first artifact matching the requested role.
filesystem_store
¶
Filesystem-backed run artifact store.
FilesystemRunArtifactStore
dataclass
¶
Bases: RunArtifactStoreProtocol
Persist run artifacts to the local filesystem.
__init__()
¶append_event(event, paths)
¶artifact_step_dir(step_id, paths)
¶copy_file_artifact(*, paths, step_id, role, filename, source_path, media_type, required, important=False)
¶create_run(run_id, root_directory)
¶read_status(run_id, root_directory)
¶status_path_for_run(run_id, root_directory)
¶write_final_state(final_state, paths)
¶write_json_artifact(*, paths, step_id, role, filename, payload, required, important=False)
¶write_metadata(metadata, paths)
¶write_status(status, paths)
¶write_step_manifest(manifest, paths)
¶write_text_artifact(*, paths, step_id, role, filename, content, media_type, required, important=False)
¶
models
¶
Typed models for run artifact persistence.
ArtifactRole
¶
Bases: str, Enum
Canonical artifact roles reserved for maintained consumers.
gate_outcome = 'gate_outcome'
class-attribute
instance-attribute
¶gate_request = 'gate_request'
class-attribute
instance-attribute
¶harness_fixture = 'harness_fixture'
class-attribute
instance-attribute
¶planner_decision = 'planner_decision'
class-attribute
instance-attribute
¶policy_summary = 'policy_summary'
class-attribute
instance-attribute
¶runner_final_response = 'runner_final_response'
class-attribute
instance-attribute
¶runner_metadata = 'runner_metadata'
class-attribute
instance-attribute
¶runner_transcript = 'runner_transcript'
class-attribute
instance-attribute
¶validation_report = 'validation_report'
class-attribute
instance-attribute
¶validation_stderr = 'validation_stderr'
class-attribute
instance-attribute
¶validation_stdout = 'validation_stdout'
class-attribute
instance-attribute
¶workspace_diff = 'workspace_diff'
class-attribute
instance-attribute
¶workspace_status = 'workspace_status'
class-attribute
instance-attribute
¶
EvidenceReference
¶
EvidenceSummary
¶
GateOutcomeArtifact
¶
GateRequestArtifact
¶
RunArtifactPaths
¶
Bases: BaseModel
Canonical run-scoped filesystem paths.
RunEventRecord
¶
Bases: BaseModel
Appendable runtime event record.
artifact_path = None
class-attribute
instance-attribute
¶artifact_role = None
class-attribute
instance-attribute
¶event_type
instance-attribute
¶lifecycle_state = None
class-attribute
instance-attribute
¶next_step_id = None
class-attribute
instance-attribute
¶opcode = None
class-attribute
instance-attribute
¶run_id
instance-attribute
¶runner_family = None
class-attribute
instance-attribute
¶step_id
instance-attribute
¶timestamp
instance-attribute
¶
RunEventType
¶
Bases: str, Enum
Canonical event types for one workflow run.
artifact_recorded = 'artifact_recorded'
class-attribute
instance-attribute
¶gate_requested = 'gate_requested'
class-attribute
instance-attribute
¶gate_resolved = 'gate_resolved'
class-attribute
instance-attribute
¶rollback_completed = 'rollback_completed'
class-attribute
instance-attribute
¶route_selected = 'route_selected'
class-attribute
instance-attribute
¶runner_completed = 'runner_completed'
class-attribute
instance-attribute
¶runner_started = 'runner_started'
class-attribute
instance-attribute
¶status_updated = 'status_updated'
class-attribute
instance-attribute
¶step_blocked = 'step_blocked'
class-attribute
instance-attribute
¶step_completed = 'step_completed'
class-attribute
instance-attribute
¶step_failed = 'step_failed'
class-attribute
instance-attribute
¶step_started = 'step_started'
class-attribute
instance-attribute
¶step_waiting = 'step_waiting'
class-attribute
instance-attribute
¶
RunLifecycleState
¶
Bases: str, Enum
Bounded operator-facing lifecycle state for one run.
RunMetadata
¶
Bases: BaseModel
Run-level metadata persisted for one workflow execution.
artifacts_root
instance-attribute
¶ended_at = None
class-attribute
instance-attribute
¶entry_step
instance-attribute
¶last_step_id = None
class-attribute
instance-attribute
¶run_id
instance-attribute
¶schema_versions = Field(default_factory=tuple)
class-attribute
instance-attribute
¶started_at
instance-attribute
¶termination = None
class-attribute
instance-attribute
¶workflow_id
instance-attribute
¶workflow_version
instance-attribute
¶workspace_context = None
class-attribute
instance-attribute
¶
RunStatus
¶
Bases: BaseModel
Live operator-facing status for one workflow run.
active_attempt = None
class-attribute
instance-attribute
¶active_opcode = None
class-attribute
instance-attribute
¶active_runner_family = None
class-attribute
instance-attribute
¶blocking_reason = None
class-attribute
instance-attribute
¶current_step_id = None
class-attribute
instance-attribute
¶elapsed_seconds = None
class-attribute
instance-attribute
¶last_artifact_write = None
class-attribute
instance-attribute
¶last_completed_step_id = None
class-attribute
instance-attribute
¶last_route_target = None
class-attribute
instance-attribute
¶lifecycle_state
instance-attribute
¶operator_note = None
class-attribute
instance-attribute
¶run_id
instance-attribute
¶started_at
instance-attribute
¶termination = None
class-attribute
instance-attribute
¶updated_at
instance-attribute
¶workflow_id
instance-attribute
¶worktree_path = None
class-attribute
instance-attribute
¶
RunnerMetadataArtifact
¶
Bases: BaseModel
Canonical maintained runner metadata artifact.
agent_family
instance-attribute
¶capture_format = None
class-attribute
instance-attribute
¶command = Field(default_factory=tuple)
class-attribute
instance-attribute
¶ended_at = None
class-attribute
instance-attribute
¶exit_code = None
class-attribute
instance-attribute
¶invocation_mode = None
class-attribute
instance-attribute
¶prompt_reference = None
class-attribute
instance-attribute
¶started_at = None
class-attribute
instance-attribute
¶termination
instance-attribute
¶working_directory = None
class-attribute
instance-attribute
¶from_normalized_metadata(payload)
classmethod
¶Build the canonical artifact from one normalized runner metadata record.
SchemaVersionRecord
¶
StepArtifactEntry
¶
StepManifest
¶
Bases: BaseModel
Canonical manifest for one executed step.
artifacts = Field(default_factory=tuple)
class-attribute
instance-attribute
¶ended_at
instance-attribute
¶evidence_summary
instance-attribute
¶opcode
instance-attribute
¶started_at
instance-attribute
¶step_id
instance-attribute
¶termination
instance-attribute
¶artifact_for_role(role)
¶Return the first artifact matching the requested role.
protocols
¶
Protocols for run artifact persistence.
RunArtifactStoreProtocol
¶
Bases: Protocol
Persist run-scoped artifacts.
append_event(event, paths)
¶Append one event record.
artifact_step_dir(step_id, paths)
¶Return the canonical artifact directory for one step.
copy_file_artifact(*, paths, step_id, role, filename, source_path, media_type, required, important=False)
¶Copy one existing file into canonical artifact storage.
create_run(run_id, root_directory)
¶Create and return canonical run paths.
read_status(run_id, root_directory)
¶Read and validate live run status for one run id.
status_path_for_run(run_id, root_directory)
¶Return the canonical status path for one run id.
write_final_state(final_state, paths)
¶Persist the terminal workflow state summary.
write_json_artifact(*, paths, step_id, role, filename, payload, required, important=False)
¶Write and register one JSON artifact.
write_metadata(metadata, paths)
¶Persist run metadata.
write_status(status, paths)
¶Persist live run status.
write_step_manifest(manifest, paths)
¶Persist one step manifest and return its path.
write_text_artifact(*, paths, step_id, role, filename, content, media_type, required, important=False)
¶Write and register one text artifact.
runners
¶
Maintained runner subsystem for agent orchestration.
__all__ = ['AdapterCapabilities', 'DelegatingRunnerService', 'RunnerAdapterProtocol', 'RunnerCaptureFormat', 'RunnerInvocationMetadata', 'RunnerInvocationMode', 'RunnerResult', 'RunnerServiceProtocol', 'RunnerTaskRequest', 'RunnerTermination', 'RunnerTextArtifact']
module-attribute
¶
AdapterCapabilities
¶
Bases: BaseModel
Native controls that one runner adapter can honor.
agent_family
instance-attribute
¶
supports_final_response_file
instance-attribute
¶
supports_native_approval_controls
instance-attribute
¶
supports_network_controls = False
class-attribute
instance-attribute
¶
supports_path_constraints = False
class-attribute
instance-attribute
¶
supports_read_only
instance-attribute
¶
supports_structured_event_stream
instance-attribute
¶
supports_workspace_write
instance-attribute
¶
DelegatingRunnerService
dataclass
¶
Bases: RunnerServiceProtocol
Dispatch runner requests to the matching maintained adapter.
RunnerAdapterProtocol
¶
RunnerCaptureFormat
¶
RunnerInvocationMetadata
¶
Bases: BaseModel
Canonical normalized invocation metadata returned by adapters.
agent_family
instance-attribute
¶
capture_format
instance-attribute
¶
command
instance-attribute
¶
ended_at
instance-attribute
¶
exit_code = None
class-attribute
instance-attribute
¶
invocation_mode
instance-attribute
¶
prompt_reference = None
class-attribute
instance-attribute
¶
started_at
instance-attribute
¶
termination
instance-attribute
¶
working_directory
instance-attribute
¶
RunnerInvocationMode
¶
RunnerResult
¶
RunnerServiceProtocol
¶
RunnerTaskRequest
¶
Bases: BaseModel
Kernel-facing runner task request.
RunnerTermination
¶
Bases: str, Enum
Mechanical outcomes exposed to the kernel by maintained runners.
completed = 'completed'
class-attribute
instance-attribute
¶
error = 'error'
class-attribute
instance-attribute
¶
killed_idle = 'killed_idle'
class-attribute
instance-attribute
¶
killed_policy = 'killed_policy'
class-attribute
instance-attribute
¶
killed_timeout = 'killed_timeout'
class-attribute
instance-attribute
¶
RunnerTextArtifact
¶
adapters
¶
Maintained runner adapters.
__all__ = ['ClaudeCliInvocationMapper', 'ClaudeCliOutputNormalizer', 'ClaudeCliRunnerAdapter', 'CodexCliInvocationMapper', 'CodexCliOutputNormalizer', 'CodexCliRunnerAdapter']
module-attribute
¶
ClaudeCliInvocationMapper
dataclass
¶
ClaudeCliOutputNormalizer
dataclass
¶
ClaudeCliRunnerAdapter
dataclass
¶
Bases: RunnerAdapterProtocol
Execute maintained headless Claude CLI runs.
executable = None
class-attribute
instance-attribute
¶execution_service
instance-attribute
¶__init__(execution_service, executable=None)
¶agent_family()
¶Return the maintained family served by this adapter.
capabilities()
¶Return the native capabilities for Claude CLI.
run(request)
¶Execute one Claude CLI request.
CodexCliInvocationMapper
dataclass
¶
Build execution requests for maintained Codex CLI runs.
default_lang = 'en_US.UTF-8'
class-attribute
instance-attribute
¶default_shell = '/bin/zsh'
class-attribute
instance-attribute
¶default_term = 'xterm-256color'
class-attribute
instance-attribute
¶executable
instance-attribute
¶model_name = None
class-attribute
instance-attribute
¶response_filename = 'codex-last-message.txt'
class-attribute
instance-attribute
¶__init__(executable, model_name=None, response_filename='codex-last-message.txt', default_shell='/bin/zsh', default_term='xterm-256color', default_lang='en_US.UTF-8')
¶map(request, response_path)
¶Map one runner request into a trusted execution request.
CodexCliOutputNormalizer
dataclass
¶
CodexCliRunnerAdapter
dataclass
¶
Bases: RunnerAdapterProtocol
Execute maintained headless Codex CLI runs.
executable = None
class-attribute
instance-attribute
¶execution_service
instance-attribute
¶model_name = None
class-attribute
instance-attribute
¶output_normalizer = CodexCliOutputNormalizer()
class-attribute
instance-attribute
¶__init__(execution_service, executable=None, model_name=None, output_normalizer=CodexCliOutputNormalizer())
¶agent_family()
¶Return the maintained family served by this adapter.
capabilities()
¶Return the native capabilities for Codex CLI.
run(request)
¶Execute one Codex CLI request.
claude_cli
¶
Claude CLI maintained runner adapter.
ClaudeCliInvocationMapper
dataclass
¶ClaudeCliOutputNormalizer
dataclass
¶ClaudeCliRunnerAdapter
dataclass
¶
Bases: RunnerAdapterProtocol
Execute maintained headless Claude CLI runs.
executable = None
class-attribute
instance-attribute
¶execution_service
instance-attribute
¶__init__(execution_service, executable=None)
¶agent_family()
¶Return the maintained family served by this adapter.
capabilities()
¶Return the native capabilities for Claude CLI.
run(request)
¶Execute one Claude CLI request.
codex_cli
¶
Codex CLI maintained runner adapter.
CodexCliInvocationMapper
dataclass
¶Build execution requests for maintained Codex CLI runs.
default_lang = 'en_US.UTF-8'
class-attribute
instance-attribute
¶default_shell = '/bin/zsh'
class-attribute
instance-attribute
¶default_term = 'xterm-256color'
class-attribute
instance-attribute
¶executable
instance-attribute
¶model_name = None
class-attribute
instance-attribute
¶response_filename = 'codex-last-message.txt'
class-attribute
instance-attribute
¶__init__(executable, model_name=None, response_filename='codex-last-message.txt', default_shell='/bin/zsh', default_term='xterm-256color', default_lang='en_US.UTF-8')
¶map(request, response_path)
¶Map one runner request into a trusted execution request.
CodexCliOutputNormalizer
dataclass
¶CodexCliRunnerAdapter
dataclass
¶
Bases: RunnerAdapterProtocol
Execute maintained headless Codex CLI runs.
executable = None
class-attribute
instance-attribute
¶execution_service
instance-attribute
¶model_name = None
class-attribute
instance-attribute
¶output_normalizer = CodexCliOutputNormalizer()
class-attribute
instance-attribute
¶__init__(execution_service, executable=None, model_name=None, output_normalizer=CodexCliOutputNormalizer())
¶agent_family()
¶Return the maintained family served by this adapter.
capabilities()
¶Return the native capabilities for Codex CLI.
run(request)
¶Execute one Codex CLI request.
models
¶
Typed models for maintained runners.
AdapterCapabilities
¶
Bases: BaseModel
Native controls that one runner adapter can honor.
agent_family
instance-attribute
¶supports_final_response_file
instance-attribute
¶supports_native_approval_controls
instance-attribute
¶supports_network_controls = False
class-attribute
instance-attribute
¶supports_path_constraints = False
class-attribute
instance-attribute
¶supports_read_only
instance-attribute
¶supports_structured_event_stream
instance-attribute
¶supports_workspace_write
instance-attribute
¶
RunnerCaptureFormat
¶
RunnerInvocationMetadata
¶
Bases: BaseModel
Canonical normalized invocation metadata returned by adapters.
agent_family
instance-attribute
¶capture_format
instance-attribute
¶command
instance-attribute
¶ended_at
instance-attribute
¶exit_code = None
class-attribute
instance-attribute
¶invocation_mode
instance-attribute
¶prompt_reference = None
class-attribute
instance-attribute
¶started_at
instance-attribute
¶termination
instance-attribute
¶working_directory
instance-attribute
¶
RunnerInvocationMode
¶
RunnerResult
¶
RunnerTaskRequest
¶
Bases: BaseModel
Kernel-facing runner task request.
shared_enums
¶
Shared orchestration enums that must not depend on package initializers.
AgentFamily
¶
RunnerTermination
¶
Bases: str, Enum
Mechanical outcomes exposed to the kernel by maintained runners.
completed = 'completed'
class-attribute
instance-attribute
¶
error = 'error'
class-attribute
instance-attribute
¶
killed_idle = 'killed_idle'
class-attribute
instance-attribute
¶
killed_policy = 'killed_policy'
class-attribute
instance-attribute
¶
killed_timeout = 'killed_timeout'
class-attribute
instance-attribute
¶
spike
¶
Phase 0 protocol layer spike for agent orchestration.
adapters
¶
Adapters for the protocol layer spike.
command_filter
¶
Command filter adapter for the spike.
RegexCommandFilter
dataclass
¶
Bases: CommandFilterProtocol
Regex-based command filter.
noop_prompt_handler
¶
No-op prompt handler for headless CLI execution.
NoopPromptHandler
dataclass
¶
Bases: PromptHandlerProtocol
Ignore all output as non-interactive in headless runs.
prompt_handler
¶
Handle confirmation prompts in agent output.
RegexPromptHandler
dataclass
¶
Bases: PromptHandlerProtocol
Handle command confirmation prompts using regex parsing.
prompt_parser
¶
Parse command confirmation prompts from agent output.
RegexCommandPromptParser
dataclass
¶
Bases: CommandPromptParserProtocol
Regex-based prompt parser.
models
¶
Domain models for the Phase 0 protocol layer spike.
AgentRunResult
¶
Bases: BaseModel
Outcome of an agent execution.
command_decision = None
class-attribute
instance-attribute
¶exit_code = None
class-attribute
instance-attribute
¶stderr_text = None
class-attribute
instance-attribute
¶stdout_text = None
class-attribute
instance-attribute
¶termination_reason
instance-attribute
¶transcript_raw
instance-attribute
¶transcript_text
instance-attribute
¶
CommandFilterDecision
¶
CommandPromptMatch
¶
GitStatusSnapshot
¶
PromptAction
¶
PromptHandlingOutcome
¶
RunArtifactPaths
¶
Bases: BaseModel
Filesystem paths for run artifacts.
diff_patch
instance-attribute
¶events
instance-attribute
¶git_post
instance-attribute
¶git_pre
instance-attribute
¶response_path
instance-attribute
¶run_metadata
instance-attribute
¶stderr_log
instance-attribute
¶stdout_log
instance-attribute
¶transcript_normalized
instance-attribute
¶transcript_raw
instance-attribute
¶
RunEvent
¶
Bases: BaseModel
Single provenance event entry for the spike.
agent = None
class-attribute
instance-attribute
¶artifact_paths = Field(default_factory=list)
class-attribute
instance-attribute
¶event_type
instance-attribute
¶exit_code = None
class-attribute
instance-attribute
¶message = None
class-attribute
instance-attribute
¶reason = None
class-attribute
instance-attribute
¶run_id
instance-attribute
¶timestamp
instance-attribute
¶work_branch = None
class-attribute
instance-attribute
¶
RunEventType
¶
Bases: str, Enum
agent_output = 'AGENT_OUTPUT'
class-attribute
instance-attribute
¶agent_started = 'AGENT_STARTED'
class-attribute
instance-attribute
¶diff_emitted = 'DIFF_EMITTED'
class-attribute
instance-attribute
¶heartbeat = 'HEARTBEAT'
class-attribute
instance-attribute
¶run_blocked = 'RUN_BLOCKED'
class-attribute
instance-attribute
¶run_completed = 'RUN_COMPLETED'
class-attribute
instance-attribute
¶run_started = 'RUN_STARTED'
class-attribute
instance-attribute
¶workspace_captured_post = 'WORKSPACE_CAPTURED_POST'
class-attribute
instance-attribute
¶workspace_captured_pre = 'WORKSPACE_CAPTURED_PRE'
class-attribute
instance-attribute
¶
RunMetadata
¶
Bases: BaseModel
Metadata for a spike run.
agent
instance-attribute
¶artifact_paths
instance-attribute
¶ended_at
instance-attribute
¶exit_code = None
class-attribute
instance-attribute
¶git_post_summary
instance-attribute
¶git_pre_summary
instance-attribute
¶prompt_id = None
class-attribute
instance-attribute
¶run_id
instance-attribute
¶started_at
instance-attribute
¶task = None
class-attribute
instance-attribute
¶termination_reason
instance-attribute
¶work_branch
instance-attribute
¶
SpikeConfig
¶
SpikeDefaults
dataclass
¶
Default values for spike settings and policy.
allow_response = 'y\n'
class-attribute
instance-attribute
¶block_response = 'n\n'
class-attribute
instance-attribute
¶default_heartbeat_interval_seconds = 10
class-attribute
instance-attribute
¶default_idle_timeout_seconds = 600
class-attribute
instance-attribute
¶default_output_event_max_chars = 2000
class-attribute
instance-attribute
¶default_timeout_seconds = 600
class-attribute
instance-attribute
¶default_transcript_tail_lines = 200
class-attribute
instance-attribute
¶runs_root = Path('.tnh-gen/runs')
class-attribute
instance-attribute
¶work_branch_prefix = 'work'
class-attribute
instance-attribute
¶__init__(runs_root=Path('.tnh-gen/runs'), work_branch_prefix='work', default_timeout_seconds=600, default_idle_timeout_seconds=600, default_transcript_tail_lines=200, default_heartbeat_interval_seconds=10, default_output_event_max_chars=2000, allow_response='y\n', block_response='n\n')
¶
SpikeParams
¶
Bases: BaseModel
Per-run parameters for the spike.
agent
instance-attribute
¶heartbeat_interval_seconds = Field(default_factory=(lambda: SpikeDefaults().default_heartbeat_interval_seconds))
class-attribute
instance-attribute
¶idle_timeout_seconds = Field(default_factory=(lambda: SpikeDefaults().default_idle_timeout_seconds))
class-attribute
instance-attribute
¶prompt_id = None
class-attribute
instance-attribute
¶response_path = None
class-attribute
instance-attribute
¶task = None
class-attribute
instance-attribute
¶timeout_seconds = Field(default_factory=(lambda: SpikeDefaults().default_timeout_seconds))
class-attribute
instance-attribute
¶transcript_tail_lines = Field(default_factory=(lambda: SpikeDefaults().default_transcript_tail_lines))
class-attribute
instance-attribute
¶work_branch = None
class-attribute
instance-attribute
¶
SpikePolicy
¶
Bases: BaseModel
Behavioral policies for the spike.
allow_response = Field(default_factory=(lambda: SpikeDefaults().allow_response))
class-attribute
instance-attribute
¶block_response = Field(default_factory=(lambda: SpikeDefaults().block_response))
class-attribute
instance-attribute
¶blocked_command_patterns = Field(default_factory=list)
class-attribute
instance-attribute
¶cleanup_on_failure = True
class-attribute
instance-attribute
¶command_capture_patterns = Field(default_factory=list)
class-attribute
instance-attribute
¶interactive_prompt_patterns = Field(default_factory=list)
class-attribute
instance-attribute
¶output_event_max_chars = Field(default_factory=(lambda: SpikeDefaults().default_output_event_max_chars))
class-attribute
instance-attribute
¶
SpikePreflightError
¶
Bases: Exception
Raised when preflight checks fail.
SpikeSettings
¶
Bases: BaseSettings
Environment-driven settings for the spike.
model_config = SettingsConfigDict(extra='ignore')
class-attribute
instance-attribute
¶runs_root = Field(default_factory=(lambda: SpikeDefaults().runs_root))
class-attribute
instance-attribute
¶sandbox_root = None
class-attribute
instance-attribute
¶work_branch_prefix = Field(default_factory=(lambda: SpikeDefaults().work_branch_prefix))
class-attribute
instance-attribute
¶from_env()
classmethod
¶Create settings from environment.
TerminationReason
¶
Bases: str, Enum
command_blocked = 'command_blocked'
class-attribute
instance-attribute
¶completed = 'completed'
class-attribute
instance-attribute
¶idle_timeout = 'idle_timeout'
class-attribute
instance-attribute
¶interactive_prompt_detected = 'interactive_prompt_detected'
class-attribute
instance-attribute
¶killed = 'killed'
class-attribute
instance-attribute
¶nonzero_exit = 'nonzero_exit'
class-attribute
instance-attribute
¶wall_clock_timeout = 'wall_clock_timeout'
class-attribute
instance-attribute
¶
policy
¶
Policy defaults for the spike.
SpikePolicyDefaults
dataclass
¶
Default policy values for the spike.
blocked_command_patterns = ('\\brm\\s+-r(f)?\\b', '\\bgit\\s+reset\\s+--hard\\b', '\\bgit\\s+clean\\s+-fdx?\\b', '\\bgit\\s+checkout\\s+--(\\s|$)', '\\bgit\\s+restore\\s+--(worktree|staged)\\b', '\\bgit\\s+branch\\s+-D\\b', '\\bgit\\s+rebase\\b', '\\bgit\\s+merge\\b', '\\bgit\\s+push\\s+--force(-with-lease)?\\b', '\\bgit\\s+commit\\b', '\\bgit\\s+push\\b', '\\bmv\\b.*(\\s|/)\\.git(/|\\s|$)', '\\bcp\\b.*(\\s|/)\\.git(/|\\s|$)', '\\b(curl|wget|ssh|scp|rsync)\\b', '\\b(pip|poetry|npm|brew)\\b')
class-attribute
instance-attribute
¶command_capture_patterns = ('command:\\s*(?P<command>.+)', 'run\\s+command:\\s*(?P<command>.+)', 'execute:\\s*(?P<command>.+)')
class-attribute
instance-attribute
¶interactive_prompt_patterns = ('\\bconfirm\\b', '\\bpassword\\b', '\\bpress\\s+enter\\b', '\\b2fa\\b', '\\botp\\b', '\\by\\/n\\b', '\\byes\\/no\\b')
class-attribute
instance-attribute
¶__init__(blocked_command_patterns=('\\brm\\s+-r(f)?\\b', '\\bgit\\s+reset\\s+--hard\\b', '\\bgit\\s+clean\\s+-fdx?\\b', '\\bgit\\s+checkout\\s+--(\\s|$)', '\\bgit\\s+restore\\s+--(worktree|staged)\\b', '\\bgit\\s+branch\\s+-D\\b', '\\bgit\\s+rebase\\b', '\\bgit\\s+merge\\b', '\\bgit\\s+push\\s+--force(-with-lease)?\\b', '\\bgit\\s+commit\\b', '\\bgit\\s+push\\b', '\\bmv\\b.*(\\s|/)\\.git(/|\\s|$)', '\\bcp\\b.*(\\s|/)\\.git(/|\\s|$)', '\\b(curl|wget|ssh|scp|rsync)\\b', '\\b(pip|poetry|npm|brew)\\b'), interactive_prompt_patterns=('\\bconfirm\\b', '\\bpassword\\b', '\\bpress\\s+enter\\b', '\\b2fa\\b', '\\botp\\b', '\\by\\/n\\b', '\\byes\\/no\\b'), command_capture_patterns=('command:\\s*(?P<command>.+)', 'run\\s+command:\\s*(?P<command>.+)', 'execute:\\s*(?P<command>.+)'))
¶
default_spike_policy()
¶
Build the default spike policy.
protocols
¶
Protocol definitions for the Phase 0 spike.
AgentCommandBuilderProtocol
¶
AgentRunnerProtocol
¶
Bases: Protocol
Run an agent command and capture output.
run(*, command, timeout_seconds, idle_timeout_seconds, heartbeat_interval_seconds, prompt_handler, on_heartbeat, on_output)
¶Execute the agent command.
ArtifactWriterProtocol
¶
ClockProtocol
¶
CommandFilterProtocol
¶
Bases: Protocol
Evaluate whether a command should be blocked.
evaluate(command)
¶Return a decision for the provided command.
CommandPromptParserProtocol
¶
Bases: Protocol
Parse command confirmation prompts.
parse(text)
¶Parse a prompt from text, if present.
EventWriterFactoryProtocol
¶
Bases: Protocol
Create event writers for runs.
create(events_path)
¶Create an event writer for the given path.
EventWriterProtocol
¶
PromptHandlerProtocol
¶
Bases: Protocol
Handle confirmation prompts from agent output.
handle_output(text)
¶Process output text and return handling instructions.
RunIdGeneratorProtocol
¶
WorkspaceCaptureProtocol
¶
Bases: Protocol
Capture git workspace details.
capture_diff()
¶Capture unified diff for the worktree.
capture_status()
¶Capture git status snapshot.
checkout_branch(branch_name)
¶Checkout the specified branch.
create_work_branch(branch_name)
¶Create and checkout a work branch.
current_branch()
¶Return the current branch name.
delete_branch(branch_name)
¶Delete a branch.
repo_root()
¶Return the repo root path.
reset_hard()
¶Reset the current worktree to HEAD.
providers
¶
Providers for the protocol layer spike.
artifact_writer
¶
command_builder
¶
Command builder for agent invocation.
AgentCommandBuilder
dataclass
¶
Bases: AgentCommandBuilderProtocol
Build commands for supported agents.
event_writer
¶
Event stream writer for the spike.
NdjsonEventWriter
dataclass
¶
Bases: EventWriterProtocol
Append events to an NDJSON file.
pty_agent_runner
¶
PTY-based agent runner for the spike.
PtyAgentRunner
dataclass
¶
Bases: AgentRunnerProtocol
Run agents in a PTY and capture output.
subprocess_agent_runner
¶
Subprocess-based agent runner for the spike.
RunnerState
dataclass
¶Mutable state for subprocess collection.
decision
instance-attribute
¶last_heartbeat
instance-attribute
¶last_output
instance-attribute
¶output
instance-attribute
¶stderr
instance-attribute
¶stdout
instance-attribute
¶termination
instance-attribute
¶__init__(output, stdout, stderr, last_output, last_heartbeat, decision, termination)
¶SubprocessAgentRunner
dataclass
¶
Bases: AgentRunnerProtocol
Run agents via subprocess pipes and capture output.
service
¶
Spike run orchestration service.
RunContext
dataclass
¶
Context for a single spike run.
SpikeRunService
dataclass
¶
Orchestrate a single spike run.
agent_runner
instance-attribute
¶artifact_writer
instance-attribute
¶clock
instance-attribute
¶command_builder
instance-attribute
¶event_writer_factory
instance-attribute
¶prompt_handler
instance-attribute
¶run_id_generator
instance-attribute
¶workspace
instance-attribute
¶__init__(clock, run_id_generator, agent_runner, workspace, artifact_writer, event_writer_factory, command_builder, prompt_handler)
¶run(params, *, config, policy)
¶
validation
¶
Maintained validation subsystem for agent orchestration.
ValidationSpec = Annotated[BuiltinValidationSpec | HarnessValidationSpec, Field(discriminator='kind')]
module-attribute
¶
__all__ = ['BackendFamily', 'BuiltinCommandEntry', 'BuiltinValidationSpec', 'BuiltinValidatorId', 'GeneratedHarnessValidatorId', 'HarnessBackendRegistry', 'HarnessBackendRequest', 'HarnessBackendResult', 'HarnessReport', 'ValidationArtifactMergeError', 'ScriptHarnessBackend', 'HarnessValidationSpec', 'HarnessReportLoader', 'StaticHarnessBackendResolver', 'StaticValidatorResolver', 'ValidationCapturedArtifact', 'ValidationTextArtifact', 'ValidationResult', 'ValidationService', 'ValidationSpec', 'ValidationStepRequest', 'ValidationTermination']
module-attribute
¶
BackendFamily
¶
BuiltinCommandEntry
¶
Bases: BaseModel
Trusted builtin command mapping.
BuiltinValidationSpec
¶
BuiltinValidatorId
¶
GeneratedHarnessValidatorId
¶
Bases: str, Enum
Trusted generated harness validator identifiers.
generated_harness = 'generated_harness'
class-attribute
instance-attribute
¶
HarnessBackendRegistry
dataclass
¶
HarnessBackendRequest
¶
Bases: BaseModel
Backend-neutral harness execution request.
arguments = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
artifact_patterns = Field(default_factory=tuple)
class-attribute
instance-attribute
¶
backend_family
instance-attribute
¶
entrypoint = None
class-attribute
instance-attribute
¶
environment_policy
instance-attribute
¶
executable
instance-attribute
¶
output_capture_policy = Field(default_factory=ExecutionOutputCapturePolicy)
class-attribute
instance-attribute
¶
timeout_seconds = None
class-attribute
instance-attribute
¶
working_directory
instance-attribute
¶
HarnessBackendResult
¶
Bases: BaseModel
Normalized harness backend result.
HarnessReport
¶
Bases: BaseModel
Minimal harness report needed by the kernel.
proposed_goldens = Field(default_factory=list)
class-attribute
instance-attribute
¶
HarnessReportLoader
dataclass
¶
HarnessValidationSpec
¶
Bases: BaseModel
Kernel-facing generated harness validator spec.
ScriptHarnessBackend
dataclass
¶
Bases: HarnessBackendProtocol
Execute generated script harnesses via the execution subsystem.
StaticHarnessBackendResolver
dataclass
¶
Bases: HarnessBackendResolverProtocol
Resolve trusted harness validators into backend requests.
harness_report_name = 'harness_report.json'
class-attribute
instance-attribute
¶
harness_script_name = 'generated_harness.py'
class-attribute
instance-attribute
¶
__init__(harness_script_name='generated_harness.py', harness_report_name='harness_report.json')
¶
resolve(spec, working_directory)
¶
Resolve one harness validation spec.
StaticValidatorResolver
dataclass
¶
Bases: ValidatorResolverProtocol
Resolve trusted builtin validators into execution requests.
ValidationArtifactMergeError
¶
ValidationCapturedArtifact
¶
ValidationResult
¶
Bases: BaseModel
Validation result exposed to the kernel.
ValidationService
dataclass
¶
Bases: ValidationServiceProtocol
Execute validation steps using the execution subsystem.
ValidationStepRequest
¶
ValidationTermination
¶
Bases: str, Enum
Validation outcomes exposed to the kernel.
completed = 'completed'
class-attribute
instance-attribute
¶
error = 'error'
class-attribute
instance-attribute
¶
killed_idle = 'killed_idle'
class-attribute
instance-attribute
¶
killed_policy = 'killed_policy'
class-attribute
instance-attribute
¶
killed_timeout = 'killed_timeout'
class-attribute
instance-attribute
¶
ValidationTextArtifact
¶
backends
¶
Maintained harness backends for validation.
__all__ = ['HarnessReportLoader', 'ScriptHarnessBackend']
module-attribute
¶
HarnessReportLoader
dataclass
¶
ScriptHarnessBackend
dataclass
¶
Bases: HarnessBackendProtocol
Execute generated script harnesses via the execution subsystem.
models
¶
Typed models for the validation subsystem.
ValidationSpec = Annotated[BuiltinValidationSpec | HarnessValidationSpec, Field(discriminator='kind')]
module-attribute
¶
BackendFamily
¶
BuiltinValidationSpec
¶
BuiltinValidatorId
¶
GeneratedHarnessValidatorId
¶
Bases: str, Enum
Trusted generated harness validator identifiers.
generated_harness = 'generated_harness'
class-attribute
instance-attribute
¶
HarnessBackendRequest
¶
Bases: BaseModel
Backend-neutral harness execution request.
arguments = Field(default_factory=tuple)
class-attribute
instance-attribute
¶artifact_patterns = Field(default_factory=tuple)
class-attribute
instance-attribute
¶backend_family
instance-attribute
¶entrypoint = None
class-attribute
instance-attribute
¶environment_policy
instance-attribute
¶executable
instance-attribute
¶output_capture_policy = Field(default_factory=ExecutionOutputCapturePolicy)
class-attribute
instance-attribute
¶timeout_seconds = None
class-attribute
instance-attribute
¶working_directory
instance-attribute
¶
HarnessBackendResult
¶
Bases: BaseModel
Normalized harness backend result.
HarnessReport
¶
Bases: BaseModel
Minimal harness report needed by the kernel.
proposed_goldens = Field(default_factory=list)
class-attribute
instance-attribute
¶
HarnessValidationSpec
¶
Bases: BaseModel
Kernel-facing generated harness validator spec.
ValidationCapturedArtifact
¶
ValidationResult
¶
Bases: BaseModel
Validation result exposed to the kernel.
ValidationStepRequest
¶
ValidationTermination
¶
Bases: str, Enum
Validation outcomes exposed to the kernel.
completed = 'completed'
class-attribute
instance-attribute
¶error = 'error'
class-attribute
instance-attribute
¶killed_idle = 'killed_idle'
class-attribute
instance-attribute
¶killed_policy = 'killed_policy'
class-attribute
instance-attribute
¶killed_timeout = 'killed_timeout'
class-attribute
instance-attribute
¶
protocols
¶
Protocols for the validation subsystem.
HarnessBackendProtocol
¶
Bases: Protocol
Execute one normalized harness backend request.
run(request)
¶Execute one harness request and normalize outputs.
HarnessBackendRegistryProtocol
¶
Bases: Protocol
Resolve one backend implementation for a harness family.
resolve(family)
¶Return the backend implementation for one harness family.
HarnessBackendResolverProtocol
¶
Bases: Protocol
Resolve harness validators into backend requests.
resolve(spec, working_directory)
¶Resolve one harness validator into a trusted backend request.
service
¶
Validation service built on the execution subsystem.
BuiltinCommandEntry
¶
Bases: BaseModel
Trusted builtin command mapping.
HarnessBackendRegistry
dataclass
¶
StaticHarnessBackendResolver
dataclass
¶
Bases: HarnessBackendResolverProtocol
Resolve trusted harness validators into backend requests.
harness_report_name = 'harness_report.json'
class-attribute
instance-attribute
¶harness_script_name = 'generated_harness.py'
class-attribute
instance-attribute
¶__init__(harness_script_name='generated_harness.py', harness_report_name='harness_report.json')
¶resolve(spec, working_directory)
¶Resolve one harness validation spec.
StaticValidatorResolver
dataclass
¶
Bases: ValidatorResolverProtocol
Resolve trusted builtin validators into execution requests.
ValidationService
dataclass
¶
Bases: ValidationServiceProtocol
Execute validation steps using the execution subsystem.
termination
¶
Shared validation termination helpers.
workspace
¶
Workspace subsystem for agent orchestration.
__all__ = ['GitWorktreeWorkspaceService', 'NullWorkspaceService', 'RollbackTarget', 'WorkspaceContext', 'WorkspaceSnapshot']
module-attribute
¶
GitWorktreeWorkspaceService
dataclass
¶
Bases: WorkspaceServiceProtocol
Manage one conductor-owned git worktree for a workflow run.
base_ref = 'HEAD'
class-attribute
instance-attribute
¶
branch_prefix = 'tnh/run-'
class-attribute
instance-attribute
¶
current_context_value = field(default=None, init=False)
class-attribute
instance-attribute
¶
repo_root
instance-attribute
¶
workspace_root
instance-attribute
¶
__init__(repo_root, workspace_root, base_ref='HEAD', branch_prefix='tnh/run-')
¶
current_context()
¶
Return the active managed workspace context.
diff_summary()
¶
Return the normalized diff for the active worktree.
planned_worktree_path(run_id)
¶
Return the managed worktree path for one run.
prepare_pre_run(run_id)
¶
Create the managed worktree and record its base state.
rollback_pre_run()
¶
Discard and recreate the managed worktree at the recorded base state.
snapshot()
¶
Return the current semantic snapshot for the active worktree.
NullWorkspaceService
dataclass
¶
Bases: WorkspaceServiceProtocol
Workspace service for tests and explicit non-operational contexts.
repo_root
instance-attribute
¶
__init__(repo_root)
¶
current_context()
¶
Return no managed workspace context.
diff_summary()
¶
Return a stable empty diff summary.
planned_worktree_path(run_id)
¶
Return no managed worktree path.
prepare_pre_run(run_id)
¶
Return a stable no-op workspace context.
rollback_pre_run()
¶
Return the stable no-op workspace context.
snapshot()
¶
Return an empty semantic snapshot.
RollbackTarget
¶
Bases: str, Enum
Supported rollback targets.
pre_run = 'pre_run'
class-attribute
instance-attribute
¶
WorkspaceContext
¶
Bases: BaseModel
Managed workspace identity for one mutable run.
base_ref
instance-attribute
¶
base_sha
instance-attribute
¶
branch_name
instance-attribute
¶
created_at = None
class-attribute
instance-attribute
¶
head_sha = None
class-attribute
instance-attribute
¶
repo_root
instance-attribute
¶
run_id = None
class-attribute
instance-attribute
¶
worktree_path
instance-attribute
¶
WorkspaceSnapshot
¶
Bases: BaseModel
Semantic snapshot of workspace state.
base_ref = None
class-attribute
instance-attribute
¶
base_sha = None
class-attribute
instance-attribute
¶
branch_name = None
class-attribute
instance-attribute
¶
diff_summary = None
class-attribute
instance-attribute
¶
head_sha = None
class-attribute
instance-attribute
¶
is_dirty = False
class-attribute
instance-attribute
¶
repo_root
instance-attribute
¶
staged_count = 0
class-attribute
instance-attribute
¶
unstaged_count = 0
class-attribute
instance-attribute
¶
worktree_path = None
class-attribute
instance-attribute
¶
models
¶
Typed models for workspace operations.
RollbackTarget
¶
Bases: str, Enum
Supported rollback targets.
pre_run = 'pre_run'
class-attribute
instance-attribute
¶
WorkspaceContext
¶
Bases: BaseModel
Managed workspace identity for one mutable run.
base_ref
instance-attribute
¶base_sha
instance-attribute
¶branch_name
instance-attribute
¶created_at = None
class-attribute
instance-attribute
¶head_sha = None
class-attribute
instance-attribute
¶repo_root
instance-attribute
¶run_id = None
class-attribute
instance-attribute
¶worktree_path
instance-attribute
¶
WorkspaceSnapshot
¶
Bases: BaseModel
Semantic snapshot of workspace state.
base_ref = None
class-attribute
instance-attribute
¶base_sha = None
class-attribute
instance-attribute
¶branch_name = None
class-attribute
instance-attribute
¶diff_summary = None
class-attribute
instance-attribute
¶head_sha = None
class-attribute
instance-attribute
¶is_dirty = False
class-attribute
instance-attribute
¶repo_root
instance-attribute
¶staged_count = 0
class-attribute
instance-attribute
¶unstaged_count = 0
class-attribute
instance-attribute
¶worktree_path = None
class-attribute
instance-attribute
¶
protocols
¶
Protocols for workspace operations.
WorkspaceServiceProtocol
¶
Bases: Protocol
Workspace safety and rollback operations.
current_context()
¶Return the current managed workspace context, if any.
diff_summary()
¶Return normalized diff summary.
planned_worktree_path(run_id)
¶Return the planned worktree path for one run, if managed.
prepare_pre_run(run_id)
¶Create and persist the pre-run workspace context.
rollback_pre_run()
¶Rollback to the pre-run state and return updated context.
snapshot()
¶Return current semantic snapshot.
service
¶
Workspace services.
GitWorktreeWorkspaceService
dataclass
¶
Bases: WorkspaceServiceProtocol
Manage one conductor-owned git worktree for a workflow run.
base_ref = 'HEAD'
class-attribute
instance-attribute
¶branch_prefix = 'tnh/run-'
class-attribute
instance-attribute
¶current_context_value = field(default=None, init=False)
class-attribute
instance-attribute
¶repo_root
instance-attribute
¶workspace_root
instance-attribute
¶__init__(repo_root, workspace_root, base_ref='HEAD', branch_prefix='tnh/run-')
¶current_context()
¶Return the active managed workspace context.
diff_summary()
¶Return the normalized diff for the active worktree.
planned_worktree_path(run_id)
¶Return the managed worktree path for one run.
prepare_pre_run(run_id)
¶Create the managed worktree and record its base state.
rollback_pre_run()
¶Discard and recreate the managed worktree at the recorded base state.
snapshot()
¶Return the current semantic snapshot for the active worktree.
NullWorkspaceService
dataclass
¶
Bases: WorkspaceServiceProtocol
Workspace service for tests and explicit non-operational contexts.
repo_root
instance-attribute
¶__init__(repo_root)
¶current_context()
¶Return no managed workspace context.
diff_summary()
¶Return a stable empty diff summary.
planned_worktree_path(run_id)
¶Return no managed worktree path.
prepare_pre_run(run_id)
¶Return a stable no-op workspace context.
rollback_pre_run()
¶Return the stable no-op workspace context.
snapshot()
¶Return an empty semantic snapshot.
ai_text_processing
¶
Public surface for tnh_scholar.ai_text_processing.
Historically this module eagerly imported multiple submodules with heavy
dependencies (e.g., audio codecs, ML toolkits) which made importing lightweight
components such as Prompt surprisingly expensive and brittle in test
environments. We now lazily import the concrete implementations on demand so
that callers can depend on just the pieces they need.
__all__ = ['OpenAIProcessor', 'SectionParser', 'SectionProcessor', 'find_sections', 'process_text', 'process_text_by_paragraphs', 'process_text_by_sections', 'get_pattern', 'translate_text_by_lines', 'openai_process_text', 'GitBackedRepository', 'LocalPromptManager', 'Prompt', 'PromptCatalog', 'AIResponse', 'LogicalSection', 'SectionEntry', 'TextObject', 'TextObjectInfo']
module-attribute
¶
AIResponse
¶
Bases: BaseModel
Class for dividing large texts into AI-processable segments while maintaining broader document context.
document_metadata = Field(..., description='Available Dublin Core standard metadata in human-readable YAML format')
class-attribute
instance-attribute
¶
document_summary = Field(..., description="Concise, comprehensive overview of the text's content and purpose")
class-attribute
instance-attribute
¶
key_concepts = Field(..., description='Important terms, ideas, or references that appear throughout the text')
class-attribute
instance-attribute
¶
language = Field(..., description='ISO 639-1 language code')
class-attribute
instance-attribute
¶
narrative_context = Field(..., description='Concise overview of how the text develops or progresses as a whole')
class-attribute
instance-attribute
¶
sections
instance-attribute
¶
GitBackedRepository
¶
Manages versioned storage of prompts using Git.
Provides basic Git operations while hiding complexity: - Automatic versioning of changes - Basic conflict resolution - History tracking
repo = Repo(repo_path)
instance-attribute
¶
repo_path = repo_path
instance-attribute
¶
__init__(repo_path)
¶
Initialize or connect to Git repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_path
|
Path
|
Path to repository directory |
required |
Raises:
| Type | Description |
|---|---|
GitCommandError
|
If Git operations fail |
display_history(file_path, max_versions=0)
¶
Display history of changes for a file with diffs between versions.
Shows most recent changes first, limited to max_versions entries. For each change shows: - Commit info and date - Stats summary of changes - Detailed color diff with 2 lines of context
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to file in repository |
required |
max_versions
|
int
|
Maximum number of versions to show; zero shows all revisions. |
0
|
Example
repo.display_history(Path("prompts/format_dharma_talk.yaml")) Commit abc123def (2024-12-28 14:30:22): 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/prompts/format_dharma_talk.yaml ... ...
update_file(file_path)
¶
Stage and commit changes to a file in the Git repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Absolute or relative path to the file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Commit hash if changes were made. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist. |
ValueError
|
If the file is outside the repository. |
GitCommandError
|
If Git operations fail. |
LocalPromptManager
¶
A simple singleton implementation of PromptManager that ensures only one instance is created and reused throughout the application lifecycle.
This class wraps the PromptManager to provide efficient prompt loading by maintaining a single reusable instance.
Attributes:
| Name | Type | Description |
|---|---|---|
_instance |
Optional[SingletonPromptManager]
|
The singleton instance |
_prompt_manager |
Optional[PromptManager]
|
The wrapped PromptManager instance |
prompt_manager
property
¶
Lazy initialization of the PromptManager instance.
Returns:
| Name | Type | Description |
|---|---|---|
PromptManager |
PromptCatalog
|
The wrapped PromptManager instance |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If PATTERN_REPO is not properly configured |
__new__()
¶
Create or return the singleton instance.
Returns:
| Name | Type | Description |
|---|---|---|
SingletonPromptManager |
LocalPromptManager
|
The singleton instance |
get_prompt(name)
¶
Get a prompt by name.
LogicalSection
¶
Bases: BaseModel
Represents a contextually meaningful segment of a larger text.
Sections should preserve natural breaks in content (explicit section markers, topic shifts, argument development, narrative progression) while staying within specified size limits in order to create chunks suitable for AI processing.
OpenAIProcessor
¶
Bases: TextProcessor
OpenAI-based text processor implementation.
Prompt
¶
Base Prompt class for version-controlled template prompts.
Prompts contain: - Instructions: The main prompt instructions as a Jinja2 template. Note: Instructions are intended to be saved in markdown format in a .md file. - Template fields: Default values for template variables - Metadata: Name and identifier information
Version control is handled externally through Git, not in the prompt itself. Prompt identity is determined by the combination of identifiers.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the prompt |
instructions |
str
|
The Jinja2 template string for this prompt |
default_template_fields |
Dict[str, str]
|
Default values for template variables |
_allow_empty_vars |
bool
|
Whether to allow undefined template variables |
_env |
Environment
|
Configured Jinja2 environment instance |
default_template_fields = default_template_fields or {}
instance-attribute
¶
instructions = instructions
instance-attribute
¶
name = name
instance-attribute
¶
path = path
instance-attribute
¶
__eq__(other)
¶
Compare prompts based on their content.
__hash__()
¶
Hash based on content hash for container operations.
__init__(name, instructions, path=None, default_template_fields=None, allow_empty_vars=False)
¶
Initialize a new Prompt instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique name identifying the prompt |
required |
instructions
|
MarkdownStr
|
Jinja2 template string containing the prompt |
required |
default_template_fields
|
Optional[Dict[str, str]]
|
Optional default values for template variables |
None
|
allow_empty_vars
|
bool
|
Whether to allow undefined template variables |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If name or instructions are empty |
TemplateError
|
If template syntax is invalid |
apply_template(field_values=None)
¶
Apply template values to prompt instructions using Jinja2.
Values precedence (highest to lowest): 1. field_values (explicitly passed) 2. frontmatter values (from prompt file) 3. default_template_fields (prompt defaults)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field_values
|
Optional[Dict[str, str]]
|
Values to substitute into the template. If None, uses frontmatter/defaults. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Rendered instructions with template values applied. |
Raises:
| Type | Description |
|---|---|
TemplateError
|
If template rendering fails |
ValueError
|
If required template variables are missing |
content_hash()
¶
Generate a SHA-256 hash of the prompt content.
Useful for quick content comparison and change detection.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Hexadecimal string of the SHA-256 hash |
extract_frontmatter()
¶
Extract and validate YAML frontmatter from markdown instructions.
Returns:
| Type | Description |
|---|---|
Optional[Dict[str, Any]]
|
Optional[Dict]: Frontmatter data if found and valid, None otherwise |
Note
Frontmatter must be at the very start of the file and properly formatted.
from_dict(data)
classmethod
¶
Create prompt instance from dictionary data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Dict[str, Any]
|
Dictionary containing prompt data |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Prompt |
Prompt
|
New prompt instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If required fields are missing |
get_content_without_frontmatter()
¶
Get markdown content with frontmatter removed.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Markdown content without frontmatter |
source_bytes()
¶
Best-effort raw bytes for prompt hashing.
Prefers hashing exact on-disk bytes including front-matter.
We therefore first try to read from prompt_path. If that fails, we fall back
to hashing the concatenation of known templates. In V1, only
the instructions (system template) are used for rendering.
to_dict()
¶
Convert prompt to dictionary for serialization.
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dict containing all prompt data in serializable format |
update_frontmatter(new_data)
¶
Update or add frontmatter to the markdown content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_data
|
Dict[str, Any]
|
Dictionary of frontmatter fields to update |
required |
PromptCatalog
¶
Main interface for prompt management system.
Provides high-level operations: - Prompt creation and loading - Automatic versioning - Safe concurrent access - Basic history tracking - Case-insensitive prompt names (stored as lowercase)
access_manager = ConcurrentAccessManager(self.base_path / '.locks')
instance-attribute
¶
base_path = Path(base_path).resolve()
instance-attribute
¶
repo = GitBackedRepository(self.base_path)
instance-attribute
¶
__init__(base_path)
¶
Initialize prompt management system.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_path
|
Path
|
Base directory for prompt storage |
required |
get_path(prompt_name)
¶
Recursively search for a prompt file with the given name (case-insensitive) in base_path and all subdirectories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_name
|
str
|
prompt name (without extension) to search for |
required |
Returns:
| Type | Description |
|---|---|
Optional[Path]
|
Optional[Path]: Full path to the found prompt file, or None if not found |
load(prompt_name)
¶
Load the .md prompt file by name, extract placeholders, and return a fully constructed Prompt object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_name
|
str
|
Name of the prompt (without .md extension). |
required |
Returns:
| Type | Description |
|---|---|
Prompt
|
A new Prompt object whose 'instructions' is the file's text |
Prompt
|
and whose 'template_fields' are inferred from placeholders in |
Prompt
|
those instructions. |
save(prompt, subdir=None)
¶
show_history(prompt_name)
¶
verify_repository(base_path)
classmethod
¶
Verify repository integrity and uniqueness of prompt names.
Performs the following checks: 1. Validates Git repository structure. 2. Ensures no duplicate prompt names exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_path
|
Path
|
Repository path to verify. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the repository is valid |
bool
|
and contains no duplicate prompt files. |
SectionEntry
¶
SectionParser
¶
Generates structured section breakdowns of text content.
review_count = review_count
instance-attribute
¶
section_pattern = section_pattern
instance-attribute
¶
section_scanner = section_scanner
instance-attribute
¶
__init__(section_scanner, section_pattern, review_count=DEFAULT_REVIEW_COUNT)
¶
Initialize section generator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
section_scanner
|
TextProcessor
|
Text processor used to extract sections |
required |
section_pattern
|
Prompt
|
Pattern object containing section generation instructions |
required |
review_count
|
int
|
Number of review passes |
DEFAULT_REVIEW_COUNT
|
find_sections(text, section_count_target=None, segment_size_target=None, template_dict=None)
¶
Generate section breakdown of input text. The text must be split up by newlines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
Input TextObject to process |
required |
section_count_target
|
Optional[int]
|
the target for the number of sections to find |
None
|
segment_size_target
|
Optional[int]
|
the target for the number of lines per section (if section_count_target is specified, this value will be set to generate correct segments) |
None
|
template_dict
|
Optional[Dict[str, str]]
|
Optional additional template variables |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject containing section breakdown |
SectionProcessor
¶
Handles section-based XML text processing with configurable output handling.
pattern = pattern
instance-attribute
¶
processor = processor
instance-attribute
¶
template_dict = template_dict
instance-attribute
¶
wrap_in_document = wrap_in_document
instance-attribute
¶
__init__(processor, pattern, template_dict, wrap_in_document=True)
¶
Initialize the XML section processor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor
|
TextProcessor
|
Implementation of TextProcessor to use |
required |
pattern
|
Prompt
|
Pattern object containing processing instructions |
required |
template_dict
|
Dict
|
Dictionary for template substitution |
required |
wrap_in_document
|
bool
|
Whether to wrap output in |
True
|
process_paragraphs(text)
¶
Process transcript by paragraphs (as sections), yielding ProcessedSection objects. Paragraphs are assumed to be given as newline separated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
TextObject to process |
required |
Yields:
| Name | Type | Description |
|---|---|---|
ProcessedSection |
ProcessedSection
|
One processed paragraph at a time, containing: - title: Paragraph number (e.g., 'Paragraph 1') - original_str: Raw paragraph text - processed_str: Processed paragraph text - metadata: Optional metadata dict |
process_sections(text_object)
¶
Process transcript sections and yield results one section at a time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_object
|
TextObject
|
Object containing section definitions |
required |
Yields:
| Name | Type | Description |
|---|---|---|
ProcessedSection |
ProcessedSection
|
One processed section at a time, containing: - title: Section title (English or original language) - original_text: Raw text segment - processed_text: Processed text content - start_line: Starting line number |
TextObject
¶
Manages text content with section organization and metadata tracking.
TextObject serves as the core container for text processing, providing: - Line-numbered text content management - Language identification - Section organization and access - Metadata tracking including incorporated processing stages
The class allows for section boundaries through line numbering, allowing sections to be defined by start lines without explicit end lines. Subsequent sections implicitly end where the next section begins. SectionObjects are utilized to represent sections.
Attributes:
| Name | Type | Description |
|---|---|---|
num_text |
Line-numbered text content manager |
|
language |
ISO 639-1 language code for the text content |
|
sections |
List of text sections with boundaries |
|
metadata |
Processing and content metadata container |
Example
content = NumberedText("Line 1\nLine 2\nLine 3") obj = TextObject(content, language="en")
content
property
¶
Get the raw text content without line numbers.
Returns:
| Type | Description |
|---|---|
str
|
Plain text content as string |
language = language or get_language_code_from_text(num_text.content)
instance-attribute
¶
last_line_num
property
¶
Get the last line number in the text.
Returns:
| Type | Description |
|---|---|
int
|
Last line number (1-based indexing) |
metadata = metadata or Metadata()
instance-attribute
¶
metadata_str
property
¶
Get metadata as YAML-formatted string.
Returns:
| Type | Description |
|---|---|
str
|
YAML representation of metadata |
Example
print(obj.metadata_str) author: Thich Nhat Hanh language: en
num_text = num_text
instance-attribute
¶
numbered_content
property
¶
Get text content with line numbers prefixed.
Returns:
| Type | Description |
|---|---|
str
|
Text with line numbers in format " 1 | line content" |
Example
print(obj.numbered_content) 1 | First line 2 | Second line
section_count
property
¶
Get the total number of sections.
Returns:
| Type | Description |
|---|---|
int
|
Number of sections, or 0 if no sections defined |
sections = sections or []
instance-attribute
¶
__init__(num_text, language=None, sections=None, metadata=None, validate_on_init=True)
¶
__iter__()
¶
Iterate through sections, yielding full section information.
Note: Pydantic BaseModel defines iter for dict-like iteration over fields. We override it here for domain-specific section iteration. The type: ignore is intentional as we're providing a different iteration interface.
__str__()
¶
export_info(source_file=None)
¶
Export serializable state for persistence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_file
|
Optional[Path]
|
Optional path to source file to record in metadata |
None
|
Returns:
| Type | Description |
|---|---|
TextObjectInfo
|
TextObjectInfo instance containing serializable state |
Note
If source_file is provided, it will be resolved to an absolute path.
from_info(info, metadata, num_text)
classmethod
¶
Create TextObject from serialized info and content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
info
|
TextObjectInfo
|
Serialized TextObjectInfo with section and language data |
required |
metadata
|
Metadata
|
Base metadata to merge into the object |
required |
num_text
|
NumberedText
|
NumberedText instance with the actual content |
required |
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance with combined info and metadata |
Example
info = TextObjectInfo.model_validate_json(json_str) text = read_str_from_file(info.source_file) obj = TextObject.from_info(info, Metadata(), NumberedText(text))
from_response(response, existing_metadata, num_text)
classmethod
¶
Create TextObject from AI response with section boundaries and metadata.
Extracts sections, language, and metadata from an AI-generated response (e.g., from sectioning or translation processing).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
response
|
AIResponse
|
AIResponse model containing sections and metadata |
required |
existing_metadata
|
Metadata
|
Base metadata to start with |
required |
num_text
|
NumberedText
|
NumberedText instance with the text content |
required |
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject with sections and merged metadata from AI response |
Note
Merges metadata in order: existing → ai_summary/concepts/context → document_metadata
from_section_file(section_file, source=None)
classmethod
¶
Create TextObject from a section info file, loading content from source_file. Metadata is extracted from the source_file or from content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
section_file
|
Path
|
Path to JSON file containing TextObjectInfo |
required |
source
|
Optional[str]
|
Optional source string in case no source file is found. |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If source_file is missing from section info |
FileNotFoundError
|
If either section_file or source_file not found |
from_str(text, language=None, sections=None, metadata=None)
classmethod
¶
Create a TextObject from a string, extracting any frontmatter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text string, potentially containing frontmatter |
required |
language
|
Optional[str]
|
ISO language code |
None
|
sections
|
Optional[List[SectionObject]]
|
List of section objects |
None
|
metadata
|
Optional[Metadata]
|
Optional base metadata to merge with frontmatter |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance with combined metadata |
from_text_file(file)
classmethod
¶
Create TextObject from a text file.
Reads the file and extracts any frontmatter metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to text file |
required |
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance with extracted content and metadata |
Example
obj = TextObject.from_text_file(Path("document.txt"))
get_section_content(index)
¶
Get content for a section by index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
Zero-based section index |
required |
Returns:
| Type | Description |
|---|---|
str
|
Section content as string |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no sections are available |
IndexError
|
If index is out of range |
Example
obj = TextObject(num_text, sections=[...]) content = obj.get_section_content(0) # First section
load(path, config=None)
classmethod
¶
Load TextObject from file with optional configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Input file path |
required |
config
|
Optional[LoadConfig]
|
Optional loading configuration. If not provided, loads directly from text file. |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance |
Usage
Load from text file with frontmatter¶
obj = TextObject.load(Path("content.txt"))
Load state from JSON with source content string¶
config = LoadConfig( format=StorageFormat.JSON, source_str="Text content..." ) obj = TextObject.load(Path("state.json"), config)
Load state from JSON with source content file¶
config = LoadConfig( format=StorageFormat.JSON, source_file=Path("content.txt") ) obj = TextObject.load(Path("state.json"), config)
merge_metadata(new_metadata, strategy=MergeStrategy.PRESERVE, source=None)
¶
Merge metadata with explicit strategy and optional provenance tracking.
merge_metadata_legacy(new_metadata, override=False)
¶
Deprecated legacy merge interface that maps to MergeStrategy.
save(path, output_format=StorageFormat.TEXT, source_file=None, pretty=True)
¶
Save TextObject to file in specified format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Output file path |
required |
output_format
|
StorageFormatType
|
"text" for full content+metadata or "json" for serialized state |
TEXT
|
source_file
|
Optional[Path]
|
Optional source file to record in metadata |
None
|
pretty
|
bool
|
For JSON output, whether to pretty print |
True
|
transform(data_str=None, language=None, metadata=None, process_metadata=None, sections=None)
¶
Return a new TextObject with requested changes; does not mutate the original.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_str
|
Optional[str]
|
Optional new text content |
None
|
language
|
Optional[str]
|
Optional new language code |
None
|
metadata
|
Optional[Metadata]
|
Metadata to merge into the new object |
None
|
process_metadata
|
Optional[ProcessMetadata]
|
Identifier/details for the process performed |
None
|
sections
|
Optional[List[SectionObject]]
|
Optional replacement list of sections |
None
|
update_metadata(**kwargs)
¶
Update metadata with new key-value pairs using PRESERVE strategy.
Convenience method for adding metadata without overriding existing keys.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Key-value pairs to add to metadata |
{}
|
Example
obj.update_metadata(author="Thich Nhat Hanh", year=2020)
validate_sections(raise_on_error=True)
¶
Validate section integrity using NumberedText boundary checks.
TextObjectInfo
¶
Bases: BaseModel
Serializable information about a text and its sections.
__dir__()
¶
__getattr__(name)
¶
find_sections(text, source_language=None, section_pattern=None, section_model=None, max_tokens=DEFAULT_SECTION_RESULT_MAX_SIZE, section_count=None, review_count=DEFAULT_REVIEW_COUNT, template_dict=None)
¶
High-level function for generating text sections.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
Input text |
required |
source_language
|
Optional[str]
|
ISO 639-1 language code |
None
|
section_pattern
|
Optional[Prompt]
|
Optional custom pattern (uses default if None) |
None
|
section_model
|
Optional[str]
|
Optional model identifier |
None
|
max_tokens
|
int
|
Maximum tokens for response |
DEFAULT_SECTION_RESULT_MAX_SIZE
|
section_count
|
Optional[int]
|
Target number of sections |
None
|
review_count
|
int
|
Number of review passes |
DEFAULT_REVIEW_COUNT
|
template_dict
|
Optional[Dict[str, str]]
|
Optional additional template variables |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject containing section breakdown |
get_pattern(name)
¶
Get a pattern by name using the singleton PatternManager.
This is a more efficient version that reuses a single PatternManager instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name of the pattern to load |
required |
Returns:
| Type | Description |
|---|---|
Prompt
|
The loaded pattern |
Raises:
| Type | Description |
|---|---|
ValueError
|
If pattern name is invalid |
FileNotFoundError
|
If pattern file doesn't exist |
openai_process_text(text_input, process_instructions, model=None, response_format=None, batch=False, max_tokens=0)
¶
postprocessing a transcription.
process_text(text, pattern, source_language=None, model=None, template_dict=None)
¶
process_text_by_paragraphs(text, template_dict, pattern=None, model=None)
¶
High-level function for processing text paragraphs, yielding ProcessedSection objects. Assumes paragraphs are separated by newlines. Uses DEFAULT_XML_FORMAT_PATTERN as default pattern for text processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
TextObject to process |
required |
template_dict
|
Dict[str, str]
|
Dictionary for template substitution |
required |
pattern
|
Optional[Prompt]
|
Pattern object containing processing instructions |
None
|
model
|
Optional[str]
|
Optional model identifier for processor |
None
|
Returns:
| Type | Description |
|---|---|
None
|
Generator for ProcessedSection objects (one per paragraph) |
process_text_by_sections(text_object, template_dict, pattern, model=None)
¶
High-level function for processing text sections with configurable output handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_object
|
TextObject
|
Object containing section definitions |
required |
pattern
|
Prompt
|
Pattern object containing processing instructions |
required |
template_dict
|
Dict
|
Dictionary for template substitution |
required |
model
|
Optional[str]
|
Optional model identifier for processor |
None
|
Returns:
| Type | Description |
|---|---|
None
|
Generator for ProcessedSections |
translate_text_by_lines(text, source_language=None, target_language=DEFAULT_TARGET_LANGUAGE, pattern=None, model=None, style=None, segment_size=None, context_lines=None, review_count=None, template_dict=None)
¶
ai_text_processing
¶
DEFAULT_MIN_SECTION_COUNT = 3
module-attribute
¶
DEFAULT_OPENAI_MODEL = 'gpt-4o'
module-attribute
¶
DEFAULT_PARAGRAPH_FORMAT_PATTERN = 'default_xml_paragraph_format'
module-attribute
¶
DEFAULT_PUNCTUATE_MODEL = 'gpt-4o'
module-attribute
¶
DEFAULT_PUNCTUATE_PATTERN = 'default_punctuate'
module-attribute
¶
DEFAULT_PUNCTUATE_STYLE = 'APA'
module-attribute
¶
DEFAULT_REVIEW_COUNT = 5
module-attribute
¶
DEFAULT_SECTION_PATTERN = 'default_section'
module-attribute
¶
DEFAULT_SECTION_RANGE_VAR = 2
module-attribute
¶
DEFAULT_SECTION_RESULT_MAX_SIZE = 4000
module-attribute
¶
DEFAULT_SECTION_TOKEN_SIZE = 650
module-attribute
¶
DEFAULT_XML_FORMAT_PATTERN = 'default_xml_format'
module-attribute
¶
SECTION_SEGMENT_SIZE_WARNING_LIMIT = 5
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
GeneralProcessor
¶
pattern = pattern
instance-attribute
¶
processor = processor
instance-attribute
¶
review_count = review_count
instance-attribute
¶
source_language = source_language
instance-attribute
¶
__init__(processor, pattern, source_language=None, review_count=DEFAULT_REVIEW_COUNT)
¶
Initialize general processor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor
|
TextProcessor
|
Implementation of TextProcessor |
required |
pattern
|
Prompt
|
Pattern object containing processing instructions |
required |
source_language
|
Optional[str]
|
ISO code for the source language |
None
|
review_count
|
int
|
Number of review passes |
DEFAULT_REVIEW_COUNT
|
process_text(text, template_dict=None)
¶
process a text based on a pattern and source language.
OpenAIProcessor
¶
Bases: TextProcessor
OpenAI-based text processor implementation.
ProcessedSection
dataclass
¶
SectionParser
¶
Generates structured section breakdowns of text content.
review_count = review_count
instance-attribute
¶
section_pattern = section_pattern
instance-attribute
¶
section_scanner = section_scanner
instance-attribute
¶
__init__(section_scanner, section_pattern, review_count=DEFAULT_REVIEW_COUNT)
¶
Initialize section generator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
section_scanner
|
TextProcessor
|
Text processor used to extract sections |
required |
section_pattern
|
Prompt
|
Pattern object containing section generation instructions |
required |
review_count
|
int
|
Number of review passes |
DEFAULT_REVIEW_COUNT
|
find_sections(text, section_count_target=None, segment_size_target=None, template_dict=None)
¶
Generate section breakdown of input text. The text must be split up by newlines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
Input TextObject to process |
required |
section_count_target
|
Optional[int]
|
the target for the number of sections to find |
None
|
segment_size_target
|
Optional[int]
|
the target for the number of lines per section (if section_count_target is specified, this value will be set to generate correct segments) |
None
|
template_dict
|
Optional[Dict[str, str]]
|
Optional additional template variables |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject containing section breakdown |
SectionProcessor
¶
Handles section-based XML text processing with configurable output handling.
pattern = pattern
instance-attribute
¶
processor = processor
instance-attribute
¶
template_dict = template_dict
instance-attribute
¶
wrap_in_document = wrap_in_document
instance-attribute
¶
__init__(processor, pattern, template_dict, wrap_in_document=True)
¶
Initialize the XML section processor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor
|
TextProcessor
|
Implementation of TextProcessor to use |
required |
pattern
|
Prompt
|
Pattern object containing processing instructions |
required |
template_dict
|
Dict
|
Dictionary for template substitution |
required |
wrap_in_document
|
bool
|
Whether to wrap output in |
True
|
process_paragraphs(text)
¶
Process transcript by paragraphs (as sections), yielding ProcessedSection objects. Paragraphs are assumed to be given as newline separated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
TextObject to process |
required |
Yields:
| Name | Type | Description |
|---|---|---|
ProcessedSection |
ProcessedSection
|
One processed paragraph at a time, containing: - title: Paragraph number (e.g., 'Paragraph 1') - original_str: Raw paragraph text - processed_str: Processed paragraph text - metadata: Optional metadata dict |
process_sections(text_object)
¶
Process transcript sections and yield results one section at a time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_object
|
TextObject
|
Object containing section definitions |
required |
Yields:
| Name | Type | Description |
|---|---|---|
ProcessedSection |
ProcessedSection
|
One processed section at a time, containing: - title: Section title (English or original language) - original_text: Raw text segment - processed_text: Processed text content - start_line: Starting line number |
TextProcessor
¶
Bases: ABC
Abstract base class for text processors that can return Pydantic objects.
process_text(input_str, instructions, response_format=None, **kwargs)
abstractmethod
¶
Process text according to instructions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_str
|
str
|
Input text to process |
required |
instructions
|
str
|
Processing instructions |
required |
response_format
|
Optional[Type[BaseModel]]
|
Optional Pydantic class for structured output |
None
|
**kwargs
|
Any
|
Additional processing parameters |
{}
|
Returns:
| Type | Description |
|---|---|
ProcessorResult
|
Either string or Pydantic model instance based on response_model |
find_sections(text, source_language=None, section_pattern=None, section_model=None, max_tokens=DEFAULT_SECTION_RESULT_MAX_SIZE, section_count=None, review_count=DEFAULT_REVIEW_COUNT, template_dict=None)
¶
High-level function for generating text sections.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
Input text |
required |
source_language
|
Optional[str]
|
ISO 639-1 language code |
None
|
section_pattern
|
Optional[Prompt]
|
Optional custom pattern (uses default if None) |
None
|
section_model
|
Optional[str]
|
Optional model identifier |
None
|
max_tokens
|
int
|
Maximum tokens for response |
DEFAULT_SECTION_RESULT_MAX_SIZE
|
section_count
|
Optional[int]
|
Target number of sections |
None
|
review_count
|
int
|
Number of review passes |
DEFAULT_REVIEW_COUNT
|
template_dict
|
Optional[Dict[str, str]]
|
Optional additional template variables |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject containing section breakdown |
get_pattern(name)
¶
Get a pattern by name using the singleton PatternManager.
This is a more efficient version that reuses a single PatternManager instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name of the pattern to load |
required |
Returns:
| Type | Description |
|---|---|
Prompt
|
The loaded pattern |
Raises:
| Type | Description |
|---|---|
ValueError
|
If pattern name is invalid |
FileNotFoundError
|
If pattern file doesn't exist |
process_text(text, pattern, source_language=None, model=None, template_dict=None)
¶
process_text_by_paragraphs(text, template_dict, pattern=None, model=None)
¶
High-level function for processing text paragraphs, yielding ProcessedSection objects. Assumes paragraphs are separated by newlines. Uses DEFAULT_XML_FORMAT_PATTERN as default pattern for text processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
TextObject to process |
required |
template_dict
|
Dict[str, str]
|
Dictionary for template substitution |
required |
pattern
|
Optional[Prompt]
|
Pattern object containing processing instructions |
None
|
model
|
Optional[str]
|
Optional model identifier for processor |
None
|
Returns:
| Type | Description |
|---|---|
None
|
Generator for ProcessedSection objects (one per paragraph) |
process_text_by_sections(text_object, template_dict, pattern, model=None)
¶
High-level function for processing text sections with configurable output handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_object
|
TextObject
|
Object containing section definitions |
required |
pattern
|
Prompt
|
Pattern object containing processing instructions |
required |
template_dict
|
Dict
|
Dictionary for template substitution |
required |
model
|
Optional[str]
|
Optional model identifier for processor |
None
|
Returns:
| Type | Description |
|---|---|
None
|
Generator for ProcessedSections |
general_processor
¶
line_translator
¶
DEFAULT_TARGET_LANGUAGE = 'en'
module-attribute
¶
DEFAULT_TRANSLATE_CONTEXT_LINES = 3
module-attribute
¶
DEFAULT_TRANSLATE_STYLE = "'American Dharma Teaching'"
module-attribute
¶
DEFAULT_TRANSLATION_PATTERN = 'default_line_translate'
module-attribute
¶
DEFAULT_TRANSLATION_TARGET_TOKENS = 300
module-attribute
¶
FOLLOWING_CONTEXT_MARKER = 'FOLLOWING_CONTEXT'
module-attribute
¶
MAX_RETRIES = 6
module-attribute
¶
MIN_SEGMENT_SIZE = 4
module-attribute
¶
PRECEDING_CONTEXT_MARKER = 'PRECEDING_CONTEXT'
module-attribute
¶
TRANSCRIPT_SEGMENT_MARKER = 'TRANSCRIPT_SEGMENT'
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
LineTranslator
¶
Translates text line by line while maintaining line numbers and context.
context_lines = context_lines
instance-attribute
¶
pattern = pattern
instance-attribute
¶
processor = processor
instance-attribute
¶
review_count = review_count
instance-attribute
¶
style = style
instance-attribute
¶
__init__(processor, pattern, review_count=DEFAULT_REVIEW_COUNT, style=DEFAULT_TRANSLATE_STYLE, context_lines=DEFAULT_TRANSLATE_CONTEXT_LINES)
¶
Initialize line translator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor
|
TextProcessor
|
Implementation of TextProcessor |
required |
pattern
|
Prompt
|
Pattern object containing translation instructions |
required |
review_count
|
int
|
Number of review passes |
DEFAULT_REVIEW_COUNT
|
style
|
str
|
Translation style to apply |
DEFAULT_TRANSLATE_STYLE
|
context_lines
|
int
|
Number of context lines to include before/after |
DEFAULT_TRANSLATE_CONTEXT_LINES
|
translate_segment(num_text, start_line, end_line, metadata, target_language, source_language, template_dict=None)
¶
Translate a segment of text with context.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_text
|
NumberedText
|
Numbered text to extract segment from |
required |
start_line
|
int
|
Starting line number of segment |
required |
end_line
|
int
|
Ending line number of segment |
required |
metadata
|
Metadata
|
metadata for text |
required |
source_language
|
str
|
Source language code |
required |
target_language
|
str
|
Target language code (default: en for English) |
required |
template_dict
|
Optional[Dict]
|
Optional additional template values |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Translated text segment with line numbers preserved |
translate_text(text, source_language, segment_size=None, target_language=DEFAULT_TARGET_LANGUAGE, template_dict=None)
¶
Translate entire text in segments while maintaining line continuity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
TextObject
|
Text to translate |
required |
segment_size
|
Optional[int]
|
Number of lines per translation segment |
None
|
source_language
|
str
|
Source language code |
required |
target_language
|
str
|
Target language code (default: en for English) |
DEFAULT_TARGET_LANGUAGE
|
template_dict
|
Optional[Dict]
|
Optional additional template values |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
Complete translated text with line numbers preserved |
translate_text_by_lines(text, source_language=None, target_language=DEFAULT_TARGET_LANGUAGE, pattern=None, model=None, style=None, segment_size=None, context_lines=None, review_count=None, template_dict=None)
¶
openai_process_interface
¶
prompts
¶
MANAGER_UPDATE_MESSAGE = 'PromptManager Update:'
module-attribute
¶
MarkdownStr = NewType('MarkdownStr', str)
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
ConcurrentAccessManager
¶
Manages concurrent access to prompt files.
Provides: - File-level locking - Safe concurrent access prompts - Lock cleanup
lock_dir = Path(lock_dir)
instance-attribute
¶
__init__(lock_dir)
¶
Initialize access manager.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lock_dir
|
Path
|
Directory for lock files |
required |
file_lock(file_path)
¶
Context manager for safely accessing files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to file to lock |
required |
Yields:
| Type | Description |
|---|---|
None
|
None when lock is acquired |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If file is already locked |
OSError
|
If lock file operations fail |
is_locked(file_path)
¶
Check if a file is currently locked.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to file to check |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if file is locked |
GitBackedRepository
¶
Manages versioned storage of prompts using Git.
Provides basic Git operations while hiding complexity: - Automatic versioning of changes - Basic conflict resolution - History tracking
repo = Repo(repo_path)
instance-attribute
¶
repo_path = repo_path
instance-attribute
¶
__init__(repo_path)
¶
Initialize or connect to Git repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_path
|
Path
|
Path to repository directory |
required |
Raises:
| Type | Description |
|---|---|
GitCommandError
|
If Git operations fail |
display_history(file_path, max_versions=0)
¶
Display history of changes for a file with diffs between versions.
Shows most recent changes first, limited to max_versions entries. For each change shows: - Commit info and date - Stats summary of changes - Detailed color diff with 2 lines of context
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to file in repository |
required |
max_versions
|
int
|
Maximum number of versions to show; zero shows all revisions. |
0
|
Example
repo.display_history(Path("prompts/format_dharma_talk.yaml")) Commit abc123def (2024-12-28 14:30:22): 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/prompts/format_dharma_talk.yaml ... ...
update_file(file_path)
¶
Stage and commit changes to a file in the Git repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Absolute or relative path to the file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Commit hash if changes were made. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist. |
ValueError
|
If the file is outside the repository. |
GitCommandError
|
If Git operations fail. |
LocalPromptManager
¶
A simple singleton implementation of PromptManager that ensures only one instance is created and reused throughout the application lifecycle.
This class wraps the PromptManager to provide efficient prompt loading by maintaining a single reusable instance.
Attributes:
| Name | Type | Description |
|---|---|---|
_instance |
Optional[SingletonPromptManager]
|
The singleton instance |
_prompt_manager |
Optional[PromptManager]
|
The wrapped PromptManager instance |
prompt_manager
property
¶
Lazy initialization of the PromptManager instance.
Returns:
| Name | Type | Description |
|---|---|---|
PromptManager |
PromptCatalog
|
The wrapped PromptManager instance |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If PATTERN_REPO is not properly configured |
__new__()
¶
Create or return the singleton instance.
Returns:
| Name | Type | Description |
|---|---|---|
SingletonPromptManager |
LocalPromptManager
|
The singleton instance |
get_prompt(name)
¶
Get a prompt by name.
Prompt
¶
Base Prompt class for version-controlled template prompts.
Prompts contain: - Instructions: The main prompt instructions as a Jinja2 template. Note: Instructions are intended to be saved in markdown format in a .md file. - Template fields: Default values for template variables - Metadata: Name and identifier information
Version control is handled externally through Git, not in the prompt itself. Prompt identity is determined by the combination of identifiers.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the prompt |
instructions |
str
|
The Jinja2 template string for this prompt |
default_template_fields |
Dict[str, str]
|
Default values for template variables |
_allow_empty_vars |
bool
|
Whether to allow undefined template variables |
_env |
Environment
|
Configured Jinja2 environment instance |
default_template_fields = default_template_fields or {}
instance-attribute
¶
instructions = instructions
instance-attribute
¶
name = name
instance-attribute
¶
path = path
instance-attribute
¶
__eq__(other)
¶
Compare prompts based on their content.
__hash__()
¶
Hash based on content hash for container operations.
__init__(name, instructions, path=None, default_template_fields=None, allow_empty_vars=False)
¶
Initialize a new Prompt instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique name identifying the prompt |
required |
instructions
|
MarkdownStr
|
Jinja2 template string containing the prompt |
required |
default_template_fields
|
Optional[Dict[str, str]]
|
Optional default values for template variables |
None
|
allow_empty_vars
|
bool
|
Whether to allow undefined template variables |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If name or instructions are empty |
TemplateError
|
If template syntax is invalid |
apply_template(field_values=None)
¶
Apply template values to prompt instructions using Jinja2.
Values precedence (highest to lowest): 1. field_values (explicitly passed) 2. frontmatter values (from prompt file) 3. default_template_fields (prompt defaults)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field_values
|
Optional[Dict[str, str]]
|
Values to substitute into the template. If None, uses frontmatter/defaults. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Rendered instructions with template values applied. |
Raises:
| Type | Description |
|---|---|
TemplateError
|
If template rendering fails |
ValueError
|
If required template variables are missing |
content_hash()
¶
Generate a SHA-256 hash of the prompt content.
Useful for quick content comparison and change detection.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Hexadecimal string of the SHA-256 hash |
extract_frontmatter()
¶
Extract and validate YAML frontmatter from markdown instructions.
Returns:
| Type | Description |
|---|---|
Optional[Dict[str, Any]]
|
Optional[Dict]: Frontmatter data if found and valid, None otherwise |
Note
Frontmatter must be at the very start of the file and properly formatted.
from_dict(data)
classmethod
¶
Create prompt instance from dictionary data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Dict[str, Any]
|
Dictionary containing prompt data |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Prompt |
Prompt
|
New prompt instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If required fields are missing |
get_content_without_frontmatter()
¶
Get markdown content with frontmatter removed.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Markdown content without frontmatter |
source_bytes()
¶
Best-effort raw bytes for prompt hashing.
Prefers hashing exact on-disk bytes including front-matter.
We therefore first try to read from prompt_path. If that fails, we fall back
to hashing the concatenation of known templates. In V1, only
the instructions (system template) are used for rendering.
to_dict()
¶
Convert prompt to dictionary for serialization.
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dict containing all prompt data in serializable format |
update_frontmatter(new_data)
¶
Update or add frontmatter to the markdown content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_data
|
Dict[str, Any]
|
Dictionary of frontmatter fields to update |
required |
PromptCatalog
¶
Main interface for prompt management system.
Provides high-level operations: - Prompt creation and loading - Automatic versioning - Safe concurrent access - Basic history tracking - Case-insensitive prompt names (stored as lowercase)
access_manager = ConcurrentAccessManager(self.base_path / '.locks')
instance-attribute
¶
base_path = Path(base_path).resolve()
instance-attribute
¶
repo = GitBackedRepository(self.base_path)
instance-attribute
¶
__init__(base_path)
¶
Initialize prompt management system.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_path
|
Path
|
Base directory for prompt storage |
required |
get_path(prompt_name)
¶
Recursively search for a prompt file with the given name (case-insensitive) in base_path and all subdirectories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_name
|
str
|
prompt name (without extension) to search for |
required |
Returns:
| Type | Description |
|---|---|
Optional[Path]
|
Optional[Path]: Full path to the found prompt file, or None if not found |
load(prompt_name)
¶
Load the .md prompt file by name, extract placeholders, and return a fully constructed Prompt object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_name
|
str
|
Name of the prompt (without .md extension). |
required |
Returns:
| Type | Description |
|---|---|
Prompt
|
A new Prompt object whose 'instructions' is the file's text |
Prompt
|
and whose 'template_fields' are inferred from placeholders in |
Prompt
|
those instructions. |
save(prompt, subdir=None)
¶
show_history(prompt_name)
¶
verify_repository(base_path)
classmethod
¶
Verify repository integrity and uniqueness of prompt names.
Performs the following checks: 1. Validates Git repository structure. 2. Ensures no duplicate prompt names exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_path
|
Path
|
Repository path to verify. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the repository is valid |
bool
|
and contains no duplicate prompt files. |
response_format
¶
TEXT_SECTIONS_DESCRIPTION = 'Ordered list of logical sections for the text. The sequence of line ranges for the sections must cover every line from start to finish without any overlaps or gaps.'
module-attribute
¶
LogicalSection
¶
Bases: BaseModel
A logically coherent section of text.
end_line = Field(..., description='Ending line number of the section (inclusive).')
class-attribute
instance-attribute
¶
start_line = Field(..., description='Starting line number of the section (inclusive).')
class-attribute
instance-attribute
¶
title = Field(..., description='Meaningful title for the section in the original language of the section.')
class-attribute
instance-attribute
¶
TextObject
¶
section_processor
¶
text_object
¶
StorageFormatType = Union[StorageFormat, Literal['text', 'json']]
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
AIResponse
¶
Bases: BaseModel
Class for dividing large texts into AI-processable segments while maintaining broader document context.
document_metadata = Field(..., description='Available Dublin Core standard metadata in human-readable YAML format')
class-attribute
instance-attribute
¶
document_summary = Field(..., description="Concise, comprehensive overview of the text's content and purpose")
class-attribute
instance-attribute
¶
key_concepts = Field(..., description='Important terms, ideas, or references that appear throughout the text')
class-attribute
instance-attribute
¶
language = Field(..., description='ISO 639-1 language code')
class-attribute
instance-attribute
¶
narrative_context = Field(..., description='Concise overview of how the text develops or progresses as a whole')
class-attribute
instance-attribute
¶
sections
instance-attribute
¶
LoadConfig
dataclass
¶
Configuration for loading a TextObject.
Attributes:
| Name | Type | Description |
|---|---|---|
format |
StorageFormat
|
Storage format of the input file |
source_str |
Optional[str]
|
Optional source content as string |
source_file |
Optional[Path]
|
Optional path to source content file |
Note
For JSON format, exactly one of source_str or source_file may be provided. Both fields are ignored for TEXT format.
format = StorageFormat.TEXT
class-attribute
instance-attribute
¶
source_file = None
class-attribute
instance-attribute
¶
source_str = None
class-attribute
instance-attribute
¶
__init__(format=StorageFormat.TEXT, source_str=None, source_file=None)
¶
__post_init__()
¶
Validate LoadConfig constraints.
Ensures exactly one source is provided for JSON format using XOR logic.
Raises:
| Type | Description |
|---|---|
ValueError
|
If JSON format specified without exactly one source |
get_source_text()
¶
Get source content as text.
Reads from source_file if provided, otherwise returns source_str.
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Source text content, or None if neither source is set |
Note
This method is primarily used internally by TextObject.load() for JSON format loading.
LogicalSection
¶
Bases: BaseModel
Represents a contextually meaningful segment of a larger text.
Sections should preserve natural breaks in content (explicit section markers, topic shifts, argument development, narrative progression) while staying within specified size limits in order to create chunks suitable for AI processing.
MergeStrategy
¶
SectionBoundaryError
¶
Bases: ValidationError
Raised when section boundaries have gaps, overlaps, or out-of-bounds errors.
Attributes:
| Name | Type | Description |
|---|---|---|
errors |
List of SectionValidationError instances from NumberedText |
|
coverage_report |
Coverage statistics (coverage_pct, gaps, overlaps) |
SectionEntry
¶
SectionObject
dataclass
¶
Represents a section of text with computed boundaries and optional metadata.
SectionObject is used internally by TextObject to track section ranges. Unlike LogicalSection (which only has start_line), SectionObject includes the computed end boundary.
Attributes:
| Name | Type | Description |
|---|---|---|
title |
str
|
Descriptive title of the section |
section_range |
SectionRange
|
Line range (start inclusive, end exclusive) |
metadata |
Optional[Metadata]
|
Optional section-specific metadata |
metadata
instance-attribute
¶
section_range
instance-attribute
¶
title
instance-attribute
¶
__init__(title, section_range, metadata)
¶
from_logical_section(logical_section, end_line, metadata=None)
classmethod
¶
Create a SectionObject from a LogicalSection with computed end boundary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
logical_section
|
LogicalSection
|
AI-generated section with start_line and title |
required |
end_line
|
int
|
Computed end boundary (exclusive) |
required |
metadata
|
Optional[Metadata]
|
Optional metadata for this section |
None
|
Returns:
| Type | Description |
|---|---|
SectionObject
|
SectionObject with complete range information |
SectionRange
¶
StorageFormat
¶
TextObject
¶
Manages text content with section organization and metadata tracking.
TextObject serves as the core container for text processing, providing: - Line-numbered text content management - Language identification - Section organization and access - Metadata tracking including incorporated processing stages
The class allows for section boundaries through line numbering, allowing sections to be defined by start lines without explicit end lines. Subsequent sections implicitly end where the next section begins. SectionObjects are utilized to represent sections.
Attributes:
| Name | Type | Description |
|---|---|---|
num_text |
Line-numbered text content manager |
|
language |
ISO 639-1 language code for the text content |
|
sections |
List of text sections with boundaries |
|
metadata |
Processing and content metadata container |
Example
content = NumberedText("Line 1\nLine 2\nLine 3") obj = TextObject(content, language="en")
content
property
¶
Get the raw text content without line numbers.
Returns:
| Type | Description |
|---|---|
str
|
Plain text content as string |
language = language or get_language_code_from_text(num_text.content)
instance-attribute
¶
last_line_num
property
¶
Get the last line number in the text.
Returns:
| Type | Description |
|---|---|
int
|
Last line number (1-based indexing) |
metadata = metadata or Metadata()
instance-attribute
¶
metadata_str
property
¶
Get metadata as YAML-formatted string.
Returns:
| Type | Description |
|---|---|
str
|
YAML representation of metadata |
Example
print(obj.metadata_str) author: Thich Nhat Hanh language: en
num_text = num_text
instance-attribute
¶
numbered_content
property
¶
Get text content with line numbers prefixed.
Returns:
| Type | Description |
|---|---|
str
|
Text with line numbers in format " 1 | line content" |
Example
print(obj.numbered_content) 1 | First line 2 | Second line
section_count
property
¶
Get the total number of sections.
Returns:
| Type | Description |
|---|---|
int
|
Number of sections, or 0 if no sections defined |
sections = sections or []
instance-attribute
¶
__init__(num_text, language=None, sections=None, metadata=None, validate_on_init=True)
¶
__iter__()
¶
Iterate through sections, yielding full section information.
Note: Pydantic BaseModel defines iter for dict-like iteration over fields. We override it here for domain-specific section iteration. The type: ignore is intentional as we're providing a different iteration interface.
__str__()
¶
export_info(source_file=None)
¶
Export serializable state for persistence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_file
|
Optional[Path]
|
Optional path to source file to record in metadata |
None
|
Returns:
| Type | Description |
|---|---|
TextObjectInfo
|
TextObjectInfo instance containing serializable state |
Note
If source_file is provided, it will be resolved to an absolute path.
from_info(info, metadata, num_text)
classmethod
¶
Create TextObject from serialized info and content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
info
|
TextObjectInfo
|
Serialized TextObjectInfo with section and language data |
required |
metadata
|
Metadata
|
Base metadata to merge into the object |
required |
num_text
|
NumberedText
|
NumberedText instance with the actual content |
required |
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance with combined info and metadata |
Example
info = TextObjectInfo.model_validate_json(json_str) text = read_str_from_file(info.source_file) obj = TextObject.from_info(info, Metadata(), NumberedText(text))
from_response(response, existing_metadata, num_text)
classmethod
¶
Create TextObject from AI response with section boundaries and metadata.
Extracts sections, language, and metadata from an AI-generated response (e.g., from sectioning or translation processing).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
response
|
AIResponse
|
AIResponse model containing sections and metadata |
required |
existing_metadata
|
Metadata
|
Base metadata to start with |
required |
num_text
|
NumberedText
|
NumberedText instance with the text content |
required |
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject with sections and merged metadata from AI response |
Note
Merges metadata in order: existing → ai_summary/concepts/context → document_metadata
from_section_file(section_file, source=None)
classmethod
¶
Create TextObject from a section info file, loading content from source_file. Metadata is extracted from the source_file or from content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
section_file
|
Path
|
Path to JSON file containing TextObjectInfo |
required |
source
|
Optional[str]
|
Optional source string in case no source file is found. |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If source_file is missing from section info |
FileNotFoundError
|
If either section_file or source_file not found |
from_str(text, language=None, sections=None, metadata=None)
classmethod
¶
Create a TextObject from a string, extracting any frontmatter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text string, potentially containing frontmatter |
required |
language
|
Optional[str]
|
ISO language code |
None
|
sections
|
Optional[List[SectionObject]]
|
List of section objects |
None
|
metadata
|
Optional[Metadata]
|
Optional base metadata to merge with frontmatter |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance with combined metadata |
from_text_file(file)
classmethod
¶
Create TextObject from a text file.
Reads the file and extracts any frontmatter metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to text file |
required |
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance with extracted content and metadata |
Example
obj = TextObject.from_text_file(Path("document.txt"))
get_section_content(index)
¶
Get content for a section by index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
Zero-based section index |
required |
Returns:
| Type | Description |
|---|---|
str
|
Section content as string |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no sections are available |
IndexError
|
If index is out of range |
Example
obj = TextObject(num_text, sections=[...]) content = obj.get_section_content(0) # First section
load(path, config=None)
classmethod
¶
Load TextObject from file with optional configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Input file path |
required |
config
|
Optional[LoadConfig]
|
Optional loading configuration. If not provided, loads directly from text file. |
None
|
Returns:
| Type | Description |
|---|---|
TextObject
|
TextObject instance |
Usage
Load from text file with frontmatter¶
obj = TextObject.load(Path("content.txt"))
Load state from JSON with source content string¶
config = LoadConfig( format=StorageFormat.JSON, source_str="Text content..." ) obj = TextObject.load(Path("state.json"), config)
Load state from JSON with source content file¶
config = LoadConfig( format=StorageFormat.JSON, source_file=Path("content.txt") ) obj = TextObject.load(Path("state.json"), config)
merge_metadata(new_metadata, strategy=MergeStrategy.PRESERVE, source=None)
¶
Merge metadata with explicit strategy and optional provenance tracking.
merge_metadata_legacy(new_metadata, override=False)
¶
Deprecated legacy merge interface that maps to MergeStrategy.
save(path, output_format=StorageFormat.TEXT, source_file=None, pretty=True)
¶
Save TextObject to file in specified format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Output file path |
required |
output_format
|
StorageFormatType
|
"text" for full content+metadata or "json" for serialized state |
TEXT
|
source_file
|
Optional[Path]
|
Optional source file to record in metadata |
None
|
pretty
|
bool
|
For JSON output, whether to pretty print |
True
|
transform(data_str=None, language=None, metadata=None, process_metadata=None, sections=None)
¶
Return a new TextObject with requested changes; does not mutate the original.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_str
|
Optional[str]
|
Optional new text content |
None
|
language
|
Optional[str]
|
Optional new language code |
None
|
metadata
|
Optional[Metadata]
|
Metadata to merge into the new object |
None
|
process_metadata
|
Optional[ProcessMetadata]
|
Identifier/details for the process performed |
None
|
sections
|
Optional[List[SectionObject]]
|
Optional replacement list of sections |
None
|
update_metadata(**kwargs)
¶
Update metadata with new key-value pairs using PRESERVE strategy.
Convenience method for adding metadata without overriding existing keys.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Key-value pairs to add to metadata |
{}
|
Example
obj.update_metadata(author="Thich Nhat Hanh", year=2020)
validate_sections(raise_on_error=True)
¶
Validate section integrity using NumberedText boundary checks.
TextObjectInfo
¶
Bases: BaseModel
Serializable information about a text and its sections.
audio_processing
¶
__all__ = ['ArtifactRetention', 'DiarizationConfig', 'MultilingualTranscriptionRequest', 'MultilingualTranscriptionService', 'TranscriptionProvider']
module-attribute
¶
__getattr__(name)
¶
Lazily expose audio processing exports to avoid heavy import side effects.
audio_slice_utils
¶
diarization
¶
__all__ = ['DiarizationProcessor', 'diarize', 'diarize_to_file', 'DiarizationParams', 'PyannoteClient', 'PyannoteConfig']
module-attribute
¶
DiarizationParams
¶
Bases: BaseModel
Per-request diarization options; maps to pyannote API payload. Use .to_api_dict() to emit API field names.
confidence = Field(default=None, ge=0.0, le=1.0, description='Confidence threshold for segments.')
class-attribute
instance-attribute
¶
model_config = ConfigDict(frozen=True, populate_by_name=True, extra='forbid')
class-attribute
instance-attribute
¶
num_speakers = Field(default=None, alias='numSpeakers', description="Fixed number of speakers or 'auto' for detection.")
class-attribute
instance-attribute
¶
webhook = Field(default=None, description='Webhook URL for job status callbacks.')
class-attribute
instance-attribute
¶
to_api_dict()
¶
Return payload dict using API field names (camelCase) and excluding Nones.
DiarizationProcessor
¶
Orchestrator over a DiarizationService.
This layer delegates to the service for generation and handles persistence.
audio_file_path = audio_file_path.resolve()
instance-attribute
¶
output_path = output_path.resolve() if output_path is not None else self.audio_file_path.parent / f'{self.audio_file_path.stem}{PYANNOTE_FILE_STR}.json'
instance-attribute
¶
params = params
instance-attribute
¶
service = service or PyannoteService(default_client)
instance-attribute
¶
writer = writer or FileResultWriter()
instance-attribute
¶
__init__(audio_file_path, output_path=None, *, service=None, params=None, api_key=None, writer=None)
¶
export(response=None)
¶
Write the provided or last response to self.output_path.
generate(*, wait_until_complete=True)
¶
One-shot convenience: delegate to the service and cache the response.
get_response(job=None, *, wait_until_complete=False)
¶
Fetch current/final response for a job, caching the last response.
start()
¶
Start a job and cache its job_id.
PyannoteClient
¶
Client for interacting with the pyannote.ai speaker diarization API.
api_key = api_key or os.getenv('PYANNOTEAI_API_TOKEN')
instance-attribute
¶
config = config or PyannoteConfig()
instance-attribute
¶
headers = {'Authorization': f'Bearer {self.api_key}'}
instance-attribute
¶
network_timeout = self.config.network_timeout
instance-attribute
¶
polling_config = self.config.polling_config
instance-attribute
¶
upload_max_retries = self.config.upload_max_retries
instance-attribute
¶
upload_timeout = self.config.upload_timeout
instance-attribute
¶
JobPoller
¶
Generic job polling helper for long-running async jobs.
job_id = job_id
instance-attribute
¶last_status = None
instance-attribute
¶poll_count = 0
instance-attribute
¶polling_config = polling_config
instance-attribute
¶start_time = time.time()
instance-attribute
¶status_fn = status_fn
instance-attribute
¶__init__(status_fn, job_id, polling_config)
¶run()
¶
__init__(api_key=None, config=None)
¶
Initialize with API key.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_key
|
Optional[str]
|
Pyannote.ai API key (defaults to environment variable) |
None
|
check_job_status(job_id)
¶
Check the status of a diarization job.
Returns a typed transport model (JobStatusResponse) or None on failure.
poll_job_until_complete(job_id, estimated_duration=None, timeout=None, wait_until_complete=False)
¶
Poll until the job reaches a terminal state or a client-side stop condition, and
return a unified JobStatusResponse (JSR) that includes both the server payload
and polling context via outcome, polls, and elapsed_s.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
Remote job identifier to poll. |
required |
estimated_duration
|
Optional[float]
|
Optional hint; currently unused (reserved for adaptive backoff). |
None
|
timeout
|
Optional[float]
|
Optional hard timeout in seconds for this poll call. If provided, it overrides
the client's default polling timeout. Ignored if |
None
|
wait_until_complete
|
Optional[bool]
|
If True, ignore timeout and poll indefinitely (subject to process lifetime). |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
JobStatusResponse |
JobStatusResponse
|
unified transport + polling-context result. |
start_diarization(media_id, params=None)
¶
Start diarization job with pyannote.ai API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
media_id
|
str
|
The media ID from upload_audio |
required |
params
|
Optional[DiarizationParams]
|
Optional parameters for diarization |
None
|
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Optional[str]: The job ID if started successfully, None otherwise |
upload_audio(file_path)
¶
Upload audio file with retry logic for network robustness.
Retries on network errors with exponential backoff. Fails fast on permanent errors (auth, file not found, etc.).
PyannoteConfig
¶
Bases: BaseSettings
Configuration constants for Pyannote API.
base_url = 'https://api.pyannote.ai/v1'
class-attribute
instance-attribute
¶
diarize_endpoint
property
¶
job_status_endpoint
property
¶
media_content_type = 'audio/mpeg'
class-attribute
instance-attribute
¶
media_input_endpoint
property
¶
media_prefix = 'media://diarization-'
class-attribute
instance-attribute
¶
model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='PYANNOTE_', extra='ignore')
class-attribute
instance-attribute
¶
network_timeout = 3
class-attribute
instance-attribute
¶
polling_config = PollingConfig()
class-attribute
instance-attribute
¶
upload_max_retries = 3
class-attribute
instance-attribute
¶
upload_timeout = 300
class-attribute
instance-attribute
¶
diarize(audio_file_path, output_path=None, *, params=None, service=None, api_key=None, wait_until_complete=True)
¶
One-shot convenience to generate a result and (optionally) write it.
This returns the DiarizationResponse. Writing is left to callers or
diarize_to_file below.
diarize_to_file(audio_file_path, output_path=None, *, params=None, service=None, api_key=None, wait_until_complete=True)
¶
Convenience helper: generate then export to JSON if successful; returns response
audio
¶
__all__ = ['AudioHandler', 'AudioHandlerConfig']
module-attribute
¶
AudioHandler
¶
Isolates audio operations and external dependencies (pydub, ffmpeg).
base_audio
instance-attribute
¶config = config
instance-attribute
¶input_format = None
instance-attribute
¶output_format = config.output_format
instance-attribute
¶__init__(config=AudioHandlerConfig())
¶build_audio_chunk(chunk, audio_file)
¶builds and sets the internal chunk.audio to be the new AudioChunk
export_audio_bytes(audio_segment, format_str=None)
¶Export AudioSegment to BytesIO for services/modules that require file-like objects.
AudioHandlerConfig
¶
Bases: BaseSettings
Configuration settings for the AudioHandler. All audio time units are milliseconds (int)
SUPPORTED_FORMATS = frozenset({'mp3', 'wav', 'flac', 'ogg', 'm4a', 'mp4'})
class-attribute
instance-attribute
¶max_segment_length = Field(default=None, description='Maximum allowed segment length (in milliseconds).')
class-attribute
instance-attribute
¶output_format = Field(default=None, description="Audio output format used when exporting segments (e.g., 'wav', 'mp3').")
class-attribute
instance-attribute
¶silence_all_intervals = Field(default=False, description='If True, replace every non-zero interval between consecutive diarization segments with silence of length spacing_time.')
class-attribute
instance-attribute
¶temp_storage_dir = Field(default=None, description='Optional directory path for storing temporary audio files (currently unused).')
class-attribute
instance-attribute
¶
config
¶
AudioHandlerConfig
¶
Bases: BaseSettings
Configuration settings for the AudioHandler. All audio time units are milliseconds (int)
SUPPORTED_FORMATS = frozenset({'mp3', 'wav', 'flac', 'ogg', 'm4a', 'mp4'})
class-attribute
instance-attribute
¶max_segment_length = Field(default=None, description='Maximum allowed segment length (in milliseconds).')
class-attribute
instance-attribute
¶output_format = Field(default=None, description="Audio output format used when exporting segments (e.g., 'wav', 'mp3').")
class-attribute
instance-attribute
¶silence_all_intervals = Field(default=False, description='If True, replace every non-zero interval between consecutive diarization segments with silence of length spacing_time.')
class-attribute
instance-attribute
¶temp_storage_dir = Field(default=None, description='Optional directory path for storing temporary audio files (currently unused).')
class-attribute
instance-attribute
¶
handler
¶
Audio handler utilities for slicing and assembling audio around diarization chunks. Designed for pipeline-friendly, single-responsibility methods so that higher-level services can remain agnostic of the underlying audio library.
This implementation purposely keeps logic minimal for testing.
logger = get_child_logger(__name__)
module-attribute
¶AudioHandler
¶Isolates audio operations and external dependencies (pydub, ffmpeg).
base_audio
instance-attribute
¶config = config
instance-attribute
¶input_format = None
instance-attribute
¶output_format = config.output_format
instance-attribute
¶__init__(config=AudioHandlerConfig())
¶build_audio_chunk(chunk, audio_file)
¶builds and sets the internal chunk.audio to be the new AudioChunk
export_audio_bytes(audio_segment, format_str=None)
¶Export AudioSegment to BytesIO for services/modules that require file-like objects.
chunker
¶
logger = get_child_logger(__name__)
module-attribute
¶
DiarizationChunker
¶
Class for chunking diarization results into processing units based on configurable duration targets.
config = ChunkConfig()
instance-attribute
¶__init__(**config_options)
¶Initialize chunker with additional config_options.
extract_contiguous_chunks(segments)
¶Split diarization segments into contiguous chunks of approximately target_duration, without splitting on speaker changes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segments
|
List[DiarizedSegment]
|
List of speaker segments from diarization |
required |
Returns:
| Type | Description |
|---|---|
List[DiarizationChunk]
|
List[Chunk]: Flat list of contiguous chunks |
config
¶
ChunkConfig
¶
Bases: BaseSettings
Configuration for chunking
gap_spacing_time = 1000
class-attribute
instance-attribute
¶gap_threshold = 4000
class-attribute
instance-attribute
¶min_duration = 30000
class-attribute
instance-attribute
¶model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='CHUNK_', extra='ignore')
class-attribute
instance-attribute
¶target_duration = 300000
class-attribute
instance-attribute
¶
DiarizationConfig
¶
Bases: BaseSettings
chunk = ChunkConfig()
class-attribute
instance-attribute
¶language = LanguageConfig()
class-attribute
instance-attribute
¶mapping = MappingPolicy()
class-attribute
instance-attribute
¶model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='DIARIZATION_', extra='ignore')
class-attribute
instance-attribute
¶speaker = SpeakerConfig()
class-attribute
instance-attribute
¶
LanguageConfig
¶
Bases: BaseSettings
default_language = 'en'
class-attribute
instance-attribute
¶export_format = 'wav'
class-attribute
instance-attribute
¶model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='LANGUAGE_', extra='ignore')
class-attribute
instance-attribute
¶probe_time = 10000
class-attribute
instance-attribute
¶
MappingPolicy
¶
Bases: BaseSettings
Mapping policy for transport→domain shaping.
TODO (future parameters to consider): - min_segment_ms: int # drop micro-segments below threshold - merge_gap_ms: int # merge adjacent same-speaker if gap ≤ this - round_ms_to: int # quantize boundaries (e.g., 10ms) - confidence_floor: float | None # filter out low-confidence segments - suppress_unlabeled: bool # drop segments missing speaker id - attach_raw_payload: bool # persist raw API payload in metadata - version: int # policy versioning for reproducibility
default_speaker_label = 'SPEAKER_00'
class-attribute
instance-attribute
¶model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='MAPPING_', extra='ignore')
class-attribute
instance-attribute
¶single_speaker = False
class-attribute
instance-attribute
¶
PollingConfig
¶
Bases: BaseSettings
Configuration constants for a generic polling class used to for Pyannote API polling.
exp_base = 2
class-attribute
instance-attribute
¶initial_poll_time = 7
class-attribute
instance-attribute
¶max_interval = 30
class-attribute
instance-attribute
¶model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='PYANNOTE_POLL_', extra='ignore')
class-attribute
instance-attribute
¶polling_interval = 15
class-attribute
instance-attribute
¶polling_timeout = 300.0
class-attribute
instance-attribute
¶
PyannoteConfig
¶
Bases: BaseSettings
Configuration constants for Pyannote API.
base_url = 'https://api.pyannote.ai/v1'
class-attribute
instance-attribute
¶diarize_endpoint
property
¶job_status_endpoint
property
¶media_content_type = 'audio/mpeg'
class-attribute
instance-attribute
¶media_input_endpoint
property
¶media_prefix = 'media://diarization-'
class-attribute
instance-attribute
¶model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='PYANNOTE_', extra='ignore')
class-attribute
instance-attribute
¶network_timeout = 3
class-attribute
instance-attribute
¶polling_config = PollingConfig()
class-attribute
instance-attribute
¶upload_max_retries = 3
class-attribute
instance-attribute
¶upload_timeout = 300
class-attribute
instance-attribute
¶
SpeakerConfig
¶
Bases: BaseSettings
Configuration settings for speaker block generation.
default_speaker_label = 'SPEAKER_00'
class-attribute
instance-attribute
¶model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', case_sensitive=False, env_prefix='SPEAKER_', extra='ignore')
class-attribute
instance-attribute
¶same_speaker_gap_threshold = TimeMs.from_seconds(2)
class-attribute
instance-attribute
¶single_speaker = False
class-attribute
instance-attribute
¶
models
¶
logger = get_child_logger(__name__)
module-attribute
¶
AudioChunk
¶
AugDiarizedSegment
¶
Bases: DiarizedSegment
DiarizedSegment with additional chunking/processing metadata.
This class extends DiarizationSegment and adds fields that are only set during
chunk accumulation or downstream processing.
Attributes:
| Name | Type | Description |
|---|---|---|
gap_before |
bool
|
Indicates if there is a gap greater than the configured threshold before this segment. Set only during chunk accumulation. |
spacing_time |
TimeMs
|
The spacing (in ms) between this and the previous segment, possibly adjusted if there is a gap before. Set only during chunk accumulation. |
audio |
TNHAudioSegment
|
The audio data for this segment, sliced from the original audio. |
Notes
- The
audiofield is a slice of the original audio corresponding to this segment. - All time values (start, end, duration) are relative to the original audio.
- When slicing or probing the
audiofield, use times relative to 0 (i.e., 0 to duration). - For language probing or any operation on
audio, always use 0 as the start anddurationas the end.
audio
instance-attribute
¶gap_before_new
instance-attribute
¶relative_end
property
¶End time relative to the segment audio (duration of segment).
relative_start
property
¶Start time relative to the segment audio (always 0).
spacing_time_new
instance-attribute
¶from_segment(segment, gap_before=None, spacing_time_new=None, audio=None, **kwargs)
classmethod
¶Create an AugDiarizedSegment from a DiarizedSegment, with optional new fields. Args: segment (DiarizedSegment): The base segment to copy fields from. gap_before_new (bool, optional): Value for gap_before_new. Defaults to False. spacing_time_new (TimeMs, optional): Value for spacing_time_new. Defaults to None. audio (AudioSegment, optional): Audio data for this segment. Defaults to None. **kwargs: Any additional fields to override. Returns: AugDiarizedSegment: The new augmented segment.
DiarizationChunk
¶
Bases: BaseModel
Represents a chunk of segments to be processed together.
accumulated_time = 0
class-attribute
instance-attribute
¶audio = None
class-attribute
instance-attribute
¶end_time
instance-attribute
¶segments
instance-attribute
¶start_time
instance-attribute
¶total_duration
property
¶Get chunk duration in milliseconds.
total_duration_sec
property
¶total_duration_time
property
¶
DiarizedSegment
¶
Bases: BaseModel
Represents a diarized audio segment for a single speaker.
Attributes:
| Name | Type | Description |
|---|---|---|
speaker |
str
|
The speaker label for this segment. |
start |
TimeMs
|
Start time in milliseconds. |
end |
TimeMs
|
End time in milliseconds. |
audio_map_start |
Optional[int]
|
Location in the audio output file, if mapped. |
gap_before |
Optional[bool]
|
Indicates if there is a gap greater than the configured threshold
before this segment. This attribute is set exclusively by |
spacing_time |
Optional[int]
|
The spacing (in ms) between this and the previous segment,
possibly adjusted if there is a gap before. This attribute is also set exclusively by
|
Notes
gap_beforeandspacing_timeare not set during initial diarization, but are assigned only when the segment is accumulated into a chunk for downstream audio handling.- These fields should be considered write-once and must not be mutated elsewhere.
audio_map_start
instance-attribute
¶duration
property
¶Get segment duration in milliseconds.
duration_sec
property
¶end
instance-attribute
¶end_time
property
¶gap_before
instance-attribute
¶mapped_end
property
¶mapped_start
property
¶Downstream registry field set by the audio handler
spacing_time
instance-attribute
¶speaker
instance-attribute
¶start
instance-attribute
¶start_time
property
¶normalize()
¶Normalize the duration of the segment to be nonzero and validate start/end values.
SpeakerBlock
¶
Bases: BaseModel
A block of contiguous or near-contiguous segments spoken by the same speaker.
Used as a higher-level abstraction over diarization segments to simplify chunking strategies (e.g., language-aware sampling, re-segmentation).
duration
property
¶duration_sec
property
¶end
property
¶segment_count
property
¶segments
instance-attribute
¶speaker
instance-attribute
¶start
property
¶from_dict(data)
classmethod
¶Create a SpeakerBlock from a dictionary (output of to_dict). Args: data (dict): Dictionary with keys matching SpeakerBlock fields. Returns: SpeakerBlock: Deserialized SpeakerBlock instance. Raises: ValueError, TypeError: If validation fails.
to_dict()
¶custom serializer for SpeakerBlock with validation.
protocols
¶
Interfaces shared by diarization strategy classes.
AudioFetcher
¶
ChunkingStrategy
¶
DiarizationService
¶
Bases: Protocol
Protocol for any diarization service.
generate(audio_path, params=None, *, wait_until_complete=True)
¶One-shot convenience: start + (optionally) wait + fetch + map.
Implementations may optimize this path; default behavior can be start() followed by get_response().
get_response(job_id, *, wait_until_complete=False)
¶Return the current state or final result as a DiarizationResponse.
When wait_until_complete is True, the service blocks until a terminal
state (succeeded/failed/timeout) and returns that envelope.
start(audio_path, params=None)
¶Start a diarization job and return an opaque job_id.
pyannote_adapter
¶
logger = get_child_logger(__name__)
module-attribute
¶
PyannoteAdapter
¶
Bases: SegmentAdapter
pyannote_client
¶
pyannote_client.py
Client interface for interacting with the pyannote.ai speaker diarization API.
This module provides a robust, object-oriented client for uploading audio files, starting diarization jobs, polling for job completion, and retrieving results from the pyannote.ai API. It includes retry logic, configurable timeouts, and support for advanced diarization parameters.
Typical usage
client = PyannoteClient(api_key="your_api_key") media_id = client.upload_audio(Path("audio.mp3")) job_id = client.start_diarization(media_id) result = client.poll_job_until_complete(job_id)
JOB_ID_FIELD = 'jobId'
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
APIKeyError
¶
Bases: Exception
Raised when API key is missing or invalid.
PyannoteClient
¶
Client for interacting with the pyannote.ai speaker diarization API.
api_key = api_key or os.getenv('PYANNOTEAI_API_TOKEN')
instance-attribute
¶config = config or PyannoteConfig()
instance-attribute
¶headers = {'Authorization': f'Bearer {self.api_key}'}
instance-attribute
¶network_timeout = self.config.network_timeout
instance-attribute
¶polling_config = self.config.polling_config
instance-attribute
¶upload_max_retries = self.config.upload_max_retries
instance-attribute
¶upload_timeout = self.config.upload_timeout
instance-attribute
¶JobPoller
¶Generic job polling helper for long-running async jobs.
job_id = job_id
instance-attribute
¶last_status = None
instance-attribute
¶poll_count = 0
instance-attribute
¶polling_config = polling_config
instance-attribute
¶start_time = time.time()
instance-attribute
¶status_fn = status_fn
instance-attribute
¶__init__(status_fn, job_id, polling_config)
¶run()
¶__init__(api_key=None, config=None)
¶Initialize with API key.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_key
|
Optional[str]
|
Pyannote.ai API key (defaults to environment variable) |
None
|
check_job_status(job_id)
¶Check the status of a diarization job.
Returns a typed transport model (JobStatusResponse) or None on failure.
poll_job_until_complete(job_id, estimated_duration=None, timeout=None, wait_until_complete=False)
¶Poll until the job reaches a terminal state or a client-side stop condition, and
return a unified JobStatusResponse (JSR) that includes both the server payload
and polling context via outcome, polls, and elapsed_s.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
Remote job identifier to poll. |
required |
estimated_duration
|
Optional[float]
|
Optional hint; currently unused (reserved for adaptive backoff). |
None
|
timeout
|
Optional[float]
|
Optional hard timeout in seconds for this poll call. If provided, it overrides
the client's default polling timeout. Ignored if |
None
|
wait_until_complete
|
Optional[bool]
|
If True, ignore timeout and poll indefinitely (subject to process lifetime). |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
JobStatusResponse |
JobStatusResponse
|
unified transport + polling-context result. |
start_diarization(media_id, params=None)
¶Start diarization job with pyannote.ai API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
media_id
|
str
|
The media ID from upload_audio |
required |
params
|
Optional[DiarizationParams]
|
Optional parameters for diarization |
None
|
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Optional[str]: The job ID if started successfully, None otherwise |
upload_audio(file_path)
¶Upload audio file with retry logic for network robustness.
Retries on network errors with exponential backoff. Fails fast on permanent errors (auth, file not found, etc.).
pyannote_diarize
¶
PYANNOTE_FILE_STR = '_pyannote_diarization'
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
DiarizationProcessor
¶
Orchestrator over a DiarizationService.
This layer delegates to the service for generation and handles persistence.
audio_file_path = audio_file_path.resolve()
instance-attribute
¶output_path = output_path.resolve() if output_path is not None else self.audio_file_path.parent / f'{self.audio_file_path.stem}{PYANNOTE_FILE_STR}.json'
instance-attribute
¶params = params
instance-attribute
¶service = service or PyannoteService(default_client)
instance-attribute
¶writer = writer or FileResultWriter()
instance-attribute
¶__init__(audio_file_path, output_path=None, *, service=None, params=None, api_key=None, writer=None)
¶export(response=None)
¶Write the provided or last response to self.output_path.
generate(*, wait_until_complete=True)
¶One-shot convenience: delegate to the service and cache the response.
get_response(job=None, *, wait_until_complete=False)
¶Fetch current/final response for a job, caching the last response.
start()
¶Start a job and cache its job_id.
PyannoteService
¶
Bases: DiarizationService
Concrete implementation of DiarizationService for pyannote.ai.
Bridges transport (PyannoteClient) and mapping (PyannoteAdapter) while exposing a clean domain-facing API.
adapter = adapter or PyannoteAdapter()
instance-attribute
¶client = client or PyannoteClient()
instance-attribute
¶__init__(client=None, adapter=None)
¶generate(audio_path, params=None, *, wait_until_complete=True)
¶get_response(job_id, *, wait_until_complete=False)
¶start(audio_path, params=None)
¶
diarize(audio_file_path, output_path=None, *, params=None, service=None, api_key=None, wait_until_complete=True)
¶
One-shot convenience to generate a result and (optionally) write it.
This returns the DiarizationResponse. Writing is left to callers or
diarize_to_file below.
diarize_to_file(audio_file_path, output_path=None, *, params=None, service=None, api_key=None, wait_until_complete=True)
¶
Convenience helper: generate then export to JSON if successful; returns response
schemas
¶
DiarizationResponse = Annotated[Union[DiarizationSucceeded, DiarizationFailed, DiarizationPending, DiarizationRunning], Field(discriminator='status')]
module-attribute
¶
__all__ = ['PollOutcome', 'DiarizationParams', 'StartDiarizationResponse', 'JobStatus', 'JobStatusResponse', 'ErrorCode', 'ErrorInfo', 'DiarizationResult', 'DiarizationSucceeded', 'DiarizationFailed', 'DiarizationPending', 'DiarizationRunning', 'DiarizationResponse']
module-attribute
¶
DiarizationParams
¶
Bases: BaseModel
Per-request diarization options; maps to pyannote API payload. Use .to_api_dict() to emit API field names.
confidence = Field(default=None, ge=0.0, le=1.0, description='Confidence threshold for segments.')
class-attribute
instance-attribute
¶model_config = ConfigDict(frozen=True, populate_by_name=True, extra='forbid')
class-attribute
instance-attribute
¶num_speakers = Field(default=None, alias='numSpeakers', description="Fixed number of speakers or 'auto' for detection.")
class-attribute
instance-attribute
¶webhook = Field(default=None, description='Webhook URL for job status callbacks.')
class-attribute
instance-attribute
¶to_api_dict()
¶Return payload dict using API field names (camelCase) and excluding Nones.
DiarizationResult
¶
Bases: BaseModel
Domain-level diarization payload used by the rest of the system.
NOTE: segments is intentionally typed as list[Any] so that it can
hold your project’s DiarizedSegment instances from models.py without
creating an import cycle. You can tighten this typing later to
list[DiarizedSegment] and import under TYPE_CHECKING if desired.
ErrorCode
¶
Bases: str, Enum
Client- and adapter-level error taxonomy (not server statuses).
API_ERROR = 'api_error'
class-attribute
instance-attribute
¶BAD_REQUEST = 'bad_request'
class-attribute
instance-attribute
¶CANCELLED = 'cancelled'
class-attribute
instance-attribute
¶PARSE_ERROR = 'parse_error'
class-attribute
instance-attribute
¶TIMEOUT = 'timeout'
class-attribute
instance-attribute
¶TRANSIENT = 'transient'
class-attribute
instance-attribute
¶UNKNOWN = 'unknown'
class-attribute
instance-attribute
¶
ErrorInfo
¶
JobHandle
dataclass
¶
JobStatus
¶
Bases: str, Enum
JobStatusResponse
¶
Bases: BaseModel
Job Status Result (JSR): unified transport payload + client polling context. Combines transport-level fields with client-side polling metadata.
Semantics:
- outcome describes how polling concluded (terminal success/failure, timeout, network error, etc.).
- status is the last known server job status (SUCCEEDED, FAILED, RUNNING, PENDING)
- server_error_msg and payload mirror the remote payload when present.
- polls and elapsed_s report client polling metrics.
elapsed_s = 0.0
class-attribute
instance-attribute
¶job_id = Field(alias='jobId')
class-attribute
instance-attribute
¶model_config = ConfigDict(frozen=True, extra='ignore', populate_by_name=True)
class-attribute
instance-attribute
¶outcome = PollOutcome.ERROR
class-attribute
instance-attribute
¶payload = Field(default=None, alias='output')
class-attribute
instance-attribute
¶polls = 0
class-attribute
instance-attribute
¶server_error_msg = Field(default=None, alias='error')
class-attribute
instance-attribute
¶status = None
class-attribute
instance-attribute
¶normalize_created_status(value)
classmethod
¶Normalize pyannote pre-running status to the existing domain contract.
PollOutcome
¶
Bases: str, Enum
ERROR = 'error'
class-attribute
instance-attribute
¶FAILED = 'failed'
class-attribute
instance-attribute
¶INTERRUPTED = 'interrupted'
class-attribute
instance-attribute
¶NETWORK_ERROR = 'network_error'
class-attribute
instance-attribute
¶SUCCEEDED = 'succeeded'
class-attribute
instance-attribute
¶TIMEOUT = 'timeout'
class-attribute
instance-attribute
¶
strategies
¶
__all__ = ['LanguageDetector', 'LanguageProbe', 'WhisperLanguageDetector', 'group_speaker_blocks', 'TimeGapChunker']
module-attribute
¶
LanguageDetector
¶
Bases: Protocol
Abstract language detector (e.g., fastText, Whisper-lang).
detect(audio, format_str)
¶
LanguageProbe
¶
detector = detector
instance-attribute
¶export_format = config.language.export_format
instance-attribute
¶probe_time = config.language.probe_time
instance-attribute
¶__init__(config, detector)
¶segment_language(aug_segment)
¶Get segment ISO-639 language code from an Augmented Diarize Segment which contains audio.
The probe window is always relative to the segment audio (0=start, duration=end).
TimeGapChunker
¶
Bases: ChunkingStrategy
Chunker that ignores speaker/language and uses only time-gap logic.
WhisperLanguageDetector
¶
group_speaker_blocks(segments, config=DiarizationConfig())
¶
Group contiguous or near-contiguous segments by speaker identity.
Segments are grouped into SpeakerBlocks when the speaker remains the same
and the gap between consecutive segments is less than the configured threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segments
|
List[DiarizedSegment]
|
A list of diarization segments (must be sorted by start time). |
required |
config
|
DiarizationConfig
|
Configuration containing the allowed gap between segments. |
DiarizationConfig()
|
Returns:
| Type | Description |
|---|---|
List[SpeakerBlock]
|
A list of SpeakerBlock objects representing grouped speaker runs. |
language_based
¶
LanguageChunker – chunking informed by speaker blocks + language probing.
logger = get_child_logger(__name__)
module-attribute
¶LanguageChunker
¶
Bases: ChunkingStrategy
Strategy:
- Group contiguous segments into SpeakerBlock objects.
- For each block longer than
language_probe_thresholdprobe language at configurable offsets; if mismatch, split on language change. - Build chunks respecting
target_timesimilar to TimeGapChunker.
language_probe
¶
Lightweight language-detection helpers pluggable into chunkers.
logger = get_child_logger(__name__)
module-attribute
¶LanguageProbe
¶detector = detector
instance-attribute
¶export_format = config.language.export_format
instance-attribute
¶probe_time = config.language.probe_time
instance-attribute
¶__init__(config, detector)
¶segment_language(aug_segment)
¶Get segment ISO-639 language code from an Augmented Diarize Segment which contains audio.
The probe window is always relative to the segment audio (0=start, duration=end).
speaker_blocker
¶
group_speaker_blocks(segments, config=DiarizationConfig())
¶Group contiguous or near-contiguous segments by speaker identity.
Segments are grouped into SpeakerBlocks when the speaker remains the same
and the gap between consecutive segments is less than the configured threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segments
|
List[DiarizedSegment]
|
A list of diarization segments (must be sorted by start time). |
required |
config
|
DiarizationConfig
|
Configuration containing the allowed gap between segments. |
DiarizationConfig()
|
Returns:
| Type | Description |
|---|---|
List[SpeakerBlock]
|
A list of SpeakerBlock objects representing grouped speaker runs. |
time_gap
¶
TimeGapChunker – baseline strategy: split purely on accumulated time.
logger = get_child_logger(__name__)
module-attribute
¶TimeGapChunker
¶
Bases: ChunkingStrategy
Chunker that ignores speaker/language and uses only time-gap logic.
timeline_mapper
¶
Timeline mapping utilities for transforming timestamps from chunk-relative coordinates to original audio coordinates.
This module enables mapping transcript segments back to their original positions in the source audio after processing chunked audio.
logger = get_child_logger(__name__)
module-attribute
¶
TimelineMapper
¶
Maps timestamps from chunk-relative coordinates to original audio coordinates.
config = config or TimelineMapperConfig()
instance-attribute
¶__init__(config=None)
¶Initialize with optional configuration.
remap(timed_text, chunk)
¶Remap all timestamps in a TimedText object from chunk-relative to original audio coordinates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timed_text
|
TimedText
|
TimedText with chunk-relative timestamps |
required |
chunk
|
DiarizationChunk
|
DiarizationChunk containing mapping information |
required |
Returns:
| Type | Description |
|---|---|
TimedText
|
New TimedText object with remapped timestamps |
TimelineMapperConfig
¶
Bases: BaseModel
Configuration options for timeline mapping.
types
¶
viewer
¶
close_segment_viewer(pid)
¶
Terminate the Streamlit viewer process by PID.
launch_segment_viewer(segments, master_audio_file)
¶
Export segment data to a temporary JSON file and launch Streamlit viewer. Args: segments: List of dicts with diarization info (start, end, speaker). master_audio_file: Path to the master audio file.
load_segments_from_file(path)
¶
main()
¶
language_utils
¶
normalize_language_code(language)
¶
Normalize common language labels to compact language codes.
multilingual_models
¶
ArtifactRetention
¶
LanguageDetectionResult
¶
Bases: BaseModel
Language detection metadata for a routed segment or block.
MergedSubtitleArtifact
¶
Bases: BaseModel
Final user-facing subtitle artifact for the current MVP path.
artifact_retention = ArtifactRetention.MINIMAL
class-attribute
instance-attribute
¶
final_english_srt
instance-attribute
¶
provider
instance-attribute
¶
source_language = None
class-attribute
instance-attribute
¶
source_srt
instance-attribute
¶
target_language = 'en'
class-attribute
instance-attribute
¶
MultilingualTranscriptionRequest
¶
Bases: BaseModel
Top-level request for the multilingual transcription service.
artifact_retention = ArtifactRetention.MINIMAL
class-attribute
instance-attribute
¶
audio_file
instance-attribute
¶
chars_per_caption = Field(default=42, ge=1)
class-attribute
instance-attribute
¶
diarization_segments = None
class-attribute
instance-attribute
¶
metadata_file = None
class-attribute
instance-attribute
¶
model_config = ConfigDict(arbitrary_types_allowed=True)
class-attribute
instance-attribute
¶
provider = TranscriptionProvider.WHISPER
class-attribute
instance-attribute
¶
skip_translation = False
class-attribute
instance-attribute
¶
source_language = None
class-attribute
instance-attribute
¶
target_language = 'en'
class-attribute
instance-attribute
¶
transcription_model = None
class-attribute
instance-attribute
¶
translation_model = None
class-attribute
instance-attribute
¶
translation_pattern = None
class-attribute
instance-attribute
¶
use_speaker_blocks = False
class-attribute
instance-attribute
¶
SegmentTranscriptionRequest
¶
Bases: BaseModel
Segment-level transcription request for provider-neutral orchestration.
audio_file
instance-attribute
¶
audio_file_extension = None
class-attribute
instance-attribute
¶
chars_per_caption = Field(default=42, ge=1)
class-attribute
instance-attribute
¶
model_config = ConfigDict(arbitrary_types_allowed=True)
class-attribute
instance-attribute
¶
provider
instance-attribute
¶
source_language = None
class-attribute
instance-attribute
¶
target_language = 'en'
class-attribute
instance-attribute
¶
transcription_model = None
class-attribute
instance-attribute
¶
SegmentTranscriptionResult
¶
Bases: BaseModel
Segment-level subtitle generation result.
error_message = None
class-attribute
instance-attribute
¶
provider
instance-attribute
¶
segment_start_ms = Field(default=0, ge=0)
class-attribute
instance-attribute
¶
source_language = None
class-attribute
instance-attribute
¶
source_srt
instance-attribute
¶
target_language = 'en'
class-attribute
instance-attribute
¶
translated_srt = None
class-attribute
instance-attribute
¶
translation_skipped = False
class-attribute
instance-attribute
¶
SpeakerLanguageBlock
¶
Bases: BaseModel
A speaker-contiguous block with language metadata.
multilingual_protocols
¶
LanguageSegmentationServiceProtocol
¶
Bases: Protocol
Build language-tagged speaker blocks for downstream routing.
build_blocks(request)
¶
Return speaker blocks for multilingual processing.
SegmentTranscriptionServiceProtocol
¶
Bases: Protocol
Provider-neutral segment transcription contract.
transcribe_segment(request)
¶
Generate source-language subtitles for a segment.
multilingual_service
¶
logger = get_logger(__name__)
module-attribute
¶
MultilingualTranscriptionService
¶
PassThroughSubtitleMergeService
¶
Bases: SubtitleMergeServiceProtocol
Merge segment SRTs into a single artifact.
ProviderBackedSegmentTranscriptionService
¶
Bases: SegmentTranscriptionServiceProtocol
Bridge to the existing provider transcription services.
transcribe_segment(request)
¶
SpeakerBlockLanguageSegmentationService
¶
Bases: LanguageSegmentationServiceProtocol
Build language-tagged speaker blocks from diarized segments.
SrtSegmentTranslationService
¶
Bases: SegmentTranslationServiceProtocol
Translate generated SRT content using the existing SRT translator.
timed_object
¶
__all__ = ['Granularity', 'TimedText', 'TimedTextUnit']
module-attribute
¶
Granularity
¶
TimedText
¶
Bases: BaseModel
Represents a collection of timed text units of a single granularity.
Only one of segments or words is populated, determined by granularity.
All units must match the declared granularity.
Notes
- Start times must be non-decreasing (overlaps allowed for multiple speakers).
- Negative start_ms or end_ms values are not allowed.
- Durations must be strictly positive (>0 ms).
- Mixed granularity is strictly prohibited.
duration
property
¶
Get the total duration in milliseconds.
end_ms
property
¶
Get the end time of the latest unit.
granularity = Field(..., description='Granularity type for all units.')
class-attribute
instance-attribute
¶
segments = Field(default_factory=list, description='Phrase-level timed units')
class-attribute
instance-attribute
¶
start_ms
property
¶
Get the start time of the earliest unit.
units
property
¶
Return the list of units matching the granularity.
words = Field(default_factory=list, description='Word-level timed units')
class-attribute
instance-attribute
¶
__init__(*, granularity=None, segments=None, words=None, units=None, **kwargs)
¶
Custom initializer for TimedText.
If units is provided, granularity is inferred from the first unit unless explicitly set.
If only segments or words is provided, granularity is set accordingly.
If all are empty, granularity must be provided.
__len__()
¶
Return the number of units.
append(unit)
¶
Add a unit to the end.
clear()
¶
Remove all units.
export_text(separator='\n', skip_empty=True, show_speaker=True)
¶
Export the text content of all units as a single string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
separator
|
str
|
String used to separate units (default: newline). |
'\n'
|
skip_empty
|
bool
|
If True, skip units with empty or whitespace-only text. |
True
|
show_speaker
|
bool
|
If True, add speaker info. |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Concatenated text of all units, separated by |
extend(units)
¶
Add multiple units to the end.
filter_by_min_duration(min_duration_ms)
¶
Return a new TimedText object containing only units with a minimum duration.
is_segment_granularity()
¶
Return True if granularity is SEGMENT.
is_word_granularity()
¶
Return True if granularity is WORD.
iter()
¶
Unified iterator over the units of the correct granularity.
iter_segments()
¶
Iterate over segment-level units.
Raises:
| Type | Description |
|---|---|
ValueError
|
If granularity is not SEGMENT. |
iter_words()
¶
Iterate over word-level units.
Raises:
| Type | Description |
|---|---|
ValueError
|
If granularity is not WORD. |
merge(items)
classmethod
¶
Merge a list of TimedText objects of the same granularity into a single TimedText object.
model_post_init(__context)
¶
After initialization, sort units by start time and normalize durations.
set_all_speakers(speaker)
¶
Set the same speaker for all units.
set_speaker(index, speaker)
¶
Set speaker for a specific unit by index.
shift(offset_ms)
¶
Shift all units by a given offset in milliseconds.
slice(start_ms, end_ms)
¶
Return a new TimedText object containing only units within [start_ms, end_ms]. Units must overlap with the interval to be included.
sort_by_start()
¶
Sort units by start time.
TimedTextUnit
¶
Bases: BaseModel
Represents a timed unit with timestamps.
A fundamental building block for subtitle and transcript processing that associates text content with start/end times and optional metadata. Can represent either a segment (phrase/sentence) or a word.
confidence = Field(None, description='Optional confidence score')
class-attribute
instance-attribute
¶
duration_ms
property
¶
Get duration in milliseconds.
duration_sec
property
¶
Get duration in seconds.
end_ms = Field(..., description='End time in milliseconds')
class-attribute
instance-attribute
¶
end_sec
property
¶
Get end time in seconds.
granularity
instance-attribute
¶
index = Field(None, description='Entry index or sequence number')
class-attribute
instance-attribute
¶
speaker = Field(None, description='Speaker identifier if available')
class-attribute
instance-attribute
¶
start_ms = Field(..., description='Start time in milliseconds')
class-attribute
instance-attribute
¶
start_sec
property
¶
Get start time in seconds.
text = Field(..., description='The text content')
class-attribute
instance-attribute
¶
normalize()
¶
Normalize the duration of the segment to be nonzero
overlaps_with(other)
¶
Check if this unit overlaps with another.
set_speaker(speaker)
¶
Set the speaker label.
shift_time(offset_ms)
¶
Create a new TimedUnit with timestamps shifted by offset.
timed_text
¶
Module for handling timed text objects. For example, can be used subtitles like VTT and SRT.
This module provides classes and utilities for parsing, manipulating, and generating timed text objects useful in subtitle and transcript processing. It uses Pydantic for robust data validation and type safety.
Granularity
¶
TimedText
¶
Bases: BaseModel
Represents a collection of timed text units of a single granularity.
Only one of segments or words is populated, determined by granularity.
All units must match the declared granularity.
Notes
- Start times must be non-decreasing (overlaps allowed for multiple speakers).
- Negative start_ms or end_ms values are not allowed.
- Durations must be strictly positive (>0 ms).
- Mixed granularity is strictly prohibited.
duration
property
¶Get the total duration in milliseconds.
end_ms
property
¶Get the end time of the latest unit.
granularity = Field(..., description='Granularity type for all units.')
class-attribute
instance-attribute
¶segments = Field(default_factory=list, description='Phrase-level timed units')
class-attribute
instance-attribute
¶start_ms
property
¶Get the start time of the earliest unit.
units
property
¶Return the list of units matching the granularity.
words = Field(default_factory=list, description='Word-level timed units')
class-attribute
instance-attribute
¶__init__(*, granularity=None, segments=None, words=None, units=None, **kwargs)
¶Custom initializer for TimedText.
If units is provided, granularity is inferred from the first unit unless explicitly set.
If only segments or words is provided, granularity is set accordingly.
If all are empty, granularity must be provided.
__len__()
¶Return the number of units.
append(unit)
¶Add a unit to the end.
clear()
¶Remove all units.
export_text(separator='\n', skip_empty=True, show_speaker=True)
¶Export the text content of all units as a single string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
separator
|
str
|
String used to separate units (default: newline). |
'\n'
|
skip_empty
|
bool
|
If True, skip units with empty or whitespace-only text. |
True
|
show_speaker
|
bool
|
If True, add speaker info. |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Concatenated text of all units, separated by |
extend(units)
¶Add multiple units to the end.
filter_by_min_duration(min_duration_ms)
¶Return a new TimedText object containing only units with a minimum duration.
is_segment_granularity()
¶Return True if granularity is SEGMENT.
is_word_granularity()
¶Return True if granularity is WORD.
iter()
¶Unified iterator over the units of the correct granularity.
iter_segments()
¶Iterate over segment-level units.
Raises:
| Type | Description |
|---|---|
ValueError
|
If granularity is not SEGMENT. |
iter_words()
¶Iterate over word-level units.
Raises:
| Type | Description |
|---|---|
ValueError
|
If granularity is not WORD. |
merge(items)
classmethod
¶Merge a list of TimedText objects of the same granularity into a single TimedText object.
model_post_init(__context)
¶After initialization, sort units by start time and normalize durations.
set_all_speakers(speaker)
¶Set the same speaker for all units.
set_speaker(index, speaker)
¶Set speaker for a specific unit by index.
shift(offset_ms)
¶Shift all units by a given offset in milliseconds.
slice(start_ms, end_ms)
¶Return a new TimedText object containing only units within [start_ms, end_ms]. Units must overlap with the interval to be included.
sort_by_start()
¶Sort units by start time.
TimedTextUnit
¶
Bases: BaseModel
Represents a timed unit with timestamps.
A fundamental building block for subtitle and transcript processing that associates text content with start/end times and optional metadata. Can represent either a segment (phrase/sentence) or a word.
confidence = Field(None, description='Optional confidence score')
class-attribute
instance-attribute
¶duration_ms
property
¶Get duration in milliseconds.
duration_sec
property
¶Get duration in seconds.
end_ms = Field(..., description='End time in milliseconds')
class-attribute
instance-attribute
¶end_sec
property
¶Get end time in seconds.
granularity
instance-attribute
¶index = Field(None, description='Entry index or sequence number')
class-attribute
instance-attribute
¶speaker = Field(None, description='Speaker identifier if available')
class-attribute
instance-attribute
¶start_ms = Field(..., description='Start time in milliseconds')
class-attribute
instance-attribute
¶start_sec
property
¶Get start time in seconds.
text = Field(..., description='The text content')
class-attribute
instance-attribute
¶normalize()
¶Normalize the duration of the segment to be nonzero
overlaps_with(other)
¶Check if this unit overlaps with another.
set_speaker(speaker)
¶Set the speaker label.
shift_time(offset_ms)
¶Create a new TimedUnit with timestamps shifted by offset.
transcription
¶
__all__ = ['patch_whisper_options', 'DiarizationChunker', 'TimedText', 'TextSegmentBuilder', 'TimedTextUnit', 'Granularity', 'TranscriptionService', 'TranscriptionServiceFactory']
module-attribute
¶
DiarizationChunker
¶
Class for chunking diarization results into processing units based on configurable duration targets.
config = ChunkConfig()
instance-attribute
¶
__init__(**config_options)
¶
Initialize chunker with additional config_options.
extract_contiguous_chunks(segments)
¶
Split diarization segments into contiguous chunks of approximately target_duration, without splitting on speaker changes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segments
|
List[DiarizedSegment]
|
List of speaker segments from diarization |
required |
Returns:
| Type | Description |
|---|---|
List[DiarizationChunk]
|
List[Chunk]: Flat list of contiguous chunks |
Granularity
¶
TextSegmentBuilder
¶
avoid_orphans = avoid_orphans
instance-attribute
¶
current_characters = 0
instance-attribute
¶
current_words = []
instance-attribute
¶
ignore_speaker = ignore_speaker
instance-attribute
¶
max_duration = max_duration_ms
instance-attribute
¶
max_gap_duration = max_gap_duration_ms
instance-attribute
¶
segments = []
instance-attribute
¶
target_characters = target_characters
instance-attribute
¶
__init__(*, max_duration_ms=None, target_characters=None, avoid_orphans=True, max_gap_duration_ms=None, ignore_speaker=True)
¶
build_segments(*, target_duration=None, target_characters=None, avoid_orphans=True, max_gap_duration=None, ignore_speaker=False)
¶
Build or rebuild segments from the contents of words.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_duration
|
Optional[int]
|
Maximum desired segment duration in milliseconds. |
None
|
target_characters
|
Optional[int]
|
Maximum desired character length of a segment. |
None
|
avoid_orphans
|
Optional[bool]
|
If True, prevent extremely short trailing segments. |
True
|
Note
This is a stub. Concrete algorithms will be implemented later.
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
Always, until implemented. |
create_segments(timed_text)
¶
TimedText
¶
Bases: BaseModel
Represents a collection of timed text units of a single granularity.
Only one of segments or words is populated, determined by granularity.
All units must match the declared granularity.
Notes
- Start times must be non-decreasing (overlaps allowed for multiple speakers).
- Negative start_ms or end_ms values are not allowed.
- Durations must be strictly positive (>0 ms).
- Mixed granularity is strictly prohibited.
duration
property
¶
Get the total duration in milliseconds.
end_ms
property
¶
Get the end time of the latest unit.
granularity = Field(..., description='Granularity type for all units.')
class-attribute
instance-attribute
¶
segments = Field(default_factory=list, description='Phrase-level timed units')
class-attribute
instance-attribute
¶
start_ms
property
¶
Get the start time of the earliest unit.
units
property
¶
Return the list of units matching the granularity.
words = Field(default_factory=list, description='Word-level timed units')
class-attribute
instance-attribute
¶
__init__(*, granularity=None, segments=None, words=None, units=None, **kwargs)
¶
Custom initializer for TimedText.
If units is provided, granularity is inferred from the first unit unless explicitly set.
If only segments or words is provided, granularity is set accordingly.
If all are empty, granularity must be provided.
__len__()
¶
Return the number of units.
append(unit)
¶
Add a unit to the end.
clear()
¶
Remove all units.
export_text(separator='\n', skip_empty=True, show_speaker=True)
¶
Export the text content of all units as a single string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
separator
|
str
|
String used to separate units (default: newline). |
'\n'
|
skip_empty
|
bool
|
If True, skip units with empty or whitespace-only text. |
True
|
show_speaker
|
bool
|
If True, add speaker info. |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Concatenated text of all units, separated by |
extend(units)
¶
Add multiple units to the end.
filter_by_min_duration(min_duration_ms)
¶
Return a new TimedText object containing only units with a minimum duration.
is_segment_granularity()
¶
Return True if granularity is SEGMENT.
is_word_granularity()
¶
Return True if granularity is WORD.
iter()
¶
Unified iterator over the units of the correct granularity.
iter_segments()
¶
Iterate over segment-level units.
Raises:
| Type | Description |
|---|---|
ValueError
|
If granularity is not SEGMENT. |
iter_words()
¶
Iterate over word-level units.
Raises:
| Type | Description |
|---|---|
ValueError
|
If granularity is not WORD. |
merge(items)
classmethod
¶
Merge a list of TimedText objects of the same granularity into a single TimedText object.
model_post_init(__context)
¶
After initialization, sort units by start time and normalize durations.
set_all_speakers(speaker)
¶
Set the same speaker for all units.
set_speaker(index, speaker)
¶
Set speaker for a specific unit by index.
shift(offset_ms)
¶
Shift all units by a given offset in milliseconds.
slice(start_ms, end_ms)
¶
Return a new TimedText object containing only units within [start_ms, end_ms]. Units must overlap with the interval to be included.
sort_by_start()
¶
Sort units by start time.
TimedTextUnit
¶
Bases: BaseModel
Represents a timed unit with timestamps.
A fundamental building block for subtitle and transcript processing that associates text content with start/end times and optional metadata. Can represent either a segment (phrase/sentence) or a word.
confidence = Field(None, description='Optional confidence score')
class-attribute
instance-attribute
¶
duration_ms
property
¶
Get duration in milliseconds.
duration_sec
property
¶
Get duration in seconds.
end_ms = Field(..., description='End time in milliseconds')
class-attribute
instance-attribute
¶
end_sec
property
¶
Get end time in seconds.
granularity
instance-attribute
¶
index = Field(None, description='Entry index or sequence number')
class-attribute
instance-attribute
¶
speaker = Field(None, description='Speaker identifier if available')
class-attribute
instance-attribute
¶
start_ms = Field(..., description='Start time in milliseconds')
class-attribute
instance-attribute
¶
start_sec
property
¶
Get start time in seconds.
text = Field(..., description='The text content')
class-attribute
instance-attribute
¶
normalize()
¶
Normalize the duration of the segment to be nonzero
overlaps_with(other)
¶
Check if this unit overlaps with another.
set_speaker(speaker)
¶
Set the speaker label.
shift_time(offset_ms)
¶
Create a new TimedUnit with timestamps shifted by offset.
TranscriptionService
¶
Bases: ABC
Abstract base class defining the interface for transcription services.
This interface provides a standard way to interact with different transcription service providers (e.g., OpenAI Whisper, AssemblyAI).
get_result(job_id)
abstractmethod
¶
Get results for an existing transcription job.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
ID of the transcription job |
required |
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Dictionary containing transcription results in the same |
TranscriptionResult
|
standardized format as transcribe() |
transcribe(audio_file, options=None)
abstractmethod
¶
Transcribe audio file to text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BytesIO]
|
Path to audio file or file-like object |
required |
options
|
Optional[Dict[str, Any]]
|
Provider-specific options for transcription |
None
|
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
TranscriptionResult |
transcribe_to_format(audio_file, format_type='srt', transcription_options=None, format_options=None)
abstractmethod
¶
Transcribe audio and return result in specified format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BytesIO]
|
Path, file-like object, or URL of audio file |
required |
format_type
|
str
|
Format type (e.g., "srt", "vtt", "text") |
'srt'
|
transcription_options
|
Optional[Dict[str, Any]]
|
Options for transcription |
None
|
format_options
|
Optional[Dict[str, Any]]
|
Format-specific options |
None
|
Returns:
| Type | Description |
|---|---|
str
|
String representation in the requested format |
TranscriptionServiceFactory
¶
Factory for creating transcription service instances.
This factory provides a standard way to create transcription service instances based on the provider name and configuration.
create_service(provider='assemblyai', api_key=None, **kwargs)
classmethod
¶
Create a transcription service instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
provider
|
str
|
Service provider name (e.g., "whisper", "assemblyai") |
'assemblyai'
|
api_key
|
Optional[str]
|
API key for the service |
None
|
**kwargs
|
Any
|
Additional provider-specific configuration |
{}
|
Returns:
| Type | Description |
|---|---|
TranscriptionService
|
TranscriptionService instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the provider is not supported |
ImportError
|
If the provider module cannot be imported |
register_provider(name, provider_class)
classmethod
¶
Register a provider implementation with the factory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Provider name (lowercase) |
required |
provider_class
|
Callable[..., TranscriptionService]
|
Provider implementation class or factory function |
required |
Example
from my_module import MyTranscriptionService TranscriptionServiceFactory.register_provider("my_provider", MyTranscriptionService)
patch_whisper_options(options, file_extension)
¶
Patch routine to ensure 'file_extension' is present in transcription options dict. This is a workaround for OpenAI Whisper API, which requires file-like objects to have a filename/extension. Only allows known audio extensions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
options
|
Optional[Dict[str, Any]]
|
Transcription options dictionary (will not be mutated) |
required |
file_extension
|
str
|
File extension string (with or without leading dot) |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
New options dictionary with 'file_extension' set appropriately |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file_extension is not in the allowed list |
assemblyai_service
¶
AssemblyAI implementation of the TranscriptionService interface.
This module provides a complete implementation of the TranscriptionService interface using the AssemblyAI Python SDK, with support for all major features including:
- Transcription with configurable options
- Speaker diarization
- Automatic language detection
- Audio intelligence features
- Subtitle generation
- Regional endpoint support
- Webhook callbacks
The implementation follows a modular design with single-action methods and supports both synchronous and asynchronous usage patterns.
logger = get_child_logger(__name__)
module-attribute
¶
AAIConfig
dataclass
¶
Comprehensive configuration for AssemblyAI transcription service.
This class contains all configurable options for the AssemblyAI API, organized by feature category.
api_key = None
class-attribute
instance-attribute
¶auto_chapters = False
class-attribute
instance-attribute
¶auto_highlights = False
class-attribute
instance-attribute
¶chars_per_caption = 60
class-attribute
instance-attribute
¶content_safety = False
class-attribute
instance-attribute
¶custom_spelling = field(default_factory=dict)
class-attribute
instance-attribute
¶disfluencies = False
class-attribute
instance-attribute
¶dual_channel = False
class-attribute
instance-attribute
¶entity_detection = False
class-attribute
instance-attribute
¶filter_profanity = False
class-attribute
instance-attribute
¶format_text = True
class-attribute
instance-attribute
¶iab_categories = False
class-attribute
instance-attribute
¶language_code = None
class-attribute
instance-attribute
¶language_detection = True
class-attribute
instance-attribute
¶polling_interval = 4
class-attribute
instance-attribute
¶punctuate = True
class-attribute
instance-attribute
¶sentiment_analysis = False
class-attribute
instance-attribute
¶speaker_labels = True
class-attribute
instance-attribute
¶speakers_expected = None
class-attribute
instance-attribute
¶speech_model = SpeechModel.BEST
class-attribute
instance-attribute
¶summarization = False
class-attribute
instance-attribute
¶use_eu_endpoint = False
class-attribute
instance-attribute
¶webhook_auth_header_name = None
class-attribute
instance-attribute
¶webhook_auth_header_value = None
class-attribute
instance-attribute
¶webhook_url = None
class-attribute
instance-attribute
¶word_boost = field(default_factory=list)
class-attribute
instance-attribute
¶__init__(api_key=None, use_eu_endpoint=False, polling_interval=4, speech_model=SpeechModel.BEST, language_code=None, language_detection=True, dual_channel=False, format_text=True, punctuate=True, disfluencies=False, filter_profanity=False, chars_per_caption=60, speaker_labels=True, speakers_expected=None, custom_spelling=dict(), word_boost=list(), auto_chapters=False, auto_highlights=False, entity_detection=False, iab_categories=False, sentiment_analysis=False, summarization=False, content_safety=False, webhook_url=None, webhook_auth_header_name=None, webhook_auth_header_value=None)
¶
AAITranscriptionService
¶
Bases: TranscriptionService
AssemblyAI implementation of the TranscriptionService interface.
Provides comprehensive access to AssemblyAI's transcription services with support for all major features through the official Python SDK.
config = AAIConfig()
instance-attribute
¶format_converter = FormatConverter()
instance-attribute
¶transcriber = aai.Transcriber(config=(self._create_transcription_config(options)))
instance-attribute
¶__init__(api_key=None, options=None)
¶Initialize the AssemblyAI transcription service.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_key
|
Optional[str]
|
AssemblyAI API key (defaults to ASSEMBLYAI_API_KEY env var) |
None
|
options
|
Optional[Dict[str, Any]]
|
Additional transcription configuration overrides |
None
|
get_result(job_id)
¶Get results for an existing transcription job.
This method blocks until the transcript is retrieved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
ID of the transcription job |
required |
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Dictionary containing transcription results |
get_subtitles(transcript_id, format_type='srt')
¶Get subtitles directly from AssemblyAI.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transcript_id
|
str
|
ID of the transcription job |
required |
format_type
|
str
|
Format type ("srt" or "vtt") |
'srt'
|
Returns:
| Type | Description |
|---|---|
str
|
String representation in the requested format |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the format type is not supported |
standardize_result(transcript)
¶Standardize AssemblyAI transcript to match common format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transcript
|
Transcript
|
AssemblyAI transcript object |
required |
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Standardized result dictionary |
transcribe(audio_file, options=None)
¶Transcribe audio file to text using AssemblyAI's synchronous SDK approach.
This method handles: - File paths - File-like objects - URLs
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BinaryIO, str]
|
Path, file-like object, or URL of audio file |
required |
options
|
Optional[Dict[str, Any]]
|
Provider-specific options for transcription |
None
|
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Dictionary containing standardized transcription results |
transcribe_async(audio_file, options=None)
¶Submit an asynchronous transcription job using AssemblyAI's SDK.
This method submits a transcription job and returns immediately with a transcript ID that can be used to retrieve results later.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BinaryIO, str]
|
Path, file-like object, or URL of audio file |
required |
options
|
Optional[Dict[str, Any]]
|
Provider-specific options for transcription |
None
|
Returns:
| Type | Description |
|---|---|
Future[Any]
|
String containing the transcript ID for later retrieval |
Notes
The SDK's submit method returns a Future object, but this method extracts just the transcript ID for simpler handling.
transcribe_to_format(audio_file, format_type='srt', transcription_options=None, format_options=None)
¶Transcribe audio and return result in specified format.
Takes advantage of the direct subtitle generation functionality when requesting SRT or VTT formats.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BinaryIO, str]
|
Path, file-like object, or URL of audio file |
required |
format_type
|
str
|
Format type (e.g., "srt", "vtt", "text") |
'srt'
|
transcription_options
|
Optional[Dict[str, Any]]
|
Options for transcription |
None
|
format_options
|
Optional[Dict[str, Any]]
|
Format-specific options |
None
|
Returns:
| Type | Description |
|---|---|
str
|
String representation in the requested format |
format_converter
¶
tnh_scholar.audio_processing.transcription.format_converter¶
Thin facade that turns raw transcription-service output dictionaries into the formats requested by callers (plain-text, SRT - VTT coming later).
Core heavy lifting now lives in:
TimedText/TimedTextUnit- canonical internal representationSegmentBuilder- word-level -> sentence/segment chunkingSRTProcessor- rendering to.srt
Only one public method remains: meth:
FormatConverter.convert.
logger = get_child_logger(__name__)
module-attribute
¶
FormatConverter
¶
Convert a raw transcription result to text, SRT, or (placeholder) VTT.
The raw result must follow the loose schema
- {"utterances": [...]} -> already speaker-segmented
- {"words": [...]} -> word-level; we chunk via :class:SegmentBuilder
- {"text": "...", "audio_duration_ms": 12345} -> single blob fallback
FormatConverterConfig
¶
Bases: BaseModel
User-tunable knobs for :class:FormatConverter.
Only a handful remain now that the heavy logic moved to SegmentBuilder.
characters_per_entry = 42
class-attribute
instance-attribute
¶include_segment_index = True
class-attribute
instance-attribute
¶include_speaker = True
class-attribute
instance-attribute
¶max_entry_duration_ms = 6000
class-attribute
instance-attribute
¶max_gap_duration_ms = 2000
class-attribute
instance-attribute
¶
patches
¶
patch_file_with_name(file_obj, extension)
¶
Ensures the file-like object has a .name attribute with the correct extension.
patch_whisper_options(options, file_extension)
¶
Patch routine to ensure 'file_extension' is present in transcription options dict. This is a workaround for OpenAI Whisper API, which requires file-like objects to have a filename/extension. Only allows known audio extensions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
options
|
Optional[Dict[str, Any]]
|
Transcription options dictionary (will not be mutated) |
required |
file_extension
|
str
|
File extension string (with or without leading dot) |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
New options dictionary with 'file_extension' set appropriately |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file_extension is not in the allowed list |
srt_processor
¶
SRTConfig
¶
Configuration options for SRT processing.
include_speaker = include_speaker
instance-attribute
¶max_chars_per_line = max_chars_per_line
instance-attribute
¶reindex_entries = reindex_entries
instance-attribute
¶speaker_format = speaker_format
instance-attribute
¶timestamp_format = timestamp_format
instance-attribute
¶use_pysrt = use_pysrt
instance-attribute
¶__init__(include_speaker=False, speaker_format='[{speaker}] {text}', reindex_entries=True, timestamp_format='{:02d}:{:02d}:{:02d},{:03d}', max_chars_per_line=42, use_pysrt=False)
¶Initialize with default settings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_speaker
|
bool
|
Whether to include speaker labels in output |
False
|
speaker_format
|
str
|
Format string for speaker attribution |
'[{speaker}] {text}'
|
reindex_entries
|
bool
|
Whether to reindex entries sequentially |
True
|
timestamp_format
|
str
|
Format string for timestamp formatting |
'{:02d}:{:02d}:{:02d},{:03d}'
|
max_chars_per_line
|
int
|
Maximum characters per line before splitting |
42
|
SRTProcessor
¶
Handles parsing and generating SRT format.
Provides functionality to convert between SRT text format and TimedText objects, with various formatting options. Supports both native parsing/generation and pysrt backend.
config = config or SRTConfig()
instance-attribute
¶__init__(config=None)
¶Initialize with optional configuration overrides.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
Optional[SRTConfig]
|
Configuration options for SRT processing |
None
|
add_speaker_labels(srt_content, *, speaker=None, speaker_labels=None)
¶Unified entry point for adding speaker labels. (Not implemented yet.)
assign_single_speaker(srt_content, speaker)
¶Assign the same speaker to all segments in the SRT content.
assign_speaker_by_mapping(srt_content, speaker_labels)
¶Assign speakers to segments based on a mapping of speaker to segment indices. (Not implemented yet.)
combine(timed_texts)
¶generate(timed_text, include_speaker=None)
¶Generate SRT content from a TimedText object. Uses internal generator or pysrt depending on configuration.
merge_srts(srt_list)
¶Merge multiple SRT files into a single SRT string.
parse(srt_content)
¶Parse SRT content into a new TimedText object. Uses internal parser or pysrt depending on configuration.
shift_timestamps(timed_text, offset_ms)
¶
text_segment_builder
¶
SegmentBuilder for creating phrase-level segments from word-level TimedText.
This module builds higher-level segments from a TimedText object containing word-level units, based on configurable criteria like duration, character count, punctuation, pauses, and speaker changes.
COMMON_ABBREVIATIONS = frozenset({'adj.', 'adm.', 'adv.', 'al.', 'anon.', 'apr.', 'arc.', 'aug.', 'ave.', 'brig.', 'bros.', 'capt.', 'cmdr.', 'col.', 'comdr.', 'con.', 'corp.', 'cpl.', 'dr.', 'drs.', 'ed.', 'enc.', 'etc.', 'ex.', 'feb.', 'gen.', 'gov.', 'hon.', 'hosp.', 'hr.', 'inc.', 'jan.', 'jr.', 'maj.', 'mar.', 'messrs.', 'mlle.', 'mm.', 'mme.', 'mr.', 'mrs.', 'ms.', 'msgr.', 'nov.', 'oct.', 'op.', 'ord.', 'ph.d.', 'prof.', 'pvt.', 'rep.', 'reps.', 'res.', 'rev.', 'rt.', 'sen.', 'sens.', 'sep.', 'sfc.', 'sgt.', 'sr.', 'st.', 'supt.', 'surg.', 'u.s.', 'v.p.', 'vs.'})
module-attribute
¶
TextSegmentBuilder
¶
avoid_orphans = avoid_orphans
instance-attribute
¶current_characters = 0
instance-attribute
¶current_words = []
instance-attribute
¶ignore_speaker = ignore_speaker
instance-attribute
¶max_duration = max_duration_ms
instance-attribute
¶max_gap_duration = max_gap_duration_ms
instance-attribute
¶segments = []
instance-attribute
¶target_characters = target_characters
instance-attribute
¶__init__(*, max_duration_ms=None, target_characters=None, avoid_orphans=True, max_gap_duration_ms=None, ignore_speaker=True)
¶build_segments(*, target_duration=None, target_characters=None, avoid_orphans=True, max_gap_duration=None, ignore_speaker=False)
¶Build or rebuild segments from the contents of words.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_duration
|
Optional[int]
|
Maximum desired segment duration in milliseconds. |
None
|
target_characters
|
Optional[int]
|
Maximum desired character length of a segment. |
None
|
avoid_orphans
|
Optional[bool]
|
If True, prevent extremely short trailing segments. |
True
|
Note
This is a stub. Concrete algorithms will be implemented later.
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
Always, until implemented. |
create_segments(timed_text)
¶
transcription_service
¶
TranscriptionResult
¶
Bases: BaseModel
audio_duration_ms = None
class-attribute
instance-attribute
¶confidence = None
class-attribute
instance-attribute
¶language
instance-attribute
¶raw_result = None
class-attribute
instance-attribute
¶status = None
class-attribute
instance-attribute
¶text
instance-attribute
¶transcript_id = None
class-attribute
instance-attribute
¶utterance_timing = None
class-attribute
instance-attribute
¶word_timing = None
class-attribute
instance-attribute
¶
TranscriptionService
¶
Bases: ABC
Abstract base class defining the interface for transcription services.
This interface provides a standard way to interact with different transcription service providers (e.g., OpenAI Whisper, AssemblyAI).
get_result(job_id)
abstractmethod
¶Get results for an existing transcription job.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
ID of the transcription job |
required |
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Dictionary containing transcription results in the same |
TranscriptionResult
|
standardized format as transcribe() |
transcribe(audio_file, options=None)
abstractmethod
¶Transcribe audio file to text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BytesIO]
|
Path to audio file or file-like object |
required |
options
|
Optional[Dict[str, Any]]
|
Provider-specific options for transcription |
None
|
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
TranscriptionResult |
transcribe_to_format(audio_file, format_type='srt', transcription_options=None, format_options=None)
abstractmethod
¶Transcribe audio and return result in specified format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BytesIO]
|
Path, file-like object, or URL of audio file |
required |
format_type
|
str
|
Format type (e.g., "srt", "vtt", "text") |
'srt'
|
transcription_options
|
Optional[Dict[str, Any]]
|
Options for transcription |
None
|
format_options
|
Optional[Dict[str, Any]]
|
Format-specific options |
None
|
Returns:
| Type | Description |
|---|---|
str
|
String representation in the requested format |
TranscriptionServiceFactory
¶
Factory for creating transcription service instances.
This factory provides a standard way to create transcription service instances based on the provider name and configuration.
create_service(provider='assemblyai', api_key=None, **kwargs)
classmethod
¶Create a transcription service instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
provider
|
str
|
Service provider name (e.g., "whisper", "assemblyai") |
'assemblyai'
|
api_key
|
Optional[str]
|
API key for the service |
None
|
**kwargs
|
Any
|
Additional provider-specific configuration |
{}
|
Returns:
| Type | Description |
|---|---|
TranscriptionService
|
TranscriptionService instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the provider is not supported |
ImportError
|
If the provider module cannot be imported |
register_provider(name, provider_class)
classmethod
¶Register a provider implementation with the factory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Provider name (lowercase) |
required |
provider_class
|
Callable[..., TranscriptionService]
|
Provider implementation class or factory function |
required |
Example
from my_module import MyTranscriptionService TranscriptionServiceFactory.register_provider("my_provider", MyTranscriptionService)
Utterance
¶
vtt_processor
¶
VTTConfig
¶
Configuration options for WebVTT processing.
include_speaker = include_speaker
instance-attribute
¶max_chars_per_line = max_chars_per_line
instance-attribute
¶reindex_entries = reindex_entries
instance-attribute
¶speaker_format = speaker_format
instance-attribute
¶timestamp_format = timestamp_format
instance-attribute
¶__init__(include_speaker=False, speaker_format='<v {speaker}>{text}', reindex_entries=False, timestamp_format='{:02d}:{:02d}:{:02d}.{:03d}', max_chars_per_line=42)
¶Initialize with default settings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_speaker
|
bool
|
Whether to include speaker labels in output |
False
|
speaker_format
|
str
|
Format string for speaker attribution |
'<v {speaker}>{text}'
|
reindex_entries
|
bool
|
Whether to reindex entries sequentially |
False
|
timestamp_format
|
str
|
Format string for timestamp formatting |
'{:02d}:{:02d}:{:02d}.{:03d}'
|
max_chars_per_line
|
int
|
Maximum characters per line before splitting |
42
|
VTTProcessor
¶
Handles parsing and generating WebVTT format.
config = config or VTTConfig()
instance-attribute
¶__init__(config=None)
¶Initialize with optional configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
Optional[VTTConfig]
|
Configuration options for VTT processing |
None
|
generate(timed_texts)
¶Generate VTT content from a list of TimedUnit objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timed_texts
|
List[TimedTextUnit]
|
List of TimedUnit objects |
required |
Returns:
| Type | Description |
|---|---|
str
|
String containing VTT formatted content |
parse(vtt_content)
¶Parse VTT content into a list of TimedUnit objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vtt_content
|
str
|
String containing VTT formatted content |
required |
Returns:
| Type | Description |
|---|---|
List[TimedTextUnit]
|
List of TimedUnit objects |
whisper_service
¶
TODO: MAJOR REFACTOR PLANNED¶
This module currently mixes persistent service configuration (WhisperConfig) with per-call runtime options, leading to complex validation and logic. Plan is to:
- Refactor so each WhisperTranscriptionService instance is configured once at construction, with all relevant settings (including file-like/path-like mode, file extension, etc).
- Use Pydantic BaseSettings for configuration to normalize configuration and validation according to TNH Scholar style.
- Remove ad-hoc runtime options from the transcribe() entrypoint; all config should be set at init.
- If a different configuration is needed, instantiate a new service object.
- This will simplify validation, error handling, and code logic, and make the contract clear and robust.
- NOTE: This will change the TranscriptionService contract and will require similar changes in other transcription system implementations.
- Update all dependent code and tests accordingly.
logger = get_child_logger(__name__)
module-attribute
¶
WhisperBase
¶
WhisperConfig
dataclass
¶
Configuration for the Whisper transcription service.
BASE_PARAMS = ['model', 'language', 'temperature', 'prompt', 'response_format']
class-attribute
instance-attribute
¶FORMAT_PARAMS = {'verbose_json': ['timestamp_granularities'], 'json': [], 'text': [], 'srt': [], 'vtt': []}
class-attribute
instance-attribute
¶SUPPORTED_FORMATS = ['json', 'text', 'srt', 'vtt', 'verbose_json']
class-attribute
instance-attribute
¶chunking_strategy = 'auto'
class-attribute
instance-attribute
¶language = None
class-attribute
instance-attribute
¶model = 'whisper-1'
class-attribute
instance-attribute
¶prompt = None
class-attribute
instance-attribute
¶response_format = 'verbose_json'
class-attribute
instance-attribute
¶temperature = None
class-attribute
instance-attribute
¶timestamp_granularities = field(default_factory=(lambda: ['word']))
class-attribute
instance-attribute
¶__init__(model='whisper-1', response_format='verbose_json', timestamp_granularities=(lambda: ['word'])(), chunking_strategy='auto', language=None, temperature=None, prompt=None)
¶to_dict()
¶Convert configuration to dictionary for API call.
validate()
¶Validate configuration values.
WhisperResponse
¶
Bases: WhisperBase
WhisperSegment
¶
WhisperTranscriptionService
¶
Bases: TranscriptionService
OpenAI Whisper implementation of the TranscriptionService interface.
Provides transcription services using the OpenAI Whisper API.
config = WhisperConfig()
instance-attribute
¶format_converter = FormatConverter()
instance-attribute
¶__init__(api_key=None, **config_options)
¶Initialize the Whisper transcription service.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_key
|
Optional[str]
|
OpenAI API key (defaults to OPENAI_API_KEY env var) |
None
|
**config_options
|
Any
|
Additional configuration options |
{}
|
get_result(job_id)
¶Get results for an existing transcription job.
Whisper API operates synchronously and doesn't use job IDs, so this method is not implemented.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
ID of the transcription job |
required |
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Dictionary containing transcription results |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not supported for Whisper |
set_api_key(api_key=None)
¶Set or update the API key.
This method allows refreshing the API key without re-instantiating the class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_key
|
Optional[str]
|
OpenAI API key (defaults to OPENAI_API_KEY env var) |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If no API key is provided or found in environment |
transcribe(audio_file, options=None)
¶Transcribe audio file to text using OpenAI Whisper API.
PATCH: If audio_file is a file-like object, options['file_extension'] must be provided (OpenAI API quirk).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BytesIO]
|
Path to audio file or file-like object |
required |
options
|
Optional[Dict[str, Any]]
|
Provider-specific options for transcription. If audio_file is file-like, must include 'file_extension'. |
None
|
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
Dictionary containing transcription results with standardized keys |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file-like object is provided without 'file_extension' in options |
transcribe_to_format(audio_file, format_type='srt', transcription_options=None, format_options=None)
¶Transcribe audio and return result in specified format.
PATCH: If audio_file is a file-like object, transcription_options['file_extension'] must be provided (OpenAI API quirk).
Takes advantage of the direct subtitle generation functionality when requesting SRT or VTT formats.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Union[Path, BytesIO]
|
Path, file-like object, or URL of audio file |
required |
format_type
|
str
|
Format type (e.g., "srt", "vtt", "text") |
'srt'
|
transcription_options
|
Optional[Dict[str, Any]]
|
Options for transcription. If audio_file is file-like, must include 'file_extension'. |
None
|
format_options
|
Optional[Dict[str, Any]]
|
Format-specific options |
None
|
Returns:
| Type | Description |
|---|---|
str
|
String representation in the requested format |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file-like object is provided without 'file_extension' in transcription_options |
utils
¶
__all__ = ['AudioEnhancer', 'get_segment_audio', 'play_audio_segment', 'play_bytes', 'play_from_file', 'play_diarization_segment', 'get_audio_from_file']
module-attribute
¶
AudioEnhancer
¶
compression_settings = compression_settings
instance-attribute
¶
config = config
instance-attribute
¶
__init__(config=EnhancementConfig(), compression_settings=CompressionSettings())
¶
Initialize with enhancement configuration and compression settings.
enhance(input_path, output_path=None)
¶
Apply enhancement routines (compression, EQ, gating, etc.) in a modular fashion. Converts input to FLAC working format for Whisper compatibility.
extract_sample(input_path, start, duration, output_path=None, output_format='flac', codec=None, compression_level=8)
¶
Extract a sample segment from the audio file.
Parameters¶
input_path : Path Path to the input audio file. start : float Start time in seconds. duration : float Duration in seconds. output_path : Path, optional Output file path. If None, auto-generated from input. output_format : str, default="flac" Output audio format/extension. codec : str, optional Audio codec to use (default: "flac" if output_format is "flac", else None). compression_level : int, default=8 Compression level for supported codecs.
Returns¶
Path Path to the extracted audio sample.
get_audio_info(file_path)
¶
Get detailed audio information using ffprobe.
play_audio(file_path)
¶
Play audio in notebook for quality assessment.
get_audio_from_file(audio_file)
¶
get_segment_audio(segment, audio)
¶
play_audio_segment(audio)
¶
play_bytes(data, format='wav')
¶
play_diarization_segment(segment, audio)
¶
play_from_file(path)
¶
audio_enhance
¶
Module review and recommendations:
Big Picture Approach:
Modular, Configurable, and Extensible: Your use of Pydantic models for settings and configs is excellent. It makes the pipeline flexible and easy to tune for different ASR or enhancement needs. Tooling: Leveraging SoX and FFmpeg is a pragmatic choice for robust, high-quality audio processing. Pipeline Structure: The AudioEnhancer class is well-structured, with clear separation of concerns for each processing step (remix, rate, gain, EQ, compand, etc.). Notebook Integration: The play_audio method and use of IPython display is great for interactive, iterative work.
Details & Points You Might Be Missing:
Error Handling & Logging:
You print errors but could benefit from more structured logging (e.g., using Python’s logging module). Consider more granular exception handling, especially for subprocess calls. Testing & Validation:
No unit tests or validation of output audio quality/format are present. Consider adding automated tests (even if just for file existence, format, and basic properties). You could add a method to compare pre/post enhancement SNR, loudness, or other metrics. Documentation & Examples:
While docstrings are good, a usage example (in code or markdown) would help new users. Consider a README or notebook cell that demonstrates a full workflow. Performance:
For large-scale or batch processing, consider parallelization or async processing. Temporary files (e.g., intermediate FLACs) could be managed/cleaned up more robustly. Extensibility:
The pipeline is modular, but adding a “custom steps” hook (e.g., user-defined SoX/FFmpeg args) would make it even more flexible. You might want to support other codecs or output formats for downstream ASR models. Feature Gaps:
The extract_sample method is a TODO. Implementing this would be useful for quick QA or dataset creation. Consider adding Voice Activity Detection (VAD) or silence trimming as optional steps. You could add a “dry run” mode to print the SoX/FFmpeg commands without executing, for debugging. ASR-Specific Enhancements:
You might want to add preset configs for different ASR models (e.g., Whisper, Wav2Vec2, etc.), as they may have different optimal preprocessing. Consider integrating with open-source ASR evaluation tools to close the loop on enhancement effectiveness. General Strategic Recommendations:
Automate QA: Add methods to check output audio quality, duration, and format, and optionally compare to input. Batch Processing: Add a method to process a directory or list of files. Config Export/Import: Allow saving/loading configs as JSON/YAML for reproducibility. CLI/Script Interface: Consider a command-line interface for use outside notebooks. Unit Tests: Add basic tests for each method, especially for error cases. Summary Table:
| Modularity | Good | Add custom step hooks | | Configurability | Excellent | Presets for more ASR models | | Error Handling | Basic | Use logging, more granular exceptions | | Testing | Missing | Add unit tests, output validation | | Documentation | Good | Add usage examples, README | | Extensibility | Good | Support more codecs, batch processing | | ASR Optimization | Good start | Add VAD, silence trim, model-specific configs |
logger = get_child_logger(__name__)
module-attribute
¶
AudioEnhancer
¶
compression_settings = compression_settings
instance-attribute
¶config = config
instance-attribute
¶__init__(config=EnhancementConfig(), compression_settings=CompressionSettings())
¶Initialize with enhancement configuration and compression settings.
enhance(input_path, output_path=None)
¶Apply enhancement routines (compression, EQ, gating, etc.) in a modular fashion. Converts input to FLAC working format for Whisper compatibility.
extract_sample(input_path, start, duration, output_path=None, output_format='flac', codec=None, compression_level=8)
¶Extract a sample segment from the audio file.
Parameters¶
input_path : Path Path to the input audio file. start : float Start time in seconds. duration : float Duration in seconds. output_path : Path, optional Output file path. If None, auto-generated from input. output_format : str, default="flac" Output audio format/extension. codec : str, optional Audio codec to use (default: "flac" if output_format is "flac", else None). compression_level : int, default=8 Compression level for supported codecs.
Returns¶
Path Path to the extracted audio sample.
get_audio_info(file_path)
¶Get detailed audio information using ffprobe.
play_audio(file_path)
¶Play audio in notebook for quality assessment.
CompressionSettings
¶
Bases: BaseSettings
Compression settings for audio enhancement routines.
Attributes:
| Name | Type | Description |
|---|---|---|
minimal |
list[str]
|
List of compand arguments for minimal compression. |
light |
list[str]
|
List of compand arguments for light compression. |
moderate |
list[str]
|
List of compand arguments for moderate compression. |
aggressive |
list[str]
|
List of compand arguments for aggressive compression. |
whisper_optimized |
list[str]
|
List of compand arguments for Whisper-optimized compression. |
whisper_aggressive |
list[str]
|
List of compand arguments for aggressive Whisper compression. |
primary_speech_only |
list[str]
|
List of compand arguments for primary speech only. |
aggressive = ['0.02,0.1', '8:-70,-55,-45,-35,-25,-15', '-5', '-90', '0.05']
class-attribute
instance-attribute
¶light = ['0.05,0.2', '6:-60,-50,-40,-30,-20,-10', '-3', '-85', '0.1']
class-attribute
instance-attribute
¶minimal = ['0.1,0.3', '3:-50,-40,-30,-20', '-3', '-80', '0.2']
class-attribute
instance-attribute
¶moderate = ['0.03,0.15', '6:-65,-50,-40,-30,-20,-10', '-4', '-85', '0.1']
class-attribute
instance-attribute
¶primary_speech_only = ['0.005,0.06', '12:-60,-45,-55,-30,-35,-18,-15,-8', '-8', '-60', '0.03']
class-attribute
instance-attribute
¶whisper_aggressive = ['0.005,0.06', '12:-75,-45,-55,-30,-35,-18,-15,-8', '-8', '-95', '0.03']
class-attribute
instance-attribute
¶whisper_optimized = ['0.005,0.06', '12:-75,-65,-55,-45,-35,-25,-15,-8', '-8', '-95', '0.03']
class-attribute
instance-attribute
¶
EQSettings
¶
Bases: BaseSettings
bass = (-5, 200)
class-attribute
instance-attribute
¶contrast = 75
class-attribute
instance-attribute
¶eq_bands = [(100, 0.9, -20), (1500, 1, 4), (4000, 0.6, 15), (10000, 1, -10)]
class-attribute
instance-attribute
¶highpass_freq = 175
class-attribute
instance-attribute
¶lowpass_freq = 15000
class-attribute
instance-attribute
¶treble = (3, 3000)
class-attribute
instance-attribute
¶
EnhancementConfig
¶
Bases: BaseModel
channels = 2
class-attribute
instance-attribute
¶codec = 'flac'
class-attribute
instance-attribute
¶compression_level = 'aggressive'
class-attribute
instance-attribute
¶eq = EQSettings()
class-attribute
instance-attribute
¶force_mono = False
class-attribute
instance-attribute
¶gate = GateSettings()
class-attribute
instance-attribute
¶include_eq = True
class-attribute
instance-attribute
¶include_gate = True
class-attribute
instance-attribute
¶norm = NormalizationSettings()
class-attribute
instance-attribute
¶rate = RateSettings()
class-attribute
instance-attribute
¶remix = RemixSettings()
class-attribute
instance-attribute
¶sample_rate = 48000
class-attribute
instance-attribute
¶target_rate = None
class-attribute
instance-attribute
¶
GateSettings
¶
Bases: BaseSettings
gate_params = ['0.1', '0.05', '-inf', '0.1', '-90', '0.1']
class-attribute
instance-attribute
¶
compress_wav_to_mp4_vbr(input_wav, output_path=None, quality=8)
¶
Compress WAV to M4A (AAC VBR) using ffmpeg.
Parameters:¶
input_wav : str or Path Path to the input .wav file output_path : str or Path, optional Output .mp4 file path. If None, auto-generated from input quality : int, default=8 VBR quality level: 1 = good (~96kbps), 2 = very good (~128kbps), 3+ = higher bitrate
Returns:¶
Path Path to the compressed .m4a file
get_sox_info(file_path)
¶
Get audio info using SoX
cli_tools
¶
TNH Scholar CLI Tools
Command-line interface tools for the TNH Scholar project:
audio-transcribe:
Audio processing pipeline that handles downloading, segmentation,
and transcription of Buddhist teachings.
tnh-gen:
Unified GenAI CLI replacing legacy tooling, including tnh-fab.
See https://aaronksolomon.github.io/tnh-scholar/architecture/tnh-gen/
See individual tool documentation for usage details and examples.
audio_transcribe
¶
__all__ = ['audio_transcribe', 'main', 'YTDVersionChecker']
module-attribute
¶
YTDVersionChecker
¶
Simple version checker for yt-dlp with robust version comparison.
This is a prototype implementation may need expansion in these areas: - Caching to prevent frequent PyPI calls - More comprehensive error handling for: - Missing/uninstalled packages - Network timeouts - JSON parsing errors - Invalid version strings - Environment detection (virtualenv, conda, system Python) - Configuration options for version pinning - Proxy support for network requests
NETWORK_TIMEOUT = 5
class-attribute
instance-attribute
¶
PYPI_URL = 'https://pypi.org/pypi/yt-dlp/json'
class-attribute
instance-attribute
¶
check_version()
¶
Check if yt-dlp needs updating.
Returns:
| Type | Description |
|---|---|
Tuple[bool, Version, Version]
|
Tuple of (needs_update, installed_version, latest_version) |
Raises:
| Type | Description |
|---|---|
ImportError
|
If yt-dlp is not installed |
RequestException
|
For network-related errors |
InvalidVersion
|
If version strings are invalid |
main()
¶
audio_transcribe
¶
CLI tool for downloading audio (YouTube or local), and transcribing to text.
Usage
audio-transcribe [OPTIONS]
e.g. audio-transcribe --yt_url https://www.youtube.com/watch?v=EXAMPLE --output_dir ./processed --service whisper --model whisper-1
DEFAULT_CHUNK_DURATION = 120
module-attribute
¶
DEFAULT_MIN_CHUNK = 10
module-attribute
¶
DEFAULT_MODEL = 'whisper-1'
module-attribute
¶
DEFAULT_OUTPUT_PATH = './audio_transcriptions/transcript.txt'
module-attribute
¶
DEFAULT_RESPONSE_FORMAT = 'text'
module-attribute
¶
DEFAULT_SERVICE = 'whisper'
module-attribute
¶
DEFAULT_TEMP_DIR = tempfile.gettempdir()
module-attribute
¶
VIDEO_EXTENSIONS = {'.mp4', '.avi', '.mov', '.mkv', '.wmv'}
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
AudioTranscribeApp
¶
Main application class for audio transcription CLI.
Organizes configuration, source resolution, and pipeline execution. All
runtime options are supplied via a validated AudioTranscribeConfig.
audio_file = self._resolve_audio_source()
instance-attribute
¶chunk_duration = TimeMs.from_seconds(config.chunk_duration)
instance-attribute
¶config = config
instance-attribute
¶diarization_config = self._build_diarization_config()
instance-attribute
¶end_time = config.end_time
instance-attribute
¶file_ = config.file_
instance-attribute
¶keep_artifacts = config.keep_artifacts
instance-attribute
¶language = config.language
instance-attribute
¶min_chunk = TimeMs.from_seconds(config.min_chunk)
instance-attribute
¶model = config.model
instance-attribute
¶output_path = Path(config.output)
instance-attribute
¶prompt = config.prompt
instance-attribute
¶response_format = config.response_format
instance-attribute
¶service = config.service
instance-attribute
¶start_time = config.start_time
instance-attribute
¶temp_dir = self.output_path.parent
instance-attribute
¶transcription_options = self._build_transcription_options()
instance-attribute
¶yt_url = config.yt_url
instance-attribute
¶yt_url_csv = config.yt_url_csv
instance-attribute
¶__init__(config)
¶Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
AudioTranscribeConfig
|
Validated AudioTranscribeConfig instance. |
required |
run()
¶Run the transcription pipeline and print results, or just download audio if no_transcribe is set.
audio_transcribe(**kwargs)
¶
CLI entry point for audio transcription.
main()
¶
config
¶
DEFAULT_OUTPUT_PATH = './audio_transcriptions/transcript.txt'
module-attribute
¶
DEFAULT_SERVICE = 'whisper'
module-attribute
¶
DEFAULT_TEMP_DIR = './audio_transcriptions/tmp'
module-attribute
¶
AudioTranscribeConfig
¶
Bases: BaseSettings
Validated runtime configuration for the audio-transcribe CLI.
chunk_duration = Field(default=120, description='Target chunk duration in seconds')
class-attribute
instance-attribute
¶end_time = Field(default=None, description='End time offset')
class-attribute
instance-attribute
¶file_ = Field(default=None, description='Path to local audio file')
class-attribute
instance-attribute
¶keep_artifacts = Field(default=False, description='Keep all intermediate artifacts in the output directory instead of using a system temp directory.')
class-attribute
instance-attribute
¶language = Field(default='en', description='Language code')
class-attribute
instance-attribute
¶min_chunk = Field(default=10, ge=10, description='Minimum chunk duration in seconds')
class-attribute
instance-attribute
¶model = Field(default='whisper-1', description='Transcription model name')
class-attribute
instance-attribute
¶model_config = SettingsConfigDict(env_file='.env', env_file_encoding='utf-8', extra='ignore')
class-attribute
instance-attribute
¶no_transcribe = Field(default=False, description='If True, only download YouTube audio to mp3, no transcription.')
class-attribute
instance-attribute
¶output = Field(default=DEFAULT_OUTPUT_PATH, description='Path to output transcript file')
class-attribute
instance-attribute
¶prompt = Field(default='', description='Prompt or keywords')
class-attribute
instance-attribute
¶response_format = Field(default='text', description='Response format')
class-attribute
instance-attribute
¶service = Field(default=DEFAULT_SERVICE, pattern='^(whisper|assemblyai)$', description='Transcription service')
class-attribute
instance-attribute
¶start_time = Field(default=None, description='Start time offset')
class-attribute
instance-attribute
¶temp_dir = Field(default=None, description='Directory for temporary processing files')
class-attribute
instance-attribute
¶yt_url = Field(default=None, description='YouTube URL')
class-attribute
instance-attribute
¶yt_url_csv = Field(default=None, description='CSV file with YouTube URLs')
class-attribute
instance-attribute
¶validate_sources()
¶Enforce coherent source selection for CLI execution.
MultipleAudioSourceError
¶
Bases: ValueError
Raised when audio source selection has multiple sources).
NoAudioSourceError
¶
Bases: ValueError
Raised when no audio source is provided.
convert_video
¶
FFMPEG_VIDEO_CONV_DEFAULT_CONFIG = {'audio_codec': 'libmp3lame', 'audio_bitrate': '192k', 'audio_samplerate': '44100'}
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
convert_video_to_audio(video_file, output_dir, conversion_params=None)
¶
Convert a video file to an audio file using ffmpeg.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
video_file
|
Path
|
Path to the video file |
required |
output_dir
|
Path
|
Directory to save the converted audio file |
required |
conversion_params
|
Optional[Dict[str, str]]
|
Optional dictionary to override default conversion parameters |
None
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to the converted audio file |
environment
¶
env
¶
logger = get_child_logger(__name__)
module-attribute
¶check_env()
¶Check the environment for necessary conditions: 1. Check OpenAI key is available. 2. Check that all requirements from requirements.txt are importable.
check_requirements(requirements_file)
¶Check that all requirements listed in requirements.txt can be imported. If any cannot be imported, print a warning.
This is a heuristic check. Some packages may not share the same name as their importable module. Adjust the name mappings below as needed.
Example
check_requirements(Path("./requirements.txt"))
Prints warnings if imports fail, otherwise silent.¶
transcription_pipeline
¶
TranscriptionPipeline
¶
audio_file = audio_file
instance-attribute
¶audio_file_extension = audio_file.suffix
instance-attribute
¶diarization_config = diarization_config or DiarizationConfig()
instance-attribute
¶diarization_dir = self.output_dir / f'{self.audio_file.stem}_diarization'
instance-attribute
¶diarization_kwargs = diarization_kwargs or {}
instance-attribute
¶diarization_results_path = self.diarization_dir / 'raw_diarization_results.json'
instance-attribute
¶logger = logger or logging.getLogger(__name__)
instance-attribute
¶output_dir = output_dir
instance-attribute
¶save_diarization = save_diarization
instance-attribute
¶transcriber = transcriber
instance-attribute
¶transcription_options = patch_whisper_options(transcription_options, file_extension=(audio_file.suffix))
instance-attribute
¶__init__(audio_file, output_dir, diarization_config=None, transcriber='whisper', transcription_options=None, diarization_kwargs=None, save_diarization=True, logger=None)
¶Initialize the TranscriptionPipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_file
|
Path
|
Path to the audio file to process. |
required |
output_dir
|
Path
|
Directory to store output files. |
required |
diarization_config
|
Optional[DiarizationConfig]
|
Diarization configuration. |
None
|
transcriber
|
str
|
Transcription service provider. |
'whisper'
|
transcription_options
|
Optional[Dict[str, Any]]
|
Options for transcription. |
None
|
diarization_kwargs
|
Optional[Dict[str, Any]]
|
Additional diarization arguments. |
None
|
save_diarization
|
bool
|
Whether to save raw diarization JSON results. |
True
|
logger
|
Optional[Logger]
|
Logger for pipeline events. |
None
|
run()
¶Execute the full transcription pipeline with robust error handling.
Returns:
| Type | Description |
|---|---|
Optional[List[Dict[str, Any]]]
|
List[Dict[str, Any]]: List of transcript dicts with chunk metadata, or None on failure |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If any pipeline step fails. |
validate
¶
validate_inputs(is_download, yt_url, yt_url_list, audio_file, split, transcribe, chunk_dir, no_chunks, silence_boundaries, whisper_boundaries)
¶
Validate the CLI inputs for coherent download, split, and transcribe flows.
version_check
¶
logger = get_child_logger(__name__)
module-attribute
¶
YTDVersionChecker
¶
Simple version checker for yt-dlp with robust version comparison.
This is a prototype implementation may need expansion in these areas: - Caching to prevent frequent PyPI calls - More comprehensive error handling for: - Missing/uninstalled packages - Network timeouts - JSON parsing errors - Invalid version strings - Environment detection (virtualenv, conda, system Python) - Configuration options for version pinning - Proxy support for network requests
NETWORK_TIMEOUT = 5
class-attribute
instance-attribute
¶PYPI_URL = 'https://pypi.org/pypi/yt-dlp/json'
class-attribute
instance-attribute
¶check_version()
¶Check if yt-dlp needs updating.
Returns:
| Type | Description |
|---|---|
Tuple[bool, Version, Version]
|
Tuple of (needs_update, installed_version, latest_version) |
Raises:
| Type | Description |
|---|---|
ImportError
|
If yt-dlp is not installed |
RequestException
|
For network-related errors |
InvalidVersion
|
If version strings are invalid |
check_ytd_version()
¶
Check if yt-dlp is up to date and available.
This function checks the installed version of yt-dlp against the latest version on PyPI. Since YouTube changes frequently break older yt-dlp versions, this check is strict and requires the latest version.
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if yt-dlp is installed and up to date, False otherwise. |
Note
This is a strict check. Outdated versions return False to prevent wasting time on long-running jobs that will likely fail due to YouTube API changes.
claude_assistant
¶
Claude assistant CLI package.
claude_assistant
¶
Typer entrypoint for a minimal local Claude worker wrapper.
claude-assistant is a thin convenience CLI for launching claude --print
from a predictable environment. It is intended as a pragmatic bridge for
delegated local worker invocation while the broader orchestration surfaces are
still evolving.
app = typer.Typer(name='claude-assistant', help='Minimal wrapper around `claude --print` for delegated local worker runs.', add_completion=False, no_args_is_help=True)
module-attribute
¶
ClaudeAssistantPaths
dataclass
¶
ClaudeAssistantResult
dataclass
¶
Serializable summary of one wrapper invocation.
command
instance-attribute
¶cwd
instance-attribute
¶exit_code
instance-attribute
¶final_message
instance-attribute
¶stderr_path
instance-attribute
¶stdout_path
instance-attribute
¶__init__(command, cwd, exit_code, stdout_path, stderr_path, final_message)
¶to_json()
¶Render one JSON summary suitable for scripted callers.
main()
¶
Dispatch to the Typer app.
run_command(prompt=typer.Option(..., '--prompt', help='Prompt text to pass to `claude --print`.'), cwd=typer.Option(Path.cwd(), '--cwd', file_okay=False, dir_okay=True, resolve_path=True, help='Working directory for the Claude run.'), claude_executable=typer.Option(None, '--claude-executable', file_okay=True, dir_okay=False, resolve_path=True, help='Optional explicit path to the Claude executable.'), stdout_path=typer.Option(None, '--stdout-path', resolve_path=True, help='Optional path for captured stdout.'), stderr_path=typer.Option(None, '--stderr-path', resolve_path=True, help='Optional path for captured stderr.'), permission_mode=typer.Option('dontAsk', '--permission-mode', help='Claude permission mode, for example `dontAsk` or `acceptEdits`.'), json_output=typer.Option(True, '--json/--no-json', help='Request Claude stream-json stdout for machine-readable capture.'), verbose=typer.Option(True, '--verbose/--no-verbose', help='Include Claude verbose event output.'), inherit_env=typer.Option(False, '--inherit-env/--sanitize-env', help='Inherit the current environment instead of the sanitized env.'))
¶
Run one local Claude worker invocation and emit a JSON summary.
codex_assistant
¶
Codex assistant CLI package.
codex_assistant
¶
Typer entrypoint for a minimal local Codex worker wrapper.
codex-assistant is a thin convenience CLI for launching codex exec
from a predictable, sanitized user-like environment. It is intended as a
pragmatic bridge for delegated local worker invocation while the broader
orchestration surfaces are still evolving.
app = typer.Typer(name='codex-assistant', help='Minimal sanitized wrapper around `codex exec` for delegated local worker runs.', add_completion=False, no_args_is_help=True)
module-attribute
¶
CodexAssistantPaths
dataclass
¶
CodexAssistantResult
dataclass
¶
Serializable summary of one wrapper invocation.
command
instance-attribute
¶cwd
instance-attribute
¶exit_code
instance-attribute
¶final_message
instance-attribute
¶stderr_path
instance-attribute
¶stdout_path
instance-attribute
¶__init__(command, cwd, exit_code, stdout_path, stderr_path, final_message)
¶to_json()
¶Render one JSON summary suitable for scripted callers.
main()
¶
Dispatch to the Typer app.
run_command(prompt=typer.Option(..., '--prompt', help='Prompt text to pass to `codex exec`.'), cwd=typer.Option(Path.cwd(), '--cwd', file_okay=False, dir_okay=True, resolve_path=True, help='Working directory for the Codex run.'), codex_executable=typer.Option(None, '--codex-executable', file_okay=True, dir_okay=False, resolve_path=True, help='Optional explicit path to the Codex executable.'), profile=typer.Option('collab', '--profile', help='Codex profile to use.'), model=typer.Option(None, '--model', help='Optional model override.'), stdout_path=typer.Option(None, '--stdout-path', resolve_path=True, help='Optional path for captured stdout.'), stderr_path=typer.Option(None, '--stderr-path', resolve_path=True, help='Optional path for captured stderr.'), output_last_message_path=typer.Option(None, '--output-last-message-path', resolve_path=True, help='Optional path for Codex `--output-last-message` capture.'), json_output=typer.Option(True, '--json/--no-json', help='Request Codex JSONL stdout for machine-readable capture.'), ephemeral=typer.Option(True, '--ephemeral/--no-ephemeral', help='Use Codex ephemeral mode.'), inherit_env=typer.Option(False, '--inherit-env/--sanitize-env', help='Inherit the current environment instead of the sanitized user-like env.'), enable_feature=typer.Option([], '--enable-feature', help='Repeatable Codex feature enable flag.'), disable_feature=typer.Option([], '--disable-feature', help='Repeatable Codex feature disable flag.'))
¶
Run one local Codex worker invocation and emit a JSON summary.
json_to_srt
¶
__all__ = ['main', 'json_to_srt']
module-attribute
¶
main()
¶
Entry point for the jsonl-to-srt CLI tool.
json_to_srt
¶
Simple CLI tool for converting JSONL transcription files to SRT format.
This module provides a command line interface for transforming JSONL transcription files (from audio-transcribe) into SRT subtitle format. Handles chunked transcriptions with proper timestamp accumulation.
JsonDict = dict[str, Any]
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
JsonlToSrtConverter
¶
Converts JSONL transcription files from audio-transcribe to SRT format.
accumulated_time = 0.0
instance-attribute
¶entry_index = 1
instance-attribute
¶__init__()
¶Initialize converter state.
build_srt_entry(index, start, end, text)
¶Format a single SRT entry.
convert(input_file, output_file=None)
¶Convert a JSONL transcription file to SRT format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_file
|
TextIO
|
JSONL transcription file to parse |
required |
output_file
|
Optional[Path]
|
Optional output file path |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
SRT formatted content |
extract_segment_data(segment)
¶Extract timestamp and text data from a segment.
format_timestamp(seconds)
¶Convert seconds to SRT timestamp format (HH:MM:SS,mmm).
get_segments_from_data(data)
¶Extract segments from a data object.
handle_output(srt_content, output_file)
¶Write SRT content to file or stdout.
parse_jsonl_line(line)
¶Parse a single JSONL line into a dictionary.
process_jsonl_content(lines)
¶Process all JSONL content into SRT format.
process_jsonl_line(line)
¶Process a single JSONL line into SRT entries.
process_segment(segment)
¶Process a single segment into SRT format.
process_segments_list(segments_list)
¶Process a list of segments into SRT entries.
read_input_lines(input_file)
¶Read and filter input lines from file.
json_to_srt(input_file, output=None)
¶
Convert JSONL transcription files to SRT subtitle format.
Reads from stdin if no INPUT_FILE is specified. Writes to stdout if no output file is specified.
main()
¶
Entry point for the jsonl-to-srt CLI tool.
json_to_srt1
¶
Simple CLI tool for converting JSONL transcription files to SRT format.
This module provides a command line interface for transforming JSONL transcription files (from audio-transcribe) into SRT subtitle format.
JsonDict = dict[str, Any]
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
convert_to_srt(input_file, output_file=None)
¶
Convert a JSONL transcription file to SRT format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_file
|
TextIO
|
JSONL transcription file to parse |
required |
output_file
|
Optional[Path]
|
Optional output file path |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
SRT formatted content |
extract_segment_data(segment)
¶
Extract timestamp and text data from a segment.
format_srt_entry(index, start, end, text)
¶
Format a single SRT entry.
format_timestamp(seconds)
¶
Convert seconds to SRT timestamp format (HH:MM:SS,mmm).
get_segments_from_data(data)
¶
Extract segments from a data object.
handle_output(srt_content, output_file)
¶
Write SRT content to file or stdout.
json_to_srt(input_file, output=None)
¶
Convert JSONL transcription files to SRT subtitle format.
Reads from stdin if no INPUT_FILE is specified. Writes to stdout if no output file is specified.
main()
¶
Entry point for the jsonl-to-srt CLI tool.
parse_jsonl_line(line)
¶
Parse a single JSONL line into a dictionary.
process_jsonl_content(lines)
¶
Process all JSONL content into SRT format.
process_jsonl_line(line, entry_index, accumulated_time)
¶
Process a single JSONL line into SRT entries.
process_segment(segment, entry_index)
¶
Process a single segment into SRT format.
process_segments_list(segments_list, entry_index)
¶
Process a list of segments into SRT entries.
read_input_lines(input_file)
¶
Read and filter input lines from file.
nfmt
¶
sent_split
¶
__all__ = ['main', 'sent_split']
module-attribute
¶
main()
¶
sent_split
¶
Simple CLI tool for sentence splitting.
This module provides a command line interface for splitting text into sentences. Uses NLTK for robust sentence tokenization. Reads from stdin and writes to stdout by default, with optional file input/output.
sent_split_bak
¶
Simple CLI tool for sentence splitting.
This module provides a command line interface for splitting text into sentences. Uses NLTK for robust sentence tokenization. Reads from stdin and writes to stdout by default, with optional file input/output.
ensure_nltk_data()
¶
Ensure NLTK punkt tokenizer is available.
main()
¶
process_text(text, newline=True)
¶
Split text into sentences using NLTK.
sent_split(input_file, output, space)
¶
Split text into sentences using NLTK's sentence tokenizer.
Reads from stdin if no input file is specified. Writes to stdout if no output file is specified.
srt_translate
¶
__all__ = ['main', 'srt_translate']
module-attribute
¶
main()
¶
Entry point for the srt-translate CLI tool.
srt_translate
¶
CLI tool for translating SRT subtitle files using tnh-scholar line translation.
This module provides a command line interface for translating SRT subtitle files from one language to another while preserving timecodes and subtitle structure. Uses the same translation engine as the prompt-driven line translator.
logger = get_child_logger(__name__)
module-attribute
¶
SrtEntry
¶
Represents a single subtitle entry from an SRT file.
end_time = end_time
instance-attribute
¶index = index
instance-attribute
¶line_key
property
¶Generate a unique line key for this entry.
start_time = start_time
instance-attribute
¶text = text.strip()
instance-attribute
¶__init__(index, start_time, end_time, text)
¶Initialize subtitle entry with timing and text.
__str__()
¶Format entry as SRT text.
SrtTranslator
¶
Translates SRT files while preserving timecodes.
metadata = metadata
instance-attribute
¶model = model
instance-attribute
¶pattern = pattern
instance-attribute
¶source_language = source_language
instance-attribute
¶target_language = target_language
instance-attribute
¶__init__(source_language=None, target_language='en', pattern=None, model=None, metadata=None)
¶Initialize translator with language, model settings, and metadata.
create_text_object(text)
¶Create a TextObject from the extracted SRT text with metadata.
entries_to_numbered_text(entries)
¶Convert SRT entries to numbered text for TextObject.
extract_translated_lines(translated_object)
¶Extract translated lines from TextObject with line keys.
format_srt(entries)
¶Format entries back to SRT content.
parse_srt(content)
¶Parse SRT content into structured entries.
translate_and_save(input_file, output_path)
¶Handles file reading, translation, and saving.
translate_srt(content)
¶Process SRT content through complete translation pipeline.
translate_text_object(text_object)
¶Translate the TextObject using line translation.
update_entries_with_translations(entries, translations)
¶Apply translations to original entries.
load_metadata_from_file(metadata_file)
¶
Load metadata from a file if provided.
main()
¶
Entry point for the srt-translate CLI tool.
set_output_path(input_file, output, target_language)
¶
set_pattern(pattern)
¶
srt_translate(input_file, output=None, source_language=None, target_language='en', model=None, pattern=None, debug=False, metadata=None)
¶
Translate SRT subtitle files from one language to another.
INPUT_FILE is the path to the SRT file to translate.
tnh_codex_harness
¶
Suspended CLI package for the reference-only Codex harness spike.
tnh_codex_harness
¶
Typer entrypoint for the Codex harness CLI.
app = typer.Typer(name='tnh-codex-harness', help='Standalone Codex API harness.', add_completion=False, no_args_is_help=True)
module-attribute
¶
main()
¶
Dispatch to Typer app.
run_command(task=typer.Option(..., '--task', help='Task for Codex.'), system_prompt=typer.Option(None, '--system-prompt', help='Optional system prompt.'), apply_patch=typer.Option(True, '--apply-patch/--no-apply-patch', help='Apply patch output.'), run_tests_command=typer.Option(None, '--run-tests', help='Test command to run after applying patch.'), model=typer.Option(None, '--model', help='Override the Codex model.'), runs_root=typer.Option(None, '--runs-root', help='Override runs root directory.'), timeout_seconds=typer.Option(None, '--timeout-seconds', help='Timeout for tests.'), max_output_tokens=typer.Option(None, '--max-output-tokens', help='Max output tokens.'), temperature=typer.Option(None, '--temperature', help='Sampling temperature.'), max_tool_rounds=typer.Option(None, '--max-tool-rounds', help='Maximum tool-call rounds to allow.'), use_chat_completions=typer.Option(False, '--use-chat-completions', help='Use Chat Completions API instead of Responses API.'))
¶
Run a single Codex harness execution.
tnh_conductor
¶
CLI package for the maintained tnh-conductor entry point.
__all__ = ['app', 'main']
module-attribute
¶
app = typer.Typer(name='tnh-conductor', help='Maintained local/headless workflow bootstrap runner.', add_completion=False, no_args_is_help=True)
module-attribute
¶
main()
¶
Dispatch to the Typer app.
tnh_conductor
¶
Typer entrypoint for the maintained tnh-conductor CLI.
STATUS_STORE = FilesystemRunArtifactStore()
module-attribute
¶
app = typer.Typer(name='tnh-conductor', help='Maintained local/headless workflow bootstrap runner.', add_completion=False, no_args_is_help=True)
module-attribute
¶
conductor_app()
¶
Expose tnh-conductor as a command group.
main()
¶
Dispatch to the Typer app.
run_command(workflow=typer.Option(..., '--workflow', exists=True, file_okay=True, dir_okay=False, readable=True, resolve_path=True, help='Workflow YAML file to execute.'), repo_root=typer.Option(Path.cwd(), '--repo-root', file_okay=False, dir_okay=True, resolve_path=True, help='Repository root for the managed worktree run.'), runs_root=typer.Option(None, '--runs-root', file_okay=False, dir_okay=True, resolve_path=True, help='Optional override for the canonical runs root.'), workspace_root=typer.Option(None, '--workspace-root', file_okay=False, dir_okay=True, resolve_path=True, help='Optional override for the managed worktree root.'), base_ref=typer.Option('HEAD', '--base-ref', help='Committed git base ref for the run.'), codex_executable=typer.Option(None, '--codex-executable', exists=True, file_okay=True, dir_okay=False, resolve_path=True, help='Optional explicit path to the Codex executable.'), claude_executable=typer.Option(None, '--claude-executable', exists=True, file_okay=True, dir_okay=False, resolve_path=True, help='Optional explicit path to the Claude executable.'))
¶
Execute one maintained local/headless bootstrap run.
status_command(run_id=typer.Argument(..., help='Run id to inspect.'), repo_root=typer.Option(Path.cwd(), '--repo-root', file_okay=False, dir_okay=True, resolve_path=True, help='Repository root used to resolve default storage roots.'), runs_root=typer.Option(None, '--runs-root', file_okay=False, dir_okay=True, resolve_path=True, help='Optional override for the canonical runs root.'), watch=typer.Option(False, '--watch', help='Poll and print status snapshots until the run reaches a terminal state.'), poll_interval_seconds=typer.Option(1.0, '--poll-interval-seconds', help='Polling interval in seconds when --watch is enabled.'))
¶
Read the maintained live status artifact for one run.
tnh_conductor_spike
¶
CLI entrypoint package for tnh-conductor-spike.
tnh_conductor_spike
¶
Typer entrypoint for the tnh-conductor-spike CLI.
app = typer.Typer(name='tnh-conductor-spike', help='Phase 0 protocol layer spike runner.', add_completion=False, no_args_is_help=True)
module-attribute
¶
main()
¶
Dispatch to the Typer app.
run_command(agent=typer.Option(..., '--agent', help='Agent identifier (claude-code, codex).'), task=typer.Option(None, '--task', help='Task text for the agent.'), prompt_id=typer.Option(None, '--prompt-id', help='Prompt id for the task.'), timeout_seconds=typer.Option(SpikeDefaults().default_timeout_seconds, '--timeout-seconds', help='Wall-clock timeout.'), idle_timeout_seconds=typer.Option(SpikeDefaults().default_idle_timeout_seconds, '--idle-timeout-seconds', help='Idle timeout.'), heartbeat_interval_seconds=typer.Option(SpikeDefaults().default_heartbeat_interval_seconds, '--heartbeat-interval-seconds', help='Heartbeat interval for progress events.'), work_branch=typer.Option(None, '--work-branch', help='Explicit work branch name.'))
¶
Run a single Phase 0 spike execution.
tnh_gen
¶
tnh-gen CLI package.
__all__ = ['app', 'main']
module-attribute
¶
app = typer.Typer(name='tnh-gen', help='TNH-Gen: Unified CLI for TNH Scholar GenAI operations.', add_completion=False, no_args_is_help=True)
module-attribute
¶
main()
¶
Dispatch execution to the Typer application.
commands
¶
tnh-gen command modules.
config
¶
ConfigValue = str | Path | float | int | None
module-attribute
¶app = typer.Typer(help='Inspect and edit tnh-gen configuration.')
module-attribute
¶get_config_value(key)
¶Retrieve a single config value by key.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
Configuration key to fetch. |
required |
list_config_keys()
¶List available configuration keys supported by the CLI.
set_config_value(key=typer.Argument(..., help=f'Config key. Supported: {', '.join(available_keys())}'), value=typer.Argument(..., help='New value for the config key.'), workspace=typer.Option(False, '--workspace', help='Persist to workspace config (.vscode/tnh-scholar.json or .tnh-gen.json).'))
¶Persist a config value to user or workspace scope.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
Configuration key to update. |
Argument(..., help=f'Config key. Supported: {join(available_keys())}')
|
value
|
str
|
New value to store. |
Argument(..., help='New value for the config key.')
|
workspace
|
bool
|
Whether to persist to workspace scope. |
Option(False, '--workspace', help='Persist to workspace config (.vscode/tnh-scholar.json or .tnh-gen.json).')
|
show_config(catalog_health=typer.Option(False, '--catalog-health', help='Include aggregated prompt catalog health in the response.'), format=typer.Option(None, '--format', help='Output format: json (requires --api), yaml, or text (human-only).', case_sensitive=False))
¶Show the effective configuration and its source precedence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
format
|
OutputFormat | None
|
Optional output format override (json or yaml). |
Option(None, '--format', help='Output format: json (requires --api), yaml, or text (human-only).', case_sensitive=False)
|
list
¶
app = typer.Typer(help='List available prompts with metadata.', invoke_without_command=True)
module-attribute
¶list_prompts(tag=typer.Option([], '--tag', help='Filter by tag (repeatable).'), search=typer.Option(None, '--search', help='Search prompt name/description.'), keys_only=typer.Option(False, '--keys-only', help='Output only prompt keys.'), format=typer.Option(None, '--format', help='Output format: json (requires --api), yaml, text/table (human-only).', case_sensitive=False))
¶List prompts with optional filters and output formats.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tag
|
list[str]
|
Filter prompts by tag (repeatable). |
Option([], '--tag', help='Filter by tag (repeatable).')
|
search
|
str | None
|
Case-insensitive search across name/description. |
Option(None, '--search', help='Search prompt name/description.')
|
keys_only
|
bool
|
Whether to output only prompt keys. |
Option(False, '--keys-only', help='Output only prompt keys.')
|
format
|
ListOutputFormat | None
|
Desired output format (defaults to global setting). |
Option(None, '--format', help='Output format: json (requires --api), yaml, text/table (human-only).', case_sensitive=False)
|
run
¶
app = typer.Typer(help='Execute a prompt with variable substitution.', invoke_without_command=True)
module-attribute
¶logger = logging.getLogger(__name__)
module-attribute
¶RunContext
dataclass
¶Encapsulates all context needed for prompt execution.
config
instance-attribute
¶config_meta
instance-attribute
¶include_provenance
instance-attribute
¶input_metadata
instance-attribute
¶intent
instance-attribute
¶metadata
instance-attribute
¶model_override
instance-attribute
¶output_file
instance-attribute
¶output_format
instance-attribute
¶prompt_key
instance-attribute
¶quiet
instance-attribute
¶service
instance-attribute
¶trace_id
instance-attribute
¶variables
instance-attribute
¶__init__(prompt_key, config, config_meta, service, metadata, input_metadata, variables, trace_id, model_override, intent, quiet, output_format, output_file, include_provenance)
¶TnhGenCLIOptions
¶Encapsulates all CLI option definitions for the run command.
API = typer.Option(False, '--api', help='Machine-readable API contract output (JSON by default).')
class-attribute
instance-attribute
¶CONFIG = typer.Option(None, '--config', help='Path to config file that overrides user/workspace config.')
class-attribute
instance-attribute
¶FORMAT = typer.Option(None, '--format', help='Output format: json or yaml (API mode only).', case_sensitive=False)
class-attribute
instance-attribute
¶INPUT_FILE = typer.Option(..., '--input-file', help='Input file containing user content.')
class-attribute
instance-attribute
¶INTENT = typer.Option(None, '--intent', help='Intent hint for routing.')
class-attribute
instance-attribute
¶MAX_TOKENS = typer.Option(None, '--max-tokens', help='Maximum output tokens.')
class-attribute
instance-attribute
¶MODEL = typer.Option(None, '--model', help='Model override.')
class-attribute
instance-attribute
¶NO_PROVENANCE = typer.Option(False, '--no-provenance', help='Omit provenance block in files.')
class-attribute
instance-attribute
¶OUTPUT_FILE = typer.Option(None, '--output-file', help='Write result text to file.')
class-attribute
instance-attribute
¶PROMPT = typer.Option(..., '--prompt', help='Prompt key to execute.')
class-attribute
instance-attribute
¶PROMPT_DIR = typer.Option(None, '--prompt-dir', help='Override the prompt catalog directory for this invocation.')
class-attribute
instance-attribute
¶STREAMING = typer.Option(False, '--streaming', help='Enable streaming output (not implemented).')
class-attribute
instance-attribute
¶TEMPERATURE = typer.Option(None, '--temperature', help='Model temperature.')
class-attribute
instance-attribute
¶TOP_P = typer.Option(None, '--top-p', help='Top-p sampling (not yet supported).')
class-attribute
instance-attribute
¶VAR = typer.Option([], '--var', help='Inline variable assignment (repeatable).')
class-attribute
instance-attribute
¶VARS_FILE = typer.Option(None, '--vars', help='JSON file with variable definitions.')
class-attribute
instance-attribute
¶run_prompt(config=TnhGenCLIOptions.CONFIG, api=TnhGenCLIOptions.API, prompt_dir=TnhGenCLIOptions.PROMPT_DIR, prompt=TnhGenCLIOptions.PROMPT, input_file=TnhGenCLIOptions.INPUT_FILE, vars_file=TnhGenCLIOptions.VARS_FILE, var=TnhGenCLIOptions.VAR, model=TnhGenCLIOptions.MODEL, intent=TnhGenCLIOptions.INTENT, max_tokens=TnhGenCLIOptions.MAX_TOKENS, temperature=TnhGenCLIOptions.TEMPERATURE, top_p=TnhGenCLIOptions.TOP_P, output_file=TnhGenCLIOptions.OUTPUT_FILE, format=TnhGenCLIOptions.FORMAT, no_provenance=TnhGenCLIOptions.NO_PROVENANCE, streaming=TnhGenCLIOptions.STREAMING)
¶Execute a prompt with variable substitution and AI processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
Path | None
|
Optional path to an explicit config file. |
CONFIG
|
api
|
bool
|
Whether to emit machine-readable API contract output. |
API
|
prompt_dir
|
Path | None
|
Optional prompt catalog directory override. |
PROMPT_DIR
|
prompt
|
str
|
Key of the prompt to execute. |
PROMPT
|
input_file
|
Path
|
File containing the main user input text. |
INPUT_FILE
|
vars_file
|
Path | None
|
Optional JSON file with additional variables. |
VARS_FILE
|
var
|
list[str]
|
Inline variable assignments ( |
VAR
|
model
|
str | None
|
Optional model override for this run. |
MODEL
|
intent
|
str | None
|
Optional routing intent to pass to the service. |
INTENT
|
max_tokens
|
int | None
|
Max output tokens override. |
MAX_TOKENS
|
temperature
|
float | None
|
Temperature override. |
TEMPERATURE
|
top_p
|
float | None
|
Top-p sampling override (accepted but not applied). |
TOP_P
|
output_file
|
Path | None
|
Optional file to write the rendered text to. |
OUTPUT_FILE
|
format
|
OutputFormat | None
|
Output format for stdout. |
FORMAT
|
no_provenance
|
bool
|
Whether to omit provenance header in written files. |
NO_PROVENANCE
|
streaming
|
bool
|
Whether to request streaming (not yet implemented). |
STREAMING
|
version
¶
version(format=typer.Option(None, '--format', help='Output format: json (requires --api), yaml, or text (human-only).', case_sensitive=False))
¶Display version information for tnh-gen and dependencies.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
format
|
OutputFormat | None
|
Optional output format override (json or yaml). |
Option(None, '--format', help='Output format: json (requires --api), yaml, or text (human-only).', case_sensitive=False)
|
config_loader
¶
CLIConfig
¶
Bases: BaseModel
CLI configuration modeled with Pydantic for consistency with OS blueprint.
api_key = None
class-attribute
instance-attribute
¶cli_path = None
class-attribute
instance-attribute
¶default_model = None
class-attribute
instance-attribute
¶default_temperature = None
class-attribute
instance-attribute
¶max_dollars = None
class-attribute
instance-attribute
¶max_input_chars = None
class-attribute
instance-attribute
¶prompt_catalog_dir = Field(default=None)
class-attribute
instance-attribute
¶with_overrides(overrides)
¶Return a new config with non-null override values applied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
overrides
|
ConfigData
|
Mapping of override keys to values. |
required |
Returns:
| Type | Description |
|---|---|
'CLIConfig'
|
New |
available_keys()
¶
Return the list of supported config keys.
Returns:
| Type | Description |
|---|---|
list[str]
|
List of available configuration keys. |
load_config(config_path=None, *, cwd=None, overrides=None, prompt_dir=None)
¶
Load CLI configuration with clear precedence and metadata.
The effective config is built in this order: defaults/env → user config →
workspace config → explicit config_path → CLI overrides → explicit
prompt_dir override. Overrides that are None are ignored to avoid
clobbering previous values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_path
|
Path | None
|
Optional explicit config file to load. |
None
|
cwd
|
Path | None
|
Working directory for resolving workspace config paths. |
None
|
overrides
|
ConfigData | None
|
In-memory override values (e.g., CLI flags). |
None
|
prompt_dir
|
Path | None
|
Optional prompt catalog directory override. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[CLIConfig, ConfigMeta]
|
Tuple of validated |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any referenced config file contains invalid JSON. |
load_config_overrides(config_path=None, *, cwd=None)
¶
Load only user/workspace/explicit config overrides (no defaults).
persist_config_value(key, value, *, workspace=False, cwd=None)
¶
Persist a single config value to the user or workspace config file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
ConfigKey
|
Configuration key to update. |
required |
value
|
Any
|
Value to persist. |
required |
workspace
|
bool
|
Whether to target workspace scope instead of user scope. |
False
|
cwd
|
Path | None
|
Working directory for resolving workspace path. |
None
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to the file that was written. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the key is not supported. |
errors
¶
ExitCode
¶
Bases: IntEnum
CLI exit codes mapped to error classes.
FORMAT_ERROR = 4
class-attribute
instance-attribute
¶INPUT_ERROR = 5
class-attribute
instance-attribute
¶POLICY_ERROR = 1
class-attribute
instance-attribute
¶PROVIDER_ERROR = 3
class-attribute
instance-attribute
¶SUCCESS = 0
class-attribute
instance-attribute
¶TRANSPORT_ERROR = 2
class-attribute
instance-attribute
¶
emit_trace_id(trace_id, error_code)
¶
Emit a trace identifier to stderr for diagnostics.
error_response(exc, *, error_code=None, suggestion=None, trace_id)
¶
Construct a serialized error response and matching exit code.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
exc
|
Exception
|
The caught exception. |
required |
error_code
|
str | None
|
Optional explicit error code to surface in diagnostics. |
None
|
suggestion
|
str | None
|
Optional user-facing recovery suggestion. |
None
|
trace_id
|
str
|
Unique trace identifier for tracking this CLI request. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[ErrorPayload, ExitCode]
|
A tuple containing the response payload and associated exit code. |
exit_with_error(exc, *, trace_id, format_override=None)
¶
Render error output, emit trace, and exit with mapped status.
map_exception(exc)
¶
Map a raised exception to a stable CLI exit code.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
exc
|
Exception
|
Exception raised during CLI execution. |
required |
Returns:
| Type | Description |
|---|---|
ExitCode
|
ExitCode representing the failure category. |
render_error(exc, *, trace_id, format_override=None, suggestion=None)
¶
Render error output based on API vs human mode.
factory
¶
DefaultServiceFactory
¶
Default factory bridging CLI config to GenAIService.
create_genai_service(cli_config, overrides)
¶Create a fully configured GenAI service instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cli_config
|
CLIConfig
|
Effective CLI configuration. |
required |
overrides
|
ServiceOverrides
|
Execution-time overrides for model and token behavior. |
required |
Returns:
| Type | Description |
|---|---|
GenAIServiceProtocol
|
GenAIServiceProtocol implementation bound to current settings. |
ServiceFactory
¶
Bases: Protocol
Factory protocol for constructing GenAI services.
create_genai_service(cli_config, overrides)
¶Create a GenAI service given CLI config and overrides.
ServiceOverrides
dataclass
¶
cli_config_to_settings_kwargs(cli_config, overrides)
¶
Translate CLI configuration into kwargs for GenAI service settings.
output
¶
Output helpers for tnh-gen, including formatting policy utilities.
formatter
¶
format_table(headers, rows)
¶Render a simple fixed-width table for CLI display.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
headers
|
list[str]
|
Column headers. |
required |
rows
|
Iterable[list[str]]
|
Row data to render. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Rendered table string. |
render_output(payload, fmt)
¶Serialize payload to the requested output format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
payload
|
Any
|
Data to serialize. |
required |
fmt
|
OutputFormat | ListOutputFormat
|
Output format enum selection. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Serialized string representation for CLI display. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the requested format is unsupported. |
human_formatter
¶
LABELS = HumanOutputLabels()
module-attribute
¶HumanOutputLabels
dataclass
¶Display labels for human-friendly CLI output.
error_prefix = 'Error: '
class-attribute
instance-attribute
¶header_template = 'Available Prompts ({count})'
class-attribute
instance-attribute
¶metadata_separator = ' | '
class-attribute
instance-attribute
¶no_default_model = '(no default)'
class-attribute
instance-attribute
¶no_tags = '(no tags)'
class-attribute
instance-attribute
¶no_variables = '(none)'
class-attribute
instance-attribute
¶suggestion_prefix = 'Suggestion: '
class-attribute
instance-attribute
¶variable_prefix = ' Variables: '
class-attribute
instance-attribute
¶__init__(no_variables='(none)', no_default_model='(no default)', no_tags='(no tags)', header_template='Available Prompts ({count})', variable_prefix=' Variables: ', metadata_separator=' | ', error_prefix='Error: ', suggestion_prefix='Suggestion: ')
¶OutputColor
¶
Bases: str, Enum
ANSI color codes for human-friendly CLI output.
ERROR = 'red'
class-attribute
instance-attribute
¶MODEL = 'green'
class-attribute
instance-attribute
¶SUGGESTION = 'yellow'
class-attribute
instance-attribute
¶TAGS = 'yellow'
class-attribute
instance-attribute
¶TITLE = 'bright_blue'
class-attribute
instance-attribute
¶VARIABLES = 'cyan'
class-attribute
instance-attribute
¶format_human_friendly_error(error, suggestion=None)
¶Format errors for human-readable CLI output.
format_human_friendly_list(prompts)
¶Format prompt metadata for human-readable CLI output.
policy
¶
resolve_list_format(*, api, format_override, ctx_format)
¶Resolve list output format with API-aware defaults.
resolve_output_format(*, api, format_override, default_format)
¶Resolve output format with API-aware defaults.
validate_global_format(api, format_override)
¶Validate global format flags shared across commands.
validate_list_format(api, format_override)
¶Validate list format combinations.
validate_run_format(api, format_override)
¶Validate run format combinations.
provenance
¶
provenance_block(envelope, *, source_metadata=None, trace_id, prompt_version)
¶Build a YAML frontmatter block capturing provenance for saved files.
provenance_metadata(envelope, *, source_metadata=None, trace_id, prompt_version)
¶Build merged provenance metadata for persisted sidecars or headers.
sidecar_path(path)
¶Return the provenance sidecar path for a structured output artifact.
write_output_file(path, *, result_text, envelope, source_metadata=None, trace_id, prompt_version, include_provenance, structured_output=False)
¶Write result text to disk, optionally prefixing provenance metadata.
state
¶
ctx = CLIContext()
module-attribute
¶
CLIContext
dataclass
¶
Holds shared CLI state populated by the Typer callback.
api = False
class-attribute
instance-attribute
¶config_path = None
class-attribute
instance-attribute
¶no_color = False
class-attribute
instance-attribute
¶output_format = None
class-attribute
instance-attribute
¶quiet = False
class-attribute
instance-attribute
¶service_factory = None
class-attribute
instance-attribute
¶__init__(config_path=None, output_format=None, api=False, quiet=False, no_color=False, service_factory=None)
¶
ListOutputFormat
¶
tnh_gen
¶
Typer entrypoint for the tnh-gen CLI.
app = typer.Typer(name='tnh-gen', help='TNH-Gen: Unified CLI for TNH Scholar GenAI operations.', add_completion=False, no_args_is_help=True)
module-attribute
¶
cli_callback(click_ctx, config=typer.Option(None, '--config', help='Path to config file that overrides user/workspace config.'), prompt_dir=typer.Option(None, '--prompt-dir', help='Override the prompt catalog directory for this invocation.'), format=typer.Option(None, '--format', help='Output format for commands (json/yaml for API output; text/yaml for human output).', case_sensitive=False), api=typer.Option(False, '--api', help='Machine-readable API contract output (JSON by default).'), quiet=typer.Option(False, '--quiet', '-q', help='Suppress non-error output.'), no_color=typer.Option(False, '--no-color', help='Disable colored output.'))
¶
Apply global options and initialize shared context.
Default behavior: human-friendly output optimized for interactive CLI use. Use --api for machine-readable JSON contract output.
Examples:
tnh-gen list tnh-gen --api list tnh-gen --prompt-dir ./my-prompts list tnh-gen run --prompt daily --input-file notes.md tnh-gen --api run --prompt daily --input-file notes.md
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
Optional[Path]
|
Optional path to an explicit config file. |
Option(None, '--config', help='Path to config file that overrides user/workspace config.')
|
prompt_dir
|
Path | None
|
Optional prompt catalog directory override. |
Option(None, '--prompt-dir', help='Override the prompt catalog directory for this invocation.')
|
format
|
OutputFormat | None
|
Output format override for commands. |
Option(None, '--format', help='Output format for commands (json/yaml for API output; text/yaml for human output).', case_sensitive=False)
|
api
|
bool
|
Whether to emit machine-readable API contract output. |
Option(False, '--api', help='Machine-readable API contract output (JSON by default).')
|
quiet
|
bool
|
Whether to suppress non-error output. |
Option(False, '--quiet', '-q', help='Suppress non-error output.')
|
no_color
|
bool
|
Whether to disable colored terminal output. |
Option(False, '--no-color', help='Disable colored output.')
|
main()
¶
Dispatch execution to the Typer application.
types
¶
ConfigKey = Literal['prompt_catalog_dir', 'default_model', 'max_dollars', 'max_input_chars', 'default_temperature', 'api_key', 'cli_path']
module-attribute
¶
DefaultVariables = Mapping[str, Any]
module-attribute
¶
PolicyApplied = Mapping[str, Any]
module-attribute
¶
RunOutcomePayload = RunSuccessPayload | RunIncompletePayload | RunFailedPayload
module-attribute
¶
VariableMap = MutableMapping[str, Any]
module-attribute
¶
ConfigData
¶
ConfigShowPayload
¶
ConfigUpdateApiPayload
¶
ConfigValuePayload
¶
ErrorDiagnostics
¶
ErrorPayload
¶
HumanEntry
¶
ListApiEntry
¶
Bases: TypedDict
default_model
instance-attribute
¶default_variables
instance-attribute
¶description
instance-attribute
¶key
instance-attribute
¶name
instance-attribute
¶optional_variables
instance-attribute
¶output_mode
instance-attribute
¶required_variables
instance-attribute
¶tags
instance-attribute
¶version
instance-attribute
¶warnings
instance-attribute
¶
ListApiPayload
¶
RunAdapterDiagnosticsPayload
¶
RunBasePayload
¶
RunFailedPayload
¶
Bases: RunBasePayload
RunFailurePayload
¶
RunIncompletePayload
¶
Bases: RunBasePayload
RunProvenancePayload
¶
RunResultPayload
¶
RunSuccessPayload
¶
Bases: RunBasePayload
RunUsagePayload
¶
SettingsKwargs
¶
VersionHumanPayload
¶
tnh_lines
¶
Typer CLI for line numbering helpers.
tnh_lines
¶
app = typer.Typer(name='tnh-lines', help='Prepare numbered text for sectioning workflows and convert it back to plain text.', add_completion=False, no_args_is_help=True)
module-attribute
¶
main()
¶
Dispatch execution to the Typer application.
number_command(input_file=typer.Argument(..., help='Plain text source file.'), output_file=typer.Argument(..., help='Numbered output path.'), start=typer.Option(1, '--start', help='Starting line number.'), separator=typer.Option(':', '--separator', help='Line-number separator.'), no_clobber=typer.Option(False, '--no-clobber', help='Fail if the output file already exists.'))
¶
Write numbered text in N:LINE format.
unnumber_command(input_file=typer.Argument(..., help='Numbered source file.'), output_file=typer.Argument(..., help='Plain-text output path.'), no_clobber=typer.Option(False, '--no-clobber', help='Fail if the output file already exists.'))
¶
Strip numbering and write plain text.
tnh_setup
¶
__all__ = ['main', 'tnh_setup']
module-attribute
¶
main()
¶
prompt_display
¶
tnh_setup
¶
Legacy Click-based tnh-setup entrypoint.
Prefer tnh_setup_typer.py for the maintained CLI implementation.
This compatibility path is retained temporarily and should not diverge.
OPENAI_ENV_HELP_MSG = "\n>>>>>>>>>> OpenAI API key not found in environment. <<<<<<<<<\n\nFor AI processing with TNH-scholar:\n\n1. Get an API key from https://platform.openai.com/api-keys\n2. Set the OPENAI_API_KEY environment variable:\n\n export OPENAI_API_KEY='your-api-key-here' # Linux/Mac\n set OPENAI_API_KEY=your-api-key-here # Windows\n\nFor OpenAI API access help: https://platform.openai.com/\n\n>>>>>>>>>>>>>>>>>>>>>>>>>>> -- <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<\n"
module-attribute
¶
SetupPaths
dataclass
¶
create_config_dirs(paths)
¶
Create required configuration directories.
main()
¶
Entry point for setup CLI tool.
maybe_check_environment(*, skip_env)
¶
Load env and report missing OpenAI configuration.
maybe_setup_ytdlp_runtime(*, skip_ytdlp_runtime)
¶
Prompt for and run yt-dlp runtime setup.
report_prompt_setup(paths, *, skip_prompts)
¶
Report prompt directory setup without external downloads.
tnh_setup(skip_env, skip_prompts, skip_ytdlp_runtime)
¶
Set up TNH Scholar configuration.
tnh_setup_typer
¶
app = typer.Typer(add_completion=False, no_args_is_help=False)
module-attribute
¶
PromptDecision
dataclass
¶
SetupConfig
dataclass
¶
main()
¶
tnh_setup(skip_env=typer.Option(False, help='Skip OpenAI API key check.'), skip_prompts=typer.Option(False, help='Skip prompt directory setup guidance.'), skip_ytdlp_runtime=typer.Option(False, help='Skip yt-dlp runtime setup.'), verify_only=typer.Option(False, help='Only run environment verification.'), assume_yes=typer.Option(False, '--yes', '-y', help='Assume yes for all prompts.'), no_input=typer.Option(False, help='Fail if a prompt would be required.'))
¶
Set up TNH Scholar configuration.
tnh_tree
¶
Developer tool for the tnh-scholar project.
This legacy utility generates repository tree snapshots for manual developer reference. It is no longer part of routine CI or release validation.
main()
¶
CLI entry point registered as tnh-tree.
token_count
¶
utils
¶
T = TypeVar('T')
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
handle_cli_exception(message, exc)
¶
Convert unexpected errors to Click-friendly messages.
run_or_fail(message, operation)
¶
Execute an operation and re-raise failures as Click exceptions to avoid stack traces.
ytt_fetch
¶
__all__ = ['main', 'ytt_fetch']
module-attribute
¶
main()
¶
ytt_fetch
¶
Simple CLI tool for retrieving video transcripts.
This module provides a command line interface for downloading video transcripts in specified languages. It uses yt-dlp for video info extraction.
logger = get_child_logger(__name__)
module-attribute
¶
cleanup_files(keep, filepath)
¶
export_data(output_path, data)
¶
export_ttml_data(metadata, ttml_path, no_embed, output_path, keep)
¶
generate_metadata(service, url, keep, output_path)
¶
generate_transcript(service, url, lang, keep, no_embed, output_path)
¶
get_ttml_download(dl, url, lang, output_path)
¶
main()
¶
ytt_fetch(url, lang, keep, info, no_embed, output)
¶
YouTube Transcript Fetch: Retrieve and save transcripts for a Youtube video using yt-dlp.
configuration
¶
Configuration utilities for TNH Scholar.
context
¶
Runtime context discovery and path resolution.
ContextIdFactory
¶
Generates correlation and session identifiers.
build(correlation_id, session_id)
¶
PromptDirectoryNames
dataclass
¶
PromptPathBuilder
¶
RegistryCategory
¶
RegistryPathBuilder
¶
TNHContext
dataclass
¶
Resolved runtime context for TNH Scholar.
builtin_root
instance-attribute
¶
correlation_id
instance-attribute
¶
session_id
instance-attribute
¶
user_root
instance-attribute
¶
workspace_root
instance-attribute
¶
__init__(builtin_root, workspace_root, user_root, correlation_id, session_id)
¶
discover(*, workspace_root=None, user_root=None, correlation_id=None, session_id=None, start_path=None)
classmethod
¶
get_primary_prompt_dir()
¶
Return the highest-precedence prompt directory that exists.
get_prompt_search_paths()
¶
Return valid prompt directories in precedence order.
get_registry_search_paths(registry_type)
¶
WorkspaceDiscoveryPolicy
dataclass
¶
exceptions
¶
__all__ = ['TnhScholarError', 'ConfigurationError', 'ValidationError', 'ExternalServiceError', 'RateLimitError', 'NotRetryable', 'MetadataConflictError', 'SectionBoundaryError']
module-attribute
¶
ConfigurationError
¶
ExternalServiceError
¶
MetadataConflictError
¶
Bases: ValidationError
Raised when metadata merge encounters key conflicts in FAIL_ON_CONFLICT mode.
NotRetryable
¶
RateLimitError
¶
SectionBoundaryError
¶
Bases: ValidationError
Raised when section boundaries have gaps, overlaps, or out-of-bounds errors.
Note: Implementation is in text_object.py to avoid circular imports. This entry exists for documentation and to reserve the error name.
TnhScholarError
¶
ValidationError
¶
Bases: TnhScholarError
Input/data validation errors (precondition failures before calling providers).
journal_processing
¶
__all__ = ['batch_section', 'batch_translate', 'generate_clean_batch', 'save_cleaned_data', 'save_sectioning_data', 'save_translation_data', 'setup_logger']
module-attribute
¶
batch_section(input_xml_path, batch_jsonl, system_message, journal_name)
¶
Split journal content into sections using GPT, with retries for starting and completing the batch.
batch_translate(input_xml_path, batch_json_path, metadata_path, system_message, journal_name)
¶
Translates the journal sections using the GPT model. Saves the translated content back to XML.
generate_clean_batch(input_xml_file, output_file, system_message, user_wrap_function)
¶
Generate a batch file for the OpenAI (OA) API using a single input XML file.
save_cleaned_data(cleaned_xml_path, cleaned_wrapped_pages, journal_name)
¶
save_sectioning_data(output_json_path, raw_output_path, serial_json, journal_name)
¶
save_translation_data(xml_output_path, translation_data, journal_name)
¶
setup_logger(log_file_path)
¶
Configures the logger to write to a log file and the console. Adds a custom "PRIORITY_INFO" logging level for important messages.
journal_process
¶
BATCH_RETRY_DELAY = 5
module-attribute
¶
DEFAULT_JOURNAL_MODEL = 'gpt-4o'
module-attribute
¶
DEFAULT_MODEL_SETTINGS = {'gpt-4o': {'max_tokens': 16000, 'temperature': 1.0}, 'gpt-3.5-turbo': {'max_tokens': 4096, 'temperature': 1.0}, 'gpt-4o-mini': {'max_tokens': 16000, 'temperature': 1.0}}
module-attribute
¶
MAX_BATCH_RETRIES = 40
module-attribute
¶
MAX_TOKEN_LIMIT = 60000
module-attribute
¶
journal_schema = {'type': 'object', 'properties': {'journal_summary': {'type': 'string'}, 'sections': {'type': 'array', 'items': {'type': 'object', 'properties': {'title_vi': {'type': 'string'}, 'title_en': {'type': 'string'}, 'author': {'type': ['string', 'null']}, 'summary': {'type': 'string'}, 'keywords': {'type': 'array', 'items': {'type': 'string'}}, 'start_page': {'type': 'integer', 'minimum': 1}, 'end_page': {'type': 'integer', 'minimum': 1}}, 'required': ['title_vi', 'title_en', 'summary', 'keywords', 'start_page', 'end_page']}}}, 'required': ['journal_summary', 'sections']}
module-attribute
¶
logger = logging.getLogger('journal_process')
module-attribute
¶
batch_section(input_xml_path, batch_jsonl, system_message, journal_name)
¶
Split journal content into sections using GPT, with retries for starting and completing the batch.
batch_translate(input_xml_path, batch_json_path, metadata_path, system_message, journal_name)
¶
Translates the journal sections using the GPT model. Saves the translated content back to XML.
create_jsonl_file_for_batch(messages, output_file_path=None, max_token_list=None, model=DEFAULT_JOURNAL_MODEL, tools=None, json_mode=False)
¶
Write a JSONL batch file mirroring the legacy OpenAI format.
deserialize_json(serialized_data)
¶
Converts a serialized JSON string into a Python dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
serialized_data
|
str
|
The JSON string to deserialize. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
The deserialized Python dictionary. |
extract_page_groups_from_metadata(metadata)
¶
Extracts page groups from the section metadata for use with split_xml_pages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
dict
|
The section metadata containing sections with start and end pages. |
required |
Returns:
| Type | Description |
|---|---|
list
|
List[Tuple[int, int]]: A list of tuples, each representing a page range (start_page, end_page). |
generate_all_batches(processed_document_dir, system_message, user_wrap_function, file_regex='.*\\.xml')
¶
Generate cleaning batches for all journals in the specified directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processed_document_dir
|
str
|
Path to the directory containing processed journal data. |
required |
system_message
|
str
|
System message template for batch processing. |
required |
user_wrap_function
|
callable
|
Function to wrap user input for processing pages. |
required |
file_regex
|
str
|
Regex pattern to identify target files (default: ".*.xml"). |
'.*\\.xml'
|
generate_clean_batch(input_xml_file, output_file, system_message, user_wrap_function)
¶
Generate a batch file for the OpenAI (OA) API using a single input XML file.
generate_messages(system_message, user_message_wrapper, data_list_to_process, log_system_message=True)
¶
Build OpenAI-style chat message payloads.
generate_single_oa_batch_from_pages(input_xml_file, output_file, system_message, user_wrap_function)
¶
*** Deprecated *** Generate a batch file for the OpenAI (OA) API using a single input XML file.
run_immediate_chat_process(messages, max_tokens=0, response_format=None, model=DEFAULT_JOURNAL_MODEL)
¶
Legacy-compatible immediate completion powered by GenAI simple_completion.
save_cleaned_data(cleaned_xml_path, cleaned_wrapped_pages, journal_name)
¶
save_sectioning_data(output_json_path, raw_output_path, serial_json, journal_name)
¶
save_translation_data(xml_output_path, translation_data, journal_name)
¶
send_data_for_tx_batch(batch_jsonl_path, section_data_to_send, system_message, max_token_list, journal_name, immediate=False)
¶
Sends data for translation batch or immediate processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_jsonl_path
|
Path
|
Path for the JSONL file to save batch data. |
required |
section_data_to_send
|
List
|
List of section data to translate. |
required |
system_message
|
str
|
System message for the translation process. |
required |
max_token_list
|
List
|
List of max tokens for each section. |
required |
journal_name
|
str
|
Name of the journal being processed. |
required |
immediate
|
bool
|
If True, run immediate chat processing instead of batch. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
List |
list
|
Translated data from the batch or immediate process. |
setup_logger(log_file_path)
¶
Configures the logger to write to a log file and the console. Adds a custom "PRIORITY_INFO" logging level for important messages.
start_batch_with_retries(jsonl_file, description='', max_retries=MAX_BATCH_RETRIES, retry_delay=BATCH_RETRY_DELAY, poll_interval=10, timeout=3600)
¶
Simulate the legacy batch runner using sequential simple_completion calls.
The parameters mirror the old interface so callers remain unchanged, but the implementation now iterates through the JSONL requests locally.
translate_sections(batch_jsonl_path, system_message, section_contents, section_metadata, journal_name, immediate=False)
¶
build up sections in batches to translate
unwrap_all_lines(pages)
¶
unwrap_lines(text)
¶
Removes angle brackets (< >) from encapsulated lines and merges them into
a newline-separated string.
Parameters:
text (str): The input string with encapsulated lines.
Returns:
str: A newline-separated string with the encapsulation removed.
Example:
>>> merge_encapsulated_lines("<Line 1> <Line 2> <Line 3>")
'Line 1
Line 2
Line 3'
>>> merge_encapsulated_lines("
validate_and_clean_data(data, schema)
¶
Recursively validate and clean AI-generated data to fit the given schema. Any missing fields are filled with defaults, and extra fields are ignored.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict
|
The AI-generated data to validate and clean. |
required |
schema
|
dict
|
The schema defining the required structure. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
The cleaned data adhering to the schema. |
validate_and_save_metadata(output_file_path, json_metadata_serial, schema)
¶
Validates and cleans journal data against the schema, then writes it to a JSON file.
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if successfully written to the file, False otherwise. |
wrap_all_lines(pages)
¶
wrap_lines(text)
¶
Encloses each line of the input text with angle brackets.
Args:
text (str): The input string containing lines separated by '
'.
Returns:
str: A string where each line is enclosed in angle brackets.
Example:
>>> enclose_lines("This is a string with
two lines.")
'
logging_config
¶
TNH-Scholar Logging Utilities¶
A production-ready, environment-driven logging system for the TNH-Scholar project. It provides JSON logs in production, color/plain text in development, optional non-blocking queue logging, file rotation, noise suppression for chatty deps, and optional routing of Python warnings into the logging pipeline.
This module is designed for application layer configuration and library layer usage:
- Applications (CLI, Streamlit, FastAPI, notebooks) call :func:
setup_logging. - Libraries / services (e.g., gen_ai_service, IssueHandler) only acquire a
logger via :func:
get_logger(or legacy :func:get_child_logger) and never configure global logging.
Quick start¶
Application entry point (recommended):
>>> from tnh_scholar.logging_config import setup_logging, get_logger
>>> setup_logging() # reads env; see variables below
>>> log = get_logger(__name__)
>>> log.info("app started", extra={"service": "gen-ai"})
Jupyter / dev (force color in non-TTY):
>>> import os
>>> os.environ["APP_ENV"] = "dev"
>>> os.environ["LOG_JSON"] = "false"
>>> os.environ["LOG_COLOR"] = "true"] # Jupyter isn't a TTY; force color
>>> from tnh_scholar.logging_config import setup_logging, get_logger
>>> setup_logging()
>>> get_logger(__name__).info("hello, color")
Library / service modules (do NOT configure logging):
>>> from tnh_scholar.logging_config import get_logger
>>> log = get_logger(__name__)
>>> log.info("library message")
Behavior by environment¶
- dev (default):
- Plain or color text to stdout by default.
- Queue logging disabled by default (synchronous).
- Color auto-detects TTY and Jupyter/IPython (can be forced).
- prod:
- JSON logs to stderr by default (suitable for log shippers).
- Queue logging enabled by default (can be disabled).
Environment variables¶
Most behavior is controlled by environment variables (read when setup_logging()
instantiates :class:LogSettings). Truthy values accept true/1/yes/on
(case-insensitive).
APP_ENV:dev|prod|test(default:dev)LOG_LEVEL: Logging level for the base project logger (default:INFO)LOG_STDOUT: Emit logs to stdout (default:true)LOG_FILE_ENABLE: Emit logs to a file (default:false)LOG_FILE_PATH: File path for logs (default:./logs/main.log)LOG_ROTATE_BYTES: Rotate at N bytes (e.g., 10485760) (default: unset)LOG_ROTATE_WHEN: Timed rotation (e.g.,midnight) (default: unset)LOG_BACKUPS: Number of rotated file backups (default:5)LOG_JSON: Use JSON formatter (recommended in prod) (default:true)LOG_COLOR:true|false|auto(default:auto)LOG_STREAM:stdout|stderr(default:stderr; dev defaults tostdout)LOG_USE_QUEUE: Use QueueHandler/QueueListener (default:true; dev defaults tofalse)LOG_CAPTURE_WARNINGS: Route Python warnings via logging (default:false)LOG_SUPPRESS: Comma-separated list of noisy module names to set to WARNING (default includesurllib3,httpx,openai,uvicorn.*, etc.)
Backward compatibility¶
get_child_logger(name, console=False, separate_file=False)remains available and can attach ad-hoc console/file handlers without reconfiguring the project base logger. When custom handlers are attached, the child’s propagation is turned off to avoid duplicate messages.setup_logging_legacy(...)forwards to :func:setup_loggingand emits a DeprecationWarning to help locate legacy call sites.-
Custom level
PRIORITY_INFO(25) and :meth:logger.priority_infostill exist but are deprecated. Prefer:log.info("message", extra={"priority": "high"})
This keeps level semantics standard and plays better with structured logging.
Queue logging notes¶
- When
LOG_USE_QUEUE=true, the base logger uses a :class:QueueHandler. A :class:QueueListeneris started with sinks mirroring your configured stdout/file handlers. This decouples log emission from I/O to minimize latency. -
In notebooks or during debugging, you may prefer synchronous logs:
os.environ["LOG_USE_QUEUE"] = "false"
Python warnings routing¶
- When
LOG_CAPTURE_WARNINGS=true, Python warnings are captured and logged throughpy.warnings. This module attaches the base logger’s handlers to that logger and disables propagation to avoid duplicate output.
Mixing print() and logging¶
print()writes to stdout; the logger can write to stdout or stderr depending onLOG_STREAMand environment. Ordering is not guaranteed, especially with queue logging enabled. Prefer logging for consistent output.
Minimal examples¶
CLI / entrypoint:
>>> import os
>>> os.environ.setdefault("APP_ENV", "prod")
>>> os.environ.setdefault("LOG_JSON", "true")
>>> from tnh_scholar.logging_config import setup_logging, get_logger
>>> setup_logging()
>>> get_logger(__name__).info("ready")
File logging with rotation:
>>> import os
>>> os.environ.update({
... "LOG_FILE_ENABLE": "true",
... "LOG_FILE_PATH": "./logs/app.log",
... "LOG_ROTATE_BYTES": "10485760", # 10MB
... "LOG_BACKUPS": "7",
... })
>>> setup_logging()
>>> get_logger("smoke").info("to file")
Jupyter with color:
>>> import os
>>> os.environ.update({"APP_ENV": "dev", "LOG_JSON": "false", "LOG_COLOR": "true"})
>>> setup_logging()
>>> get_logger(__name__).info("color in notebook")
Notes¶
- JSON formatting requires
python-json-logger; without it, we fall back to plain/color format automatically. - This module never configures the root logger; it configures the project
base logger (
tnh) so your app can coexist with other libraries cleanly.
BASE_LOG_DIR = Path('./logs')
module-attribute
¶
BASE_LOG_NAME = 'tnh'
module-attribute
¶
DEFAULT_CONSOLE_FORMAT_STRING = LOG_FMT_COLOR
module-attribute
¶
DEFAULT_FILE_FORMAT_STRING = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
module-attribute
¶
DEFAULT_LOG_FILEPATH = Path('main.log')
module-attribute
¶
JsonFormatter = getattr(_pythonjsonlogger_json, 'JsonFormatter', None)
module-attribute
¶
LOG_COLORS = {'DEBUG': 'bold_green', 'INFO': 'cyan', 'PRIORITY_INFO': 'bold_cyan', 'WARNING': 'bold_yellow', 'ERROR': 'bold_red', 'CRITICAL': 'bold_red'}
module-attribute
¶
LOG_FMT_COLOR = '%(asctime)s | %(log_color)s%(levelname)-8s%(reset)s | %(name)s | %(message)s'
module-attribute
¶
LOG_FMT_JSON = '%(asctime)s %(levelname)s %(name)s %(message)s %(process)d %(thread)d %(module)s %(filename)s %(lineno)d'
module-attribute
¶
LOG_FMT_PLAIN = '%(asctime)s | %(levelname)-8s | %(name)s | %(message)s'
module-attribute
¶
MAX_FILE_SIZE = 10 * 1024 * 1024
module-attribute
¶
PRIORITY_INFO_LEVEL = 25
module-attribute
¶
__all__ = ['BASE_LOG_NAME', 'BASE_LOG_DIR', 'DEFAULT_LOG_FILEPATH', 'MAX_FILE_SIZE', 'OMPFilter', 'setup_logging', 'setup_logging_legacy', 'get_logger', 'get_child_logger']
module-attribute
¶
LogSettings
dataclass
¶
Environment-driven logging settings with sensible defaults.
backups = field(default_factory=(lambda: _env_int('LOG_BACKUPS', 5)))
class-attribute
instance-attribute
¶
base_name = field(default_factory=(lambda: _env_str('LOG_BASE', BASE_LOG_NAME)))
class-attribute
instance-attribute
¶
capture_warnings = field(default_factory=(lambda: _env_bool('LOG_CAPTURE_WARNINGS', 'false')))
class-attribute
instance-attribute
¶
colorize = field(default_factory=(lambda: _env_str('LOG_COLOR', 'auto')))
class-attribute
instance-attribute
¶
environment = field(default_factory=(lambda: _env_str('APP_ENV', 'dev')))
class-attribute
instance-attribute
¶
file_path = field(default_factory=(lambda: Path(_env_str('LOG_FILE_PATH', str(BASE_LOG_DIR / DEFAULT_LOG_FILEPATH)))))
class-attribute
instance-attribute
¶
json_format = field(default_factory=(lambda: _env_bool('LOG_JSON', 'true')))
class-attribute
instance-attribute
¶
level = field(default_factory=(lambda: _env_str('LOG_LEVEL', 'INFO')))
class-attribute
instance-attribute
¶
log_stream = field(default_factory=(lambda: _env_str('LOG_STREAM', 'stderr')))
class-attribute
instance-attribute
¶
rotate_bytes = field(default_factory=(lambda: _env_int('LOG_ROTATE_BYTES', 0) or None))
class-attribute
instance-attribute
¶
rotate_when = field(default_factory=(lambda: _env_str('LOG_ROTATE_WHEN', '') or None))
class-attribute
instance-attribute
¶
suppress_modules = field(default_factory=(lambda: _env_str('LOG_SUPPRESS', 'urllib3,httpx,openai,botocore,boto3,asyncio,uvicorn,uvicorn.error,uvicorn.access')))
class-attribute
instance-attribute
¶
to_file = field(default_factory=(lambda: _env_bool('LOG_FILE_ENABLE', 'false')))
class-attribute
instance-attribute
¶
to_stdout = field(default_factory=(lambda: _env_bool('LOG_STDOUT', 'true')))
class-attribute
instance-attribute
¶
use_queue = field(default_factory=(lambda: _env_bool('LOG_USE_QUEUE', 'true')))
class-attribute
instance-attribute
¶
__init__(environment=(lambda: _env_str('APP_ENV', 'dev'))(), base_name=(lambda: _env_str('LOG_BASE', BASE_LOG_NAME))(), level=(lambda: _env_str('LOG_LEVEL', 'INFO'))(), to_stdout=(lambda: _env_bool('LOG_STDOUT', 'true'))(), to_file=(lambda: _env_bool('LOG_FILE_ENABLE', 'false'))(), file_path=(lambda: Path(_env_str('LOG_FILE_PATH', str(BASE_LOG_DIR / DEFAULT_LOG_FILEPATH))))(), rotate_when=(lambda: _env_str('LOG_ROTATE_WHEN', '') or None)(), rotate_bytes=(lambda: _env_int('LOG_ROTATE_BYTES', 0) or None)(), backups=(lambda: _env_int('LOG_BACKUPS', 5))(), json_format=(lambda: _env_bool('LOG_JSON', 'true'))(), colorize=(lambda: _env_str('LOG_COLOR', 'auto'))(), capture_warnings=(lambda: _env_bool('LOG_CAPTURE_WARNINGS', 'false'))(), log_stream=(lambda: _env_str('LOG_STREAM', 'stderr'))(), use_queue=(lambda: _env_bool('LOG_USE_QUEUE', 'true'))(), suppress_modules=(lambda: _env_str('LOG_SUPPRESS', 'urllib3,httpx,openai,botocore,boto3,asyncio,uvicorn,uvicorn.error,uvicorn.access'))())
¶
__post_init__()
¶
is_dev()
¶
selected_stream()
¶
Return the Python stream object to emit logs to (stdout or stderr).
should_color()
¶
LoggingConfigurator
¶
settings = settings or LogSettings()
instance-attribute
¶
__init__(settings=None)
¶
apply_config(config)
¶
apply_legacy_args(*, log_level, log_filepath, max_log_file_size, backup_count, console)
¶
build_config(*, filters, formatters, handlers)
¶
build_filters()
¶
build_formatters()
¶
build_handlers(formatters)
¶
configure(*, legacy_args, suppressed_modules)
¶
select_base_handlers(handlers)
¶
start_queue_listener(handlers)
¶
suppress_noise(modules_override, force=False)
¶
UtcFormatter
¶
get_child_logger(name, console=False, separate_file=False)
¶
Get a child logger that writes logs to a console or a specified file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the child logger (e.g., module name). |
required |
console
|
bool
|
If True, log to the console. If False, do not log to the console. If None, inherit console behavior from the parent logger. |
False
|
Returns:
| Type | Description |
|---|---|
Logger
|
logging.Logger: Configured child logger. |
get_logger(name)
¶
Preferred helper: returns a namespaced logger under the base project name.
Backwards-compatible with existing call sites that used get_child_logger(name).
priority_info(self, message, *args, **kwargs)
¶
Deprecated: use logger.info(msg, extra={"priority": "high"}) instead.
This custom level (25) was introduced for highlighting important informational
events, but it complicates interoperability with external log shippers and
structured log processing. The recommended migration path is to log at the
standard INFO level with an added extra field indicating priority.
Example
logger.info("Important event", extra={"priority": "high"})
setup_logging(log_level=logging.INFO, log_filepath=DEFAULT_LOG_FILEPATH, max_log_file_size=MAX_FILE_SIZE, backup_count=5, console=True, suppressed_modules=None, *, settings=None)
¶
Initialize project-wide logging using dictConfig, with JSON in prod and colorized/plain text in dev.
Backward compatible with previous signature. Prefer using env vars or pass a LogSettings via the
keyword-only settings parameter.
setup_logging_legacy(*args, **kwargs)
¶
Deprecated: use setup_logging().
This wrapper preserves old call sites during migration. It emits a DeprecationWarning (once per process) and forwards all arguments to the current setup_logging().
metadata
¶
__all__ = ['Frontmatter', 'Metadata', 'ProcessMetadata']
module-attribute
¶
Frontmatter
¶
Handles YAML frontmatter embedding and extraction.
Note: extract is pure (no I/O). extract_from_file performs I/O and should be
treated as adapter-level convenience, not domain-level parsing.
embed(metadata, content)
classmethod
¶
Embed metadata as YAML frontmatter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
Metadata
|
Dictionary of metadata |
required |
content
|
str
|
Content text |
required |
Returns:
| Type | Description |
|---|---|
str
|
Text with embedded frontmatter |
extract(content)
staticmethod
¶
Extract frontmatter and content from text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str
|
Text with optional YAML frontmatter |
required |
Returns:
| Type | Description |
|---|---|
tuple[Metadata, str]
|
Tuple of (metadata object, remaining content) |
extract_from_file(file)
classmethod
¶
Adapter-level convenience wrapper that reads from disk then parses.
generate(metadata)
staticmethod
¶
Metadata
¶
Bases: MutableMapping
Flexible metadata container that behaves like a dict while ensuring JSON serializability. Designed for AI processing pipelines where schema flexibility is prioritized over structure.
process_history
property
¶
Access process history with proper typing.
__delitem__(key)
¶
__get_pydantic_core_schema__(source_type, handler)
classmethod
¶
Defines the Pydantic core schema for the Metadata class.
This method allows Pydantic to validate Metadata objects as dictionaries.
It handles both direct Metadata instances and dictionaries during validation,
providing flexibility for data input.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_type
|
Any
|
The source type being validated. |
required |
handler
|
Callable[[Any], CoreSchema]
|
A callable to handle schema generation for other types. |
required |
Returns:
| Type | Description |
|---|---|
CoreSchema
|
A Pydantic core schema that validates either a Metadata instance |
CoreSchema
|
(by converting it to a dictionary) or a standard dictionary. |
__getitem__(key)
¶
__init__(data=None)
¶
__ior__(other)
¶
__iter__()
¶
__len__()
¶
__or__(other)
¶
__repr__()
¶
__ror__(other)
¶
__setitem__(key, value)
¶
Process and set value, ensuring JSON serializability.
__str__()
¶
add_process_info(process_metadata)
¶
Add process metadata to history.
copy()
¶
Create a deep copy of the metadata object.
from_dict(data)
classmethod
¶
Create from a plain dict.
from_fields(data, fields)
classmethod
¶
Create a Metadata object by extracting specified fields from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict
|
Source dictionary |
required |
fields
|
list[str]
|
List of field names to extract |
required |
Returns:
| Type | Description |
|---|---|
Metadata
|
New Metadata instance with only specified fields |
from_yaml(yaml_str)
classmethod
¶
Create Metadata instance from YAML string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
yaml_str
|
str
|
YAML formatted string |
required |
Returns:
| Type | Description |
|---|---|
Metadata
|
New Metadata instance |
Raises:
| Type | Description |
|---|---|
YAMLError
|
If YAML parsing fails |
text_embed(content)
¶
to_dict()
¶
Convert to plain dict for JSON serialization.
to_yaml()
¶
Return metadata as YAML formatted string
ProcessMetadata
¶
metadata
¶
JsonValue = Union[str, int, float, bool, list, dict, None]
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
Frontmatter
¶
Handles YAML frontmatter embedding and extraction.
Note: extract is pure (no I/O). extract_from_file performs I/O and should be
treated as adapter-level convenience, not domain-level parsing.
embed(metadata, content)
classmethod
¶
Embed metadata as YAML frontmatter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
Metadata
|
Dictionary of metadata |
required |
content
|
str
|
Content text |
required |
Returns:
| Type | Description |
|---|---|
str
|
Text with embedded frontmatter |
extract(content)
staticmethod
¶
Extract frontmatter and content from text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str
|
Text with optional YAML frontmatter |
required |
Returns:
| Type | Description |
|---|---|
tuple[Metadata, str]
|
Tuple of (metadata object, remaining content) |
extract_from_file(file)
classmethod
¶
Adapter-level convenience wrapper that reads from disk then parses.
generate(metadata)
staticmethod
¶
Metadata
¶
Bases: MutableMapping
Flexible metadata container that behaves like a dict while ensuring JSON serializability. Designed for AI processing pipelines where schema flexibility is prioritized over structure.
process_history
property
¶
Access process history with proper typing.
__delitem__(key)
¶
__get_pydantic_core_schema__(source_type, handler)
classmethod
¶
Defines the Pydantic core schema for the Metadata class.
This method allows Pydantic to validate Metadata objects as dictionaries.
It handles both direct Metadata instances and dictionaries during validation,
providing flexibility for data input.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_type
|
Any
|
The source type being validated. |
required |
handler
|
Callable[[Any], CoreSchema]
|
A callable to handle schema generation for other types. |
required |
Returns:
| Type | Description |
|---|---|
CoreSchema
|
A Pydantic core schema that validates either a Metadata instance |
CoreSchema
|
(by converting it to a dictionary) or a standard dictionary. |
__getitem__(key)
¶
__init__(data=None)
¶
__ior__(other)
¶
__iter__()
¶
__len__()
¶
__or__(other)
¶
__repr__()
¶
__ror__(other)
¶
__setitem__(key, value)
¶
Process and set value, ensuring JSON serializability.
__str__()
¶
add_process_info(process_metadata)
¶
Add process metadata to history.
copy()
¶
Create a deep copy of the metadata object.
from_dict(data)
classmethod
¶
Create from a plain dict.
from_fields(data, fields)
classmethod
¶
Create a Metadata object by extracting specified fields from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict
|
Source dictionary |
required |
fields
|
list[str]
|
List of field names to extract |
required |
Returns:
| Type | Description |
|---|---|
Metadata
|
New Metadata instance with only specified fields |
from_yaml(yaml_str)
classmethod
¶
Create Metadata instance from YAML string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
yaml_str
|
str
|
YAML formatted string |
required |
Returns:
| Type | Description |
|---|---|
Metadata
|
New Metadata instance |
Raises:
| Type | Description |
|---|---|
YAMLError
|
If YAML parsing fails |
text_embed(content)
¶
to_dict()
¶
Convert to plain dict for JSON serialization.
to_yaml()
¶
Return metadata as YAML formatted string
ProcessMetadata
¶
safe_yaml_load(yaml_str, *, context='unknown')
¶
ocr_processing
¶
__all__ = ['PDFParseWarning', 'annotate_image_with_text', 'build_processed_pdf', 'deserialize_entity_annotations_from_json', 'extract_image_from_page', 'get_page_dimensions', 'load_pdf_pages', 'load_processed_PDF_data', 'make_image_preprocess_mask', 'pil_to_bytes', 'process_page', 'process_single_image', 'save_processed_pdf_data', 'serialize_entity_annotations_to_json', 'start_image_annotator_client']
module-attribute
¶
PDFParseWarning
¶
Bases: Warning
Custom warning class for PDF parsing issues. Encapsulates minimal logic for displaying warnings with a custom format.
warn(message)
staticmethod
¶
Display a warning message with custom formatting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
message
|
str
|
The warning message to display. |
required |
annotate_image_with_text(image, text_annotations, annotation_font_path, font_size=12)
¶
Annotates a PIL image with bounding boxes and text descriptions from OCR results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image
|
Image
|
The input PIL image to annotate. |
required |
text_annotations
|
List[EntityAnnotation]
|
OCR results containing bounding boxes and text. |
required |
annotation_font_path
|
str
|
Path to the font file for text annotations. |
required |
font_size
|
int
|
Font size for text annotations. |
12
|
Returns:
| Type | Description |
|---|---|
Image
|
Image.Image: The annotated PIL image. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input image is None. |
IOError
|
If the font file cannot be loaded. |
Exception
|
For any other unexpected errors. |
build_processed_pdf(pdf_path, client, preprocessor=None, annotation_font_path=DEFAULT_ANNOTATION_FONT_PATH)
¶
Processes a PDF document, extracting text, word locations, annotated images, and unannotated images.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
Path
|
Path to the PDF file. |
required |
client
|
ImageAnnotatorClient
|
Google Vision API client for text detection. |
required |
annotation_font_path
|
Path
|
Path to the font file for annotations. |
DEFAULT_ANNOTATION_FONT_PATH
|
Returns:
| Type | Description |
|---|---|
Tuple[List[str], List[List[EntityAnnotation]], List[Image], List[Image]]
|
Tuple[List[str], List[List[vision.EntityAnnotation]], List[Image.Image], List[Image.Image]]:
- List of extracted full-page texts (one entry per page).
- List of word locations (list of |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified PDF file does not exist. |
ValueError
|
If the PDF file is invalid or contains no pages. |
Exception
|
For any unexpected errors during processing. |
Example
from pathlib import Path from google.cloud import vision pdf_path = Path("/path/to/example.pdf") font_path = Path("/path/to/fonts/Arial.ttf") client = vision.ImageAnnotatorClient() try: text_pages, word_locations_list, annotated_images, unannotated_images = build_processed_pdf( pdf_path, client, font_path ) print(f"Processed {len(text_pages)} pages successfully!") except Exception as e: print(f"Error processing PDF: {e}")
deserialize_entity_annotations_from_json(data)
¶
Deserializes JSON data into a nested list of EntityAnnotation objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
str
|
The JSON string containing serialized annotations. |
required |
Returns:
| Type | Description |
|---|---|
List[List[EntityAnnotation]]
|
List[List[EntityAnnotation]]: The reconstructed nested list of EntityAnnotation objects. |
extract_image_from_page(page)
¶
Extracts the first image from the given PDF page and returns it as a PIL Image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
Page
|
The PDF page object. |
required |
Returns:
| Type | Description |
|---|---|
Image
|
Image.Image: The first image on the page as a Pillow Image object. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no images are found on the page or the image data is incomplete. |
Exception
|
For unexpected errors during image extraction. |
Example
import fitz from PIL import Image doc = fitz.open("/path/to/document.pdf") page = doc.load_page(0) # Load the first page try: image = extract_image_from_page(page) image.show() # Display the image except Exception as e: print(f"Error extracting image: {e}")
get_page_dimensions(page)
¶
Extracts the width and height of a single PDF page in both inches and pixels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
Page
|
A single PDF page object from PyMuPDF. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
A dictionary containing the width and height of the page in inches and pixels. |
load_pdf_pages(pdf_path)
¶
Opens the PDF document and returns the fitz Document object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
Path
|
The path to the PDF file. |
required |
Returns:
| Type | Description |
|---|---|
Document
|
fitz.Document: The loaded PDF document. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified file does not exist. |
ValueError
|
If the file is not a valid PDF document. |
Exception
|
For any unexpected error. |
Example
from pathlib import Path pdf_path = Path("/path/to/example.pdf") try: pdf_doc = load_pdf_pages(pdf_path) print(f"PDF contains {pdf_doc.page_count} pages.") except Exception as e: print(f"Error loading PDF: {e}")
load_processed_PDF_data(base_path)
¶
Loads processed PDF data from files using metadata for file references.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_path
|
Path
|
Base path where processed assets are stored. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[List[str], List[List[EntityAnnotation]], List[Image], List[Image]]
|
Tuple[List[str], List[List[EntityAnnotation]], List[Image.Image], List[Image.Image]]:
- Loaded text pages.
- Word locations (list of |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If any required files are missing. |
ValueError
|
If the metadata file is incomplete or invalid. |
make_image_preprocess_mask(mask_height)
¶
Creates a preprocessing function that masks a specified height at the bottom of the image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mask_height
|
float
|
The proportion of the image height to mask at the bottom (0.0 to 1.0). |
required |
Returns:
| Type | Description |
|---|---|
Callable[[Image, int], Image]
|
Callable[[Image.Image, int], Image.Image]: A preprocessing function that takes an image |
Callable[[Image, int], Image]
|
and page number as input and returns the processed image. |
pil_to_bytes(image, format='PNG')
¶
Converts a Pillow image to raw bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image
|
Image
|
The Pillow image object to convert. |
required |
format
|
str
|
The format to save the image as (e.g., "PNG", "JPEG"). Default is "PNG". |
'PNG'
|
Returns:
| Name | Type | Description |
|---|---|---|
bytes |
bytes
|
The raw bytes of the image. |
process_page(page, client, annotation_font_path, preprocessor=None)
¶
Processes a single PDF page, extracting text, word locations, and annotated images.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
Page
|
The PDF page object. |
required |
client
|
ImageAnnotatorClient
|
Google Vision API client for text detection. |
required |
preprocessor
|
Callable[[Image, int], Image]
|
Preprocessing function for the image. |
None
|
annotation_font_path
|
str
|
Path to the font file for annotations. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[str, List[EntityAnnotation], Image, Image, dict]
|
Tuple[str, List[vision.EntityAnnotation], Image.Image, Image.Image, dict]: - Full page text (str) - Word locations (List of vision.EntityAnnotation) - Annotated image (Pillow Image object) - Original unprocessed image (Pillow Image object) - Page dimensions (dict) |
process_single_image(image, client, feature_type=DEFAULT_ANNOTATION_METHOD, language_hints=DEFAULT_ANNOTATION_LANGUAGE_HINTS)
¶
Processes a single image with the Google Vision API and returns text annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image
|
Image
|
The preprocessed Pillow image object. |
required |
client
|
ImageAnnotatorClient
|
Google Vision API client for text detection. |
required |
feature_type
|
str
|
Type of text detection to use ('TEXT_DETECTION' or 'DOCUMENT_TEXT_DETECTION'). |
DEFAULT_ANNOTATION_METHOD
|
language_hints
|
List
|
Language hints for OCR. |
DEFAULT_ANNOTATION_LANGUAGE_HINTS
|
Returns:
| Type | Description |
|---|---|
Any
|
List[vision.EntityAnnotation]: Text annotations from the Vision API response. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no text is detected. |
save_processed_pdf_data(output_dir, journal_name, text_pages, word_locations, annotated_images, unannotated_images)
¶
Saves processed PDF data to files for later reloading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
Path
|
Directory to save the data (as a Path object). |
required |
journal_name
|
str
|
Name for the output directory (usually the PDF name without extension). |
required |
text_pages
|
List[str]
|
Extracted full-page text. |
required |
word_locations
|
List[List[EntityAnnotation]]
|
Word locations and annotations from Vision API. |
required |
annotated_images
|
List[Image]
|
Annotated images with bounding boxes. |
required |
unannotated_images
|
List[Image]
|
Raw unannotated images. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
serialize_entity_annotations_to_json(annotations)
¶
Serializes a nested list of EntityAnnotation objects into a JSON-compatible format using Base64 encoding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annotations
|
List[List[EntityAnnotation]]
|
The nested list of EntityAnnotation objects. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The serialized data in JSON format as a string. |
start_image_annotator_client(credentials_file=None, api_endpoint='vision.googleapis.com', timeout=(10, 30), enable_logging=False)
¶
Starts and returns a Google Vision API ImageAnnotatorClient with optional configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
credentials_file
|
str
|
Path to the credentials JSON file. If None, uses the default environment variable. |
None
|
api_endpoint
|
str
|
Custom API endpoint for the Vision API. Default is the global endpoint. |
'vision.googleapis.com'
|
timeout
|
Tuple[int, int]
|
Connection and read timeouts in seconds. Default is (10, 30). |
(10, 30)
|
enable_logging
|
bool
|
Enable detailed logging for debugging. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
ImageAnnotatorClient
|
vision.ImageAnnotatorClient: Configured Vision API client. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified credentials file is not found. |
Exception
|
For unexpected errors during client setup. |
Example
client = start_image_annotator_client( credentials_file="/path/to/credentials.json", api_endpoint="vision.googleapis.com", timeout=(10, 30), enable_logging=True ) print("Google Vision API client initialized.")
ocr_editor
¶
current_image = st.session_state.current_image
module-attribute
¶
current_page_index = st.session_state.current_page_index
module-attribute
¶
current_text = pages[current_page_index]
module-attribute
¶
edited_text = st.text_area('Edit OCR Text', value=(st.session_state.current_text), key=f'text_area_{st.session_state.current_page_index}', height=400)
module-attribute
¶
image_directory = st.sidebar.text_input('Image Directory', value='./images')
module-attribute
¶
ocr_text_directory = st.sidebar.text_input('OCR Text Directory', value='./ocr_text')
module-attribute
¶
pages = st.session_state.pages
module-attribute
¶
save_path = os.path.join(ocr_text_directory, 'updated_ocr.xml')
module-attribute
¶
tree = st.session_state.tree
module-attribute
¶
uploaded_image_file = st.sidebar.file_uploader('Upload an Image', type=['jpg', 'jpeg', 'png', 'pdf'])
module-attribute
¶
uploaded_text_file = st.sidebar.file_uploader('Upload OCR Text File', type=['xml'])
module-attribute
¶
extract_pages(tree)
¶
Extract page data from the XML tree.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tree
|
ElementTree
|
Parsed XML tree. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
list
|
A list of dictionaries containing 'number' and 'text' for each page. |
load_xml(file_obj)
¶
Load an XML file from a file-like object.
save_xml(tree, file_path)
¶
Save the modified XML tree to a file.
ocr_processing
¶
DEFAULT_ANNOTATION_FONT_PATH = Path('/System/Library/Fonts/Supplemental/Arial.ttf')
module-attribute
¶
DEFAULT_ANNOTATION_FONT_SIZE = 12
module-attribute
¶
DEFAULT_ANNOTATION_LANGUAGE_HINTS = ['vi']
module-attribute
¶
DEFAULT_ANNOTATION_METHOD = 'DOCUMENT_TEXT_DETECTION'
module-attribute
¶
DEFAULT_ANNOTATION_OFFSET = 2
module-attribute
¶
logger = logging.getLogger('ocr_processing')
module-attribute
¶
PDFParseWarning
¶
Bases: Warning
Custom warning class for PDF parsing issues. Encapsulates minimal logic for displaying warnings with a custom format.
warn(message)
staticmethod
¶
Display a warning message with custom formatting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
message
|
str
|
The warning message to display. |
required |
annotate_image_with_text(image, text_annotations, annotation_font_path, font_size=12)
¶
Annotates a PIL image with bounding boxes and text descriptions from OCR results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image
|
Image
|
The input PIL image to annotate. |
required |
text_annotations
|
List[EntityAnnotation]
|
OCR results containing bounding boxes and text. |
required |
annotation_font_path
|
str
|
Path to the font file for text annotations. |
required |
font_size
|
int
|
Font size for text annotations. |
12
|
Returns:
| Type | Description |
|---|---|
Image
|
Image.Image: The annotated PIL image. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input image is None. |
IOError
|
If the font file cannot be loaded. |
Exception
|
For any other unexpected errors. |
build_processed_pdf(pdf_path, client, preprocessor=None, annotation_font_path=DEFAULT_ANNOTATION_FONT_PATH)
¶
Processes a PDF document, extracting text, word locations, annotated images, and unannotated images.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
Path
|
Path to the PDF file. |
required |
client
|
ImageAnnotatorClient
|
Google Vision API client for text detection. |
required |
annotation_font_path
|
Path
|
Path to the font file for annotations. |
DEFAULT_ANNOTATION_FONT_PATH
|
Returns:
| Type | Description |
|---|---|
Tuple[List[str], List[List[EntityAnnotation]], List[Image], List[Image]]
|
Tuple[List[str], List[List[vision.EntityAnnotation]], List[Image.Image], List[Image.Image]]:
- List of extracted full-page texts (one entry per page).
- List of word locations (list of |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified PDF file does not exist. |
ValueError
|
If the PDF file is invalid or contains no pages. |
Exception
|
For any unexpected errors during processing. |
Example
from pathlib import Path from google.cloud import vision pdf_path = Path("/path/to/example.pdf") font_path = Path("/path/to/fonts/Arial.ttf") client = vision.ImageAnnotatorClient() try: text_pages, word_locations_list, annotated_images, unannotated_images = build_processed_pdf( pdf_path, client, font_path ) print(f"Processed {len(text_pages)} pages successfully!") except Exception as e: print(f"Error processing PDF: {e}")
deserialize_entity_annotations_from_json(data)
¶
Deserializes JSON data into a nested list of EntityAnnotation objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
str
|
The JSON string containing serialized annotations. |
required |
Returns:
| Type | Description |
|---|---|
List[List[EntityAnnotation]]
|
List[List[EntityAnnotation]]: The reconstructed nested list of EntityAnnotation objects. |
extract_image_from_page(page)
¶
Extracts the first image from the given PDF page and returns it as a PIL Image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
Page
|
The PDF page object. |
required |
Returns:
| Type | Description |
|---|---|
Image
|
Image.Image: The first image on the page as a Pillow Image object. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no images are found on the page or the image data is incomplete. |
Exception
|
For unexpected errors during image extraction. |
Example
import fitz from PIL import Image doc = fitz.open("/path/to/document.pdf") page = doc.load_page(0) # Load the first page try: image = extract_image_from_page(page) image.show() # Display the image except Exception as e: print(f"Error extracting image: {e}")
get_page_dimensions(page)
¶
Extracts the width and height of a single PDF page in both inches and pixels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
Page
|
A single PDF page object from PyMuPDF. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
A dictionary containing the width and height of the page in inches and pixels. |
load_pdf_pages(pdf_path)
¶
Opens the PDF document and returns the fitz Document object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf_path
|
Path
|
The path to the PDF file. |
required |
Returns:
| Type | Description |
|---|---|
Document
|
fitz.Document: The loaded PDF document. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified file does not exist. |
ValueError
|
If the file is not a valid PDF document. |
Exception
|
For any unexpected error. |
Example
from pathlib import Path pdf_path = Path("/path/to/example.pdf") try: pdf_doc = load_pdf_pages(pdf_path) print(f"PDF contains {pdf_doc.page_count} pages.") except Exception as e: print(f"Error loading PDF: {e}")
load_processed_PDF_data(base_path)
¶
Loads processed PDF data from files using metadata for file references.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_path
|
Path
|
Base path where processed assets are stored. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[List[str], List[List[EntityAnnotation]], List[Image], List[Image]]
|
Tuple[List[str], List[List[EntityAnnotation]], List[Image.Image], List[Image.Image]]:
- Loaded text pages.
- Word locations (list of |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If any required files are missing. |
ValueError
|
If the metadata file is incomplete or invalid. |
make_image_preprocess_mask(mask_height)
¶
Creates a preprocessing function that masks a specified height at the bottom of the image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mask_height
|
float
|
The proportion of the image height to mask at the bottom (0.0 to 1.0). |
required |
Returns:
| Type | Description |
|---|---|
Callable[[Image, int], Image]
|
Callable[[Image.Image, int], Image.Image]: A preprocessing function that takes an image |
Callable[[Image, int], Image]
|
and page number as input and returns the processed image. |
pil_to_bytes(image, format='PNG')
¶
Converts a Pillow image to raw bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image
|
Image
|
The Pillow image object to convert. |
required |
format
|
str
|
The format to save the image as (e.g., "PNG", "JPEG"). Default is "PNG". |
'PNG'
|
Returns:
| Name | Type | Description |
|---|---|---|
bytes |
bytes
|
The raw bytes of the image. |
process_page(page, client, annotation_font_path, preprocessor=None)
¶
Processes a single PDF page, extracting text, word locations, and annotated images.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
Page
|
The PDF page object. |
required |
client
|
ImageAnnotatorClient
|
Google Vision API client for text detection. |
required |
preprocessor
|
Callable[[Image, int], Image]
|
Preprocessing function for the image. |
None
|
annotation_font_path
|
str
|
Path to the font file for annotations. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[str, List[EntityAnnotation], Image, Image, dict]
|
Tuple[str, List[vision.EntityAnnotation], Image.Image, Image.Image, dict]: - Full page text (str) - Word locations (List of vision.EntityAnnotation) - Annotated image (Pillow Image object) - Original unprocessed image (Pillow Image object) - Page dimensions (dict) |
process_single_image(image, client, feature_type=DEFAULT_ANNOTATION_METHOD, language_hints=DEFAULT_ANNOTATION_LANGUAGE_HINTS)
¶
Processes a single image with the Google Vision API and returns text annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image
|
Image
|
The preprocessed Pillow image object. |
required |
client
|
ImageAnnotatorClient
|
Google Vision API client for text detection. |
required |
feature_type
|
str
|
Type of text detection to use ('TEXT_DETECTION' or 'DOCUMENT_TEXT_DETECTION'). |
DEFAULT_ANNOTATION_METHOD
|
language_hints
|
List
|
Language hints for OCR. |
DEFAULT_ANNOTATION_LANGUAGE_HINTS
|
Returns:
| Type | Description |
|---|---|
Any
|
List[vision.EntityAnnotation]: Text annotations from the Vision API response. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no text is detected. |
save_processed_pdf_data(output_dir, journal_name, text_pages, word_locations, annotated_images, unannotated_images)
¶
Saves processed PDF data to files for later reloading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
Path
|
Directory to save the data (as a Path object). |
required |
journal_name
|
str
|
Name for the output directory (usually the PDF name without extension). |
required |
text_pages
|
List[str]
|
Extracted full-page text. |
required |
word_locations
|
List[List[EntityAnnotation]]
|
Word locations and annotations from Vision API. |
required |
annotated_images
|
List[Image]
|
Annotated images with bounding boxes. |
required |
unannotated_images
|
List[Image]
|
Raw unannotated images. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
serialize_entity_annotations_to_json(annotations)
¶
Serializes a nested list of EntityAnnotation objects into a JSON-compatible format using Base64 encoding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annotations
|
List[List[EntityAnnotation]]
|
The nested list of EntityAnnotation objects. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The serialized data in JSON format as a string. |
start_image_annotator_client(credentials_file=None, api_endpoint='vision.googleapis.com', timeout=(10, 30), enable_logging=False)
¶
Starts and returns a Google Vision API ImageAnnotatorClient with optional configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
credentials_file
|
str
|
Path to the credentials JSON file. If None, uses the default environment variable. |
None
|
api_endpoint
|
str
|
Custom API endpoint for the Vision API. Default is the global endpoint. |
'vision.googleapis.com'
|
timeout
|
Tuple[int, int]
|
Connection and read timeouts in seconds. Default is (10, 30). |
(10, 30)
|
enable_logging
|
bool
|
Enable detailed logging for debugging. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
ImageAnnotatorClient
|
vision.ImageAnnotatorClient: Configured Vision API client. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the specified credentials file is not found. |
Exception
|
For unexpected errors during client setup. |
Example
client = start_image_annotator_client( credentials_file="/path/to/credentials.json", api_endpoint="vision.googleapis.com", timeout=(10, 30), enable_logging=True ) print("Google Vision API client initialized.")
prompt_system
¶
Prompt system package scaffolding per ADR-PT04.
Modules will provide object-service compliant prompt catalog, rendering, and validation.
adapters
¶
Prompt catalog adapters.
filesystem_catalog_adapter
¶
Filesystem-backed prompt catalog adapter.
frontmatter_fallback
¶
Shared helpers for resilient prompt body extraction.
extract_best_effort_body(content)
¶
Return prompt body even when frontmatter is missing or malformed.
git_catalog_adapter
¶
Git-backed prompt catalog adapter.
config
¶
Configuration models and policies for the prompt system.
policy
¶
Behavior policies controlling prompt rendering and validation.
PromptRenderPolicy
¶
Bases: BaseModel
Policy for prompt rendering precedence and behavior.
allow_undefined_vars = False
class-attribute
instance-attribute
¶merge_strategy = 'override'
class-attribute
instance-attribute
¶policy_version = '1.0'
class-attribute
instance-attribute
¶precedence_order = ['caller_context', 'frontmatter_defaults', 'settings_defaults']
class-attribute
instance-attribute
¶
ValidationPolicy
¶
Bases: BaseModel
Policy controlling validation strictness.
domain
¶
Domain models and protocols for prompt handling.
models
¶
Domain models for the prompt system.
CatalogHealth
¶
Bases: BaseModel
Aggregated prompt catalog health report.
CatalogIssue
¶
CatalogIssueType
¶
InputStrictness
¶
Message
¶
Prompt
¶
PromptArtifactSpec
¶
PromptInputSpec
¶
Bases: BaseModel
Prompt input declaration.
description = None
class-attribute
instance-attribute
¶name
instance-attribute
¶required = False
class-attribute
instance-attribute
¶source = None
class-attribute
instance-attribute
¶strictness = InputStrictness.loose
class-attribute
instance-attribute
¶type = None
class-attribute
instance-attribute
¶
PromptMetadata
¶
Bases: BaseModel
Prompt front matter metadata.
content_flags = Field(default_factory=list)
class-attribute
instance-attribute
¶created_at = None
class-attribute
instance-attribute
¶default_model = None
class-attribute
instance-attribute
¶default_variables = Field(default_factory=dict)
class-attribute
instance-attribute
¶description
instance-attribute
¶input_contract_ref = None
class-attribute
instance-attribute
¶inputs = Field(default_factory=list)
class-attribute
instance-attribute
¶key = ''
class-attribute
instance-attribute
¶model_config = ConfigDict(extra='forbid')
class-attribute
instance-attribute
¶name
instance-attribute
¶optional_variables = Field(default_factory=list)
class-attribute
instance-attribute
¶output_contract = None
class-attribute
instance-attribute
¶output_contract_ref = None
class-attribute
instance-attribute
¶output_mode = None
class-attribute
instance-attribute
¶pii_handling = None
class-attribute
instance-attribute
¶prompt_id = None
class-attribute
instance-attribute
¶required_variables = Field(default_factory=list)
class-attribute
instance-attribute
¶role = None
class-attribute
instance-attribute
¶safety_level = None
class-attribute
instance-attribute
¶schema_version = '1.0'
class-attribute
instance-attribute
¶tags = Field(default_factory=list)
class-attribute
instance-attribute
¶updated_at = None
class-attribute
instance-attribute
¶version
instance-attribute
¶warnings = Field(default_factory=list)
class-attribute
instance-attribute
¶canonical_key()
¶Return canonical key without version suffix.
immutable_ref()
¶Return immutable prompt reference key.v
resolved_output_mode()
¶Return normalized platform output mode.
PromptOutputContract
¶
PromptOutputMode
¶
PromptValidationResult
¶
Bases: BaseModel
Result of prompt validation.
valid is maintained for API ergonomics and derived from errors to keep a
single source of truth.
RenderParams
¶
Bases: BaseModel
Per-call rendering parameters.
RenderedPrompt
¶
mappers
¶
Mappers for translating transport data to domain models and back.
prompt_mapper
¶
Mapper for translating prompt files to domain models.
PromptMapper
¶
Maps transport-layer prompt data into domain objects.
to_domain_prompt(file_content, source_key=None)
¶Map raw file content (including front matter) to a Prompt.
to_file_request(key, base_path)
¶Map prompt key to a filesystem path for transport.
to_key_from_path(path, base_path)
¶Map a prompt file path to canonical key.
Absolute paths are relativized only when they live under base_path.
Relative paths also strip a matching relative base_path prefix, which
keeps tnh-prompts/foo.md and foo.md equivalent for callers such
as --prompt-dir ./tnh-prompts. Paths outside the base are preserved.
service
¶
Prompt system services (rendering, validation, loading).
contract_schema
¶
Prompt contract schema resolution and validation.
SCHEMA_DIRECTORY_PARTS = ('schemas', 'prompt-contracts')
module-attribute
¶
SCHEMA_SUFFIX = '.schema.json'
module-attribute
¶
PromptContractSchemaResolver
¶
Resolve and validate prompt-contract JSON Schema artifacts.
__init__(context)
¶for_prompt_directory(prompts_base)
classmethod
¶Build a resolver using runtime-context discovery for a prompt directory.
resolve(schema_ref)
¶Resolve a schema_ref to the highest-precedence schema file.
resolve_validated(schema_ref)
¶Resolve a schema_ref and confirm the artifact is valid JSON Schema.
search_roots()
¶Return schema search roots in workspace/user/built-in precedence.
validate_instance(resolved, payload)
¶Validate a JSON payload against a resolved schema.
ResolvedPromptContractSchema
¶
format_contract_validation_error(*, schema_ref, error)
¶
Build a user-facing contract validation failure message.
loader
¶
Prompt loader orchestration service.
PromptLoader
¶
Responsible for preparing prompts (parse + validate).
__init__(validator)
¶parse_error_issue(prompt_key, message)
¶Build a fatal issue for unreadable or invalid frontmatter.
validate(prompt)
¶Validate prompt using configured validator.
validation_issues(prompt_key, validation)
¶Convert validation errors into fatal catalog issues.
warning_issues(prompt_key, warnings)
¶Convert non-fatal prompt warnings into catalog issues.
renderer
¶
Prompt rendering service.
PromptRenderer
¶
Bases: PromptRendererPort
Renders prompts using configured policy.
validator
¶
Prompt validation service.
PromptValidator
¶
Bases: PromptValidatorPort
Validates prompt metadata and render parameters.
text_processing
¶
__all__ = ['bracket_lines', 'unbracket_lines', 'lines_from_bracketed_text', 'NumberedText', 'normalize_newlines', 'clean_text']
module-attribute
¶
NumberedText
¶
Immutable container for text documents with numbered lines.
Provides utilities for working with line-numbered text including reading, writing, accessing lines by number, and iterating over numbered lines.
Immutability Note
NumberedText is designed to be used immutably after construction. While not enforced at runtime (for performance reasons as a low-level container), instances should not be modified after creation. All operations return new data rather than mutating the instance.
Whitespace and Blank Line Handling (Monaco Editor as standard for compatibility): NumberedText follows Monaco Editor's verbatim line and whitespace handling. Monaco Editor: https://microsoft.github.io/monaco-editor/typedoc/interfaces/IRange.html
- Blank lines: Preserved as empty strings in the lines list
- Whitespace: Leading/trailing whitespace preserved (never stripped)
- Line count: Blank lines count as lines (e.g., "a\n\nb" has 3 lines)
- Indexing: 1-based line numbers with inclusive end semantics (Monaco IRange)
Numbered Input Detection:
When input contains line numbers (e.g., "1: foo\n2:\n3: bar"):
- Pattern validation: Only non-blank lines validated for sequential numbering
- Number extraction: Removes number prefix (e.g., "2: ") from all lines
- Blank line handling: After number removal, blank lines become empty strings
- Example: "1: foo\n2:\n3: bar" → lines=[' foo', '', ' bar']
Attributes:
| Name | Type | Description |
|---|---|---|
lines |
List[str]
|
List of text lines (do not modify after construction) |
start |
int
|
Starting line number (do not modify after construction) |
separator |
str
|
Separator between line number and content (do not modify after construction) |
Examples:
>>> text = "First line\nSecond line\n\nFourth line"
>>> doc = NumberedText(text)
>>> print(doc)
1: First line
2: Second line
3:
4: Fourth line
content
property
¶
Get original text without line numbers.
end
property
¶
lines = []
instance-attribute
¶
numbered_content
property
¶
Get text with line numbers as a string. Equivalent to str(self)
numbered_lines
property
¶
Get list of lines with line numbers included.
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: Lines with numbers and separator prefixed |
Examples:
>>> doc = NumberedText("First line\nSecond line")
>>> doc.numbered_lines
['1: First line', '2: Second line']
Note
- Unlike str(self), this returns a list rather than joined string
- Maintains consistent formatting with separator
- Useful for processing or displaying individual numbered lines
separator = separator
instance-attribute
¶
size
property
¶
Get the number of lines.
start = start
instance-attribute
¶
LineSegment
dataclass
¶
Represents a segment of lines with start and end indices in 1-based indexing.
The segment follows Python range conventions where start is inclusive and end is exclusive. However, indexing is 1-based to match NumberedText.
Attributes:
| Name | Type | Description |
|---|---|---|
start |
int
|
Starting line number (inclusive, 1-based) |
end |
int
|
Ending line number (exclusive, 1-based) |
SegmentIterator
¶
Iterator for generating line segments of specified size.
Produces segments of lines with start/end indices following 1-based indexing. The final segment may be smaller than the specified segment size.
Attributes:
| Name | Type | Description |
|---|---|---|
total_lines |
Total number of lines in text |
|
segment_size |
Number of lines per segment |
|
start_line |
Starting line number (1-based) |
|
min_segment_size |
Minimum size for the final segment |
min_segment_size = min_segment_size
instance-attribute
¶
num_segments = (remaining_lines + segment_size - 1) // segment_size
instance-attribute
¶
segment_size = segment_size
instance-attribute
¶
start_line = start_line
instance-attribute
¶
total_lines = total_lines
instance-attribute
¶
__init__(total_lines, segment_size, start_line=1, min_segment_size=None)
¶
Initialize the segment iterator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
total_lines
|
int
|
Total number of lines to iterate over |
required |
segment_size
|
int
|
Desired size of each segment |
required |
start_line
|
int
|
First line number (default: 1) |
1
|
min_segment_size
|
Optional[int]
|
Minimum size for final segment (default: None) If specified, the last segment will be merged with the previous one if it would be smaller than this size. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If segment_size < 1 or total_lines < 1 |
ValueError
|
If start_line < 1 (must use 1-based indexing) |
ValueError
|
If min_segment_size >= segment_size |
__iter__()
¶
Iterate over line segments.
Yields:
| Type | Description |
|---|---|
LineSegment
|
LineSegment containing start (inclusive) and end (exclusive) indices |
__getitem__(index)
¶
Get line content by line number (1-based indexing).
__init__(content=None, start=1, separator=':')
¶
Initialize a numbered text document, detecting and preserving existing numbering.
Valid numbered text must have: - Sequential line numbers - Consistent separator character(s) - Every non-empty line must follow the numbering pattern
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
Optional[str]
|
Initial text content, if any |
None
|
start
|
int
|
Starting line number (used only if content isn't already numbered) |
1
|
separator
|
str
|
Separator between line numbers and content (only if content isn't numbered) |
':'
|
Examples:
__iter__()
¶
Iterate over (line_number, line_content) pairs.
__len__()
¶
Return the number of lines.
__str__()
¶
Return the numbered text representation.
from_file(path, **kwargs)
classmethod
¶
Create a NumberedText instance from a file.
get_coverage_report(section_start_lines)
¶
Return coverage statistics for sections defined by start lines.
get_line(line_num)
¶
Get content of specified line number.
get_lines(start, end)
¶
Deprecated: use get_lines_exclusive; end index remains exclusive.
get_lines_exclusive(start, end)
¶
Get content of line range [start, end) using 1-based line numbers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start
|
int
|
Inclusive start line (1-based external indexing). |
required |
end
|
int
|
Exclusive end line (1-based; not included), matching Python slicing semantics. |
required |
get_numbered_line(line_num)
¶
Get specified line with line number.
get_numbered_lines(start, end)
¶
Get numbered lines for [start, end) using 1-based numbering.
get_numbered_segment(start, end)
¶
get_segment(start, end)
¶
Return the segment from start line (inclusive) up to end line (inclusive).
This aligns with Monaco's inclusive range semantics. Internally we convert to Python's exclusive upper bound when slicing.
iter_segments(segment_size, min_segment_size=None)
¶
Iterate over segments of the text with specified size.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segment_size
|
int
|
Number of lines per segment |
required |
min_segment_size
|
Optional[int]
|
Optional minimum size for final segment. If specified, last segment will be merged with previous one if it would be smaller than this size. |
None
|
Yields:
| Type | Description |
|---|---|
LineSegment
|
LineSegment objects containing start and end line numbers |
Example
text = NumberedText("line1\nline2\nline3\nline4\nline5") for segment in text.iter_segments(2): ... print(f"Lines {segment.start}-{segment.end}") Lines 1-3 Lines 3-5 Lines 5-6
save(path, numbered=True)
¶
Save document to file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Output file path |
required |
numbered
|
bool
|
Whether to save with line numbers (default: True) |
True
|
validate_section_boundaries(section_start_lines)
¶
Validate section boundaries for gaps, overlaps, and out-of-bounds errors.
Sections are defined by their start lines; the end of each section is implicit: it ends at the line before the next section starts, with the final section ending at the last line of the text. Validation enforces: - First section starts at self.start - No overlaps (next start must be > previous start) - No gaps (next start must be exactly previous start + 1) - All start lines within [self.start, self.end]
bracket_lines(text, number=False)
¶
Encloses each line of the input text with angle brackets.
If number is True, adds a line number followed by a colon `:` and then the line.
Args:
text (str): The input string containing lines separated by '
'. number (bool): Whether to prepend line numbers to each line.
Returns:
str: A string where each line is enclosed in angle brackets.
Examples:
>>> bracket_lines("This is a string with
two lines.")
'
>>> bracket_lines("This is a string with
two lines.", number=True) '<1:This is a string with> <2: two lines.>'
clean_text(text, newline=False)
¶
Cleans a given text by replacing specific unwanted characters such as tab, and non-breaking spaces with regular spaces.
This function takes a string as input and applies replacements based on a predefined mapping of characters to replace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to be cleaned. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The cleaned text with unwanted characters replaced by spaces. |
Example
text = "This is\n an example\ttext with\xa0extra spaces." clean_text(text) 'This is an example text with extra spaces.'
lines_from_bracketed_text(text, start, end, keep_brackets=False)
¶
Extracts lines from bracketed text between the start and end indices, inclusive.
Handles both numbered and non-numbered cases.
Args:
text (str): The input bracketed text containing lines like <...>.
start (int): The starting line number (1-based).
end (int): The ending line number (1-based).
Returns:
list[str]: The lines from start to end inclusive, with angle brackets removed.
Raises:
FormattingError: If the text contains improperly formatted lines (missing angle brackets).
ValueError: If start or end indices are invalid or out of bounds.
Examples:
>>> text = "<1:Line 1>
<2:Line 2> <3:Line 3>" >>> lines_from_bracketed_text(text, 1, 2) ['Line 1', 'Line 2']
>>> text = "<Line 1>
normalize_newlines(text, spacing=2)
¶
Normalize newline blocks in the input text by reducing consecutive newlines
to the specified number of newlines for consistent readability and formatting.
Parameters:
----------
text : str
The input text containing inconsistent newline spacing.
spacing : int, optional
The number of newlines to insert between lines. Defaults to 2.
Returns:
-------
str
The text with consecutive newlines reduced to the specified number of newlines.
Example:
--------
>>> raw_text = "Heading
Paragraph text 1 Paragraph text 2
" >>> normalize_newlines(raw_text, spacing=2) 'Heading
Paragraph text 1
Paragraph text 2
'
unbracket_lines(text, number=False)
¶
Removes angle brackets (< >) from encapsulated lines and optionally removes line numbers.
Args:
text (str): The input string with encapsulated lines.
number (bool): If True, removes line numbers in the format 'digit:'.
Raises a ValueError if `number=True` and a line does not start
with a digit followed by a colon.
Returns:
str: A newline-separated string with the encapsulation removed,
and line numbers stripped if specified.
Examples:
>>> unbracket_lines("<1:Line 1>
<2:Line 2>", number=True) 'Line 1 Line 2'
>>> unbracket_lines("<Line 1>
>>> unbracket_lines("<1Line 1>", number=True)
ValueError: Line does not start with a valid number: '1Line 1'
bracket
¶
FormattingError
¶
Bases: Exception
Custom exception raised for formatting-related errors.
__init__(message='An error occurred due to invalid formatting.')
¶
bracket_all_lines(pages)
¶
bracket_lines(text, number=False)
¶
Encloses each line of the input text with angle brackets.
If number is True, adds a line number followed by a colon `:` and then the line.
Args:
text (str): The input string containing lines separated by '
'. number (bool): Whether to prepend line numbers to each line.
Returns:
str: A string where each line is enclosed in angle brackets.
Examples:
>>> bracket_lines("This is a string with
two lines.")
'
>>> bracket_lines("This is a string with
two lines.", number=True) '<1:This is a string with> <2: two lines.>'
lines_from_bracketed_text(text, start, end, keep_brackets=False)
¶
Extracts lines from bracketed text between the start and end indices, inclusive.
Handles both numbered and non-numbered cases.
Args:
text (str): The input bracketed text containing lines like <...>.
start (int): The starting line number (1-based).
end (int): The ending line number (1-based).
Returns:
list[str]: The lines from start to end inclusive, with angle brackets removed.
Raises:
FormattingError: If the text contains improperly formatted lines (missing angle brackets).
ValueError: If start or end indices are invalid or out of bounds.
Examples:
>>> text = "<1:Line 1>
<2:Line 2> <3:Line 3>" >>> lines_from_bracketed_text(text, 1, 2) ['Line 1', 'Line 2']
>>> text = "<Line 1>
number_lines(text, start=1, separator=': ')
¶
Numbers each line of text with a readable format, including empty lines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text to be numbered. Can be multi-line. |
required |
start
|
int
|
Starting line number. Defaults to 1. |
1
|
separator
|
str
|
Separator between line number and content. Defaults to ": ". |
': '
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Numbered text where each line starts with "{number}: ". |
Examples:
>>> text = "First line\nSecond line\n\nFourth line"
>>> print(number_lines(text))
1: First line
2: Second line
3:
4: Fourth line
>>> print(number_lines(text, start=5, separator=" | "))
5 | First line
6 | Second line
7 |
8 | Fourth line
Notes
- All lines are numbered, including empty lines, to maintain text structure
- Line numbers are aligned through natural string formatting
- Customizable separator allows for different formatting needs
- Can start from any line number for flexibility in text processing
unbracket_all_lines(pages)
¶
unbracket_lines(text, number=False)
¶
Removes angle brackets (< >) from encapsulated lines and optionally removes line numbers.
Args:
text (str): The input string with encapsulated lines.
number (bool): If True, removes line numbers in the format 'digit:'.
Raises a ValueError if `number=True` and a line does not start
with a digit followed by a colon.
Returns:
str: A newline-separated string with the encapsulation removed,
and line numbers stripped if specified.
Examples:
>>> unbracket_lines("<1:Line 1>
<2:Line 2>", number=True) 'Line 1 Line 2'
>>> unbracket_lines("<Line 1>
>>> unbracket_lines("<1Line 1>", number=True)
ValueError: Line does not start with a valid number: '1Line 1'
match_section
¶
MatchObject
¶
Bases: BaseModel
Basic Match Object definition.
SectionConfig
¶
find_keyword(line, words, case_sensitive, decorator)
¶
Check if line matches keyword pattern.
find_markdown_header(line, level)
¶
Check if line matches markdown header pattern.
find_regex(line, pattern)
¶
Check if line matches regex pattern.
find_section_boundaries(text, config)
¶
Find all section boundary line numbers.
numbered_text
¶
NumberedFormat
¶
NumberedText
¶
Immutable container for text documents with numbered lines.
Provides utilities for working with line-numbered text including reading, writing, accessing lines by number, and iterating over numbered lines.
Immutability Note
NumberedText is designed to be used immutably after construction. While not enforced at runtime (for performance reasons as a low-level container), instances should not be modified after creation. All operations return new data rather than mutating the instance.
Whitespace and Blank Line Handling (Monaco Editor as standard for compatibility): NumberedText follows Monaco Editor's verbatim line and whitespace handling. Monaco Editor: https://microsoft.github.io/monaco-editor/typedoc/interfaces/IRange.html
- Blank lines: Preserved as empty strings in the lines list
- Whitespace: Leading/trailing whitespace preserved (never stripped)
- Line count: Blank lines count as lines (e.g., "a\n\nb" has 3 lines)
- Indexing: 1-based line numbers with inclusive end semantics (Monaco IRange)
Numbered Input Detection:
When input contains line numbers (e.g., "1: foo\n2:\n3: bar"):
- Pattern validation: Only non-blank lines validated for sequential numbering
- Number extraction: Removes number prefix (e.g., "2: ") from all lines
- Blank line handling: After number removal, blank lines become empty strings
- Example: "1: foo\n2:\n3: bar" → lines=[' foo', '', ' bar']
Attributes:
| Name | Type | Description |
|---|---|---|
lines |
List[str]
|
List of text lines (do not modify after construction) |
start |
int
|
Starting line number (do not modify after construction) |
separator |
str
|
Separator between line number and content (do not modify after construction) |
Examples:
>>> text = "First line\nSecond line\n\nFourth line"
>>> doc = NumberedText(text)
>>> print(doc)
1: First line
2: Second line
3:
4: Fourth line
content
property
¶
Get original text without line numbers.
end
property
¶
lines = []
instance-attribute
¶
numbered_content
property
¶
Get text with line numbers as a string. Equivalent to str(self)
numbered_lines
property
¶
Get list of lines with line numbers included.
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: Lines with numbers and separator prefixed |
Examples:
>>> doc = NumberedText("First line\nSecond line")
>>> doc.numbered_lines
['1: First line', '2: Second line']
Note
- Unlike str(self), this returns a list rather than joined string
- Maintains consistent formatting with separator
- Useful for processing or displaying individual numbered lines
separator = separator
instance-attribute
¶
size
property
¶
Get the number of lines.
start = start
instance-attribute
¶
LineSegment
dataclass
¶
Represents a segment of lines with start and end indices in 1-based indexing.
The segment follows Python range conventions where start is inclusive and end is exclusive. However, indexing is 1-based to match NumberedText.
Attributes:
| Name | Type | Description |
|---|---|---|
start |
int
|
Starting line number (inclusive, 1-based) |
end |
int
|
Ending line number (exclusive, 1-based) |
SegmentIterator
¶
Iterator for generating line segments of specified size.
Produces segments of lines with start/end indices following 1-based indexing. The final segment may be smaller than the specified segment size.
Attributes:
| Name | Type | Description |
|---|---|---|
total_lines |
Total number of lines in text |
|
segment_size |
Number of lines per segment |
|
start_line |
Starting line number (1-based) |
|
min_segment_size |
Minimum size for the final segment |
min_segment_size = min_segment_size
instance-attribute
¶num_segments = (remaining_lines + segment_size - 1) // segment_size
instance-attribute
¶segment_size = segment_size
instance-attribute
¶start_line = start_line
instance-attribute
¶total_lines = total_lines
instance-attribute
¶__init__(total_lines, segment_size, start_line=1, min_segment_size=None)
¶Initialize the segment iterator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
total_lines
|
int
|
Total number of lines to iterate over |
required |
segment_size
|
int
|
Desired size of each segment |
required |
start_line
|
int
|
First line number (default: 1) |
1
|
min_segment_size
|
Optional[int]
|
Minimum size for final segment (default: None) If specified, the last segment will be merged with the previous one if it would be smaller than this size. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If segment_size < 1 or total_lines < 1 |
ValueError
|
If start_line < 1 (must use 1-based indexing) |
ValueError
|
If min_segment_size >= segment_size |
__iter__()
¶Iterate over line segments.
Yields:
| Type | Description |
|---|---|
LineSegment
|
LineSegment containing start (inclusive) and end (exclusive) indices |
__getitem__(index)
¶
Get line content by line number (1-based indexing).
__init__(content=None, start=1, separator=':')
¶
Initialize a numbered text document, detecting and preserving existing numbering.
Valid numbered text must have: - Sequential line numbers - Consistent separator character(s) - Every non-empty line must follow the numbering pattern
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
Optional[str]
|
Initial text content, if any |
None
|
start
|
int
|
Starting line number (used only if content isn't already numbered) |
1
|
separator
|
str
|
Separator between line numbers and content (only if content isn't numbered) |
':'
|
Examples:
__iter__()
¶
Iterate over (line_number, line_content) pairs.
__len__()
¶
Return the number of lines.
__str__()
¶
Return the numbered text representation.
from_file(path, **kwargs)
classmethod
¶
Create a NumberedText instance from a file.
get_coverage_report(section_start_lines)
¶
Return coverage statistics for sections defined by start lines.
get_line(line_num)
¶
Get content of specified line number.
get_lines(start, end)
¶
Deprecated: use get_lines_exclusive; end index remains exclusive.
get_lines_exclusive(start, end)
¶
Get content of line range [start, end) using 1-based line numbers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start
|
int
|
Inclusive start line (1-based external indexing). |
required |
end
|
int
|
Exclusive end line (1-based; not included), matching Python slicing semantics. |
required |
get_numbered_line(line_num)
¶
Get specified line with line number.
get_numbered_lines(start, end)
¶
Get numbered lines for [start, end) using 1-based numbering.
get_numbered_segment(start, end)
¶
get_segment(start, end)
¶
Return the segment from start line (inclusive) up to end line (inclusive).
This aligns with Monaco's inclusive range semantics. Internally we convert to Python's exclusive upper bound when slicing.
iter_segments(segment_size, min_segment_size=None)
¶
Iterate over segments of the text with specified size.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segment_size
|
int
|
Number of lines per segment |
required |
min_segment_size
|
Optional[int]
|
Optional minimum size for final segment. If specified, last segment will be merged with previous one if it would be smaller than this size. |
None
|
Yields:
| Type | Description |
|---|---|
LineSegment
|
LineSegment objects containing start and end line numbers |
Example
text = NumberedText("line1\nline2\nline3\nline4\nline5") for segment in text.iter_segments(2): ... print(f"Lines {segment.start}-{segment.end}") Lines 1-3 Lines 3-5 Lines 5-6
save(path, numbered=True)
¶
Save document to file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Output file path |
required |
numbered
|
bool
|
Whether to save with line numbers (default: True) |
True
|
validate_section_boundaries(section_start_lines)
¶
Validate section boundaries for gaps, overlaps, and out-of-bounds errors.
Sections are defined by their start lines; the end of each section is implicit: it ends at the line before the next section starts, with the final section ending at the last line of the text. Validation enforces: - First section starts at self.start - No overlaps (next start must be > previous start) - No gaps (next start must be exactly previous start + 1) - All start lines within [self.start, self.end]
SectionValidationError
¶
Bases: BaseModel
Error found in section boundaries.
Error metadata class following tnh-scholar standards: - Pydantic v2 BaseModel for validation and serialization - Frozen for immutability - Used as data structure returned from validation methods
See: src/tnh_scholar/exceptions.py for exception classes
get_numbered_format(text)
¶
Analyze text to determine if it follows a consistent line numbering format.
Valid formats have: - Sequential numbers starting from some value - Consistent separator character(s) - Every line must follow the format
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
Returns:
| Type | Description |
|---|---|
NumberedFormat
|
Tuple of (is_numbered, separator, start_number) |
Examples:
text_processing
¶
clean_text(text, newline=False)
¶
Cleans a given text by replacing specific unwanted characters such as tab, and non-breaking spaces with regular spaces.
This function takes a string as input and applies replacements based on a predefined mapping of characters to replace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to be cleaned. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The cleaned text with unwanted characters replaced by spaces. |
Example
text = "This is\n an example\ttext with\xa0extra spaces." clean_text(text) 'This is an example text with extra spaces.'
normalize_newlines(text, spacing=2)
¶
Normalize newline blocks in the input text by reducing consecutive newlines
to the specified number of newlines for consistent readability and formatting.
Parameters:
----------
text : str
The input text containing inconsistent newline spacing.
spacing : int, optional
The number of newlines to insert between lines. Defaults to 2.
Returns:
-------
str
The text with consecutive newlines reduced to the specified number of newlines.
Example:
--------
>>> raw_text = "Heading
Paragraph text 1 Paragraph text 2
" >>> normalize_newlines(raw_text, spacing=2) 'Heading
Paragraph text 1
Paragraph text 2
'
tools
¶
utils
¶
__all__ = ['copy_files_with_regex', 'ensure_directory_exists', 'ensure_directory_writable', 'iterate_subdir', 'path_as_str', 'read_str_from_file', 'sanitize_filename', 'to_slug', 'write_str_to_file', 'load_json_into_model', 'load_jsonl_to_dict', 'save_model_to_json', 'get_language_code_from_text', 'get_language_from_code', 'get_language_name_from_text', 'fraction_to_percent', 'ExpectedTimeTQDM', 'TimeProgress', 'TimeMs', 'TNHAudioSegment', 'convert_ms_to_sec', 'convert_sec_to_ms', 'get_user_confirmation', 'check_ocr_env', 'check_openai_env']
module-attribute
¶
ExpectedTimeTQDM
¶
A context manager for a time-based tqdm progress bar with optional delay.
- 'expected_time': number of seconds we anticipate the task might take.
- 'display_interval': how often (seconds) to refresh the bar.
- 'desc': a short description for the bar.
- 'delay_start': how many seconds to wait (sleep) before we even create/start the bar.
If the task finishes before 'delay_start' has elapsed, the bar may never appear.
delay_start = delay_start
instance-attribute
¶
desc = desc
instance-attribute
¶
display_interval = display_interval
instance-attribute
¶
expected_time = round(expected_time)
instance-attribute
¶
__enter__()
¶
__exit__(exc_type, exc_value, traceback)
¶
__init__(expected_time, display_interval=0.5, desc='Time-based Progress', delay_start=1.0)
¶
TNHAudioSegment
¶
raw
property
¶
Access the underlying pydub.AudioSegment if needed.
__add__(other)
¶
__getitem__(key)
¶
__iadd__(other)
¶
__init__(segment)
¶
__len__()
¶
empty()
staticmethod
¶
export(out_f, format, **kwargs)
¶
Wrapper: Export the audio segment to a file-like object or file path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
out_f
|
str | BinaryIO
|
File path or file-like object to write the audio data to. |
required |
format
|
str
|
Audio format (e.g., 'mp3', 'wav'). |
required |
**kwargs
|
Any
|
Additional keyword arguments passed to pydub.AudioSegment.export. |
{}
|
from_file(file, format=None, **kwargs)
staticmethod
¶
Wrapper: Load an audio file into a TNHAudioSegment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
str | Path | BytesIO
|
Path to the audio file. |
required |
format
|
str | None
|
Optional audio format (e.g., 'mp3', 'wav'). If None, pydub will attempt to infer it. |
None
|
**kwargs
|
Any
|
Additional keyword arguments passed to pydub.AudioSegment.from_file. |
{}
|
Returns:
| Type | Description |
|---|---|
TNHAudioSegment
|
TNHAudioSegment instance containing the loaded audio. |
silent(duration)
staticmethod
¶
TimeMs
¶
Bases: int
Lightweight representation of a time interval or timestamp in milliseconds. Allows negative values.
TimeProgress
¶
A context manager for a time-based progress display using dots.
The display updates once per second, printing a dot and showing: - Expected time (if provided) - Elapsed time (always displayed)
Example:
import time with ExpectedTimeProgress(expected_time=60, desc="Transcribing..."): ... time.sleep(5) # Simulate work [Expected Time: 1:00, Elapsed Time: 0:05] .....
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected_time
|
Optional[float]
|
Expected time in seconds. Optional. |
None
|
display_interval
|
float
|
How often to print a dot (seconds). |
1.0
|
desc
|
str
|
Description to display alongside the progress. |
''
|
check_ocr_env(output=True)
¶
Check OCR processing requirements.
check_openai_env(output=True)
¶
Check OpenAI API requirements.
convert_ms_to_sec(ms)
¶
Convert time from milliseconds (int) to seconds (float).
convert_sec_to_ms(val)
¶
Convert seconds to milliseconds, rounding to the nearest integer.
copy_files_with_regex(source_dir, destination_dir, regex_patterns, preserve_structure=True)
¶
Copies files from subdirectories one level down in the source directory to the destination directory if they match any regex pattern. Optionally preserves the directory structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_dir
|
Path
|
Path to the source directory to search files in. |
required |
destination_dir
|
Path
|
Path to the destination directory where files will be copied. |
required |
regex_patterns
|
list[str]
|
List of regex patterns to match file names. |
required |
preserve_structure
|
bool
|
Whether to preserve the directory structure. Defaults to True. |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the source directory does not exist or is not a directory. |
Example
copy_files_with_regex( ... source_dir=Path("/path/to/source"), ... destination_dir=Path("/path/to/destination"), ... regex_patterns=[r'.*.txt\(', r'.*\.log\)'], ... preserve_structure=True ... )
ensure_directory_exists(dir_path)
¶
Create directory if it doesn't exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dir_path
|
Path
|
Directory path to ensure exists. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the directory exists or was created successfully, False otherwise. |
ensure_directory_writable(dir_path)
¶
Ensure the directory exists and is writable. Creates the directory if it does not exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dir_path
|
Path
|
Directory to verify or create. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the directory cannot be created or is not writable. |
TypeError
|
If the provided path is not a Path instance. |
fraction_to_percent(numerator, denominator)
¶
Convert a fraction to a percentage (0.0 if denominator is zero).
get_language_code_from_text(text)
¶
Detect the language of the provided text using langdetect.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
return result 'code' ISO 639-1 for detected language. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If text is empty or invalid |
get_language_from_code(code)
¶
get_language_name_from_text(text)
¶
get_user_confirmation(prompt, default=True)
¶
Prompt the user for a yes/no confirmation with single-character input. Cross-platform implementation. Returns True if 'y' is entered, and False if 'n' Allows for default value if return is entered.
Example usage if get_user_confirmation("Do you want to continue"): print("Continuing...") else: print("Exiting...")
iterate_subdir(directory, recursive=False)
¶
Iterates through subdirectories in the given directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory
|
Path
|
The root directory to start the iteration. |
required |
recursive
|
bool
|
If True, iterates recursively through all subdirectories. If False, iterates only over the immediate subdirectories. |
False
|
Yields:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Paths to each subdirectory. |
Example
for subdir in iterate_subdir(Path('/root'), recursive=False): ... print(subdir)
load_json_into_model(file, model)
¶
Loads a JSON file and validates it against a Pydantic model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the JSON file. |
required |
model
|
type[BaseModel]
|
The Pydantic model to validate against. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
BaseModel |
BaseModel
|
An instance of the validated Pydantic model. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file content is invalid JSON or does not match the model. |
Example: class ExampleModel(BaseModel): name: str age: int city: str
if __name__ == "__main__":
json_file = Path("example.json")
try:
data = load_json_into_model(json_file, ExampleModel)
print(data)
except ValueError as e:
print(e)
load_jsonl_to_dict(file_path)
¶
Load a JSONL file into a list of dictionaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the JSONL file. |
required |
Returns:
| Type | Description |
|---|---|
List[Dict]
|
List[Dict]: A list of dictionaries, each representing a line in the JSONL file. |
Example
from pathlib import Path file_path = Path("data.jsonl") data = load_jsonl_to_dict(file_path) print(data) [{'key1': 'value1'}, {'key2': 'value2'}]
path_as_str(path)
¶
read_str_from_file(file_path)
¶
Reads the entire content of a text file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
The path to the text file. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The content of the text file as a single string. |
sanitize_filename(filename, max_length=DEFAULT_MAX_FILENAME_LENGTH)
¶
Sanitize filename for use unix use.
save_model_to_json(file, model, indent=4, ensure_ascii=False)
¶
Saves a Pydantic model to a JSON file, formatted with indentation for readability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the JSON file where the model will be saved. |
required |
model
|
BaseModel
|
The Pydantic model instance to save. |
required |
indent
|
int
|
Number of spaces for JSON indentation. Defaults to 4. |
4
|
ensure_ascii
|
bool
|
Whether to escape non-ASCII characters. Defaults to False. |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the model cannot be serialized to JSON. |
IOError
|
If there is an issue writing to the file. |
Example
class ExampleModel(BaseModel): name: str age: int
if name == "main": model_instance = ExampleModel(name="John", age=30) json_file = Path("example.json") try: save_model_to_json(json_file, model_instance) print(f"Model saved to {json_file}") except (ValueError, IOError) as e: print(e)
to_slug(string)
¶
Slugify a Unicode string.
Converts a string to a strict URL-friendly slug format, allowing only lowercase letters, digits, and hyphens.
Example
slugify("Héllø_Wörld!") 'hello-world'
write_str_to_file(file_path, text, overwrite=False)
¶
Writes text to a file with file locking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
PathLike
|
The path to the file to write. |
required |
text
|
str
|
The text to write to the file. |
required |
overwrite
|
bool
|
Whether to overwrite the file if it exists. |
False
|
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If the file exists and overwrite is False. |
OSError
|
If there's an issue with file locking or writing. |
file_utils
¶
DEFAULT_MAX_FILENAME_LENGTH = 25
module-attribute
¶
PathLike = Union[str, Path]
module-attribute
¶
__all__ = ['DEFAULT_MAX_FILENAME_LENGTH', 'FileExistsWarning', 'ensure_directory_exists', 'ensure_directory_writable', 'iterate_subdir', 'path_source_str', 'copy_files_with_regex', 'read_str_from_file', 'write_str_to_file', 'sanitize_filename', 'to_slug', 'path_as_str']
module-attribute
¶
FileExistsWarning
¶
Bases: UserWarning
copy_files_with_regex(source_dir, destination_dir, regex_patterns, preserve_structure=True)
¶
Copies files from subdirectories one level down in the source directory to the destination directory if they match any regex pattern. Optionally preserves the directory structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_dir
|
Path
|
Path to the source directory to search files in. |
required |
destination_dir
|
Path
|
Path to the destination directory where files will be copied. |
required |
regex_patterns
|
list[str]
|
List of regex patterns to match file names. |
required |
preserve_structure
|
bool
|
Whether to preserve the directory structure. Defaults to True. |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the source directory does not exist or is not a directory. |
Example
copy_files_with_regex( ... source_dir=Path("/path/to/source"), ... destination_dir=Path("/path/to/destination"), ... regex_patterns=[r'.*.txt\(', r'.*\.log\)'], ... preserve_structure=True ... )
ensure_directory_exists(dir_path)
¶
Create directory if it doesn't exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dir_path
|
Path
|
Directory path to ensure exists. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the directory exists or was created successfully, False otherwise. |
ensure_directory_writable(dir_path)
¶
Ensure the directory exists and is writable. Creates the directory if it does not exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dir_path
|
Path
|
Directory to verify or create. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the directory cannot be created or is not writable. |
TypeError
|
If the provided path is not a Path instance. |
iterate_subdir(directory, recursive=False)
¶
Iterates through subdirectories in the given directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory
|
Path
|
The root directory to start the iteration. |
required |
recursive
|
bool
|
If True, iterates recursively through all subdirectories. If False, iterates only over the immediate subdirectories. |
False
|
Yields:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Paths to each subdirectory. |
Example
for subdir in iterate_subdir(Path('/root'), recursive=False): ... print(subdir)
path_as_str(path)
¶
path_source_str(path)
¶
read_str_from_file(file_path)
¶
Reads the entire content of a text file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
The path to the text file. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The content of the text file as a single string. |
sanitize_filename(filename, max_length=DEFAULT_MAX_FILENAME_LENGTH)
¶
Sanitize filename for use unix use.
to_slug(string)
¶
Slugify a Unicode string.
Converts a string to a strict URL-friendly slug format, allowing only lowercase letters, digits, and hyphens.
Example
slugify("Héllø_Wörld!") 'hello-world'
write_str_to_file(file_path, text, overwrite=False)
¶
Writes text to a file with file locking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
PathLike
|
The path to the file to write. |
required |
text
|
str
|
The text to write to the file. |
required |
overwrite
|
bool
|
Whether to overwrite the file if it exists. |
False
|
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If the file exists and overwrite is False. |
OSError
|
If there's an issue with file locking or writing. |
json_utils
¶
format_json(file)
¶
Formats a JSON file with line breaks and indentation for readability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the JSON file to be formatted. |
required |
Example
format_json(Path("data.json"))
load_json_into_model(file, model)
¶
Loads a JSON file and validates it against a Pydantic model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the JSON file. |
required |
model
|
type[BaseModel]
|
The Pydantic model to validate against. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
BaseModel |
BaseModel
|
An instance of the validated Pydantic model. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file content is invalid JSON or does not match the model. |
Example: class ExampleModel(BaseModel): name: str age: int city: str
if __name__ == "__main__":
json_file = Path("example.json")
try:
data = load_json_into_model(json_file, ExampleModel)
print(data)
except ValueError as e:
print(e)
load_jsonl_to_dict(file_path)
¶
Load a JSONL file into a list of dictionaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the JSONL file. |
required |
Returns:
| Type | Description |
|---|---|
List[Dict]
|
List[Dict]: A list of dictionaries, each representing a line in the JSONL file. |
Example
from pathlib import Path file_path = Path("data.jsonl") data = load_jsonl_to_dict(file_path) print(data) [{'key1': 'value1'}, {'key2': 'value2'}]
save_model_to_json(file, model, indent=4, ensure_ascii=False)
¶
Saves a Pydantic model to a JSON file, formatted with indentation for readability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the JSON file where the model will be saved. |
required |
model
|
BaseModel
|
The Pydantic model instance to save. |
required |
indent
|
int
|
Number of spaces for JSON indentation. Defaults to 4. |
4
|
ensure_ascii
|
bool
|
Whether to escape non-ASCII characters. Defaults to False. |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the model cannot be serialized to JSON. |
IOError
|
If there is an issue writing to the file. |
Example
class ExampleModel(BaseModel): name: str age: int
if name == "main": model_instance = ExampleModel(name="John", age=30) json_file = Path("example.json") try: save_model_to_json(json_file, model_instance) print(f"Model saved to {json_file}") except (ValueError, IOError) as e: print(e)
write_data_to_json_file(file, data, indent=4, ensure_ascii=False)
¶
Writes a dictionary or list as a JSON string to a file, ensuring the parent directory exists, and supports formatting with indentation and ASCII control.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path
|
Path to the JSON file where the data will be written. |
required |
data
|
Union[dict, list]
|
The data to write to the file. Typically a dict or list. |
required |
indent
|
int
|
Number of spaces for JSON indentation. Defaults to 4. |
4
|
ensure_ascii
|
bool
|
Whether to escape non-ASCII characters. Defaults to False. |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the data cannot be serialized to JSON. |
IOError
|
If there is an issue writing to the file. |
Example
from pathlib import Path data = {"key": "value"} write_json_str_to_file(Path("output.json"), data, indent=2, ensure_ascii=True)
lang
¶
logger = get_child_logger(__name__)
module-attribute
¶
get_language_code_from_text(text)
¶
Detect the language of the provided text using langdetect.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
return result 'code' ISO 639-1 for detected language. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If text is empty or invalid |
get_language_from_code(code)
¶
get_language_name_from_text(text)
¶
math_utils
¶
fraction_to_percent(numerator, denominator)
¶
Convert a fraction to a percentage (0.0 if denominator is zero).
progress_utils
¶
BAR_FORMAT = '{desc}: {percentage:3.0f}%|{bar}| Total: {total_fmt} sec. [elapsed: {elapsed}]'
module-attribute
¶
ExpectedTimeTQDM
¶
A context manager for a time-based tqdm progress bar with optional delay.
- 'expected_time': number of seconds we anticipate the task might take.
- 'display_interval': how often (seconds) to refresh the bar.
- 'desc': a short description for the bar.
- 'delay_start': how many seconds to wait (sleep) before we even create/start the bar.
If the task finishes before 'delay_start' has elapsed, the bar may never appear.
delay_start = delay_start
instance-attribute
¶
desc = desc
instance-attribute
¶
display_interval = display_interval
instance-attribute
¶
expected_time = round(expected_time)
instance-attribute
¶
__enter__()
¶
__exit__(exc_type, exc_value, traceback)
¶
__init__(expected_time, display_interval=0.5, desc='Time-based Progress', delay_start=1.0)
¶
TimeProgress
¶
A context manager for a time-based progress display using dots.
The display updates once per second, printing a dot and showing: - Expected time (if provided) - Elapsed time (always displayed)
Example:
import time with ExpectedTimeProgress(expected_time=60, desc="Transcribing..."): ... time.sleep(5) # Simulate work [Expected Time: 1:00, Elapsed Time: 0:05] .....
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected_time
|
Optional[float]
|
Expected time in seconds. Optional. |
None
|
display_interval
|
float
|
How often to print a dot (seconds). |
1.0
|
desc
|
str
|
Description to display alongside the progress. |
''
|
timing_utils
¶
tnh_audio_segment
¶
TNHAudioSegment: A typed, minimal wrapper for pydub.AudioSegment.
This class provides a type-safe interface for working with audio segments using pydub, enabling easier composition, slicing, and manipulation of audio data. It exposes common operations such as concatenation, slicing, and length retrieval, while hiding the underlying pydub implementation.
Key features
- Type-annotated methods for static analysis and IDE support
- Static constructors for silent and empty segments
- Operator overloads for concatenation and slicing
- Access to the underlying pydub.AudioSegment via the
rawproperty
Extend this class with additional methods as needed for your audio processing workflows.
TNHAudioSegment
¶
raw
property
¶
Access the underlying pydub.AudioSegment if needed.
__add__(other)
¶
__getitem__(key)
¶
__iadd__(other)
¶
__init__(segment)
¶
__len__()
¶
empty()
staticmethod
¶
export(out_f, format, **kwargs)
¶
Wrapper: Export the audio segment to a file-like object or file path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
out_f
|
str | BinaryIO
|
File path or file-like object to write the audio data to. |
required |
format
|
str
|
Audio format (e.g., 'mp3', 'wav'). |
required |
**kwargs
|
Any
|
Additional keyword arguments passed to pydub.AudioSegment.export. |
{}
|
from_file(file, format=None, **kwargs)
staticmethod
¶
Wrapper: Load an audio file into a TNHAudioSegment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
str | Path | BytesIO
|
Path to the audio file. |
required |
format
|
str | None
|
Optional audio format (e.g., 'mp3', 'wav'). If None, pydub will attempt to infer it. |
None
|
**kwargs
|
Any
|
Additional keyword arguments passed to pydub.AudioSegment.from_file. |
{}
|
Returns:
| Type | Description |
|---|---|
TNHAudioSegment
|
TNHAudioSegment instance containing the loaded audio. |
silent(duration)
staticmethod
¶
user_io_utils
¶
get_single_char(prompt=None)
¶
Get a single character from input, adapting to the execution environment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
Optional[str]
|
Optional prompt to display before getting input |
None
|
Returns:
| Type | Description |
|---|---|
str
|
A single character string from user input |
Note
- In terminal environments, uses raw input mode without requiring Enter
- In Jupyter/IPython, falls back to regular input with message about Enter
get_user_confirmation(prompt, default=True)
¶
Prompt the user for a yes/no confirmation with single-character input. Cross-platform implementation. Returns True if 'y' is entered, and False if 'n' Allows for default value if return is entered.
Example usage if get_user_confirmation("Do you want to continue"): print("Continuing...") else: print("Exiting...")
validate
¶
OCR_ENV_VARS = {'GOOGLE_APPLICATION_CREDENTIALS'}
module-attribute
¶
OPENAI_ENV_VARS = {'OPENAI_API_KEY'}
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
check_env(required_vars, feature='this feature', output=True)
¶
Check environment variables and provide user-friendly error messages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
required_vars
|
Set[str]
|
Set of environment variable names to check |
required |
feature
|
str
|
Description of feature requiring these variables |
'this feature'
|
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if all required variables are set |
check_ocr_env(output=True)
¶
Check OCR processing requirements.
check_openai_env(output=True)
¶
Check OpenAI API requirements.
get_env_message(missing_vars, feature='this feature')
¶
Generate user-friendly environment setup message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
missing_vars
|
List[str]
|
List of missing environment variable names |
required |
feature
|
str
|
Name of feature requiring the variables |
'this feature'
|
Returns:
| Type | Description |
|---|---|
str
|
Formatted error message with setup instructions |
version_check
¶
Version checker package for monitoring package version compatibility.
__all__ = ['PackageVersionChecker', 'VersionCheckerConfig', 'VersionStrategy', 'Result', 'PackageInfo']
module-attribute
¶
PackageInfo
dataclass
¶
Information about a package and its versions.
PackageVersionChecker
¶
Main class for checking package versions against requirements.
Result
dataclass
¶
Result of a version check operation.
diff_details = None
class-attribute
instance-attribute
¶
error = None
class-attribute
instance-attribute
¶
is_compatible
instance-attribute
¶
needs_update
instance-attribute
¶
package_info
instance-attribute
¶
warning_level = None
class-attribute
instance-attribute
¶
__init__(is_compatible, needs_update, package_info, error=None, warning_level=None, diff_details=None)
¶
get_upgrade_command()
¶
Return pip command to upgrade package.
VersionCheckerConfig
¶
Configuration for version checking behavior.
cache_duration = cache_duration
instance-attribute
¶
fail_on_error = fail_on_error
instance-attribute
¶
network_timeout = network_timeout
instance-attribute
¶
requirement = requirement
instance-attribute
¶
strategy = strategy
instance-attribute
¶
vdiff_fail_matrix = vdiff_fail_matrix
instance-attribute
¶
vdiff_warn_matrix = vdiff_warn_matrix
instance-attribute
¶
__init__(strategy=VersionStrategy.MINIMUM, requirement='', fail_on_error=False, cache_duration=3600, network_timeout=5, vdiff_warn_matrix=None, vdiff_fail_matrix=None)
¶
Initialize version checker configuration.
get_required_version()
¶
Get required version as a Version object.
VersionStrategy
¶
Bases: Enum
Enumeration of version checking strategies.
cache
¶
Simple caching mechanism for version information.
VersionCache
¶
Simple time-based cache for version information.
cache = {}
instance-attribute
¶cache_duration = cache_duration
instance-attribute
¶timestamps = {}
instance-attribute
¶__init__(cache_duration=3600)
¶Initialize cache with specified expiration time in seconds.
get(key)
¶Get cached version if still valid.
is_valid(key)
¶Check if cached value is still valid.
set(key, value)
¶Cache version with current timestamp.
checker
¶
Main version checker implementation.
PackageVersionChecker
¶
Main class for checking package versions against requirements.
cli
¶
Command-line interface for version checking (stub for future implementation).
main()
¶
Command-line interface for version checking.
config
¶
Configuration classes for version checking.
VersionCheckerConfig
¶
Configuration for version checking behavior.
cache_duration = cache_duration
instance-attribute
¶fail_on_error = fail_on_error
instance-attribute
¶network_timeout = network_timeout
instance-attribute
¶requirement = requirement
instance-attribute
¶strategy = strategy
instance-attribute
¶vdiff_fail_matrix = vdiff_fail_matrix
instance-attribute
¶vdiff_warn_matrix = vdiff_warn_matrix
instance-attribute
¶__init__(strategy=VersionStrategy.MINIMUM, requirement='', fail_on_error=False, cache_duration=3600, network_timeout=5, vdiff_warn_matrix=None, vdiff_fail_matrix=None)
¶Initialize version checker configuration.
get_required_version()
¶Get required version as a Version object.
VersionStrategy
¶
Bases: Enum
Enumeration of version checking strategies.
models
¶
Data models for version checking results.
PackageInfo
dataclass
¶
Information about a package and its versions.
Result
dataclass
¶
Result of a version check operation.
diff_details = None
class-attribute
instance-attribute
¶error = None
class-attribute
instance-attribute
¶is_compatible
instance-attribute
¶needs_update
instance-attribute
¶package_info
instance-attribute
¶warning_level = None
class-attribute
instance-attribute
¶__init__(is_compatible, needs_update, package_info, error=None, warning_level=None, diff_details=None)
¶get_upgrade_command()
¶Return pip command to upgrade package.
providers
¶
Version provider implementations for retrieving package versions.
StandardVersionProvider
¶
Bases: VersionProvider
Standard implementation of version provider using importlib and PyPI.
cache = cache or VersionCache()
instance-attribute
¶pypi_url_template = 'https://pypi.org/pypi/{package}/json'
instance-attribute
¶timeout = timeout
instance-attribute
¶__init__(cache=None, timeout=5)
¶get_installed_version(package_name)
¶Get installed package version.
get_latest_version(package_name)
¶Get latest available package version from PyPI.
strategies
¶
Version comparison strategies for package version checking.
check_exact_version(installed, required)
¶
Check if installed version exactly matches requirement.
check_minimum_version(installed, required)
¶
Check if installed version meets minimum requirement.
check_version_diff(installed, reference, vdiff_matrix)
¶
Check if version difference is within specified limits.
parse_vdiff_matrix(matrix_str)
¶
Parse a version difference matrix string.
webhook_server
¶
WebhookServer
¶
A generic webhook server that can receive callbacks from external services.
app = self._create_flask_app()
instance-attribute
¶
flask_running = Event()
instance-attribute
¶
flask_server_thread = None
instance-attribute
¶
port = port
instance-attribute
¶
tunnel_process = None
instance-attribute
¶
webhook_data = None
instance-attribute
¶
webhook_received = Condition()
instance-attribute
¶
__init__(port=5050)
¶
Initialize webhook server with configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
port
|
int
|
The port to run the Flask server on |
5050
|
cleanup()
¶
Clean up all resources.
close_tunnel()
¶
Close the tunnel if it's running.
create_tunnel()
¶
Create a public webhook URL using py-localtunnel.
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Optional[str]: The public webhook URL or None if tunnel creation failed |
shutdown_server()
¶
Gracefully shut down the Flask server.
start_server()
¶
Start Flask server in a separate thread.
wait_for_webhook(timeout=120)
¶
Wait for webhook data to be received.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timeout
|
int
|
Maximum time to wait in seconds |
120
|
Returns:
| Type | Description |
|---|---|
Optional[Dict]
|
Optional[Dict]: The webhook data or None if timed out |
video_processing
¶
__all__ = ['DLPDownloader', 'DownloadError', 'TranscriptError', 'VideoAudio', 'VideoProcessingError', 'VideoResource', 'VideoTranscript', 'YTDownloadService', 'extract_text_from_ttml', 'get_youtube_urls_from_csv']
module-attribute
¶
DLPDownloader
¶
Bases: YTDownloader
yt-dlp based implementation of YouTube content retrieval.
Assures temporary file export is in the form
Renames the export file to be based on title and ID by default, or moves the export file to the specified output file with appropriate extension.
config = config or BASE_YDL_OPTIONS
instance-attribute
¶
__init__(config=None)
¶
get_audio(url, start=None, end=None, output_path=None)
¶
Download audio and get metadata for a YouTube video.
get_default_export_name(url)
¶
Get default export filename for a URL.
get_default_filename_stem(metadata)
¶
Generate the object download filename.
get_metadata(url)
¶
Get metadata for a YouTube video.
get_transcript(url, lang='en', output_path=None)
¶
Downloads video transcript in TTML format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
YouTube video URL |
required |
lang
|
str
|
Language code for transcript (default: "en") |
'en'
|
output_path
|
Optional[Path]
|
Optional output directory (uses current dir if None) |
None
|
Returns:
| Type | Description |
|---|---|
VideoTranscript
|
TranscriptResource containing TTML file path and metadata |
Raises:
| Type | Description |
|---|---|
TranscriptError
|
If no transcript found for specified language |
get_video(url, quality=None, output_path=None)
¶
Download the full video with associated metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
YouTube video URL |
required |
quality
|
Optional[str]
|
yt-dlp format string (default: highest available) |
None
|
output_path
|
Optional[Path]
|
Optional output directory |
None
|
Returns:
| Type | Description |
|---|---|
VideoFile
|
VideoFile containing video file path and metadata |
Raises:
| Type | Description |
|---|---|
VideoDownloadError
|
If download fails |
DownloadError
¶
Bases: VideoProcessingError
Raised for download-related errors.
TranscriptError
¶
Bases: VideoProcessingError
Raised for transcript-related errors.
VideoAudio
dataclass
¶
Bases: VideoResource
VideoProcessingError
¶
Bases: Exception
Base exception for video processing errors.
VideoResource
dataclass
¶
VideoTranscript
dataclass
¶
Bases: VideoResource
YTDownloadService
dataclass
¶
Service wrapper for YouTube download operations.
Notes
Keeps Object-Service protocol alignment; behavior is delegated for now.
downloader
instance-attribute
¶
__init__(downloader)
¶
fetch_audio(url, start=None, end=None, output_path=None)
¶
Fetch audio via the configured downloader.
fetch_metadata(url)
¶
Fetch metadata via the configured downloader.
fetch_transcript(url, lang='en', output_path=None)
¶
Fetch a transcript via the configured downloader.
fetch_video(url, quality=None, output_path=None)
¶
Fetch video via the configured downloader.
extract_text_from_ttml(ttml_path)
¶
Extract plain text content from TTML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ttml_path
|
Path
|
Path to TTML transcript file |
required |
Returns:
| Type | Description |
|---|---|
str
|
Plain text content with one sentence per line |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file doesn't exist or has invalid content |
get_youtube_urls_from_csv(file_path)
¶
Reads a CSV file containing YouTube URLs and titles, logs the titles, and returns a list of URLs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the CSV file containing YouTube URLs and titles. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: List of YouTube URLs. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist. |
ValueError
|
If the CSV file is improperly formatted. |
ops_check
¶
OpsCheckConfig
dataclass
¶
OpsCheckFailure
dataclass
¶
OpsCheckProgressReporter
¶
Bases: Protocol
Observer for live yt-dlp ops-check progress events.
on_run_finished(report)
¶
Called when the full ops check completes.
on_run_started(total_urls)
¶
Called when the ops check begins.
on_url_failed(index, total_urls, url, reason)
¶
Called when one URL fails.
on_url_started(index, total_urls, url)
¶
Called before validating one URL.
on_url_succeeded(index, total_urls, url)
¶
Called when one URL succeeds.
OpsCheckReport
dataclass
¶
video_processing
¶
video_processing.py
BASE_YDL_OPTIONS = {'quiet': False, 'no_warnings': True, 'extract_flat': True, 'socket_timeout': 30, 'retries': 3, 'ignoreerrors': True, 'logger': logger}
module-attribute
¶
DEFAULT_AUDIO_OPTIONS = BASE_YDL_OPTIONS | {'format': 'bestaudio/best', 'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3', 'preferredquality': '192'}], 'noplaylist': True}
module-attribute
¶
DEFAULT_METADATA_FIELDS = ['id', 'title', 'description', 'duration', 'upload_date', 'uploader', 'channel_url', 'webpage_url', 'original_url', 'channel', 'language', 'categories', 'tags']
module-attribute
¶
DEFAULT_METADATA_OPTIONS = BASE_YDL_OPTIONS | {'skip_download': True}
module-attribute
¶
DEFAULT_TRANSCRIPT_OPTIONS = BASE_YDL_OPTIONS | {'skip_download': True, 'writesubtitles': True, 'writeautomaticsub': True, 'subtitlesformat': 'ttml'}
module-attribute
¶
DEFAULT_VIDEO_OPTIONS = BASE_YDL_OPTIONS | {'format': 'bestvideo+bestaudio/best', 'merge_output_format': 'mp4', 'noplaylist': True}
module-attribute
¶
TEMP_FILENAME_FORMAT = 'temp_%(id)s'
module-attribute
¶
TEMP_FILENAME_STR = 'temp_{id}'
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
DLPDownloader
¶
Bases: YTDownloader
yt-dlp based implementation of YouTube content retrieval.
Assures temporary file export is in the form
Renames the export file to be based on title and ID by default, or moves the export file to the specified output file with appropriate extension.
config = config or BASE_YDL_OPTIONS
instance-attribute
¶
__init__(config=None)
¶
get_audio(url, start=None, end=None, output_path=None)
¶
Download audio and get metadata for a YouTube video.
get_default_export_name(url)
¶
Get default export filename for a URL.
get_default_filename_stem(metadata)
¶
Generate the object download filename.
get_metadata(url)
¶
Get metadata for a YouTube video.
get_transcript(url, lang='en', output_path=None)
¶
Downloads video transcript in TTML format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
YouTube video URL |
required |
lang
|
str
|
Language code for transcript (default: "en") |
'en'
|
output_path
|
Optional[Path]
|
Optional output directory (uses current dir if None) |
None
|
Returns:
| Type | Description |
|---|---|
VideoTranscript
|
TranscriptResource containing TTML file path and metadata |
Raises:
| Type | Description |
|---|---|
TranscriptError
|
If no transcript found for specified language |
get_video(url, quality=None, output_path=None)
¶
Download the full video with associated metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
YouTube video URL |
required |
quality
|
Optional[str]
|
yt-dlp format string (default: highest available) |
None
|
output_path
|
Optional[Path]
|
Optional output directory |
None
|
Returns:
| Type | Description |
|---|---|
VideoFile
|
VideoFile containing video file path and metadata |
Raises:
| Type | Description |
|---|---|
VideoDownloadError
|
If download fails |
DownloadError
¶
Bases: VideoProcessingError
Raised for download-related errors.
TranscriptError
¶
Bases: VideoProcessingError
Raised for transcript-related errors.
VideoAudio
dataclass
¶
Bases: VideoResource
VideoDownloadError
¶
Bases: VideoProcessingError
Raised for video download-related errors.
VideoFile
dataclass
¶
VideoProcessingError
¶
Bases: Exception
Base exception for video processing errors.
VideoResource
dataclass
¶
VideoTranscript
dataclass
¶
Bases: VideoResource
YTDownloader
¶
Abstract base class for YouTube content retrieval.
get_audio(url, start, end, output_path)
¶
Extract audio with associated metadata.
get_metadata(url)
¶
Retrieve video metadata only.
get_transcript(url, lang='en', output_path=None)
¶
Retrieve video transcript with associated metadata.
get_video(url, quality=None, output_path=None)
¶
Download the full video with associated metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
YouTube video URL |
required |
quality
|
Optional[str]
|
yt-dlp format string (default: highest available) |
None
|
output_path
|
Optional[Path]
|
Optional output directory |
None
|
Returns:
| Type | Description |
|---|---|
VideoFile
|
VideoFile containing video file path and metadata |
Raises:
| Type | Description |
|---|---|
VideoDownloadError
|
If download fails |
extract_text_from_ttml(ttml_path)
¶
Extract plain text content from TTML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ttml_path
|
Path
|
Path to TTML transcript file |
required |
Returns:
| Type | Description |
|---|---|
str
|
Plain text content with one sentence per line |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file doesn't exist or has invalid content |
get_youtube_urls_from_csv(file_path)
¶
Reads a CSV file containing YouTube URLs and titles, logs the titles, and returns a list of URLs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the CSV file containing YouTube URLs and titles. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: List of YouTube URLs. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist. |
ValueError
|
If the CSV file is improperly formatted. |
video_processing_old1
¶
DEFAULT_TRANSCRIPT_DIR = Path.home() / '.yt_dlp_transcripts'
module-attribute
¶
DEFAULT_TRANSCRIPT_OPTIONS = {'skip_download': True, 'quiet': True, 'no_warnings': True, 'extract_flat': True, 'socket_timeout': 30, 'retries': 3, 'ignoreerrors': True, 'logger': logger}
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
SubtitleTrack
¶
TranscriptNotFoundError
¶
Bases: Exception
Raised when no transcript is available for the requested language.
language = language
instance-attribute
¶
video_url = video_url
instance-attribute
¶
__init__(video_url, language)
¶
Initialize TranscriptNotFoundError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
video_url
|
str
|
URL of the video where transcript was not found |
required |
language
|
str
|
Language code that was requested |
required |
VideoInfo
¶
download_audio_yt(url, output_dir, start_time=None, prompt_overwrite=True)
¶
Downloads audio from a YouTube video using yt_dlp.YoutubeDL, with an optional start time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
URL of the YouTube video. |
required |
output_dir
|
Path
|
Directory to save the downloaded audio file. |
required |
start_time
|
str
|
Optional start time (e.g., '00:01:30' for 1 minute 30 seconds). |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Path to the downloaded audio file. |
get_transcript(url, lang='en', download_dir=DEFAULT_TRANSCRIPT_DIR, keep_transcript_file=False)
¶
Downloads and extracts the transcript for a given YouTube video URL.
Retrieves the transcript file, extracts the text content, and returns the raw text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The URL of the YouTube video. |
required |
lang
|
str
|
The language code for the transcript (default: 'en'). |
'en'
|
download_dir
|
Path
|
The directory to download the transcript to. |
DEFAULT_TRANSCRIPT_DIR
|
keep_transcript_file
|
bool
|
Whether to keep the downloaded transcript file (default: False). |
False
|
Returns:
| Type | Description |
|---|---|
str
|
The extracted transcript text. |
Raises:
| Type | Description |
|---|---|
TranscriptNotFoundError
|
If no transcript is available in the specified language. |
DownloadError
|
If video info extraction or download fails. |
ValueError
|
If the downloaded transcript file is invalid or empty. |
ParseError
|
If XML parsing of the transcript fails. |
get_transcript_info(video_url, lang='en')
¶
Retrieves the transcript URL for a video in the specified language.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
video_url
|
str
|
The URL of the video |
required |
lang
|
str
|
The desired language code |
'en'
|
Returns:
| Type | Description |
|---|---|
str
|
URL of the transcript |
Raises:
| Type | Description |
|---|---|
TranscriptNotFoundError
|
If no transcript is available in the specified language |
DownloadError
|
If video info extraction fails |
get_video_download_path_yt(output_dir, url)
¶
Extracts the video title using yt-dlp.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The YouTube URL. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
Path
|
The title of the video. |
get_youtube_urls_from_csv(file_path)
¶
Reads a CSV file containing YouTube URLs and titles, logs the titles, and returns a list of URLs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the CSV file containing YouTube URLs and titles. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: List of YouTube URLs. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist. |
ValueError
|
If the CSV file is improperly formatted. |
video_processing_old2
¶
AUDIO_DOWNLOAD_OPTIONS = BASE_YDL_OPTIONS | {'format': 'bestaudio/best', 'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3', 'preferredquality': '192'}], 'noplaylist': True}
module-attribute
¶
BASE_YDL_OPTIONS = {'quiet': True, 'no_warnings': True, 'extract_flat': True, 'socket_timeout': 30, 'retries': 3, 'ignoreerrors': True, 'logger': logger}
module-attribute
¶
DEFAULT_METADATA_FIELDS = ['id', 'title', 'description', 'duration', 'upload_date', 'uploader', 'channel_url', 'webpage_url', 'original_url', 'channel', 'language', 'categories', 'tags']
module-attribute
¶
DEFAULT_TRANSCRIPT_DIR = Path.home() / '.yt_dlp_transcripts'
module-attribute
¶
TRANSCRIPT_OPTIONS = BASE_YDL_OPTIONS | {'writesubtitles': True, 'writeautomaticsub': True, 'subtitlesformat': 'ttml'}
module-attribute
¶
logger = get_child_logger(__name__)
module-attribute
¶
SubtitleTrack
¶
TranscriptNotFoundError
¶
VideoDownload
dataclass
¶
Bases: VideoMetadata
Result of download operations.
VideoInfo
¶
VideoMetadata
dataclass
¶
VideoTranscript
dataclass
¶
Bases: VideoMetadata
Result of transcript operations.
download_audio_yt(url, output_dir, start_time=None)
¶
Downloads audio from YouTube URL with optional start time.
get_transcript(url, lang='en', download_dir=DEFAULT_TRANSCRIPT_DIR, keep_transcript_file=False)
¶
Downloads and extracts transcript with metadata.
get_video_download_path_yt(output_dir, url)
¶
Get video metadata and expected download path.
get_video_metadata(url)
¶
Get metadata for a YouTube video without downloading content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
YouTube video URL |
required |
Returns:
| Type | Description |
|---|---|
VideoMetadata
|
VideoMetadata with only metadata field populated |
Raises:
| Type | Description |
|---|---|
DownloadError
|
If video info extraction fails |
get_youtube_urls_from_csv(file_path)
¶
Reads YouTube URLs from a CSV file containing URLs and titles.
yt_download_service
¶
YTDownloadService
dataclass
¶
Service wrapper for YouTube download operations.
Notes
Keeps Object-Service protocol alignment; behavior is delegated for now.
downloader
instance-attribute
¶
__init__(downloader)
¶
fetch_audio(url, start=None, end=None, output_path=None)
¶
Fetch audio via the configured downloader.
fetch_metadata(url)
¶
Fetch metadata via the configured downloader.
fetch_transcript(url, lang='en', output_path=None)
¶
Fetch a transcript via the configured downloader.
fetch_video(url, quality=None, output_path=None)
¶
Fetch video via the configured downloader.
xml_processing
¶
__all__ = ['FormattingError', 'PagebreakXMLParser', 'join_xml_data_to_doc', 'remove_page_tags', 'save_pages_to_xml', 'split_xml_on_pagebreaks', 'split_xml_pages']
module-attribute
¶
FormattingError
¶
Bases: Exception
Custom exception raised for formatting-related errors.
__init__(message='An error occurred due to invalid formatting.')
¶
PagebreakXMLParser
¶
Parses XML documents split by
cleaned_text = ''
instance-attribute
¶
original_text = text
instance-attribute
¶
pagebreak_tags = []
instance-attribute
¶
pages = []
instance-attribute
¶
__init__(text)
¶
parse(page_groups=None, keep_pagebreaks=True)
¶
Parses the XML and returns a list of page contents, optionally grouped and with pagebreaks retained.
join_xml_data_to_doc(file_path, data, overwrite=False)
¶
Joins a list of XML-tagged data with newlines, wraps it with
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the output file. |
required |
data
|
List[str]
|
List of XML-tagged data strings. |
required |
overwrite
|
bool
|
Whether to overwrite the file if it exists. |
False
|
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If the file exists and overwrite is False. |
ValueError
|
If the data list is empty. |
Example
join_xml_data_to_doc(Path("output.xml"), ["
Data "], overwrite=True)
remove_page_tags(text)
¶
Removes
Parameters:
- text (str): The input text containing
Returns:
- str: The text with
save_pages_to_xml(output_xml_path, text_pages, overwrite=False)
¶
Generates and saves an XML file containing text pages, with a
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_xml_path
|
Path
|
The Path object for the file where the XML file will be saved. |
required |
text_pages
|
List[str]
|
A list of strings, each representing the text content of a page. |
required |
overwrite
|
bool
|
If True, overwrites the file if it exists. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
None
|
None |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input list of text_pages is empty or contains invalid types. |
FileExistsError
|
If the file already exists and overwrite is False. |
PermissionError
|
If the file cannot be created due to insufficient permissions. |
OSError
|
For other file I/O-related errors. |
split_xml_on_pagebreaks(text, page_groups=None, keep_pagebreaks=True)
¶
Splits an XML document into individual pages based on
split_xml_pages(text)
¶
Backwards-compatible helper that returns the page contents without pagebreak tags.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
XML document string. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of page strings. |
extract_tags
¶
xml_processing
¶
FormattingError
¶
Bases: Exception
Custom exception raised for formatting-related errors.
__init__(message='An error occurred due to invalid formatting.')
¶
PagebreakXMLParser
¶
Parses XML documents split by
cleaned_text = ''
instance-attribute
¶
original_text = text
instance-attribute
¶
pagebreak_tags = []
instance-attribute
¶
pages = []
instance-attribute
¶
__init__(text)
¶
parse(page_groups=None, keep_pagebreaks=True)
¶
Parses the XML and returns a list of page contents, optionally grouped and with pagebreaks retained.
join_xml_data_to_doc(file_path, data, overwrite=False)
¶
Joins a list of XML-tagged data with newlines, wraps it with
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the output file. |
required |
data
|
List[str]
|
List of XML-tagged data strings. |
required |
overwrite
|
bool
|
Whether to overwrite the file if it exists. |
False
|
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If the file exists and overwrite is False. |
ValueError
|
If the data list is empty. |
Example
join_xml_data_to_doc(Path("output.xml"), ["
Data "], overwrite=True)
remove_page_tags(text)
¶
Removes
Parameters:
- text (str): The input text containing
Returns:
- str: The text with
save_pages_to_xml(output_xml_path, text_pages, overwrite=False)
¶
Generates and saves an XML file containing text pages, with a
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_xml_path
|
Path
|
The Path object for the file where the XML file will be saved. |
required |
text_pages
|
List[str]
|
A list of strings, each representing the text content of a page. |
required |
overwrite
|
bool
|
If True, overwrites the file if it exists. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
None
|
None |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input list of text_pages is empty or contains invalid types. |
FileExistsError
|
If the file already exists and overwrite is False. |
PermissionError
|
If the file cannot be created due to insufficient permissions. |
OSError
|
For other file I/O-related errors. |
split_xml_on_pagebreaks(text, page_groups=None, keep_pagebreaks=True)
¶
Splits an XML document into individual pages based on
split_xml_pages(text)
¶
Backwards-compatible helper that returns the page contents without pagebreak tags.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
XML document string. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of page strings. |