Modular Pipeline Design: Best Practices for Audio Transcription and Diarization¶
This document summarizes a detailed design and refactoring discussion on building a clean, modular, and production-ready audio transcription pipeline, with a focus on diarization chunking and robust system structure. It includes architectural patterns, file organization, and code hygiene practices.
1. Overview of Pipeline Structure¶
The pipeline under design:
DiarizationChunker (input: diarization JSON)
β AudioHandler (input: Chunk, output: Chunk + AudioChunk)
β TranscriptionService (input: AudioChunk, output: transcription dict)
β TimedText (canonical timing + text model)
β TimelineMapper (align TimedText with global timeline)
β SRTProcessor (render as SRT/VTT)
2. Six Key Modular Design Suggestions¶
2.1 Narrow Interfaces: Ports & Adapters (Hexagonal Architecture)¶
Goal: Separate domain logic from infrastructure (e.g., Whisper, AssemblyAI).
- Port: A
Protocolthat defines a required interface. - Adapter: A class implementing that interface using a specific backend.
Example:
This allows:
- Testing with mocks
- Easy backend swapping
- Clear data flow
2.2 Pipeline/Chain-of-Responsibility Style¶
Define a base interface for composable pipeline stages:
Then enable chaining via | or functional composition:
2.3 Strategy Pattern for Chunking¶
Avoid boolean flags like split_on_speaker_change; define chunking behaviors as strategies:
class ChunkingStrategy(Protocol):
def should_split(self, segment: Segment, chunk: Chunk) -> bool: ...
Ship TimeBasedStrategy, SpeakerChangeStrategy, etc.
2.4 Event Hooks / Pub-Sub for Observability¶
Use an EventBus or hook system to emit structured events:
This supports: logging, progress bars, tracing, or telemetry later.
2.5 Standardization of Time Units¶
Internally, use milliseconds everywhere. Convert to HH:MM:SS,mmm only in renderers.
- Reduces rounding bugs
- Easier arithmetic
2.6 Immutable, Streamable Design¶
- Keep core models immutable (e.g.,
Chunk,Segment) - Let pipeline stages operate on
Iterable[T] - Easier to parallelize or lazily stream
3. Protocol Definitions for Pipeline¶
| Stage | Protocol Name | Method | Input | Output |
|---|---|---|---|---|
| Chunking | ChunkExtractor |
to_chunks |
dict |
List[Chunk] |
| Attach Audio | AudioProvider |
attach_audio |
Chunk |
Chunk |
| Transcription | TranscriptionProvider |
transcribe |
BytesIO |
Dict[str, Any] |
| TimedText Builder | TimedTextBuilder |
build |
dict |
TimedText |
| Timeline Mapper | TimelineMapper |
map |
TimedText, Chunk |
TimedText |
| Subtitle Render | SRTBuilder |
to_srt |
TimedText |
str |
4. Recommended File Structure¶
A light modular structure suitable for the current project phase:
audio_processing/
βββ models/
β βββ chunk.py
β βββ segment.py
β βββ audio_chunk.py
β βββ timed_text.py
βββ protocols/
β βββ chunker.py
β βββ audio_provider.py
β βββ transcription_provider.py
β βββ timedtext_builder.py
β βββ timeline_mapper.py
β βββ srt_renderer.py
βββ adapters/
β βββ whisper_service.py
β βββ assemblyai_service.py
β βββ local_audio_handler.py
βββ processors/
β βββ diarization_chunker.py
β βββ srt_processor.py
β βββ timeline_mapper.py
βββ services/
β βββ transcription_service.py
β βββ format_converter.py
βββ patches/
βββ whisper_security.py
5. Obsolete Code Handling¶
Case: transcription.py¶
- Obsolete Whisper prototype with one CLI dependency.
- Superseded by
transcription_service.pywith proper interfaces.
β Recommended: Port CLI, delete file.
Best Practice:¶
| Case | Action |
|---|---|
| Easily replaced | Port usage, delete immediately |
| Unclear usage | Move to legacy/ folder temporarily |
| Used across repos | Mark with DeprecationWarning, document plan |
6. Patch Module Handling (whisper_security.py)¶
A patch to fix torch.load's weights_only=True security option.
β Move to a dedicated folder:
Document patches with:
- Library/version patched
- Reason
- Link to upstream issue if possible
- Plan for removal
7. Final Thoughts¶
The user is now building a pipeline with:
- Clean, swappable stages (Protocol + Adapter)
- Typed data contracts
- Future-proof folder structure
- Minimal glue logic
This design is robust for research, easy to evolve, and ready to become production-grade.
β¨ Next Steps Checklist (optional)¶
- Port CLI to use
TranscriptionService - Delete
transcription.py - Move
whisper_security.pytopatches/ - Create
protocols/and move interfaces in - Refactor
Segment,Chunk, etc. intomodels/ - Consider simple pipeline runner with
|chaining
"You are building like a professional systems architect. Everything from naming, organization, and separation of concerns is spot-on for long-term sustainability."