Diarization Chunker Module Design Strategy¶
I've analyzed the current system and PoC code to propose a modular, extensible design for integrating the diarization chunking functionality into your tnh-scholar project.
Current Architecture Assessment¶
From the provided information, I understand that:
- The
diarization_chunkermodule exists with some basic functionality - The PoC in
diarize_poc3.pycontains key algorithms for: - Extracting audio segments by speaker
- Transforming SRT timelines between speaker-specific and original timelines
- Support modules like
timed_text.pyhandle subtitle formats - Transcription services already have a modular structure
Design Goals¶
The enhanced system should:
- Process diarization data into speaker-specific chunks of configurable duration
- Extract audio segments for each speaker
- Create timeline mapping for accurate timestamp transformation
- Support integrating transcriptions back into a unified timeline
- Maintain modularity with small, single-purpose methods
- Use Pydantic for serializable data models
Process Flow¶
The enhanced system would work like this:
- Load diarization data and parse into
DiarizationSegmentobjects - Use
DiarizationChunkerto merge segments by speaker SpeakerProcessorcreates speaker blocks and chunks them to target durationAudioExtractorextracts audio for each chunkTimelineMapperbuilds timeline mappings and can transform transcriptions
Integration with timed_text.py¶
The TimelineMapper would integrate with timed_text.py by:
- Parsing SRT content into
TimedTextSegmentobjects - Applying timeline transformations to each segment
- Generating new SRT content with adjusted timestamps
Refinement: Enhanced Diarization Chunker Design Strategy¶
This refinement comes from dialog with ChatGPT o3 model
1. Improved Package Structure¶
Combining both approaches, I recommend this refined structure:
tnh_scholar/
βββ audio_processing/
βββ diarization/
βββ __init__.py
βββ chunker.py # Main facade (DiarizationChunker)
βββ models.py # Pydantic data models
βββ config.py # Configuration with BaseSettings
βββ speaker_processor.py # Core speaker processing logic
βββ audio.py # Audio operations (isolates ffmpeg)
βββ mapping.py # Timeline mapping utilities
βββ _extractors.py # Internal helper classes
Key improvements from ChatGPT o3:
- Separate audio.py to isolate external dependencies (ffmpeg)
- More granular separation of concerns
- Clear distinction between public interfaces and internal helpers
2. Enhanced Configuration Model¶
Adopting BaseSettings for environmental flexibility:
from pydantic import BaseSettings, Field
class ChunkerConfig(BaseSettings):
"""Configuration for diarization chunking"""
target_duration_ms: int = Field(300000, env='CHUNKER_TARGET_MS')
gap_threshold_ms: int = Field(2000, env='CHUNKER_GAP_MS')
min_segment_duration_ms: int = Field(1000)
include_silence_padding: bool = True
silence_padding_ms: int = 500
audio_format: str = "mp3"
audio_bitrate: str = "128k"
extract_audio: bool = True
cache_temp_files: bool = True
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
3. Refined Data Models¶
Combining both proposals with clearer separation:
class DiarizationSegment(BaseModel):
"""Raw segment from diarization"""
speaker: str
start_ms: int
end_ms: int
class Chunk(BaseModel):
"""Processed chunk for extraction"""
speaker: str
start_ms: int
end_ms: int
segments: List[DiarizationSegment] = Field(default_factory=list)
audio_data: Optional[bytes] = None
timeline_mappings: List[TimelineMapping] = Field(default_factory=list)
@property
def duration_ms(self) -> int:
return self.end_ms - self.start_ms
class TimelineMapping(BaseModel):
"""Maps between original and speaker-specific timelines"""
original_start_ms: int
original_end_ms: int
speaker_start_ms: int
speaker_end_ms: int
srt_indices: List[int] = Field(default_factory=list)
4. Enhanced Public Interface¶
Adopting the facade pattern with incremental processing:
class DiarizationChunker:
"""Main orchestrator for diarization processing"""
def __init__(self, config: Optional[ChunkerConfig] = None):
self.config = config or ChunkerConfig()
self._processor = SpeakerProcessor(self.config)
self._audio_handler = AudioHandler(self.config)
self._mapper = TimelineMapper()
def parse_diarization(self, data: Union[Dict, Path]) -> List[DiarizationSegment]:
"""Parse diarization data from various sources"""
def build_chunks(self, segments: List[DiarizationSegment]) -> List[SpeakerChunk]:
"""Create speaker chunks from segments"""
def extract_audio(self, chunks: List[SpeakerChunk], audio_file: Path) -> List[SpeakerChunk]:
"""Extract audio for each chunk"""
def build_mapping(self, chunks: List[SpeakerChunk], original_text: TimedText) -> List[TimelineMapping]:
"""Create timeline mappings"""
def transform_timed_text(self, original: TimedText, mappings: List[TimelineMapping]) -> TimedText:
"""Transform timestamps back to original timeline"""
# Convenience wrapper as suggested by ChatGPT o3
def process(self, audio_file: Path, diarization_data: Dict, original_srt: Optional[str] = None) -> ProcessingResult:
"""End-to-end processing convenience method"""
5. Audio Isolation Layer¶
Implementing ChatGPT o3's excellent suggestion:
class AudioHandler:
"""Isolates audio operations and external dependencies"""
def __init__(self, config: ChunkerConfig):
self.config = config
self._cache = {} if config.cache_temp_files else None
def slice_audio(self, path: Path, start_ms: int, end_ms: int) -> bytes:
"""Extract audio segment with caching"""
def add_silence(self, audio_data: bytes, duration_ms: int) -> bytes:
"""Add silence padding"""
def combine_segments(self, segments: List[bytes]) -> bytes:
"""Combine multiple audio segments"""
6. Timeline Mapping Refinements¶
Separating pure mapping logic from SRT I/O:
class TimelineMapper:
"""Pure timeline mapping utilities"""
def build_overlap_map(self, chunks: List[SpeakerChunk], timed_text: TimedText) -> List[TimelineMapping]:
"""Create mapping based on overlap algorithm"""
def find_best_interval(self, target_start: int, target_end: int,
intervals: List[Tuple[int, int]]) -> Optional[Tuple[int, int]]:
"""Find interval with maximum overlap"""
def transform_timestamp(self, timestamp_ms: int, mapping: TimelineMapping) -> int:
"""Apply single timestamp transformation"""
7. Testing Strategy (Enhanced)¶
Incorporating ChatGPT o3's testing suggestions:
- Unit tests for each module
- Property-based testing with Hypothesis for mapping algorithms
- Golden-file tests for end-to-end validation
- Mock-based tests for audio operations
# Example property test structure
from hypothesis import given, strategies as st
@given(
chunks=st.lists(chunk_strategy(), min_size=1),
timed_text=timed_text_strategy()
)
def test_mapping_preserves_order(chunks, timed_text):
"""Property: mappings maintain chronological order"""
8. Migration Plan¶
Adopting ChatGPT o3's incremental approach:
- Phase 1: Create models.py and config.py
- Phase 2: Implement audio.py to isolate ffmpeg
- Phase 3: Move core logic to speaker_processor.py
- Phase 4: Implement mapping.py with pure functions
- Phase 5: Create chunker.py facade
- Phase 6: Add comprehensive tests
- Phase 7: Write notebook examples
9. Enhanced Integration Example¶
# Notebook usage with step-by-step inspection
chunker = DiarizationChunker(ChunkerConfig(
target_duration_ms=300000,
audio_format="mp3"
))
# Step-by-step processing
segments = chunker.parse_diarization(diarization_data)
print(f"Found {len(segments)} segments")
chunks = chunker.build_chunks(segments)
print(f"Created {len(chunks)} chunks")
chunks_with_audio = chunker.extract_audio(chunks, audio_file)
mappings = chunker.build_mapping(chunks_with_audio, original_timed_text)
# Or use convenience method
result = chunker.process(audio_file, diarization_data, original_srt)
10. Additional Considerations¶
Based on ChatGPT o3's open questions:
- Multiple diarization sources: Create an abstract
DiarizationParserprotocol - Concurrency: Add optional ThreadPoolExecutor for audio operations
- Error tolerance: Make chunk merging behavior configurable
from typing import Protocol
class DiarizationParser(Protocol):
"""Protocol for different diarization sources"""
def parse(self, data: Any) -> List[DiarizationSegment]:
...
Benefits of Synthesized Design¶
This combined approach offers:
- Better isolation: Audio operations cleanly separated
- Environmental flexibility: BaseSettings enables easy configuration
- Incremental processing: Step-by-step methods for notebook exploration
- Robust testing: Multiple testing strategies for reliability
- Future extensibility: Clear extension points for new features
- Cache optimization: Optional caching for performance
- Clean migration path: Incremental steps from PoC to production
This synthesized design takes the best ideas from both approaches, creating a more robust, flexible, and maintainable solution that's still suitable for the alpha development phase.