Language-Aware Chunking Orchestrator Notes¶
Working notes for extending the DiarizationChunker orchestrator with language-aware strategies.
Design Proposal: Language-Aware Chunking Extension for diarization_chunker
- Background
DiarizationChunker currently supports two implicit strategies 1. Contiguous-Time (no speaker split) 2. Speaker-Aware (split at speaker change)
Both rely exclusively on timing / speaker labels and treat language as uniform.
⸻
- Objectives
Goal Why it matters Language-Purity per Chunk Whisper (and most ASR models) transcribe best when each input clip is single-language. Pluggable Strategies Future research (e.g., energy-based chunking, silence detection) should drop in without core refactors. Compact, Single-Purpose Methods Improves testability & maintenance. Config-Driven Users should tune via settings rather than editing code.
⸻
-
High-Level Architecture
+--------------------+audio ----> | DiarizationChunker | (orchestrator, owns config) +---------+----------+ | v +--------------------+ +----------------+ | IChunkingStrategy |<--- dependency delve --| LanguageProbe | +---------+----------+ +----------------+ ^ ^ (Strategy 1) | | | (Strategy N) | +--------------------+ +-------------------+ | DefaultStrategy | | LanguageAwareStrategy | +--------------------+ +-------------------+
• IChunkingStrategy (Protocol / ABC)
class IChunkingStrategy(Protocol): def extract(self, segments: list[DiarizationSegment]) -> list[DiarizationChunk]: ...
• DiarizationChunker becomes a light façade
• Accepts strategy: IChunkingStrategy via constructor or config name.
• Delegates all chunk creation to that strategy—no branching inside.
• LanguageProbe helper
• Stateless (or memoized) service:
• probe(audio: AudioChunk, *, sample_ms: int, tries: int, randomize: bool) -> str
• Wraps Whisper‐tiny / faster language ID.
⸻
- New Strategy: LanguageAwareStrategy
Step Action Notes 1 Group contiguous segments by speaker Reuse existing _gap_time, _speaker_change. 2 Compute speaker-block duration Skip probing if block < min_probe_block_ms. 3 Sample audio from each block Either deterministic (mid-segment) or random.sample. 4 Detect language via LanguageProbe Early-exit if all blocks agree on language. 5 Split chunk if mixed languages Options:• Hard split at first language boundary• Greedy accumulate until mismatch• Recursively bisect noisy blocks 6 Return list[DiarizationChunk] Each chunk carries a language attribute (extend DiarizationChunk).
Config Additions (suggested):
Field Type Default Description lang_probe_sample_ms int 2000 Milliseconds per sample. lang_probe_tries int 3 Number of probes per block. lang_split_enabled bool True Gate for runtime enable/disable. max_speakers_for_probe int 2 Skip whole strategy if count greater. min_probe_block_ms int 15 000 Don’t probe very short blocks.
⸻
- Modifications to Existing Models
Class Change DiarizationChunk Add language: Optional[str] = None ChunkerConfig Extend with new language-probe fields. AudioChunk (if available) Provide lightweight .slice(start_ms, end_ms) for probe sampling.
⸻
- Example (Illustrative Only)
orchestrator¶
chunker = DiarizationChunker( strategy=LanguageAwareStrategy(config=ChunkerConfig(lang_split_enabled=True)) )
chunks = chunker.chunk(diarization_segments) for ch in chunks: print(ch.language, ch.total_duration_sec)
⸻
- Open Questions & Options
Topic Choices & Trade-offs Language detector Whisper-tiny vs. fasttext vs. pycld3. Whisper is heavier but aligned with final ASR. Granularity of split Split at block boundary vs. mid-block split when detection disagrees? Parallelism Probe calls can run concurrently; decide between concurrent.futures or asyncio. Caching Same speaker ID across chunks might reuse earlier language result. Testing Need synthetic mixed-language audio fixtures. Fallback If detector returns "unknown" above threshold, skip splitting to avoid over-fragmentation.
⸻
- Migration Plan
- Introduce interfaces.py with IChunkingStrategy, LanguageDetector.
- Refactor existing logic out of DiarizationChunker into DefaultStrategy.
- Add unit tests ensuring parity with current outputs for default strategy.
- Implement LanguageProbe (start with Whisper CLI call wrapper).
- Develop LanguageAwareStrategy iteratively: • Phase-0: Probe once per speaker, no splitting—just annotate language. • Phase-1: Enable optional split on mismatch.
- Benchmark transcription WER before & after on bilingual test set.
- Update docs & example notebooks.
⸻
- Future Extensions • Energy-AwareStrategy (split on low RMS zones). • Topic-AwareStrategy (use embedding similarity between segments). • Dynamic-Target-Duration (adjust chunk length based on compute budget).
⸻
- Summary
By extracting chunking logic into interchangeable strategies and adding a LanguageAwareStrategy that probes speaker blocks for language shifts, we achieve: • Cleaner separation of concerns • Config-driven experimentation • Improved ASR accuracy on mixed-language recordings • Foundation for future chunking heuristics
⸻