Simplified Language-Aware Chunking Design¶
Language-aware chunking strategy that augments diarization splits with practical language detection heuristics.
Core Problem¶
Current chunking only considers time and speaker boundaries. We need to add language boundary detection to prevent mixed-language chunks that degrade ASR performance.
Core Functionality¶
- Detect language changes within speaker blocks (when speaker has long continuous speech)
- Detect language changes between speakers
- Keep segments together when language is consistent (up to target chunk size)
Basic Protocols¶
class ChunkingStrategy(Protocol):
def chunk(self, segments: List[DiarizationSegment], config: ChunkerConfig) -> List[DiarizationChunk]:
...
class LanguageDetector(Protocol):
def detect(self, audio_chunk: AudioChunk, start_ms: int = 0, duration_ms: int = 5000) -> tuple[str, float]:
"""Returns (language_code, confidence)"""
...
def detect_multiple(self, audio_samples: List[tuple[AudioChunk, int, int]]) -> List[tuple[str, float]]:
"""Concurrent detection for multiple samples"""
...
Simple Configuration¶
@dataclass
class ChunkerConfig:
# Existing fields
target_time: int = 300_000
# Language detection
enable_language_detection: bool = False
min_speaker_duration_for_probe: int = 20_000 # Only probe long speaker segments
sample_duration_ms: int = 5000
confidence_threshold: float = 0.7
High-Level Algorithm¶
Language-Aware Chunking Flow¶
flowchart TD
A[Diarization Segments] --> B[Group by Speaker]
B --> C{Speaker Duration > Threshold?}
C -->|Yes| D[Probe Speaker Language]
C -->|No| E[Use Previous Language]
D --> F[Compare with Previous Language]
E --> F
F --> G{Language Changed?}
G -->|Yes| H[Split Chunk Here]
G -->|No| I[Continue Current Chunk]
H --> J{Chunk Size > Target?}
I --> J
J -->|Yes| K[Finalize Chunk]
J -->|No| L[Add More Segments]
K --> M[Next Segments]
L --> M
M --> N{More Segments?}
N -->|Yes| C
N -->|No| O[Final Chunks]
Core Algorithm Logic¶
- Initialize: Start with first speaker's segments
- For each speaker transition:
- If speaker duration > threshold: probe language
- Compare language with current chunk language
- If different: split chunk, start new chunk
- If same: continue adding to current chunk
- Within speaker: Only probe if speaker has very long continuous segments
- Size check: Split chunk if target size exceeded regardless of language
Simple Implementation Outline¶
class LanguageAwareStrategy:
def __init__(self, detector: LanguageDetector):
self.detector = detector
def chunk(self, segments: List[DiarizationSegment], config: ChunkerConfig) -> List[DiarizationChunk]:
chunks = []
current_chunk_segments = []
current_language = None
speaker_groups = self._group_by_speaker(segments)
for speaker_group in speaker_groups:
# Probe language if speaker duration is significant
if self._should_probe_language(speaker_group, config):
detected_language = self._probe_speaker_language(speaker_group, config)
else:
detected_language = current_language # Inherit previous
# Check for language change
if (current_language is not None and
detected_language != current_language and
current_chunk_segments):
# Language changed - finalize current chunk
chunks.append(self._create_chunk(current_chunk_segments, current_language))
current_chunk_segments = []
# Add speaker segments to current chunk
current_chunk_segments.extend(speaker_group)
current_language = detected_language
# Check size limit
if self._chunk_too_large(current_chunk_segments, config):
chunks.append(self._create_chunk(current_chunk_segments, current_language))
current_chunk_segments = []
# Handle final chunk
if current_chunk_segments:
chunks.append(self._create_chunk(current_chunk_segments, current_language))
return chunks
Concurrency Requirements¶
WhisperService Thread Safety¶
class ConcurrentLanguageDetector:
def __init__(self, max_workers: int = 3):
self.max_workers = max_workers
self._thread_local = threading.local()
def _get_whisper_service(self) -> WhisperTranscriptionService:
"""Thread-local whisper service instance"""
if not hasattr(self._thread_local, 'service'):
self._thread_local.service = WhisperTranscriptionService()
return self._thread_local.service
Key Requirements:
- Thread-local WhisperService instances (avoid shared state)
- Concurrent audio sample processing
- Simple result aggregation
Decision Points¶
When to Probe Language¶
- Always probe: When speaker duration >
min_speaker_duration_for_probe - Never probe: Short speaker segments (inherit previous language)
- Future enhancement: Probe within very long single-speaker segments
Language Change Handling¶
- High confidence change: Split chunk immediately
- Low confidence: Continue current chunk (avoid false splits)
- No previous language: Use detected language as baseline
Simple Data Flow¶
sequenceDiagram
participant C as Chunker
participant D as LanguageDetector
participant W as WhisperService
loop For each speaker group
C->>C: Check if should probe
alt Duration > threshold
C->>D: Probe language
D->>W: Detect language (concurrent)
W->>D: Language + confidence
D->>C: Detection result
else Too short
C->>C: Use previous language
end
C->>C: Compare with current chunk language
alt Language changed
C->>C: Finalize current chunk
C->>C: Start new chunk
else Same language
C->>C: Add to current chunk
end
end
Minimal Extensions¶
Future Hooks¶
- Energy-based detection: Add energy threshold checks before language probing
- Topic awareness: Add semantic similarity checks
- Adaptive thresholds: Adjust confidence thresholds based on audio quality
Configuration Expansion¶
# Future config additions (placeholders)
adaptive_thresholds: bool = False # Future: adjust based on audio quality
probe_long_segments: bool = False # Future: probe within long single-speaker segments
Success Criteria¶
- Functional: Chunks have consistent language when confidence is high
- Performance: Language detection doesn't significantly slow processing
- Simple: Easy to understand and modify algorithm
- Extensible: Clear hooks for future enhancements
This simplified approach focuses on the core language boundary detection while maintaining the extensibility needed for future improvements.