Practical Language-Aware Chunking Design¶
Practical heuristics for detecting language changes during chunking when diarization output is noisy.
Core Problem¶
Current chunking only considers time and speaker boundaries. We need language boundary detection for mixed-language audio, but diarization output is noisy (800+ segments, false speaker detection) requiring practical segment handling.
Real-World Context¶
Typical diarization output:
- 1hr audio with 5 actual speakers → 800+ segments detected
- Diarizer reports 12 speakers (due to noise/audio artifacts)
- Many segments ≤1 second (utterance-based detection)
- Need minimum 5+ seconds for reliable language detection
Audio scenarios:
- Conversational: Multiple speakers, translation, code-switching
- Formal: Panel discussions, Q&A sessions, prepared statements
Probing Strategies for Testing¶
1. Individual Segment Threshold¶
Strategy: Poll any segment longer than X seconds
- Config:
min_segment_duration_for_probe: int = 5000 # 5 seconds - Use case: Catch longer utterances that might contain language switches
- Pro: Simple, catches obvious language segments
- Con: Misses short code-switching
2. Contiguous Speaker Duration¶
Strategy: Poll when speaker's contiguous segments total > Y seconds
- Config:
min_speaker_duration_for_probe: int = 10000 # 10 seconds - Logic: Accumulate consecutive segments from same speaker until interrupted
- Use case: Speaker delivers longer statement, might switch languages mid-way
- Special case: Y=0 means probe every speaker appearance (exhaustive)
3. Fixed Speaker Languages¶
Strategy: Assign known languages to specific speakers
- Config:
fixed_speaker_languages: Dict[str, str] = {"SPEAKER_01": "en", "SPEAKER_02": "vi"} - Use case: Known multilingual scenarios (interpreter + speaker)
- Pro: Eliminates false language detection
- Con: Requires prior knowledge
4. Contiguous Speaker Consistency¶
Strategy: Keep language fixed for uninterrupted speaker segments
- Config:
lock_language_until_interruption: bool = True - Logic: Once language detected for speaker, maintain until different speaker talks
- Use case: Formal presentations where speakers stick to one language per turn
- Pro: Reduces false language switches
- Con: Misses mid-turn language changes
5. Speaker Transition Probing¶
Strategy: Always probe language when speaker changes
- Config:
probe_on_speaker_change: bool = True - Logic: Check language of new speaker (if sufficient audio available)
- Use case: Detect language switching between speakers
- Pro: Catches most language boundaries
- Con: Expensive with many speaker transitions
Algorithm Flow¶
flowchart TD
A[Diarization Segments] --> B[Process Sequentially]
B --> C{Speaker Changed?}
C -->|Yes| D[Apply Speaker Transition Logic]
C -->|No| E[Apply Same-Speaker Logic]
D --> F{Fixed Language for Speaker?}
F -->|Yes| G[Use Fixed Language]
F -->|No| H{Enough Audio to Probe?}
H -->|Yes| I[Probe New Speaker Language]
H -->|No| J[Inherit Previous Language]
E --> K{Strategy Check}
K --> L{Segment > Threshold?}
L -->|Yes| M[Probe Segment Language]
L -->|No| N{Contiguous Duration > Threshold?}
N -->|Yes| O[Probe Accumulated Language]
N -->|No| P[Keep Current Language]
G --> Q[Language Decision Made]
I --> Q
J --> Q
M --> Q
O --> Q
P --> Q
Q --> R{Language Changed?}
R -->|Yes| S[Split Chunk]
R -->|No| T[Continue Chunk]
S --> U[Add to Chunks]
T --> V{Chunk Size OK?}
V -->|No| U
V -->|Yes| W[Continue Processing]
U --> W
W --> X{More Segments?}
X -->|Yes| B
X -->|No| Y[Final Chunks]
Configuration Options¶
@dataclass
class LanguageProbeConfig:
# Individual segment probing
min_segment_duration_for_probe: int = 5000 # ms
# Contiguous speaker probing
min_speaker_duration_for_probe: int = 10000 # ms
probe_exhaustively: bool = False # Y=0 case
# Fixed assignments
fixed_speaker_languages: Dict[str, str] = field(default_factory=dict)
# Consistency rules
lock_language_until_interruption: bool = False
# Transition probing
probe_on_speaker_change: bool = True
# Detection settings
confidence_threshold: float = 0.7
sample_duration_ms: int = 5000
Segment Combining Logic¶
Before language detection, combine very short segments to reach minimum testable duration:
flowchart LR
A[Short Segments] --> B[Combine by Speaker]
B --> C{Combined Duration > Min?}
C -->|Yes| D[Probe Combined Audio]
C -->|No| E[Skip Probing]
D --> F[Language Result]
E --> G[Inherit/Default Language]
Combining rules:
- Combine consecutive segments from same speaker
- Stop combining when different speaker appears
- If combined duration still < minimum, skip probing
- Apply probing strategy to combined segment
Processing Sequence¶
For Each Segment in Timeline¶
-
Check speaker transition: Has speaker changed from previous segment?
-
Apply appropriate strategy:
- If speaker changed: Check fixed languages, probe if needed
-
If same speaker: Apply same-speaker probing rules
-
Language detection decision:
- Use fixed language if configured
- Probe if thresholds met and audio sufficient
-
Inherit previous language otherwise
-
Chunk management:
- If language changed (high confidence): Split chunk
- If chunk too large: Split on size
- Otherwise: Continue building chunk
Strategy Testing Plan¶
Phase 1: Basic Implementation¶
- Implement all 5 strategies as configurable options
- Test with known multilingual audio samples
- Measure language detection accuracy vs. manual annotation
Phase 2: Optimization¶
- Test different threshold values (3s, 5s, 10s, 15s)
- Compare exhaustive vs. selective probing performance
- Evaluate false positive rates for each strategy
Phase 3: Combination Testing¶
- Test multiple strategies simultaneously
- Find optimal combinations for different audio types
- Measure impact on transcription quality
Expected Outcomes¶
Conversational audio (cooking demo scenario):
- Strategy 2 + 5: Probe contiguous speaker duration + speaker transitions
- Handle 800+ segments efficiently
- Catch translation segments between speakers
Formal audio (panel discussions):
- Strategy 4 + 5: Lock language until interruption + speaker transitions
- Reduce false language switches during long statements
- Probe only at natural speaker boundaries
This practical approach handles real diarization noise while providing multiple testable strategies for different audio contexts.