ADR: YouTube Transcript Source Handling in yt-fetch CLI
Status
Proposed (supplements ADR: YouTube Transcript Format Selection)
Context
When requesting transcripts from YouTube videos:
- Videos may have manually uploaded subtitles
- Videos may have auto-generated captions
- Some videos may have both
- Quality and accuracy can vary significantly between sources
Currently yt-dlp options:
```python opts = { "writesubtitles": True, # Get manual subtitles "writeautomaticsub": True, # Get auto-generated captions "subtitleslangs": ["en"] # Language selection }
Decision
-
Initially accept both sources (manual and auto-generated) with preference given to manual subtitles when available (yt-dlp's default behavior)
-
Flag this as a known limitation/consideration:
- Source of transcript (manual vs auto) may affect quality
- No current mechanism to force selection of specific source
- Transcript source not clearly indicated in output
Future Considerations
Future versions should consider: - Adding transcript source metadata - Option to specify preferred source - Quality indicators in output - Logging which source was used
Consequences
Positive
- Simple initial implementation
- Works with all video types
- Maximum transcript availability
Negative
- Uncertain transcript source
- No quality indicators
- May get auto-generated when manual exists
- May get manual when auto-generated preferred
Notes
This limitation is acceptable for prototyping but should be revisited when: - Transcript quality becomes critical - Source attribution needed - Specific use cases require specific transcript types
Would you like me to explore any specific aspect of this further, or shall we move on to implementation?