ADR-YF01: YouTube Transcript Source Handling¶

Keeps both manual and auto-generated transcripts available while acknowledging source ambiguity in early yt-fetch releases.

Status: Proposed
Date: 2025-01-15

Context¶

When requesting transcripts from YouTube videos:

Videos may have manually uploaded subtitles
Videos may have auto-generated captions
Some videos may have both
Quality and accuracy can vary significantly between sources

Currently yt-dlp options:

```python opts = { "writesubtitles": True, # Get manual subtitles "writeautomaticsub": True, # Get auto-generated captions "subtitleslangs": ["en"] # Language selection }

Decision¶

Initially accept both sources (manual and auto-generated) with preference given to manual subtitles when available (yt-dlp's default behavior)
Flag this as a known limitation/consideration:
Source of transcript (manual vs auto) may affect quality
No current mechanism to force selection of specific source
Transcript source not clearly indicated in output

Future Considerations¶

Future versions should consider: - Adding transcript source metadata - Option to specify preferred source - Quality indicators in output - Logging which source was used

Consequences¶

Positive¶

Simple initial implementation
Works with all video types
Maximum transcript availability

Negative¶

Uncertain transcript source
No quality indicators
May get auto-generated when manual exists
May get manual when auto-generated preferred

Notes¶

This limitation is acceptable for prototyping but should be revisited when: - Transcript quality becomes critical - Source attribution needed - Specific use cases require specific transcript types

Would you like me to explore any specific aspect of this further, or shall we move on to implementation?