ADR-TR01: AssemblyAI Integration for Transcription Service¶
Adds AssemblyAI as a modular transcription backend alongside Whisper by defining a shared interface and provider factory.
- Status: Proposed
- Date: 2025-05-01
Context¶
The TNH Scholar project currently uses OpenAI's Whisper API for audio transcription services, while using PyAnnote for speaker diarization. We wish to explore alternative transcription providers to potentially improve quality, gain additional features, and create a more modular architecture that allows for multiple transcription backends.
AssemblyAI offers a comprehensive speech-to-text API with advanced features including high-quality transcription, support for multiple languages, and additional capabilities such as entity detection that could be valuable in future development. Integrating AssemblyAI as an alternative transcription backend would provide flexibility while allowing us to continue using PyAnnote for language-independent diarization.
Decision Drivers¶
- Need for a pluggable transcription service architecture that supports multiple backends
- Desire to access AssemblyAI's transcription quality and feature set
- Requirement to maintain compatibility with the existing diarization system
- Support for rapid prototyping while enabling future production-quality implementations
- Consistent authentication and configuration management across services
Proposed Decision¶
We will create a modular transcription service architecture with AssemblyAI as an additional backend option. The implementation will:
- Define an abstract
TranscriptionServiceinterface that various providers can implement - Create concrete implementations for both OpenAI Whisper (current) and AssemblyAI (new)
- Implement a factory pattern to select the appropriate service at runtime
- Support consistent configuration and authentication across services
Design Details¶
Class Structure¶
TranscriptionService (ABC)
├── WhisperTranscriptionService
└── AssemblyAITranscriptionService
Interface Definition¶
from abc import ABC, abstractmethod
from pathlib import Path
from typing import Dict, Optional, Any, BinaryIO, Union
class TranscriptionService(ABC):
"""Abstract base class defining the interface for transcription services."""
@abstractmethod
def transcribe(
self,
audio_file: Union[Path, BinaryIO],
options: Optional[Dict[str, Any]] = None
) -> Dict[str, Any]:
"""
Transcribe audio file to text.
Args:
audio_file: Path to audio file or file-like object
options: Provider-specific options for transcription
Returns:
Dictionary containing transcription results with standardized keys
and provider-specific data
"""
pass
@abstractmethod
def get_result(self, job_id: str) -> Dict[str, Any]:
"""
Get results for an existing transcription job.
Args:
job_id: ID of the transcription job
Returns:
Dictionary containing transcription results
"""
pass
Factory Implementation¶
class TranscriptionServiceFactory:
"""Factory for creating transcription service instances."""
@staticmethod
def create_service(
provider: str = "whisper",
api_key: Optional[str] = None,
**kwargs
) -> TranscriptionService:
"""
Create a transcription service instance.
Args:
provider: Service provider ("whisper" or "assemblyai")
api_key: API key for the service
**kwargs: Additional provider-specific configuration
Returns:
TranscriptionService instance
"""
if provider.lower() == "whisper":
from .whisper_service import WhisperTranscriptionService
return WhisperTranscriptionService(api_key=api_key, **kwargs)
elif provider.lower() == "assemblyai":
from .assemblyai_service import AssemblyAITranscriptionService
return AssemblyAITranscriptionService(api_key=api_key, **kwargs)
else:
raise ValueError(f"Unsupported transcription provider: {provider}")
Authentication Management¶
Authentication will be managed using environment variables with consistent naming conventions:
OPENAI_API_KEY- for OpenAI Whisper API (existing)ASSEMBLYAI_API_KEY- for AssemblyAI API (new)
The system will look for these variables directly or in a .env file, with appropriate fallbacks and error messages.
Configuration Design¶
Configuration will be managed through a hierarchical approach:
- Default configurations defined at the class level
- Configuration via constructor parameters
- Per-request options for fine-grained control
AssemblyAI Implementation¶
The AssemblyAI implementation will use the AssemblyAI REST API with the following components:
- File upload endpoint (
https://api.assemblyai.com/v2/upload) - Transcription endpoint (
https://api.assemblyai.com/v2/transcript) - Polling mechanism for async job completion
- Result standardization for compatibility with existing systems
Consequences¶
Advantages¶
- Modularity: Clear separation of concerns with a pluggable architecture
- Feature Access: Enables use of AssemblyAI's advanced features
- Flexibility: Allows switching between providers without changing client code
- Future-proofing: Architecture supports adding additional providers
- Consistency: Standardized result format regardless of backend
Disadvantages¶
- Complexity: Additional abstraction layer increases system complexity
- Integration Effort: Requires implementing and testing a new service
- Maintenance Overhead: Multiple implementations to maintain
- Dependency Management: New external dependency to manage
Risks and Mitigations¶
| Risk | Mitigation |
|---|---|
| API compatibility changes | Encapsulate provider-specific code in concrete implementations |
| Authentication failures | Clear error messages and validation for API keys |
| Result format inconsistencies | Standardized result mapping in each implementation |
| Performance differences | Documentation of expected behavior differences |
Implementation Plan¶
Phase 1: Core Interface (Week 1)¶
- Define the
TranscriptionServiceabstract base class - Create a simple adapter for the existing Whisper implementation
- Implement basic factory pattern
Phase 2: AssemblyAI Integration (Week 2)¶
- Implement
AssemblyAITranscriptionServicefor basic transcription - Add authentication and configuration management
- Implement result standardization
Phase 3: Feature Parity and Testing (Week 3)¶
- Ensure feature parity with the current Whisper implementation
- Add comprehensive testing
- Update documentation
Phase 4: Integration with Existing Systems (Week 4)¶
- Modify
audio_transcribe.pyto use the new abstraction - Add configuration options for selecting the transcription provider
- Validate integration with the diarization system