ADR-TR01: AssemblyAI Integration for Transcription Service¶

Adds AssemblyAI as a modular transcription backend alongside Whisper by defining a shared interface and provider factory.

Status: Proposed
Date: 2025-05-01

Context¶

The TNH Scholar project currently uses OpenAI's Whisper API for audio transcription services, while using PyAnnote for speaker diarization. We wish to explore alternative transcription providers to potentially improve quality, gain additional features, and create a more modular architecture that allows for multiple transcription backends.

AssemblyAI offers a comprehensive speech-to-text API with advanced features including high-quality transcription, support for multiple languages, and additional capabilities such as entity detection that could be valuable in future development. Integrating AssemblyAI as an alternative transcription backend would provide flexibility while allowing us to continue using PyAnnote for language-independent diarization.

Decision Drivers¶

Need for a pluggable transcription service architecture that supports multiple backends
Desire to access AssemblyAI's transcription quality and feature set
Requirement to maintain compatibility with the existing diarization system
Support for rapid prototyping while enabling future production-quality implementations
Consistent authentication and configuration management across services

Proposed Decision¶

We will create a modular transcription service architecture with AssemblyAI as an additional backend option. The implementation will:

Define an abstract TranscriptionService interface that various providers can implement
Create concrete implementations for both OpenAI Whisper (current) and AssemblyAI (new)
Implement a factory pattern to select the appropriate service at runtime
Support consistent configuration and authentication across services

Design Details¶

Class Structure¶

TranscriptionService (ABC)
├── WhisperTranscriptionService
└── AssemblyAITranscriptionService

Interface Definition¶

from abc import ABC, abstractmethod
from pathlib import Path
from typing import Dict, Optional, Any, BinaryIO, Union

class TranscriptionService(ABC):
    """Abstract base class defining the interface for transcription services."""

    @abstractmethod
    def transcribe(
        self,
        audio_file: Union[Path, BinaryIO],
        options: Optional[Dict[str, Any]] = None
    ) -> Dict[str, Any]:
        """
        Transcribe audio file to text.

        Args:
            audio_file: Path to audio file or file-like object
            options: Provider-specific options for transcription

        Returns:
            Dictionary containing transcription results with standardized keys
            and provider-specific data
        """
        pass

    @abstractmethod
    def get_result(self, job_id: str) -> Dict[str, Any]:
        """
        Get results for an existing transcription job.

        Args:
            job_id: ID of the transcription job

        Returns:
            Dictionary containing transcription results
        """
        pass

Factory Implementation¶

class TranscriptionServiceFactory:
    """Factory for creating transcription service instances."""

    @staticmethod
    def create_service(
        provider: str = "whisper",
        api_key: Optional[str] = None,
        **kwargs
    ) -> TranscriptionService:
        """
        Create a transcription service instance.

        Args:
            provider: Service provider ("whisper" or "assemblyai")
            api_key: API key for the service
            **kwargs: Additional provider-specific configuration

        Returns:
            TranscriptionService instance
        """
        if provider.lower() == "whisper":
            from .whisper_service import WhisperTranscriptionService
            return WhisperTranscriptionService(api_key=api_key, **kwargs)
        elif provider.lower() == "assemblyai":
            from .assemblyai_service import AssemblyAITranscriptionService
            return AssemblyAITranscriptionService(api_key=api_key, **kwargs)
        else:
            raise ValueError(f"Unsupported transcription provider: {provider}")

Authentication Management¶

Authentication will be managed using environment variables with consistent naming conventions:

OPENAI_API_KEY - for OpenAI Whisper API (existing)
ASSEMBLYAI_API_KEY - for AssemblyAI API (new)

The system will look for these variables directly or in a .env file, with appropriate fallbacks and error messages.

Configuration Design¶

Configuration will be managed through a hierarchical approach:

Default configurations defined at the class level
Configuration via constructor parameters
Per-request options for fine-grained control

AssemblyAI Implementation¶

The AssemblyAI implementation will use the AssemblyAI REST API with the following components:

File upload endpoint (https://api.assemblyai.com/v2/upload)
Transcription endpoint (https://api.assemblyai.com/v2/transcript)
Polling mechanism for async job completion
Result standardization for compatibility with existing systems

Consequences¶

Advantages¶

Modularity: Clear separation of concerns with a pluggable architecture
Feature Access: Enables use of AssemblyAI's advanced features
Flexibility: Allows switching between providers without changing client code
Future-proofing: Architecture supports adding additional providers
Consistency: Standardized result format regardless of backend

Disadvantages¶

Complexity: Additional abstraction layer increases system complexity
Integration Effort: Requires implementing and testing a new service
Maintenance Overhead: Multiple implementations to maintain
Dependency Management: New external dependency to manage

Risks and Mitigations¶

Risk	Mitigation
API compatibility changes	Encapsulate provider-specific code in concrete implementations
Authentication failures	Clear error messages and validation for API keys
Result format inconsistencies	Standardized result mapping in each implementation
Performance differences	Documentation of expected behavior differences