Skip to content

TNH Scholar Design Principles

Architectural patterns, design philosophy, and system organization principles for TNH Scholar development.

Overview

This document establishes design principles for the TNH Scholar project. While the project is currently in a rapid prototyping phase, these guidelines aim to maintain architectural quality and consistency throughout development. The guide distinguishes between immediate prototyping requirements and standards for later production phases where appropriate.

For code formatting and naming conventions, see Style Guide. For high-level project philosophy and vision, see Project Principles and Conceptual Architecture.

Core Design Philosophy

The TNH Scholar system embraces several key philosophical principles:

  • Evolutionary improvement through self-generated training data
  • Modular design enabling flexible pipeline construction
  • Balance of rapid prototyping with extensible architecture
  • Focus on AI-enhanced content processing and transformation

See Project Philosophy for deeper context.

Fundamental Principles

Single Responsibility

Code should adhere to the principle of single responsibility, with functions and classes focused on one clear task:

  • Functions perform one logical operation
  • Classes manage related state or behavior
  • Modules encapsulate coherent functionality
  • Services handle one domain concern

Composition Over Inheritance

Favor composition over inheritance to build complex behavior from simple, reusable components:

# βœ… Preferred: Composition
class TextPipeline:
    def __init__(
        self,
        punctuator: PunctuationService,
        translator: TranslationService,
        sectioner: SectioningService
    ):
        self.punctuator = punctuator
        self.translator = translator
        self.sectioner = sectioner

# 🚫 Avoid: Deep inheritance hierarchies
class TranslatingPunctuatingSectioningProcessor(
    PunctuationProcessor,
    TranslationProcessor,
    SectioningProcessor
):
    pass

Separation of Concerns

Validation logic should be distinct from mutation or side effects to ensure maintainability:

# βœ… Preferred: Separate concerns
def validate_text_input(text: str) -> None:
    """Validate input only."""
    if not text or not text.strip():
        raise ValueError("Text cannot be empty")

def process_text(text: str) -> ProcessedText:
    """Process after validation."""
    validate_text_input(text)
    return apply_processing(text)

# 🚫 Avoid: Mixed concerns
def process_text(text: str) -> ProcessedText:
    """Validation, mutation, and I/O all mixed."""
    if not text:
        raise ValueError("Empty")
    result = apply_processing(text)
    save_to_database(result)  # Side effect mixed with processing
    return result

Modularity Principles

Design for Modularity

Each module, class, and function should have a single, well-defined responsibility.

Guidelines:

  • Encapsulate related functionality: Group related functions and classes into modules and packages to promote reuse and clarity
  • Minimize coupling: Modules should interact through well-defined interfaces, minimizing dependencies and side effects
  • Limit module size: Aim for modules that are small enough to be easily understood (generally < 300 lines), but large enough to encapsulate a coherent set of functionality
  • Explicit module exports: Use __all__ to define public API of modules where appropriate

Helpers Should Be Small and Composable

Helpers should be small (target 10 lines) and composable, enabling reuse and clarity:

# βœ… Preferred: Small, composable helpers
def extract_language_code(text_object: TextObject) -> str:
    """Extract language code from text object."""
    return text_object.metadata.get("language", "en")

def validate_language_code(code: str) -> None:
    """Validate ISO 639-1 language code."""
    if len(code) != 2 or not code.isalpha():
        raise ValueError(f"Invalid language code: {code}")

def get_validated_language(text_object: TextObject) -> str:
    """Get and validate language code."""
    code = extract_language_code(text_object)
    validate_language_code(code)
    return code

# 🚫 Avoid: Large, monolithic helpers
def process_and_validate_language_with_fallback_and_logging(
    text_object: TextObject,
    default: str = "en"
) -> str:
    """Do everything in one place (50+ lines)."""
    # ... lots of mixed logic

Stateful Iteration Loops Use Classes

When iteration logic requires complex state management, use a dedicated class to encapsulate state and dispatch to methods:

# βœ… Preferred: Stateful validation as a class
class _SectionBoundaryValidator:
    """Stateful validator for clean, readable loop logic."""

    def __init__(self, owner: "NumberedText", section_start_lines: List[int]) -> None:
        self.owner = owner
        self.section_start_lines = section_start_lines
        self.errors: List[SectionValidationError] = []
        self.prev_start: Optional[int] = None
        self.first_valid_seen = False

    def run(self) -> List[SectionValidationError]:
        """Main loop stays simple by dispatching to focused methods."""
        if not self.section_start_lines:
            return self.owner._errors_for_no_sections()

        sorted_with_idx = sorted(enumerate(self.section_start_lines), key=lambda t: t[1])

        for section_index, (input_idx, start_line) in enumerate(sorted_with_idx):
            if self.owner._is_out_of_bounds(start_line):
                self.errors.append(
                    self.owner._error_out_of_bounds(section_index, input_idx, start_line)
                )
                continue

            if not self.first_valid_seen:
                self._handle_first(section_index, input_idx, start_line)
                continue

            self._handle_body(section_index, input_idx, start_line)

        return self.errors

    def _handle_first(self, section_index: int, input_idx: int, start_line: int) -> None:
        """Handle first valid section."""
        self.errors.extend(
            self.owner._errors_for_first_section(section_index, input_idx, start_line)
        )
        self.first_valid_seen = True
        self.prev_start = start_line

    def _handle_body(self, section_index: int, input_idx: int, start_line: int) -> None:
        """Handle subsequent sections."""
        assert self.prev_start is not None
        if start_line <= self.prev_start:
            self.errors.append(
                self.owner._error_overlap(section_index, input_idx, self.prev_start, start_line)
            )
        elif start_line > self.prev_start + 1:
            self.errors.append(
                self.owner._error_gap(section_index, input_idx, self.prev_start, start_line)
            )
        self.prev_start = start_line

# Public API stays clean:
def validate_section_boundaries(
    self, section_start_lines: List[int]
) -> List[SectionValidationError]:
    """Validate section boundaries for gaps, overlaps, and out-of-bounds errors."""
    return self._SectionBoundaryValidator(self, section_start_lines).run()

# 🚫 Avoid: Complex state in procedural loop
def validate_section_boundaries(
    self, section_start_lines: List[int]
) -> List[SectionValidationError]:
    """Procedural loop with hard-to-follow state management."""
    errors = []
    prev_start = None
    first_valid_seen = False

    # 50+ lines of nested conditionals, state mutations, and edge cases
    # Hard to test individual logic branches
    # Difficult to understand control flow
    for section_index, (input_idx, start_line) in enumerate(sorted_with_idx):
        if out_of_bounds:
            # complex logic here
            pass
        elif not first_valid_seen:
            # complex logic here
            first_valid_seen = True
            prev_start = start_line
        else:
            # more complex logic here
            prev_start = start_line

    return errors

Benefits:

  • Readability: Main loop logic is clean and straightforward
  • Testability: Each method can be tested independently
  • Maintainability: State is explicit and localized
  • Debuggability: Clear method boundaries for breakpoints and logging

When to use:

  • Validation loops with complex state tracking
  • Multi-pass parsers with lookahead/lookbehind
  • State machines with transitions
  • Iterative algorithms that accumulate results based on conditions

Implementation notes:

  • Use private nested classes (_ValidatorClass) for internal-only validators
  • Keep the main loop in a run() method for clear entry point
  • Dispatch to focused helper methods (_handle_first, _handle_body, etc.)
  • Initialize all state in __init__ for explicitness

Interface Design

Abstract Base Classes and Protocols

All system interfaces must be defined via abstract base classes:

  • Use Protocol for structural typing and interface contracts (no inheritance required)
  • Use ABC only when enforcing init-time invariants or providing shared mixin behavior

Example with Protocol:

from typing import Protocol

class PromptProvider(Protocol):
    """Protocol for prompt providers."""

    def get_prompt(self, name: str) -> Prompt:
        """Retrieve prompt by name."""
        ...

    def list_prompts(self) -> list[str]:
        """List available prompt names."""
        ...

Example with ABC:

from abc import ABC, abstractmethod

class BaseProcessor(ABC):
    """Base processor with shared initialization."""

    def __init__(self, config: ProcessorConfig):
        self.config = config
        self._validate_config()

    @abstractmethod
    def _validate_config(self) -> None:
        """Validate configuration at init time."""
        pass

    @abstractmethod
    def process(self, input_data: Any) -> Any:
        """Process input data."""
        pass

Dependency Injection

Prefer dependency injection over global state:

# βœ… Preferred: Dependency injection
class TextProcessor:
    def __init__(
        self,
        prompt_catalog: PromptCatalog,
        ai_service: GenAIService
    ):
        self.prompt_catalog = prompt_catalog
        self.ai_service = ai_service

# 🚫 Avoid: Global singleton access
class TextProcessor:
    def process(self, text: str) -> str:
        prompt = LocalPromptManager().get_prompt("default")  # Global access
        return global_ai_service.process(text, prompt)  # Global access

Note: During prototyping, singleton access (like LocalPromptManager) is acceptable for rapid development. Plan transition to dependency injection for production (see ADR-PT01 in Historical References below).

Data Architecture

Immutability by Default

Keep data models immutable when possible for safer concurrent code:

from dataclasses import dataclass

@dataclass(frozen=True)
class ProcessingResult:
    """Immutable processing result."""
    content: str
    metadata: Dict[str, Any]
    timestamp: datetime

Strong Type Boundaries

Critical: No literals or untyped structures in application logic:

  • Configuration values come from Settings (pydantic BaseSettings)
  • Dicts are not used in app layers; prefer Pydantic models or dataclasses
  • Enums replace string literals for identifiers
  • Adapters handle dict conversions only at API transport boundaries

See Style Guide: Strong Typing Standards for details.

Choosing Between Pydantic Models and Plain Python Classes

Default to Pydantic v2 BaseModel for domain models, but use plain Python classes when Pydantic introduces significant friction.

When to Use Pydantic BaseModel

Pydantic excels at:

  • Data Transfer Objects (DTOs): Serialization/deserialization for APIs, JSON persistence, configuration
  • Simple domain models: Data containers with straightforward validation
  • Transport layer models: API requests/responses, file format schemas
  • Configuration models: Settings with validation and environment variable parsing

Example - Ideal Pydantic usage:

from pydantic import BaseModel, Field, ConfigDict

class TextObjectInfo(BaseModel):
    """Serializable DTO for TextObject persistence."""
    model_config = ConfigDict(frozen=True)

    source_file: Path | None = None
    language: str
    sections: list[SectionObject]
    metadata: Metadata

    # Clean serialization:
    # info.model_dump_json() β†’ JSON file
    # TextObjectInfo.model_validate_json(json_str) β†’ instance

When NOT to Use Pydantic BaseModel

Use plain Python classes when the domain model has:

  1. Custom iteration semantics that conflict with Pydantic's __iter__
  2. Pydantic's BaseModel.__iter__ yields (field_name, value) for dict-like iteration
  3. Domain-specific iteration (e.g., iterating over sections, lines, entries) requires # type: ignore[override]
  4. This signals a semantic mismatch

  5. Core dependencies that are non-Pydantic types

  6. When using arbitrary_types_allowed=True for most fields
  7. Defeats Pydantic's validation benefits
  8. Example: Model wrapping NumberedText (plain class) and Metadata (MutableMapping)

  9. Complex initialization requiring multiple validation hooks

  10. Needing both @model_validator(mode="before") AND model_post_init
  11. Suggests Pydantic's lifecycle doesn't match domain needs
  12. Plain class with explicit __init__ is clearer

  13. Rich domain behavior beyond data validation

  14. Complex stateful operations
  15. Multiple internal state transitions
  16. Custom protocols that don't align with Pydantic's model lifecycle

Example - When plain class is better:

# βœ… Preferred: Plain Python class for rich domain model
class TextObject:
    """Rich domain model with complex behavior (NOT a Pydantic model)."""

    def __init__(
        self,
        num_text: NumberedText,  # Non-Pydantic type
        language: str | None = None,
        sections: list[SectionObject] | None = None,
        metadata: Metadata | None = None,  # MutableMapping, not Pydantic
    ):
        self.num_text = num_text
        self.language = language or get_language_code_from_text(num_text.content)
        self.sections = sections or []
        self.metadata = metadata or Metadata()

        if self.sections:
            self.validate_sections()  # Complex validation

    def __iter__(self) -> Iterator[SectionEntry]:
        """Domain-specific iteration - clean, no type: ignore needed."""
        for i, section in enumerate(self.sections):
            content = self.num_text.get_segment(section.section_range.start, section.section_range.end)
            yield SectionEntry(number=i + 1, title=section.title, content=content)

    def export_info(self) -> TextObjectInfo:
        """Convert to Pydantic DTO for serialization."""
        return TextObjectInfo(
            source_file=self.source_file,
            language=self.language,
            sections=self.sections,
            metadata=self.metadata
        )

    @classmethod
    def from_info(cls, info: TextObjectInfo, num_text: NumberedText) -> "TextObject":
        """Create from Pydantic DTO."""
        return cls(
            num_text=num_text,
            language=info.language,
            sections=info.sections,
            metadata=info.metadata
        )

# 🚫 Avoid: Forcing Pydantic when it doesn't fit
class TextObject(BaseModel):
    """Pydantic model with significant friction."""
    model_config = ConfigDict(arbitrary_types_allowed=True)  # Red flag

    num_text: NumberedText  # Requires arbitrary_types_allowed
    metadata: Metadata  # Requires coercion in @model_validator + model_post_init

    @model_validator(mode="before")
    @classmethod
    def _coerce_metadata(cls, data: Any) -> Any:
        """Complex pre-validation coercion needed."""
        # ... complex logic

    def model_post_init(self, __context: Any) -> None:
        """More complex post-init coercion needed."""
        # ... more complex logic

    def __iter__(self) -> Iterator[SectionEntry]:  # type: ignore[override]
        """Domain iteration conflicts with Pydantic's __iter__."""
        # ... requires type: ignore

Hybrid Pattern: Separate Domain Model and DTO

Best of both worlds - use plain class for domain logic, Pydantic for serialization:

# Domain model: Plain Python class
class TextObject:
    """Rich domain model with complex behavior."""

    def __init__(self, num_text: NumberedText, ...):
        # ... domain logic

    def __iter__(self) -> Iterator[SectionEntry]:
        # Clean domain-specific iteration
        ...

    def export_info(self) -> TextObjectInfo:
        """Convert to DTO for persistence."""
        return TextObjectInfo(...)

# DTO: Pydantic model for serialization
class TextObjectInfo(BaseModel):
    """Serializable snapshot of TextObject state."""
    source_file: Path | None
    language: str
    sections: list[SectionObject]
    metadata: Metadata

Benefits:

  • Separation of concerns: Domain logic separate from serialization
  • Type safety: Pydantic validates DTOs, domain model handles business logic
  • Clean APIs: Each class has a clear, focused purpose
  • No friction: No # type: ignore, no arbitrary_types_allowed, no dual validators

Decision Checklist

Use Pydantic BaseModel if:

  • βœ… Primary purpose is data validation or serialization
  • βœ… Simple field-based model without complex initialization
  • βœ… All fields are Pydantic-compatible types
  • βœ… No custom iteration semantics needed

Use plain Python class if:

  • βœ… Model has rich domain behavior beyond data validation
  • βœ… Custom __iter__ conflicts with Pydantic's field iteration
  • βœ… Core fields require arbitrary_types_allowed
  • βœ… Initialization needs complex multi-step validation
  • βœ… Model wraps non-Pydantic core types (NumberedText, custom collections)

Consider hybrid pattern if:

  • βœ… Domain model is complex BUT needs serialization
  • βœ… Want clean separation between domain logic and persistence
  • βœ… Multiple serialization formats needed (JSON, YAML, database)

See also:

  • ADR-AT03.3 Addendum 2025-12-14: Real-world example of reverting Pydantic adoption after identifying friction
  • Examples in codebase:
  • Plain classes: src/tnh_scholar/text_processing/numbered_text.py, src/tnh_scholar/metadata/metadata.py, TextObject (after ADR-AT03.3 revision)
  • Pydantic DTOs: src/tnh_scholar/ai_text_processing/text_object.py, src/tnh_scholar/ai_text_processing/text_object.py
  • Pydantic domain models: src/tnh_scholar/ai_text_processing/text_object.py, src/tnh_scholar/gen_ai_service/models/domain.py

Separation of Data and Logic

Separate data representation from business logic where appropriate:

# βœ… Preferred: Data models separate from services
@dataclass
class PromptMetadata:
    """Data model."""
    name: str
    version: str
    variables: list[str]

class PromptService:
    """Business logic."""
    def render_prompt(self, metadata: PromptMetadata, context: Dict) -> str:
        ...

# 🚫 Avoid: Mixed data and complex logic
class Prompt:
    def __init__(self, name: str):
        self.name = name
        # ... lots of fields

    def render_with_ai_fallback_and_caching(self, context: Dict) -> str:
        # ... 100 lines of business logic in data class

Error Handling Philosophy

Explicit Over Implicit

Handle warnings and errors thoughtfully, distinguishing recoverable conditions from critical failures:

  • Use specific exception types (not Exception)
  • Let unknown exceptions propagate
  • Document expected exceptions in docstrings
  • Use custom exceptions for domain-specific errors

Error Classes: Exceptions vs Error Metadata

TNH Scholar uses two distinct patterns for error handling:

1. Exception Classes - For errors that interrupt control flow:

# βœ… Exception classes inherit from TnhScholarError
from tnh_scholar.exceptions import TnhScholarError

class ValidationError(TnhScholarError):
    """Raised when input validation fails."""
    pass

class ConfigurationError(TnhScholarError):
    """Raised when configuration is invalid."""
    pass

2. Error Metadata Classes - For structured error information returned as data:

# βœ… Error metadata classes use Pydantic BaseModel
from pydantic import BaseModel, ConfigDict

class ErrorInfo(BaseModel):
    """Error information returned from validation or processing.

    Not raised as exception - returned as structured data for
    programmatic handling, serialization, and reporting.
    """
    model_config = ConfigDict(frozen=True, extra="forbid")

    error_type: str
    message: str
    context: dict[str, Any] | None = None

# βœ… Example: Validation returns error metadata
def validate_sections(sections: List[Section]) -> List[SectionValidationError]:
    """Returns list of errors found (does not raise exception)."""
    errors = []
    # ... validation logic that appends SectionValidationError instances
    return errors

When to use each pattern:

Pattern Use Case Example
Exception class (inherits TnhScholarError) Fatal errors requiring immediate control flow change raise ConfigurationError("Missing API key")
Error metadata class (Pydantic BaseModel) Validation results, batch error reporting, API responses return [ErrorInfo(type="gap", message="...")]

Key requirements for error metadata classes:

from pydantic import BaseModel, ConfigDict

class MyErrorInfo(BaseModel):
    """Always include these for error metadata classes."""
    model_config = ConfigDict(
        frozen=True,      # Immutability
        extra="forbid"    # Reject unexpected fields
    )

    # Strongly typed fields (no raw dicts/strings)
    error_type: str
    message: str

Benefits:

  • Serialization: Pydantic models serialize to JSON for APIs and logging
  • Validation: Type checking and field validation at construction
  • Immutability: frozen=True prevents accidental mutation
  • Strict schema: extra="forbid" catches schema mismatches

Examples in the codebase:

  • Exception classes: src/tnh_scholar/exceptions.py
  • Error metadata: src/tnh_scholar/gen_ai_service/models/transport.py (ErrorInfo)
  • Error metadata: src/tnh_scholar/text_processing/numbered_text.py (SectionValidationError)

Fail Fast

During prototyping, fail fast to identify issues early:

  • Don't mask exceptions with catch-all handlers
  • Let stack traces propagate for debugging
  • Add TODO comments for future error handling

See Style Guide: Error Handling for implementation details.

Processing Architecture

Pipeline Design

Pipelines compose processors into workflows:

  • Each processor has a single, well-defined transformation
  • Processors are independent and reusable
  • Pipeline orchestration is separate from processing logic
  • Results flow through immutable data structures

Example:

class ProcessingPipeline:
    """Orchestrates content processing workflow."""

    def __init__(
        self,
        processors: list[ContentProcessor],
        training_collector: Optional[TrainingCollector] = None
    ):
        self.processors = processors
        self.collector = training_collector

    def execute(
        self,
        source: ContentSource
    ) -> tuple[ProcessedContent, Optional[TrainingData]]:
        """Execute pipeline with optional training data collection."""
        content = source.load()

        for processor in self.processors:
            content = processor.process(content)

            if self.collector:
                self.collector.collect(processor, content)

        training_data = self.collector.finalize() if self.collector else None
        return content, training_data

Dispatch Patterns

Employ dispatch patterns (prefer match/case when possible) to cleanly separate concerns and improve extensibility:

# βœ… Preferred: Pattern matching for dispatch
def route_processing(task: ProcessingTask) -> ProcessedContent:
    match task.type:
        case TaskType.PUNCTUATE:
            return punctuate_processor.process(task.content)
        case TaskType.TRANSLATE:
            return translation_processor.process(task.content)
        case TaskType.SECTION:
            return sectioning_processor.process(task.content)
        case _:
            raise ValueError(f"Unknown task type: {task.type}")

Text Processing Principles

Explicit Text Handling

Text handling should be explicit and consistent, favoring clarity in encoding and processing:

  • Always specify encoding (UTF-8 default)
  • Handle normalization explicitly (NFC, NFKC, etc.)
  • Document text format assumptions
  • Preserve provenance and metadata

Metadata Preservation

Throughout processing pipelines:

  • Maintain document structure
  • Preserve metadata across transformations
  • Track provenance and processing history
  • Support metadata enrichment

Testing Architecture

Test Organization

Tests follow this structure even during prototyping:

tests/
β”œβ”€β”€ unit/              # Fast, isolated tests
β”‚   β”œβ”€β”€ test_text_processing.py
β”‚   └── test_prompt_catalog.py
β”œβ”€β”€ integration/       # Multi-component tests
β”‚   └── test_full_pipeline.py
└── conftest.py        # Shared fixtures

Test Requirements by Phase

Prototyping Phase:

  • Basic unit tests for core functionality
  • Critical path testing
  • Basic integration tests

Production Phase:

  • Comprehensive unit test coverage (>80%)
  • Full integration test suite
  • Performance testing
  • Edge case handling
  • Mock external services

Performance Principles

Resource Management

Basic guidelines apply across phases:

Memory Management:

  • Stream large files (don't load entirely into memory)
  • Clean up temporary files
  • Monitor memory usage in processing pipelines

Processing Optimization:

  • Batch operations where possible
  • Cache frequently used data (prompts, configurations)
  • Monitor API usage and costs
  • Use async/await for I/O-bound operations

Lazy Evaluation

Defer expensive computations until needed:

# βœ… Preferred: Lazy evaluation
class PromptCatalog:
    def __init__(self, directory: Path):
        self.directory = directory
        self._cache: Dict[str, Prompt] = {}

    def get_prompt(self, name: str) -> Prompt:
        """Load prompt on demand."""
        if name not in self._cache:
            self._cache[name] = self._load_prompt(name)
        return self._cache[name]

# 🚫 Avoid: Eager loading of everything
class PromptCatalog:
    def __init__(self, directory: Path):
        self.prompts = {
            p.name: p for p in self._load_all_prompts(directory)
        }  # Loads hundreds of prompts at startup

Refactoring Triggers

Refactor when:

  • Code becomes difficult to follow
  • Duplication arises (DRY principle)
  • New requirements suggest clearer abstractions
  • Function/module exceeds complexity limits
  • Tests become difficult to write or maintain

Development Phase Considerations

Prototyping Phase Priorities

During prototyping, prioritize:

  • Rapid iteration and experimentation
  • Core functionality over comprehensive error handling
  • Simple pipeline construction
  • Clear component boundaries
  • Basic testing and documentation

Production Phase Requirements

For production, add:

  • Comprehensive error handling
  • Performance optimization
  • Security hardening
  • Full test coverage
  • Complete documentation
  • Monitoring and observability

Future Architecture Considerations

Areas marked for future development:

  • Plugin system architecture
  • Enhanced configuration management
  • Rebuild of ai_text_processing suite with modern patterns
  • Extended API integration (batch processing, alternate model services)
  • Enhanced security features
  • Performance optimization and async processing
  • Extended prompt capabilities
  • Model training and fine-tuning tools

See Future Directions for long-term vision.

Historical References

πŸ“š View superseded design documents (maintainers/contributors) **Note**: These documents are archived and excluded from the published documentation. They provide historical context for the prompt system architecture and terminology migration. - **[ADR-PT01: Pattern Access Strategy](/architecture/prompt-system/archive/adr/adr-pt01-pattern-access-strategy.md)** (2024) *Status*: Superseded by prompt system ADR series (ADR-PT03/ADR-PT04) - **[Core Pattern Architecture](/architecture/prompt-system/archive/core-pattern-architecture.md)** (2024) *Status*: Historical prompt/pattern architecture exploration

References