Skip to content

TextObject System Design Document

Detailed blueprint for the modern TextObject pipeline, outlining segmentation models, metadata, and API surfaces.

1. Overview and Purpose

The TextObject system manages the division of large texts into processable segments while maintaining contextual integrity. It serves two key purposes:

Primary Goal:

  • Enable AI processing of large texts by breaking them into manageable chunks
  • Preserve essential context across segmentation boundaries
  • Provide rich contextual information to compensate for segmentation

Secondary Goal:

  • Maintain structured metadata for human analysis and documentation
  • Support standard metadata practices (Dublin Core)
  • Enable systematic text processing workflows

2. Core Components

2.1 Response Format (API Layer)

The response format is optimized for AI interaction, emphasizing human-readable context:

class LogicalSection(BaseModel):
    """Represents a contextually meaningful segment of a larger text.

    Sections should preserve natural breaks in content (e.g., explicit section markers, topic shifts,
    argument development, narrative progression) while staying within specified size limits 
    in order to create chunks suitable for AI processing."""
    start_line: int = Field(
        ..., 
        description="Starting line number that begins this logical segment"
    )
    title: str = Field(
        ...,
        description="Descriptive title of section's key content"
    )

class TextObjectResponse(BaseModel):
    """Format for dividing large texts into AI-processable segments while
    maintaining broader document context."""
    document_summary: str = Field(
        ...,
        description="Concise, comprehensive overview of the text's content and purpose"
    )
    document_metadata: str = Field(
        ...,
        description="Available Dublin Core standard metadata in human-readable format"
    )
    key_concepts: str = Field(
        ...,
        description="Important terms, ideas, or references that appear throughout the text"
    )
    narrative_context: str = Field(
        ...,
        description="Concise overview of how the text develops or progresses as a whole"
    )
    language: str = Field(..., description="ISO 639-1 language code")
    sections: List[LogicalSection]

Key Design Points:

  • Separates document-level context into distinct conceptual units
  • Uses human-readable format for metadata and context
  • Maintains simple section structure for reliable AI processing

2.2 Internal Representation

The internal system uses a richer structure based on Dublin Core standards:

class TextMetadata(BaseModel):
    """Rich metadata container following Dublin Core standards."""

    # Core Dublin Core elements with validation
    title: str
    creator: List[str]
    subject: List[str]
    description: str
    publisher: Optional[str] = None
    contributor: List[str] = Field(default_factory=list)
    date: Optional[str] = None
    type: str
    format: str
    identifier: Optional[str] = None
    source: Optional[str] = None
    language: str

    # Additional fields
    context: str = ""
    additional_info: Dict[str, Any] = Field(default_factory=dict)

    # Custom fields can be added through additional_info

    class Config:
        """Pydantic model configuration."""
        extra = 'allow'  # Allows additional fields beyond those specified

    def to_dublin_core(self) -> Dict[str, Any]:
        """Extract Dublin Core fields as dictionary."""
        return self.model_dump(
            exclude={'context', 'additional_info'},
            exclude_none=True
        )

3. Key Design Decisions

3.1 Dual-Layer Design

  1. AI Interface Layer (TextObjectResponse)
  2. Optimized for AI processing
  3. Human-readable context
  4. Simplified structure

  5. Internal Layer (TextObject)

  6. Strict validation
  7. Structured metadata
  8. Rich processing capabilities

3.2 Metadata Approach

  1. AI Format
  2. Narrative document summary
  3. Human-readable metadata
  4. Context and key concepts separated
  5. Focus on information relevant for processing

  6. Internal Format

  7. Structured Dublin Core metadata
  8. Additional context storage
  9. Extensible design

3.3 Content Integration

Content management is delegated to NumberedText class:

  • Clean separation of concerns
  • Efficient text storage and access
  • Section-aware interface

4. Existing implementation details

4.1 Section Access

def get_section_content(self, index: int) -> str:
    """Retrieve content for specific section."""
    start = self.sections[index].start_line
    end = (self.sections[index + 1].start_line 
           if index < len(self.sections) - 1 
           else self.total_lines + 1)
    return self.content.get_segment(start, end)

4.2 Validation of TextObject

def _validate(self) -> None:
    """Validate section integrity."""
    if not self.sections:
        raise ValueError("TextObject must have at least one section")

    # Validate section ordering
    for i, section in enumerate(self.sections):
        if section.start_line < 1:
            raise ValueError(f"Section {i}: start line must be >= 1")
        if section.start_line > self.total_lines:
            raise ValueError(f"Section {i}: start line exceeds text length")
        if i > 0 and section.start_line <= self.sections[i-1].start_line:
            raise ValueError(f"Section {i}: non-sequential start line")

6. Future Considerations

  1. Performance Optimization
  2. Index sections for faster access
  3. Optimize metadata string parsing

  4. Extended Functionality

  5. Section manipulation (merge/split)
  6. Advanced metadata querying
  7. Enhanced validation rules

  8. Integration Enhancements

  9. Expanded AI context generation
  10. Bulk processing capabilities

This validation strategy ensures data integrity while providing clear feedback for both programmatic and human review of processing results.