TextObject System Design Document¶
Detailed blueprint for the modern TextObject pipeline, outlining segmentation models, metadata, and API surfaces.
1. Overview and Purpose¶
The TextObject system manages the division of large texts into processable segments while maintaining contextual integrity. It serves two key purposes:
Primary Goal:
- Enable AI processing of large texts by breaking them into manageable chunks
- Preserve essential context across segmentation boundaries
- Provide rich contextual information to compensate for segmentation
Secondary Goal:
- Maintain structured metadata for human analysis and documentation
- Support standard metadata practices (Dublin Core)
- Enable systematic text processing workflows
2. Core Components¶
2.1 Response Format (API Layer)¶
The response format is optimized for AI interaction, emphasizing human-readable context:
class LogicalSection(BaseModel):
"""Represents a contextually meaningful segment of a larger text.
Sections should preserve natural breaks in content (e.g., explicit section markers, topic shifts,
argument development, narrative progression) while staying within specified size limits
in order to create chunks suitable for AI processing."""
start_line: int = Field(
...,
description="Starting line number that begins this logical segment"
)
title: str = Field(
...,
description="Descriptive title of section's key content"
)
class TextObjectResponse(BaseModel):
"""Format for dividing large texts into AI-processable segments while
maintaining broader document context."""
document_summary: str = Field(
...,
description="Concise, comprehensive overview of the text's content and purpose"
)
document_metadata: str = Field(
...,
description="Available Dublin Core standard metadata in human-readable format"
)
key_concepts: str = Field(
...,
description="Important terms, ideas, or references that appear throughout the text"
)
narrative_context: str = Field(
...,
description="Concise overview of how the text develops or progresses as a whole"
)
language: str = Field(..., description="ISO 639-1 language code")
sections: List[LogicalSection]
Key Design Points:
- Separates document-level context into distinct conceptual units
- Uses human-readable format for metadata and context
- Maintains simple section structure for reliable AI processing
2.2 Internal Representation¶
The internal system uses a richer structure based on Dublin Core standards:
class TextMetadata(BaseModel):
"""Rich metadata container following Dublin Core standards."""
# Core Dublin Core elements with validation
title: str
creator: List[str]
subject: List[str]
description: str
publisher: Optional[str] = None
contributor: List[str] = Field(default_factory=list)
date: Optional[str] = None
type: str
format: str
identifier: Optional[str] = None
source: Optional[str] = None
language: str
# Additional fields
context: str = ""
additional_info: Dict[str, Any] = Field(default_factory=dict)
# Custom fields can be added through additional_info
class Config:
"""Pydantic model configuration."""
extra = 'allow' # Allows additional fields beyond those specified
def to_dublin_core(self) -> Dict[str, Any]:
"""Extract Dublin Core fields as dictionary."""
return self.model_dump(
exclude={'context', 'additional_info'},
exclude_none=True
)
3. Key Design Decisions¶
3.1 Dual-Layer Design¶
- AI Interface Layer (TextObjectResponse)
- Optimized for AI processing
- Human-readable context
-
Simplified structure
-
Internal Layer (TextObject)
- Strict validation
- Structured metadata
- Rich processing capabilities
3.2 Metadata Approach¶
- AI Format
- Narrative document summary
- Human-readable metadata
- Context and key concepts separated
-
Focus on information relevant for processing
-
Internal Format
- Structured Dublin Core metadata
- Additional context storage
- Extensible design
3.3 Content Integration¶
Content management is delegated to NumberedText class:
- Clean separation of concerns
- Efficient text storage and access
- Section-aware interface
4. Existing implementation details¶
4.1 Section Access¶
def get_section_content(self, index: int) -> str:
"""Retrieve content for specific section."""
start = self.sections[index].start_line
end = (self.sections[index + 1].start_line
if index < len(self.sections) - 1
else self.total_lines + 1)
return self.content.get_segment(start, end)
4.2 Validation of TextObject¶
def _validate(self) -> None:
"""Validate section integrity."""
if not self.sections:
raise ValueError("TextObject must have at least one section")
# Validate section ordering
for i, section in enumerate(self.sections):
if section.start_line < 1:
raise ValueError(f"Section {i}: start line must be >= 1")
if section.start_line > self.total_lines:
raise ValueError(f"Section {i}: start line exceeds text length")
if i > 0 and section.start_line <= self.sections[i-1].start_line:
raise ValueError(f"Section {i}: non-sequential start line")
6. Future Considerations¶
- Performance Optimization
- Index sections for faster access
-
Optimize metadata string parsing
-
Extended Functionality
- Section manipulation (merge/split)
- Advanced metadata querying
-
Enhanced validation rules
-
Integration Enhancements
- Expanded AI context generation
- Bulk processing capabilities
This validation strategy ensures data integrity while providing clear feedback for both programmatic and human review of processing results.