ADR-AT03.3: TextObject Robustness and Metadata Management¶
Fixes critical metadata propagation bugs, enhances section validation using NumberedText capabilities (ADR-AT03.2), and adds explicit merge strategies to TextObject for reliable text processing workflows.
- Status: Accepted
- Type: Design Detail
- Date: 2025-12-12
- Owner: aaronksolomon
- Author: Aaron Solomon, Claude Sonnet 4.5
- Parent ADR: ADR-AT03: Minimal AI Text Processing Refactor for tnh-gen
- Related ADRs: ADR-AT03.2: NumberedText Validation, ADR-AT03.1: Transition Plan
Implementation Fix Checklist (2025-12-14 update)¶
All core implementation items have been completed. The ADR design is now fully implemented in the codebase.
1. MetadataConflictError Definition¶
- Status: β
COMPLETED - Added in
src/tnh_scholar/exceptions.py:53-54 - Action: Use this for the
FAIL_ON_CONFLICTmerge strategy to surface metadata key collisions. - Implementation:
src/tnh_scholar/ai_text_processing/text_object.py:390-396
2. _deep_merge_metadata Must Preserve the Metadata Wrapper¶
- Status: β
COMPLETED - Implemented in
src/tnh_scholar/ai_text_processing/text_object.py:383-388 - Problem: Reassigning a raw dict to
self.metadatawould drop Metadata methods (to_yaml,add_process_info, etc.). - Solution: Correctly accesses
self.metadata._dataandnew_metadata._data, then reassigns merged dict toself.metadata._data.
3. List Merge with Unhashable Elements¶
- Status: β
COMPLETED - Implemented in
src/tnh_scholar/ai_text_processing/text_object.py:430-431 - Problem: Deduping with
dict.fromkeys(result[key] + value)fails on unhashable items (e.g., provenance dicts). - Solution: Uses simple append (
result[key] = result[key] + value) without deduplication.
4. Validation Mode Default¶
- Status: β
COMPLETED - Implemented in
src/tnh_scholar/ai_text_processing/text_object.py:458-475 - Problem: Existing
validate_sections()only warns; design requires fail-fast by default. - Solution: Defaults to
raise_on_error=Truewith opt-out viaraise_on_error=False.
5. Section End-Line Strategy¶
- Status: β COMPLETED - Documented in code comments
- Problem: Validation checks only start lines; callers might assume custom end bounds are honored.
- Solution: Start lines are authoritative;
_build_section_objectsderives end lines (each section ends where the next begins).
6. SectionBoundaryError Exception Hierarchy¶
- Status: β
COMPLETED - Implemented in
src/tnh_scholar/ai_text_processing/text_object.py:167-215 - Problem: Must inherit from
ValidationErrorper TNH Scholar exception standards. - Solution:
SectionBoundaryError(ValidationError)with structured context including errors and coverage_report.
7. Pydantic v2 BaseModel Conversion¶
- Status: β COMPLETED - See Addendum 2025-12-14 below
- Problem: TextObject was a plain Python class, violating ADR-OS01 Β§1.1 "Strong Typing" requirement.
- Solution: Converted to Pydantic v2 BaseModel with
ConfigDict(arbitrary_types_allowed=True)and@model_validator.
Open Questions (Discovered During Implementation)¶
- Provenance growth bounds:
TextObject.merge_metadataappends provenance entries without deduplication or size limits. For this interim tnh-gen unblock, provenance is intentionally unbounded. Future maintainers should decide on a policy (e.g., cap entries, drop oldest, or deduplicate by source/strategy/keys) during the planned TextObject rebuild.
Context¶
The Problem¶
TextObject (in src/tnh_scholar/ai_text_processing/text_object.py) is the primary container for text content with section organization and metadata tracking. It's used throughout the ai_text_processing module for translation, summarization, and other AI-powered workflows.
Current Pain Points:
-
Weak Section Validation:
validate_sections()(lines 359-372) only logs warnings and doesn't leverage NumberedText's boundary checking capabilities. -
Metadata Merge Ambiguity:
merge_metadata()(lines 324-352) has anoverrideparameter but the semantics are confusing:
if override:
self._metadata |= new_metadata # new overrides existing
else:
self._metadata = new_metadata | self._metadata # existing preserved
- Which values win in nested dicts?
- How are lists merged?
-
When should you use
override=Truevsoverride=False? -
Silent Metadata Loss: No clear strategy for merging complex metadata structures (nested dicts, lists, provenance chains).
-
No Provenance Chain: Metadata operations don't track transformation history (who merged what when).
Current Implementation¶
From text_object.py:359-372:
def validate_sections(self) -> None:
"""Basic validation of section integrity."""
if not self._sections:
raise ValueError("No sections set.")
# Check section ordering and bounds
for i, section in enumerate(self._sections):
if section.section_range.start < 1:
logger.warning(f"Section {i}: start line must be >= 1")
if section.section_range.start > self.num_text.size:
logger.warning(f"Section {i}: start line exceeds text length")
if i > 0 and \
section.section_range.start <= self._sections[i-1].section_range.start:
logger.warning(f"Section {i}: non-sequential start line")
Problems: - Only logs warnings (doesn't raise exceptions) - Doesn't check for gaps or overlaps - Doesn't leverage NumberedText's validation (ADR-AT03.2)
From text_object.py:324-352:
def merge_metadata(self, new_metadata: Metadata, override=False) -> None:
"""
Merge new metadata with existing metadata.
For now, performs simple dict-like union (|=) but can be extended
to handle more complex merging logic in the future (e.g., merging
nested structures, handling conflicts, merging arrays).
"""
if not new_metadata:
return
if override:
self._metadata |= new_metadata # new overrides existing
else:
self._metadata = new_metadata | self._metadata # existing values preserved
Problems:
- override semantics unclear for nested structures
- No deep merging (nested dicts shallow-merged)
- No strategy for lists (append vs replace?)
- No provenance tracking
Design Drivers¶
- Fail Fast: Section validation should raise exceptions, not log warnings
- Leverage NumberedText: Use ADR-AT03.2's validation methods
- Clear Merge Semantics: Explicit strategies for common merge patterns
- Metadata Integrity: Preserve provenance and transformation history
- Foundation for GenAI Integration: Enable reliable metadata from GenAI Service (AT03 Tier 2)
Decision¶
1. Enhanced Section Validation¶
Replace weak warning-based validation with robust boundary checking using NumberedText (ADR-AT03.2):
# text_object.py
from dataclasses import dataclass
@dataclass
class SectionBoundaryError(Exception):
"""Section boundaries have gaps, overlaps, or out-of-bounds errors."""
errors: List[Any] # List of SectionValidationError from NumberedText
coverage_report: dict[str, Any]
def __init__(self, errors: List[Any], coverage_report: dict[str, Any]):
self.errors = errors
self.coverage_report = coverage_report
# Build human-readable message
error_msgs = [e.message for e in errors]
message = (
f"Section validation failed with {len(errors)} error(s):\\n" +
"\\n".join(f" - {msg}" for msg in error_msgs) +
f"\\n\\nCoverage: {coverage_report['coverage_pct']:.1f}% " +
f"({coverage_report['covered_lines']}/{coverage_report['total_lines']} lines)"
)
if coverage_report['gaps']:
gap_ranges = [f"{start}-{end}" for start, end in coverage_report['gaps']]
message += f"\\nGaps at lines: {', '.join(gap_ranges)}"
if coverage_report['overlaps']:
overlap_count = sum(len(o['lines']) for o in coverage_report['overlaps'])
message += f"\\nOverlapping lines: {overlap_count}"
super().__init__(message)
class TextObject:
def validate_sections(self, raise_on_error: bool = True) -> Optional[List[Any]]:
"""Validate section integrity using NumberedText boundary checking.
Args:
raise_on_error: If True, raise SectionBoundaryError on validation failure.
If False, return error list (empty if valid).
Returns:
List of validation errors if raise_on_error=False, None otherwise.
Raises:
ValueError: If no sections are set
SectionBoundaryError: If validation fails and raise_on_error=True
Example:
>>> obj = TextObject(num_text, sections=[...])
>>> obj.validate_sections() # Raises if invalid
>>> errors = obj.validate_sections(raise_on_error=False)
>>> if errors:
... print(f"Found {len(errors)} validation errors")
"""
if not self._sections:
raise ValueError("No sections set.")
# Extract start lines from sections
start_lines = [section.section_range.start for section in self._sections]
# Use NumberedText validation (from ADR-AT03.2)
errors = self.num_text.validate_section_boundaries(start_lines)
if errors:
if raise_on_error:
# Get detailed coverage report for diagnostics
coverage_report = self.num_text.get_coverage_report(start_lines)
raise SectionBoundaryError(errors, coverage_report)
else:
return errors
return None if raise_on_error else []
Design Notes:
- Leverages ADR-AT03.2: Uses NumberedText's
validate_section_boundaries()andget_coverage_report() - Fail-fast default: Raises exception by default to catch errors early
- Opt-in non-throwing: Can return error list for batch validation scenarios
- Rich diagnostics: Exception includes coverage report with gap/overlap details
- Start-line authoritative: Validation only checks start lines because end lines are derived
from start lines (each section ends where the next begins, per
_build_section_objects:243-246). This is the TextObject contract: callers supply start lines, ends are computed automatically.
2. Explicit Metadata Merge Strategies¶
Replace ambiguous override parameter with explicit merge strategies:
# text_object.py
from enum import Enum
from typing import Any
class MergeStrategy(Enum):
"""Strategy for merging metadata."""
PRESERVE = "preserve" # Keep existing values, ignore conflicts
UPDATE = "update" # New values override existing
DEEP_MERGE = "deep" # Deep merge nested dicts, append lists
FAIL_ON_CONFLICT = "fail" # Raise exception on key conflicts
class TextObject:
def merge_metadata(
self,
new_metadata: Metadata,
strategy: MergeStrategy = MergeStrategy.PRESERVE
) -> None:
"""Merge metadata with explicit strategy.
Args:
new_metadata: Metadata to merge
strategy: Merge strategy (default: PRESERVE)
Strategies:
- PRESERVE: Keep existing values, only add new keys
- UPDATE: New values override existing (shallow)
- DEEP_MERGE: Recursively merge dicts, append lists
- FAIL_ON_CONFLICT: Raise exception if keys overlap
Raises:
MetadataConflictError: If strategy=FAIL_ON_CONFLICT and keys overlap
Example:
>>> obj.merge_metadata(ai_metadata, strategy=MergeStrategy.UPDATE)
>>> obj.merge_metadata(provenance, strategy=MergeStrategy.DEEP_MERGE)
"""
if not new_metadata:
return
if strategy == MergeStrategy.PRESERVE:
# Only add keys that don't exist
for key, value in new_metadata.items():
if key not in self._metadata:
self._metadata[key] = value
elif strategy == MergeStrategy.UPDATE:
# Shallow update: new values override existing
self._metadata.update(new_metadata)
elif strategy == MergeStrategy.DEEP_MERGE:
# Deep merge nested structures
merged_dict = self._deep_merge_metadata(
self._metadata._data,
new_metadata._data
)
self._metadata._data = merged_dict
elif strategy == MergeStrategy.FAIL_ON_CONFLICT:
# Check for conflicts first
conflicts = set(self._metadata.keys()) & set(new_metadata.keys())
if conflicts:
raise MetadataConflictError(
f"Metadata key conflicts: {conflicts}"
)
self._metadata.update(new_metadata)
logger.debug(
f"Merged metadata using {strategy.value} strategy: "
f"{len(new_metadata)} keys"
)
@staticmethod
def _deep_merge_metadata(base: dict, update: dict) -> dict:
"""Recursively merge metadata dictionaries.
Rules:
- Nested dicts: Recursively merge
- Lists: Append update items to base (no deduplication for unhashable items)
- Scalars: Update value overrides base value
Args:
base: Base metadata dict (raw dict from Metadata._data)
update: Update metadata dict (raw dict from Metadata._data)
Returns:
Merged metadata dict (to be assigned back to Metadata._data)
Note:
This is an interim implementation for ADR-AT03.3. List merging is kept simple
to avoid TypeError with unhashable elements (e.g., provenance dicts).
For production use, consider more sophisticated merge strategies.
"""
result = base.copy()
for key, value in update.items():
if key not in result:
# New key: add directly
result[key] = value
elif isinstance(result[key], dict) and isinstance(value, dict):
# Both dicts: recurse
result[key] = TextObject._deep_merge_metadata(result[key], value)
elif isinstance(result[key], list) and isinstance(value, list):
# Both lists: simple append (preserves order, no deduplication)
# Note: Avoids TypeError with unhashable items like provenance dicts
result[key] = result[key] + value
else:
# Scalar or type mismatch: override with update value
result[key] = value
return result
Design Notes:
- Explicit strategies: No ambiguity about merge behavior
- Deep merge support: Handles nested dicts and list appending
- Fail-on-conflict option: Enables strict metadata validation
- Backward compatible: Old
merge_metadata(override=True)βMergeStrategy.UPDATE - Metadata contract:
merge_metadataexpects sanitizedMetadatainstances, not raw dicts. Values pass throughMetadataprocessing for JSON-serializable types; callers must not bypass it.
3. Metadata Provenance Tracking¶
Add provenance chain to track metadata transformations:
# text_object.py
from datetime import datetime
from typing import Optional
class TextObject:
def merge_metadata(
self,
new_metadata: Metadata,
strategy: MergeStrategy = MergeStrategy.PRESERVE,
source: Optional[str] = None
) -> None:
"""Merge metadata with provenance tracking.
Args:
new_metadata: Metadata to merge
strategy: Merge strategy
source: Optional source identifier (e.g., 'genai_service', 'translation')
"""
if not new_metadata:
return
# Perform merge (same logic as above)
# ... (merge logic from section 2)
# Track provenance
if source:
if '_provenance' not in self._metadata:
self._metadata['_provenance'] = []
self._metadata['_provenance'].append({
'timestamp': datetime.utcnow().isoformat(),
'source': source,
'strategy': strategy.value,
'keys_added': list(new_metadata.keys())
})
Design Notes:
- Opt-in provenance: Only tracked if
sourceprovided - Transformation history: Captures when, where, how metadata merged
- Foundation for AT03 Tier 2: GenAI Service can tag metadata with
source='genai_service'
4. Backward Compatibility Helpers¶
Maintain compatibility with existing code during transition:
# text_object.py
class TextObject:
def merge_metadata_legacy(
self,
new_metadata: Metadata,
override: bool = False
) -> None:
"""Legacy merge_metadata interface (deprecated).
DEPRECATED: Use merge_metadata() with explicit MergeStrategy.
Args:
new_metadata: Metadata to merge
override: If True, new values override (UPDATE strategy)
If False, existing values preserved (PRESERVE strategy)
"""
import warnings
warnings.warn(
"merge_metadata(override=...) is deprecated. "
"Use merge_metadata(strategy=MergeStrategy.UPDATE) instead.",
DeprecationWarning,
stacklevel=2
)
strategy = MergeStrategy.UPDATE if override else MergeStrategy.PRESERVE
self.merge_metadata(new_metadata, strategy=strategy)
Object-Service Conformance¶
Alignment with ADR-OS01¶
This ADR positions TextObject to align with TNH Scholar's Object-Service Architecture (ADR-OS01), specifically as a domain service / orchestrator component in the ai_text_processing subsystem.
Current State: Domain Model with Service Characteristics¶
TextObject Today (hybrid role):
From ADR-OS01 Β§3.1 Layer Structure:
| Layer | TextObject's Current Role |
|---|---|
| Domain Models | β Typed container for text + sections + metadata |
| Service (Orchestrator) | β οΈ Partial: Coordinates NumberedText, Metadata, section management |
| Allowed Dependencies | β Domain models only (NumberedText, Metadata) |
Key Conformance Points (Current):
- Strong Typing (OS01 Β§1.1): Uses Pydantic models (
LogicalSection,Metadata) - Config at init (OS01 Β§4): Constructor accepts domain objects (not runtime config yet)
- Pure domain logic (OS01 Β§3.2): No direct transport or infrastructure dependencies
- Explicit errors (OS01 Β§8.7): Introduces
SectionBoundaryError,MetadataConflictError
Partial Conformance Issues:
- β οΈ No Envelope pattern: Methods return domain objects directly (not
Envelope) - β οΈ No Provenance: Metadata tracked but not with OS01's
Provenancemodel - β οΈ Mixed concerns: Contains both data model AND orchestration logic
Metadata Management as Internal Mapping¶
The new merge_metadata() strategies align with OS01 Β§8.8 "Internal Layer Adapters":
Metadata Merge as Boundary Mapping:
From OS01 Β§8.8:
"Adapters and mappers are not just for translating vendor payloadsβthey are equally critical for translating between our own internal abstraction layers."
TextObject's Metadata Merging:
# Current: Internal mapping between metadata layers
class TextObject:
def merge_metadata(
self,
new_metadata: Metadata,
strategy: MergeStrategy = MergeStrategy.PRESERVE,
source: Optional[str] = None # Provenance tracking!
) -> None:
"""Map metadata across processing boundaries."""
# 1. Perform semantic translation (deep merge, conflict resolution)
# 2. Track provenance (when, what, source)
# 3. Maintain invariants (metadata integrity)
This follows OS01's principle:
"Mapping makes these changes explicit... Internal mappers are the right place to inject policy decisions, provenance metadata, or safety checks as data crosses boundaries."
Conformance:
- β
Pure mapper logic (OS01 Β§8.6):
_deep_merge_metadata()is pure (no I/O) - β
Provenance injection (OS01 Β§8.8):
sourceparameter enables tracking - β
Policy at boundaries (OS01 Β§4.3):
MergeStrategyis a domain policy
Future Evolution: Full Service Orchestrator¶
Path to OS01 Compliance (future ADR):
When TextObject becomes a full Service Orchestrator (like GenAIService):
# Future: TextObject as Service Orchestrator (hypothetical)
class TextObjectService:
"""Service for creating and validating text objects."""
def __init__(
self,
settings: Settings,
validator: SectionValidator | None = None,
metadata_policy: MetadataPolicy | None = None
):
self.settings = settings
self._validator = validator or NumberedTextValidator()
self._policy = metadata_policy or MetadataPolicy()
def create_from_ai_response(
self,
response: AIResponse,
params: TextObjectParams | None = None
) -> Envelope:
"""Create TextObject from AI response with validation.
Returns:
Envelope with TextObject result or validation errors
"""
# 1. Extract numbered text (domain model)
num_text = NumberedText(response.content)
# 2. Validate sections (uses ADR-AT03.2)
start_lines = [s.start_line for s in response.sections]
errors = num_text.validate_section_boundaries(start_lines)
if errors:
return Envelope(
status="failed",
error="Section validation failed",
diagnostics={"validation_errors": [e.message for e in errors]},
provenance=Provenance(
backend="ai_text_processing",
params={"response_hash": hash(response)}
)
)
# 3. Build TextObject
sections = self._build_sections(response.sections, num_text.size)
text_obj = TextObject(
num_text=num_text,
language=response.language,
sections=sections
)
# 4. Merge metadata with policy
text_obj.merge_metadata(
Metadata.from_yaml(response.document_metadata),
strategy=self._policy.merge_strategy,
source="ai_response"
)
# 5. Return envelope
return Envelope(
status="succeeded",
result=text_obj,
provenance=Provenance(
backend="ai_text_processing",
model=response.language, # Or AI model used
params={"strategy": self._policy.merge_strategy.value}
)
)
Full OS01 Conformance:
- β
Config at init:
Settings,MetadataPolicypassed at construction - β
Params per call:
paramsper request (not global state) - β
Envelope always: Returns
Envelopewith provenance - β
Protocol-based ports:
SectionValidatorprotocol for validation - β Internal mapping: Metadata merging as boundary translation
- β
Provenance tracking:
_provenancechain in metadata - β
Typed errors:
SectionBoundaryErrorβEnvelope(status="failed")
Integration with AT03 Refactor¶
TextObject in AT03 Service Stack:
tnh-gen CLI (Application Layer)
ββ TextProcessor (Orchestrator - AT03 Tier 1+2)
ββ GenAIService.generate() β AIResponse
ββ TextObjectService.create_from_ai_response() [Future]
ββ NumberedText.validate_section_boundaries() [ADR-AT03.2]
ββ TextObject.merge_metadata() [This ADR]
Envelope(
status="succeeded",
result=TextObject,
provenance=Provenance(backend="ai_text_processing", ...)
)
Current AT03 Implementation (simpler):
For AT03's 1-2 week timeline, TextObject stays as domain model with enhanced validation:
- β Enhanced validation (uses NumberedText from AT03.2)
- β Explicit merge strategies (this ADR)
- β Provenance tracking in metadata
- β No Envelope wrapping (deferred to full platform)
- β No Service protocol (deferred to AT04)
TODO Reference¶
This work addresses TODO.md Item #11 ("Improve NumberedText Ergonomics") and TODO.md line 495-500 (object-service conformance for TextObject and NumberedText).
The metadata merge strategies and provenance tracking lay the groundwork for future full OS01 conformance when TextObject becomes a service orchestrator.
Consequences¶
Positive¶
- Fail-Fast Validation: Section errors caught immediately with detailed diagnostics
- Clear Merge Semantics: No ambiguity about metadata merge behavior
- Deep Merge Support: Handles complex nested metadata structures correctly
- Provenance Tracking: Metadata transformations traceable for debugging and audit
- Foundation for GenAI: Enables reliable metadata from GenAI Service (AT03 Tier 2)
- Better Error Messages: SectionBoundaryError includes coverage report with actionable details
- Backward Compatible: Legacy code continues working during transition
Negative¶
- Breaking Change:
validate_sections()now raises exceptions by default (can opt into old behavior withraise_on_error=False) - API Complexity: Multiple merge strategies increase API surface
- Migration Effort: Existing calls to
merge_metadata(override=True)need updating - Provenance Overhead: Tracking adds metadata size (minimal impact)
Risks & Mitigations¶
| Risk | Impact | Mitigation |
|---|---|---|
| Breaking existing code | Code relying on warning-only validation breaks | Gradual rollout; legacy helper methods |
| Merge strategy confusion | Users pick wrong strategy | Clear docstrings with examples; default to PRESERVE |
| Performance regression | Deep merge slower than shallow | Benchmark; optimize if needed (unlikely bottleneck) |
| Provenance bloat | Large provenance chains | Make provenance opt-in; add truncation after N entries |
Alternatives Considered¶
Alternative 1: Keep Warning-Only Validation¶
Approach: Leave validate_sections() as warnings, don't integrate NumberedText validation.
Rejected: - Fails late (during content retrieval instead of validation) - Doesn't leverage ADR-AT03.2's robust boundary checking - Accumulates technical debt
Alternative 2: Single "Smart" Merge Strategy¶
Approach: Auto-detect merge strategy based on metadata content.
Rejected: - Magic behavior hard to debug - Users lose explicit control - Ambiguous in edge cases (what if metadata has both dicts and scalars?)
Alternative 3: Immutable Metadata¶
Approach: Make Metadata immutable, return new instance on merge.
Rejected: - Breaking change to existing code - Performance overhead (copying large metadata) - Doesn't align with current Metadata class design - Can revisit in future ADR if needed
Implementation Notes¶
Phase 1: Section Validation Enhancement (Days 1-2)¶
- Add
SectionBoundaryErrorexception class - Update
validate_sections()to use NumberedText validation - Add
raise_on_errorparameter for gradual migration - Unit tests:
- Valid sections (no errors)
- Invalid sections (gaps, overlaps, out-of-bounds)
- Error message formatting
- Non-throwing mode
Phase 2: Metadata Merge Strategies (Days 3-4)¶
- Add
MergeStrategyenum - Implement new
merge_metadata()with strategy parameter - Implement
_deep_merge_metadata()helper - Add
merge_metadata_legacy()for backward compatibility - Unit tests:
- Each strategy with various metadata structures
- Deep merge with nested dicts and lists
- Conflict detection (FAIL_ON_CONFLICT)
- Provenance tracking
Phase 3: Integration (Day 5)¶
- Update
from_response()to use new validation (lines 292-322) - Update
from_info()to validate sections (lines 399-415) - Integration tests:
- AI-generated sections with validation
- Metadata merging in realistic workflows
- Provenance chains through multi-step processing
Testing Strategy¶
Unit Tests (tests/ai_text_processing/test_text_object_robustness.py):
def test_validate_sections_raises_on_gap():
"""validate_sections raises SectionBoundaryError on gaps."""
num_text = NumberedText("\\n".join(f"line{i}" for i in range(1, 11)))
sections = [
SectionObject("Section 1", SectionRange(1, 5), None),
SectionObject("Section 2", SectionRange(7, 11), None), # Gap at line 6
]
obj = TextObject(num_text, sections=sections)
with pytest.raises(SectionBoundaryError) as exc_info:
obj.validate_sections()
assert "gap" in str(exc_info.value).lower()
assert "line 6" in str(exc_info.value) or "lines 5-6" in str(exc_info.value)
def test_merge_metadata_deep_merge_nested_dicts():
"""Deep merge strategy merges nested dicts."""
obj = TextObject(num_text, metadata=Metadata({
'processing': {'stage': 'translation', 'model': 'gpt-4'},
'tags': ['dharma']
}))
obj.merge_metadata(
Metadata({
'processing': {'version': '1.0'},
'tags': ['teaching']
}),
strategy=MergeStrategy.DEEP_MERGE
)
assert obj.metadata['processing'] == {
'stage': 'translation',
'model': 'gpt-4',
'version': '1.0'
}
assert set(obj.metadata['tags']) == {'dharma', 'teaching'}
def test_merge_metadata_provenance_tracking():
"""Provenance tracked when source provided."""
obj = TextObject(num_text)
obj.merge_metadata(
Metadata({'model': 'gpt-4'}),
source='genai_service'
)
assert '_provenance' in obj.metadata
assert len(obj.metadata['_provenance']) == 1
assert obj.metadata['_provenance'][0]['source'] == 'genai_service'
Integration Tests:
- TextObject creation from AIResponse with section validation
- Multi-step metadata merging (provenance chain)
- Error handling in realistic workflows
Migration Path¶
Phase 1: Add New APIs (Fail-Fast by Default)¶
- Add new
merge_metadata()withstrategyparameter - Keep old
merge_metadata()signature working via deprecation wrapper - Update
validate_sections()to fail-fast by default (raise_on_error=True) - Rationale: This is an interim implementation to unblock tnh-gen, not a final solution. Fail-fast catches errors early during development and prevents silent failures.
Phase 2: Gradual Adoption¶
- Update internal
ai_text_processingcode to use new APIs - Update tests to expect new behavior
- Add deprecation warnings to old methods
Phase 3: Remove Legacy (Future)¶
- Remove
merge_metadata_legacy()after 2-3 releases - Switch
validate_sections()to throwing by default
Success Criteria¶
This ADR succeeds if:
- Validation robustness: Section boundary errors caught with actionable diagnostics
- Merge clarity: Developers understand which strategy to use for their use case
- Deep merge correctness: Nested structures merged without data loss
- Provenance working: GenAI Service integration can track metadata sources (AT03 Tier 2)
- Tests pass: >95% code coverage for new validation and merge logic
- No regressions: Existing workflows continue working during migration
References¶
Related ADRs¶
- ADR-AT03: Minimal AI Text Processing Refactor - Parent ADR (Tier 0: TextObject robustness)
- ADR-AT03.2: NumberedText Validation - Sibling ADR (validation methods used here)
- ADR-AT03.1: Transition Plan - Implementation timeline
- ADR-AT02: TextObject Architecture - Historical context
Implementation Files¶
- Current:
src/tnh_scholar/ai_text_processing/text_object.py-merge_metadata()andvalidate_sections()implementations - Dependency:
src/tnh_scholar/text_processing/numbered_text.py- NumberedText implementation
Approval Path: Architecture review β Implementation β Unit tests β Integration with ADR-AT03.2
This ADR completes the TextObject robustness improvements (AT03 Tier 0) and establishes the foundation for GenAI Service integration (AT03 Tier 2) and prompt system adoption (AT03 Tier 3).
Addendum 2025-12-14: Pydantic v2 BaseModel Requirement¶
Background Context¶
During implementation review, we identified that TextObject violates ADR-OS01 Β§1.1 "Strong Typing" principle:
"All payloads, parameters, and responses use Pydantic or dataclass models"
Current State: TextObject is implemented as a plain Python class with manual attribute assignment, while sibling models (LogicalSection, AIResponse, TextObjectInfo) correctly use Pydantic v2 BaseModel.
ADR-OS01 Conformance Requirement (Β§3.1 Layer Structure):
| Layer | Example Classes | Type Requirement |
|---|---|---|
| Domain Models | CompletionResult, SpeakerBlock |
Pydantic or dataclass |
Addendum Decision¶
Convert TextObject to Pydantic v2 BaseModel to align with TNH Scholar's object-service architecture principles.
Rationale:
- Consistency: All domain models in
ai_text_processingshould use the same pattern - Type Safety: Pydantic enforces runtime validation of
num_text,language,sections,metadata - Serialization: Automatic
.model_dump()and.model_dump_json()support - Validation: Leverage
@model_validatorfor complex initialization logic (auto-detect language, validate sections) - Future-Proof: Enables easier integration with GenAI Service and prompt system (AT03 Tier 2+3)
Implementation Strategy¶
Challenge: TextObject has complex initialization logic that requires careful migration:
- Auto-detection of
languagefrom content if not provided - Automatic section validation on construction
- Support for
NumberedText(non-Pydantic type)
Solution: Use Pydantic v2's model_validator with arbitrary_types_allowed:
from pydantic import BaseModel, ConfigDict, Field, model_validator
class TextObject(BaseModel):
"""Domain model for text content with section organization."""
model_config = ConfigDict(
arbitrary_types_allowed=True, # Allow NumberedText
validate_assignment=True
)
num_text: NumberedText
language: str = "" # Will be auto-detected if empty
sections: List[SectionObject] = Field(default_factory=list)
metadata: Metadata = Field(default_factory=Metadata)
@model_validator(mode='after')
def validate_initialization(self) -> Self:
"""Post-initialization validation and defaults."""
# Auto-detect language if not provided
if not self.language:
self.language = get_language_code_from_text(self.num_text.content)
# Validate sections if provided
if self.sections:
self.validate_sections()
return self
Private Attributes: The _sections and _metadata private attributes become public fields managed by Pydantic. The property accessors remain for backward compatibility.
Backward Compatibility: All existing methods and properties remain unchanged; only the internal class structure changes.
Migration Checklist¶
- Convert
TextObjectclass definition to inherit fromBaseModel-src/tnh_scholar/ai_text_processing/text_object.py:217 - Add
model_config = ConfigDict(arbitrary_types_allowed=True)-src/tnh_scholar/ai_text_processing/text_object.py:226 - Convert
__init__parameters to class fields with defaults -src/tnh_scholar/ai_text_processing/text_object.py:221-224 - Move initialization logic to
@model_validator(mode='after')-src/tnh_scholar/ai_text_processing/text_object.py:243-254 - Update property accessors - Removed
_sectionsand_metadataprefixes; fields now public - Update
SectionBoundaryErrorto inherit fromValidationError-src/tnh_scholar/ai_text_processing/text_object.py:167 - Verify all tests pass with new Pydantic model - ALL 9 TESTS PASSING
- Update any serialization code to use
.model_dump()- VERIFIED: No changes needed
Related Artifacts¶
- TNH Scholar Architecture: ADR-OS01: Object-Service Architecture Β§1.1, Β§3.1
- Implementation File:
src/tnh_scholar/ai_text_processing/text_object.py:196 - Sibling Models:
LogicalSection(line 90),AIResponse(line 108),TextObjectInfo(line 152) - already using Pydantic v2
Addendum 2025-12-14: Reversal - Plain Python Class for TextObject¶
Discovered Friction with Pydantic Implementation¶
After implementing TextObject as a Pydantic v2 BaseModel (previous addendum), we identified significant friction that suggests Pydantic is not the right fit for this domain model:
Friction Points Observed:
__iter__Override with# type: ignore[override](src/tnh_scholar/ai_text_processing/text_object.py:412)- Pydantic's
BaseModel.__iter__yields(field_name, value)for dict-like iteration - TextObject needs domain-specific iteration over
SectionEntryobjects -
This fundamental semantic conflict requires suppressing type checking
-
Dual Validation Hooks (
src/tnh_scholar/ai_text_processing/text_object.py:273-297) - Required both
@model_validator(mode="before")ANDmodel_post_init - Needed because
Metadatais aMutableMapping, not a Pydantic model -
Suggests Pydantic's lifecycle doesn't match domain needs
-
arbitrary_types_allowed=True(src/tnh_scholar/ai_text_processing/text_object.py:382) - Required for
NumberedText(plain class) andMetadata(MutableMapping) - Disables much of Pydantic's validation power for core fields
- Defeats the primary benefit of using Pydantic
While the implementation works (all 9 tests passing, zero mypy errors), the complexity indicates a fundamental impedance mismatch between Pydantic's data-centric model and TextObject's rich domain behavior.
Design Principle Discovery¶
This friction led to an important design principle clarification documented in Design Principles: Choosing Between Pydantic Models and Plain Python Classes:
When NOT to use Pydantic BaseModel:
- Custom iteration semantics that conflict with Pydantic's
__iter__ - Core dependencies that are non-Pydantic types (requires
arbitrary_types_allowed) - Complex initialization requiring multiple validation hooks
- Rich domain behavior beyond data validation
TextObject meets all four criteria for using a plain Python class instead of Pydantic.
Reversal Decision¶
Revert TextObject to plain Python class while keeping TextObjectInfo as Pydantic DTO for serialization.
Rationale:
- Separation of Concerns: Domain logic (TextObject) separate from serialization (TextObjectInfo)
- Clean Iteration: No
# type: ignore[override]needed for domain-specific__iter__ - Simple Initialization: Plain
__init__with explicit validation, no dual validators - No
arbitrary_types_allowed: Works naturally with NumberedText and Metadata - Follows Updated Design Principles: Real-world example of when plain class is better
- Living Design History: Documents architectural learning for future work
Hybrid Pattern:
# Domain model: Plain Python class
class TextObject:
"""Rich domain model with complex behavior (NOT a Pydantic model)."""
def __init__(
self,
num_text: NumberedText,
language: Optional[str] = None,
sections: Optional[List[SectionObject]] = None,
metadata: Optional[Metadata] = None,
):
self.num_text = num_text
self.language = language or get_language_code_from_text(num_text.content)
self.sections = sections or []
self.metadata = metadata or Metadata()
if self.sections:
self.validate_sections()
def __iter__(self) -> Iterator[SectionEntry]:
"""Domain-specific iteration - clean, no type: ignore needed."""
for i, section in enumerate(self.sections):
content = self.num_text.get_segment(section.section_range.start, section.section_range.end)
yield SectionEntry(number=i + 1, title=section.title, content=content)
def export_info(self, source_file: Optional[Path] = None) -> TextObjectInfo:
"""Convert to Pydantic DTO for serialization."""
return TextObjectInfo(
source_file=source_file,
language=self.language,
sections=self.sections,
metadata=self.metadata
)
@classmethod
def from_info(cls, info: TextObjectInfo, metadata: Metadata, num_text: NumberedText) -> "TextObject":
"""Create from Pydantic DTO."""
return cls(
num_text=num_text,
language=info.language,
sections=info.sections,
metadata=metadata
)
# DTO: Pydantic model for serialization (unchanged)
class TextObjectInfo(BaseModel):
"""Serializable snapshot of TextObject state."""
source_file: Optional[Path] = None
language: str
sections: List[SectionObject]
metadata: Metadata
@model_validator(mode="before")
@classmethod
def _coerce_metadata(cls, data: Any) -> Any:
"""Coerce raw dict to Metadata instance before Pydantic validation."""
if isinstance(data, dict) and "metadata" in data and isinstance(data["metadata"], dict):
data = {**data, "metadata": Metadata(data["metadata"])}
return data
Benefits of This Approach:
- β Clean Domain API: No Pydantic-specific workarounds in TextObject
- β Type Safety: Pydantic validates TextObjectInfo DTOs
- β
Serialization:
TextObjectInfoprovides JSON persistence via Pydantic - β Simplicity: Each class has clear, focused purpose
- β
No Friction: No
# type: ignore, noarbitrary_types_allowed, no dual validators - β Architectural Learning: Documents when plain class is better than Pydantic
Reversal Checklist¶
Implementation tasks for code agent:
- Convert
TextObjectfromBaseModelback to plain Python class - Remove
model_config = ConfigDict(arbitrary_types_allowed=True) - Convert class fields back to
__init__parameters with type hints - Replace
@model_validator(mode='after')with plain initialization logic in__init__ - Remove
# type: ignore[override]from__iter__method - Keep
TextObjectInfoas Pydantic BaseModel unchanged (DTO for serialization) - Verify
export_info()andfrom_info()work correctly with hybrid pattern - Run full test suite - confirm all 9 tests still passing
- Run mypy - confirm zero type errors (should improve with removal of
type: ignore)
ADR-OS01 Conformance Note¶
This reversal does NOT violate ADR-OS01 Β§1.1 "Strong Typing" requirement. The updated Design Principles document clarifies:
Default to Pydantic v2
BaseModelfor domain models, but use plain Python classes when Pydantic introduces significant friction.
TextObject is a rich domain model with behavior beyond data validation. The hybrid pattern (plain class + Pydantic DTO) maintains type safety while avoiding Pydantic's lifecycle friction.
ADR-OS01 Compliance via Hybrid Pattern:
- β
Type Safety: Full type hints on
TextObject.__init__and all methods - β
Serialization:
TextObjectInfo(Pydantic) handles JSON persistence - β Validation: Explicit validation methods with typed errors
- β Transport: Pydantic DTOs at serialization boundaries
Implementation References¶
- Design Principles Update: Choosing Between Pydantic Models and Plain Python Classes
- Examples in Codebase:
- Plain classes:
src/tnh_scholar/text_processing/numbered_text.py,src/tnh_scholar/metadata/metadata.py - Pydantic DTOs:
src/tnh_scholar/ai_text_processing/text_object.py(TextObjectInfo,LogicalSection) - Pydantic domain models:
src/tnh_scholar/ai_text_processing/text_object.py(AIResponse),src/tnh_scholar/gen_ai_service/models/domain.py(CompletionResult)
Lessons Learned¶
Key Insight: Architectural consistency doesn't mean forcing every model into the same pattern. The "Strong Typing" principle is satisfied by:
- Using Pydantic where it excels (DTOs, simple models, serialization)
- Using plain classes where Pydantic creates friction (rich domain models, custom iteration)
- Maintaining type safety through explicit type hints and validation
This addendum provides a living example of architectural decision-making based on real implementation experience, documenting the learning for future development work.
Addendum 2025-12-14: Immutability - Refactor transform() to Return New Instance¶
Inconsistency Discovered¶
After reverting TextObject to a plain Python class, a design inconsistency was identified in the transform() method:
Current Implementation (src/tnh_scholar/ai_text_processing/text_object.py:776-809):
def transform(
self,
data_str: Optional[str] = None,
language: Optional[str] = None,
metadata: Optional[Metadata] = None,
process_metadata: Optional[ProcessMetadata] = None,
sections: Optional[List[SectionObject]] = None,
) -> Self:
"""
Update TextObject content and metadata **in place** and return self for chaining.
"""
# Update potentially changed elements
if data_str:
self.num_text = NumberedText(data_str) # MUTATION
if language:
self.language = language # MUTATION
if metadata:
self.merge_metadata(metadata) # MUTATION
if process_metadata:
self.metadata.add_process_info(process_metadata) # MUTATION
if sections:
self.sections = sections # MUTATION
return self
Problem: This mutates the TextObject in place, violating the immutability design principle.
Design Principle (Design Principles: Immutability by Default):
Keep data models immutable when possible for safer concurrent code
NumberedText Precedent (src/tnh_scholar/text_processing/numbered_text.py:39-50):
Immutable container for text documents with numbered lines. Instances should not be modified after creation. All operations return new data rather than mutating the instance.
Usage Analysis¶
The transform() method is used in 7-8 call sites across the codebase:
- line_translator.py (2 calls):
- Translation:
text.transform(data_str=new_text, language=target_language) -
Process tracking:
text.transform(process_metadata=process_metadata) -
ai_text_processing.py (4 calls):
-
Various processing stages adding process metadata
-
sent_split.py (2 calls):
- Sentence splitting with new content
Current usage pattern:
# Pattern 1: Mutation with return value ignored
text.transform(process_metadata=process_metadata)
return text # Callers expect mutation
# Pattern 2: Chained mutation
return text.transform(data_str=new_text, language=target_language)
Proposed Refactor¶
Make transform() return a new TextObject instance following immutability principles:
def transform(
self,
data_str: Optional[str] = None,
language: Optional[str] = None,
metadata: Optional[Metadata] = None,
process_metadata: Optional[ProcessMetadata] = None,
sections: Optional[List[SectionObject]] = None,
) -> "TextObject":
"""
Create a new TextObject with transformed content and metadata.
Returns a new instance with updated values while preserving the original.
Follows immutability design principle for safer concurrent processing.
Args:
data_str: New text content (creates new NumberedText)
language: New language code
metadata: Metadata to merge into new instance
process_metadata: Process info to add to metadata
sections: Replacement section list
Returns:
New TextObject instance with transformed values
Example:
>>> original = TextObject(num_text, language="en")
>>> translated = original.transform(
... data_str=translated_text,
... language="es"
... )
>>> assert original.language == "en" # Original unchanged
>>> assert translated.language == "es"
"""
# Start with current values
new_num_text = NumberedText(data_str) if data_str else self.num_text
new_language = language if language else self.language
new_sections = sections if sections else self.sections.copy()
# Deep copy metadata to avoid shared mutable state
new_metadata = Metadata(self.metadata._data.copy())
# Apply metadata updates to copy
if metadata:
new_metadata.update(metadata)
if process_metadata:
new_metadata.add_process_info(process_metadata)
# Return new instance
return TextObject(
num_text=new_num_text,
language=new_language,
sections=new_sections,
metadata=new_metadata,
validate_on_init=False, # Skip validation (sections already validated)
)
Migration Impact¶
Call Site Updates (7-8 locations):
# Before (mutation):
text.transform(process_metadata=process_metadata)
return text
# After (immutable):
result = text.transform(process_metadata=process_metadata)
return result
# Before (chained mutation):
return text.transform(data_str=new_text, language=target_language)
# After (immutable - no change needed):
return text.transform(data_str=new_text, language=target_language)
Breaking Change: Code that relies on in-place mutation will need updates, but most call sites already use the return value.
Benefits¶
Immutability advantages:
- Thread Safety: Original TextObject unchanged during transformation
- Debugging: No hidden state mutations to track
- Testing: Easier to verify transformations without side effects
- Design Consistency: Aligns with NumberedText's immutability principle
- Functional Style: Enables composable transformations
Example - Composable Transformations:
# Immutable style enables clear transformation chains
original = TextObject.from_text_file(Path("input.txt"))
translated = original.transform(
data_str=translate(original.content),
language="es"
)
sectioned = translated.transform(
sections=compute_sections(translated.content),
process_metadata=ProcessMetadata(step="sectioning")
)
# Original and intermediate results preserved
assert original.language == "en"
assert translated.language == "es"
assert len(sectioned.sections) > 0
Implementation Checklist¶
- Refactor
transform()to create and return new TextObject instance - Deep copy metadata to avoid shared mutable state
- Add
validate_on_init=Falseparameter to TextObject constructor (optimization) - Update 7-8 call sites in:
-
line_translator.py(2 calls) -
ai_text_processing.py(4 calls) -
sent_split.py(2 calls) - Update
transform()docstring with immutability guarantees and example - Add unit test verifying original instance unchanged after transform
- Run full test suite - confirm all tests passing
- Update AGENTLOG.md with immutability refactor notes
Design Principles Update¶
This refactor reinforces the existing design principle:
Keep data models immutable when possible for safer concurrent code
Add TextObject as example:
# β
Preferred: Immutable transformation
class TextObject:
def transform(self, ...) -> "TextObject":
"""Return new instance with transformed values."""
return TextObject(...) # New instance, original unchanged
# π« Avoid: In-place mutation
class TextObject:
def transform(self, ...) -> Self:
"""Mutate self and return for chaining."""
self.language = language # Mutates original
return self
Rationale¶
Why immutability matters for TextObject:
-
Processing Pipelines: TextObject flows through multi-stage pipelines (translate β section β process). Immutability prevents upstream stages from being affected by downstream transformations.
-
Provenance Tracking: Metadata tracks processing history. Immutability ensures each transformation creates a new snapshot, preserving the full lineage.
-
Concurrent Processing: Future parallelization of section processing requires thread-safe domain models.
-
Consistency with Dependencies: NumberedText is immutable; TextObject should follow the same principle.
Alternative Considered: Frozen Dataclass¶
Approach: Use @dataclass(frozen=True) to enforce immutability at runtime.
Rejected:
- Requires all fields to be frozen (including complex
MetadataandNumberedText) Metadatais intentionally mutable (MutableMapping) for flexibility- Runtime enforcement adds overhead for minimal benefit (convention over enforcement)
- Plain class with immutable methods is sufficient
Decision: Keep plain class with immutable methods by convention, documented in docstrings and enforced through code review.
Refactor Success Criteria¶
This refactor succeeds if:
- β
transform()returns new TextObject instance - β Original TextObject unchanged after transformation
- β All 7-8 call sites updated and tested
- β Full test suite passes (9 TextObject tests + integration tests)
- β No performance regression (deep copy overhead acceptable)
- β Design principle reinforced with TextObject example
Implementation Artifacts¶
- NumberedText Immutability:
src/tnh_scholar/text_processing/numbered_text.py:39-50- "Immutable container... All operations return new data" - Design Principles: Immutability by Default
- Call Sites:
src/tnh_scholar/ai_text_processing/line_translator.py:286src/tnh_scholar/ai_text_processing/ai_text_processing.py(lines 297, 503, 537, 578)src/tnh_scholar/cli_tools/sent_split/sent_split.py:85