ADR-MD02: Metadata Infrastructure Object-Service Integration¶
Establishes metadata system (tnh_scholar.metadata) as foundational cross-cutting infrastructure that supports the object-service architecture (ADR-OS01) while maintaining design principle compliance.
- Status: Accepted
- Date: 2025-12-07
- Authors: Aaron Solomon, Claude Sonnet 4.5
- Related: ADR-MD01, ADR-OS01, ADR-PT04
Context¶
Discovery¶
During implementation of ADR-PT04 (Prompt System Refactor), we discovered that:
- Metadata already exists:
tnh_scholar.metadataprovidesMetadata,Frontmatter, andProcessMetadata - Duplication was occurring: Services were reimplementing YAML frontmatter parsing
- Broader pattern identified: ALL .md files in TNH Scholar (prompts, corpus, derivatives, docs) use metadata frontmatter
- Self-reflexive design: TNH Scholar operates on its own metadata-bearing artifacts
Architectural Questions¶
- Is metadata a service? Noβit's foundational infrastructure with no external dependencies
- Does it need ports/adapters? Noβpure utility classes used across all layers
- How does it fit object-service architecture? Cross-cutting concern, available to all layers
- Should services reuse or reimplement? ALWAYS reuse; metadata is foundational
Object-Service Compliance Assessment¶
Current implementation (src/tnh_scholar/metadata/metadata.py):
β
Compliant aspects:
- Strong typing (Metadata, ProcessMetadata are typed classes)
- Pure functions (no side effects in Frontmatter.extract())
- JSON serialization via Metadata.to_dict() (type-safe)
- Type processors for Path, datetime conversion
β Non-compliant aspects:
- No transport/domain separation
- Mixes file I/O with domain logic in Frontmatter.extract_from_file()
- Logging in utility code
- No mapper pattern for domain schema validation
Decision¶
1. Metadata System Role¶
Metadata is FOUNDATIONAL INFRASTRUCTURE, not a service:
- Available everywhere: All layers (domain, service, adapter, mapper, transport) can import
- No protocols/ports: Pure utility classes with no abstraction needed
- Cross-cutting concern: Supports object-service architecture without being one
- Self-reflexive enabler: System can reason about its own artifacts
2. Integration Patterns with Object-Service Layers¶
Pattern 1: Mappers Use Frontmatter for .md Files¶
Rule: Never reimplement YAML frontmatter parsing; always use Frontmatter.extract().
# β
CORRECT: mappers/prompt_mapper.py
from tnh_scholar.metadata import Frontmatter
class PromptMapper:
def to_domain_prompt(self, file_content: str) -> Prompt:
# Use shared infrastructure
metadata_obj, body = Frontmatter.extract(file_content)
# Validate against domain schema
prompt_metadata = PromptMetadata.model_validate(metadata_obj.to_dict())
return Prompt(metadata=prompt_metadata, template=body)
# β WRONG: Don't reimplement
def _parse_frontmatter(content: str) -> dict:
# Reinventing the wheel
match = re.match(r'^---\n(.*?)\n---\n', content, re.DOTALL)
return yaml.safe_load(match[1]) # Duplication!
Benefits: - Consistent frontmatter handling across all .md files - JSON-LD support (ADR-MD01) available when needed - BOM handling, whitespace normalization already implemented - Future enhancements benefit all consumers
Pattern 2: Domain Models Use Metadata for Flexible Fields¶
Rule: Replace Dict[str, Any] with Metadata for type-safe, JSON-serializable metadata storage.
# β
CORRECT: domain/models.py
from tnh_scholar.metadata import Metadata
class DocumentResult(BaseModel):
content: str
metadata: Metadata # JSON-serializable, dict-like
class TranslationResult(BaseModel):
text: str
source_metadata: Metadata
output_metadata: Metadata
Benefits:
- Type safety (Metadata ensures JSON-serializable values)
- Auto-conversion (Path β str, datetime β ISO format)
- Dict-like interface (familiar |, [] operators)
- Explicit serialization (to_dict(), to_yaml())
Pattern 3: Services Track Provenance with ProcessMetadata¶
Rule: Use ProcessMetadata to record transformation steps in multi-stage pipelines.
# β
CORRECT: services/translation_service.py
from tnh_scholar.metadata import ProcessMetadata, Metadata
class TranslationService:
def translate(self, doc: Document) -> Document:
result = self._translate_content(doc)
# Track transformation
result.metadata.add_process_info(
ProcessMetadata(
step="translation",
processor="genai_service",
tool="gpt-4o",
source_lang=doc.metadata.get("language"),
target_lang="en",
timestamp=datetime.now(), # Auto-converted to ISO
)
)
return result
Benefits:
- Automatic provenance chain (stored in tnh_metadata_process field)
- Supports semantic queries (JSON-LD compatible)
- Reproducibility (track exact tools/versions used)
- Self-reflexive operations (system can analyze its own transformations)
Pattern 4: Mappers Separate Infrastructure from Domain Validation¶
Rule: Mappers use Frontmatter (infrastructure), then validate with domain schemas (Pydantic models).
# β
CORRECT: Two-step process
class CorpusMapper:
def to_domain_document(self, file_content: str) -> CorpusDocument:
# Step 1: Infrastructure (Frontmatter parsing)
metadata_obj, body = Frontmatter.extract(file_content)
# Step 2: Domain validation (Pydantic schema)
corpus_metadata = CorpusMetadata.model_validate(metadata_obj.to_dict())
return CorpusDocument(metadata=corpus_metadata, content=body)
Why separate? - Infrastructure concern: YAML parsing, BOM handling, whitespace - Domain concern: Required fields, business rules, semantic validation - Separation of concerns: Metadata module doesn't know about domain schemas - Reusability: Same Frontmatter code works for prompts, corpus, derivatives
3. Object-Service Architecture Modifications¶
Updated Layer Model with Metadata¶
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Foundational Infrastructure (Cross-Cutting) β
β β’ tnh_scholar.metadata (Metadata, Frontmatter, Process) β
β β’ Available to ALL layers below β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββ
β Application Layer β
β β’ CLI, notebooks, web, Streamlit β
ββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββ
β Service Layer β
β β’ Orchestrators (use ProcessMetadata) β
ββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββ
β Adapter + Mapper Layer β
β β’ Mappers use Frontmatter.extract() β
β β’ Adapters use Metadata for flexible data β
ββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββ
β Transport Layer β
β β’ Uses Metadata.to_dict() for JSON β
ββββββββββββββββββββββββββββββββββββββββββββββ
Metadata as "Horizontal" Infrastructure¶
Unlike services (which flow vertically: Application β Service β Adapter β Transport), metadata is horizontalβavailable at every layer:
| Layer | Metadata Usage |
|---|---|
| Application | Display metadata in UIs, format for users |
| Service | Track provenance with ProcessMetadata |
| Adapter | Store flexible provider-specific data in Metadata |
| Mapper | Parse .md files with Frontmatter.extract() |
| Transport | Serialize with Metadata.to_dict() for JSON |
4. Compliance Improvements Needed¶
Issue 1: File I/O Mixed with Domain Logic¶
Current (metadata/metadata.py:262-264):
class Frontmatter:
@classmethod
def extract_from_file(cls, file: Path) -> tuple[Metadata, str]:
text_str = read_str_from_file(file) # β File I/O in domain utility
return cls.extract(text_str)
Recommendation: Mark as adapter-level helper, or move to separate FrontmatterFileAdapter:
# Option A: Keep but document as adapter-level
class Frontmatter:
"""Pure frontmatter parsing (no I/O)."""
@staticmethod
def extract(content: str) -> tuple[Metadata, str]:
"""Extract frontmatter from string (pure function)."""
...
@classmethod
def extract_from_file(cls, file: Path) -> tuple[Metadata, str]:
"""ADAPTER-LEVEL: Convenience for file-based workflows.
Note: This method performs I/O. For pure parsing, use extract().
Services should inject file content via transport layer.
"""
text_str = read_str_from_file(file)
return cls.extract(text_str)
# Option B: Separate adapter (stricter compliance)
# adapters/frontmatter_file_adapter.py
class FrontmatterFileAdapter:
"""Adapter for reading frontmatter from files."""
def __init__(self, transport: FileTransport):
self._transport = transport
def extract_from_file(self, file: Path) -> tuple[Metadata, str]:
content = self._transport.read_text(file)
return Frontmatter.extract(content)
Decision: Keep Option A for rapid prototype phase; document as adapter-level helper. Consider Option B post-1.0 if strict layer separation becomes critical.
Issue 2: Logging in Utility Code¶
Current (metadata/metadata.py:22-36):
def safe_yaml_load(yaml_str: str, *, context: str = "unknown") -> dict:
try:
data = yaml.safe_load(yaml_str)
if not isinstance(data, dict):
logger.warning(...) # β Side effect in utility function
return {}
except ScannerError as e:
logger.error(...) # β Side effect
except yaml.YAMLError as e:
logger.error(...) # β Side effect
return {}
Recommendation: Raise typed exceptions; let callers decide logging strategy:
# metadata/errors.py (new file)
class MetadataError(Exception):
"""Base error for metadata operations."""
pass
class FrontmatterParseError(MetadataError):
"""YAML frontmatter parsing failed."""
pass
class InvalidMetadataError(MetadataError):
"""Metadata not a valid dict."""
pass
# metadata/metadata.py
def safe_yaml_load(yaml_str: str, *, context: str = "unknown") -> dict:
"""Parse YAML string to dict.
Raises:
FrontmatterParseError: If YAML parsing fails
InvalidMetadataError: If result is not a dict
"""
try:
data = yaml.safe_load(yaml_str)
if not isinstance(data, dict):
raise InvalidMetadataError(
f"YAML in [{context}] is not a dict, got {type(data)}"
)
return data
except yaml.ScannerError as e:
raise FrontmatterParseError(
f"YAML scanner error in [{context}]: {e}"
) from e
except yaml.YAMLError as e:
raise FrontmatterParseError(
f"YAML error in [{context}]: {e}"
) from e
Benefits: - Pure functions (no side effects) - Callers choose logging strategy (service layer logs, transport retries, etc.) - Typed errors enable better error handling - Testable without mocking loggers
Decision: Implement for 0.2.0; track in TODO as "Metadata error handling improvements".
5. Design Principles Summary¶
When to Use Metadata Infrastructure¶
| Scenario | Use Metadata? | Pattern |
|---|---|---|
| Parsing .md files with frontmatter | β YES | Frontmatter.extract() in mappers |
| Flexible metadata storage | β YES | Metadata instead of Dict[str, Any] |
| Tracking transformation provenance | β YES | ProcessMetadata in services |
| Service-to-service data contracts | β NO | Use Pydantic domain models |
| Provider-specific vendor data | β YES | Metadata for flexible fields |
| Strict business rules validation | β NO | Use Pydantic schemas (validate after Frontmatter.extract()) |
Metadata vs. Pydantic Models¶
Use Metadata when:
- Schema is flexible (user-supplied, vendor-specific)
- Need dict-like operations (|, [], iteration)
- JSON serialization is primary concern
- Provenance tracking with ProcessMetadata
Use Pydantic models when: - Schema is well-defined (domain objects) - Need strict validation (required fields, types) - Want IDE autocomplete and type checking - Encoding business rules
Use both when (common pattern):
class DocumentResult(BaseModel):
"""Domain model with strict validation."""
id: str
content: str
language: str # Strict field
# Flexible metadata for extensions
custom_metadata: Metadata = Metadata()
Consequences¶
Positive¶
- No duplication: Services reuse frontmatter parsing instead of reimplementing
- Consistent behavior: All .md files parsed the same way (prompts, corpus, docs)
- Type safety:
Metadataensures JSON-serializable values (no serialization surprises) - Future-ready: JSON-LD support (ADR-MD01) available when needed
- Provenance tracking:
ProcessMetadataenables reproducible transformations - Self-reflexive: System can operate on its own metadata-bearing artifacts
- Object-service aligned: Clear patterns for metadata usage in each layer
Negative¶
- Mixed concerns (current):
Frontmatter.extract_from_file()has I/O, needs documentation - Logging side effects (current):
safe_yaml_load()logs instead of raising exceptions - Learning curve: Developers must understand when to use
Metadatavs Pydantic
Risks¶
- Temptation to expand: Metadata should stay simple; avoid adding service-specific logic
- Over-use: Don't use
Metadatafor everything; Pydantic models better for strict schemas
Implementation Plan¶
Phase 1: Documentation (Immediate - 0.1.4)¶
- Document metadata role in ADR-OS01 (Section 3.3)
- Create ADR-MD02 (this document)
- Update ADR-PT04 addendum with as-built notes
- Add usage examples to metadata module docstrings
Phase 2: API Cleanup (0.2.0)¶
- Document
Frontmatter.extract_from_file()as adapter-level helper - Refactor
safe_yaml_load()to raise typed exceptions (remove logging) - Create
metadata/errors.pywithMetadataErrorhierarchy - Update tests to catch typed exceptions
Phase 3: Broader Adoption (0.3.0+)¶
- Audit codebase for
Dict[str, Any]usage β replace withMetadatawhere appropriate - Add
ProcessMetadatato translation/transcription pipelines - Enable JSON-LD semantic queries in knowledge base (future)
Open Questions¶
- Should
Metadatasupport nested validation? Currently shallow; consider recursive validation for nested dicts/lists - JSON-LD activation? When to fully enable schema.org vocabularies (deferred to knowledge base implementation)
- Metadata versioning? Should
Metadatatrack schema versions for migrations?
References¶
- ADR-MD01: JSON-LD Metadata Strategy
- ADR-OS01: Object-Service Architecture V3
- ADR-PT04: Prompt System Refactor
- src/tnh_scholar/metadata/metadata.py
Approval: Accepted 2025-12-07 (Aaron Solomon)