ADR-MD01: Adoption of JSON-LD for Metadata Management¶

Commits to JSON-LD metadata so provenance, multilingual transformations, and semantic relationships stay queryable.

Status: Proposed
Date: 2025-01-30

Context¶

TNH Scholar needs a robust metadata management system to track content through various processing stages, particularly for multilingual content processing and human-AI collaborative workflows. The system must support both embedded metadata in text files and associated metadata for binary files, while enabling future expansion to database storage.

Two primary approaches were considered:

Simple YAML frontmatter with basic key-value pairs
JSON-LD based metadata using semantic web standards

The initial inclination was toward YAML frontmatter for its simplicity and readability. However, deeper analysis revealed that JSON-LD's semantic capabilities align well with TNH Scholar's document processing and provenance tracking needs.

Decision Drivers¶

Need to track content transformations through multiple processing stages
Requirement to maintain clear provenance for AI-assisted translations
Future requirements for web-based interfaces showing processing history
Importance of standardized metadata for content management
Value of semantic relationships in understanding content connections
Long-term extensibility requirements

Decision¶

We will adopt JSON-LD as TNH Scholar's primary metadata format, implemented through a phased approach:

Phase 1: File-based Storage

Embedded JSON-LD in text files using frontmatter
Sidecar JSON-LD files for binary content
Basic metadata validation and processing through pyld

Phase 2: Enhanced Processing

Expanded semantic relationships
Improved validation
Training data extraction capabilities

Phase 3: Database Integration

Central metadata storage
Unified querying
Maintained backward compatibility with file-based storage

Technical Implementation¶

Initial implementation will center on a Frontmatter class handling JSON-LD:

class Frontmatter:
    """Handles JSON-LD frontmatter embedding and extraction."""

    SCHEMA_ORG = "https://schema.org/"
    DC_CONTEXT = "http://purl.org/dc/elements/1.1/"

    @staticmethod
    def extract(content: str) -> Tuple[Dict[str, Any], str]:
        """Extract JSON-LD frontmatter and content from text."""

    @staticmethod
    def embed(metadata: Dict[str, Any], content: str) -> str:
        """Embed metadata as JSON-LD frontmatter."""

    @staticmethod
    def validate_jsonld(metadata: Dict[str, Any]) -> bool:
        """Validate if the metadata is valid JSON-LD."""

Example Usage¶

Document processing workflow metadata:

{
  "@context": "https://schema.org/",
  "@type": "Translation",
  "@id": "translation_123_revised",
  "translationOf": {"@id": "transcript_123_raw"},
  "basedOn": {"@id": "translation_123_draft"},
  "sourceLanguage": "vi",
  "targetLanguage": "en",
  "processingStage": "human_revised",
  "revisor": "John Doe",
  "revisionDate": "2024-01-30"
}

Consequences¶

Positive¶

Rich semantic relationships between content
Standard vocabularies through schema.org and Dublin Core
Strong support for content provenance tracking
Better foundation for future web interfaces
Industry-standard metadata format
Enhanced machine readability for AI processing

Negative¶

Increased complexity compared to YAML
Steeper learning curve for contributors
Additional dependency on pyld library
More complex validation requirements

Neutral¶

Changes to existing metadata handling required
Need for migration strategy from current formats
Documentation requirements for JSON-LD usage

Alternative Approaches Considered¶

Simple YAML Frontmatter¶

type: translation
id: translation_123
source_id: transcript_123
language: vi
target_language: en

While simpler, this approach lacks semantic richness and standardization.

Custom Metadata Format¶

Creating a custom format was rejected due to:

Reinventing existing solutions
Lack of standardization
Limited tool support

Implementation Strategy¶

Initial Phase
Implement basic JSON-LD frontmatter handling
Focus on core metadata fields
Simple validation
Enhancement Phase
Add semantic relationship support
Improve validation
Develop migration tools
Integration Phase
Database integration
Advanced querying
Web interface support

Notes¶

This decision prioritizes long-term system capabilities over short-term simplicity. The initial complexity investment is justified by:

Enhanced content relationship tracking
Better support for human-AI workflow management
Improved potential for web interface development
Standard compliance for potential interoperability

TNH Scholar System Design (system-design.md)
Pattern System Documentation (patterns.md)
Text Processing Documentation

The semantic capabilities of JSON-LD align particularly well with TNH Scholar's vision for cyclical learning and content processing improvements as outlined in the system architecture document.

Here's the ADR documentation for the metadata implementation:

ADR 002: Metadata Implementation Strategy¶

02-01-2025

Status¶

Accepted

Context¶

TNH Scholar needs a flexible metadata storage solution during its rapid prototyping phase. The system currently uses Dict[str, Any] throughout for metadata storage, but requires a more controlled yet still flexible approach that:

Maintains JSON serializability for AI pipeline integration
Preserves dict-like operations (especially the | operator for combining metadata)
Allows schema flexibility during prototyping
Provides clear extension points for future structure

Two main approaches were considered:

Type alias: Metadata = Dict[str, Any]
Custom class implementing MutableMapping

Decision Drivers¶

Need for JSON serializability in AI pipelines
Heavy use of dict union operations (|) in existing code
Requirement for maximum flexibility during prototyping
Future extensibility requirements
Minimal overhead during development

Decision¶

Implement a custom Metadata class using MutableMapping that provides dict-like behavior while ensuring JSON serializability:

from collections.abc import MutableMapping
from typing import Any, Dict, Optional, Union, Iterator, Mapping

JsonValue = Union[str, int, float, bool, list, dict, None]

class Metadata(MutableMapping):
    """
    Flexible metadata container that behaves exactly like a dict while ensuring
    JSON serializability. Designed for AI processing pipelines where schema
    flexibility is prioritized over structure.
    """
    def __init__(self, data: Optional[Union[Dict[str, JsonValue], 'Metadata']] = None) -> None:
        self._data: Dict[str, JsonValue] = {}
        if data is not None:
            self.update(data._data if isinstance(data, Metadata) else data)

    # [Core implementation as shown above...]

This implementation was chosen over a simple type alias because it provides:

JSON serializability guarantees
Full dict-like behavior including all operators
Clear extension points for future enhancements
Type safety for JSON values
Explicit serialization methods

Consequences¶

Positive¶

Ensures metadata remains JSON-serializable
Maintains all dict operations including | operator
Makes metadata objects self-identifying
Provides clear path for adding validation/structure later
Explicit serialization methods improve code clarity

Negative¶

Slightly more complex than simple type alias
Need to implement and maintain custom class
Must ensure all dict operations are properly supported
Minor performance overhead compared to raw dict

Neutral¶

Changes required to existing Dict[str, Any] usage
Need to document class behavior and limitations
May need to add features as dict usage patterns emerge

Alternative Approaches Considered¶

Type Alias¶

Metadata = Dict[str, Any]

Rejected because it provides no guarantees about JSON serializability and no extension points for future enhancements.

Pydantic Model¶

Rejected as too structured for current prototyping needs.

attrs/dataclasses¶

Python's built-in dataclass or attrs library
Offers strong typing and validation
Rejected as requiring too rigid structure for current needs

jsonschema/JSON Schema¶

Provides flexible schema validation
Rejected as overkill for current metadata needs
Could be considered for future validation requirements

Existing Metadata Libraries¶

python-metadata: Dedicated metadata handling
metadatastore: Scientific metadata management
dublin-core-metadata: Dublin Core implementation
All rejected as adding unnecessary complexity during prototyping

Examined metadata handling in:

Documentation tools (Sphinx, mkdocs, pelican)
Git metadata systems
Python package metadata

Found these approaches either too specialized or too complex for current needs

Implementation Strategy¶

Initial Implementation
Create Metadata class with full dict behavior
Ensure JSON value type constraints
Add basic serialization methods
Migration
Replace Dict[str, Any] usage with Metadata class
Update existing metadata handling code
Document any behavioral differences
Future Considerations
Potential addition of schema validation
Integration with Dublin Core standards
Enhanced metadata merging strategies

Notes¶

This decision prioritizes flexibility and simplicity during prototyping while ensuring basic guarantees about metadata structure and behavior. The implementation can evolve toward more structured approaches as requirements solidify.

The design specifically supports AI pipeline integration by maintaining JSON compatibility while providing full dict-like operations for easy metadata manipulation.

ADR 001: Adoption of JSON-LD for Metadata Management
TNH Scholar System Design (system-design.md)
Pattern System Documentation (patterns.md)

Yes, good points about the deeper validation needs and potential extensibility. Let's draft an ADR to document these considerations:

ADR 003: Metadata Validation and Serialization Strategy¶

02-23-2024

Status¶

Proposed - Prototyping Phase

Context¶

The TNH Scholar system needs flexible metadata handling that balances immediate prototyping needs with potential future requirements. Current implementation using JsonValue typing provides basic type safety but has several limitations and considerations that need to be documented.

Key Issues: 1. Recursive Validation - Current JsonValue validation is shallow - Nested dictionaries may contain non-serializable objects - List contents are not validated

Object Serialization
Some objects may have valid serialization methods
Current approach limited to primitive JSON types
No standard interface for serializable objects
Type Processing
Current processing happens at initialization
No validation on subsequent updates
Limited to predefined type processors

Current Implementation (Prototype Phase)¶

JsonValue = Union[str, int, float, bool, list, dict, None]

class Metadata(MutableMapping):
    _type_processors = {
        Path: lambda p: str(p.resolve()),
        datetime: lambda d: d.isoformat(),
    }

    def __init__(self, data: Optional[Union[Dict[str, Any], 'Metadata']] = None):
        self._data: Dict[str, JsonValue] = {}
        if data is not None:
            raw_data = data._data if isinstance(data, Metadata) else data
            processed_data = {
                k: self._process_value(v) for k, v in raw_data.items()
            }
            self.update(processed_data)

Future Considerations¶

Deep Validation

def validate_json_value(value: Any, path: str = "") -> bool:
    if isinstance(value, dict):
        return all(
            validate_json_value(v, f"{path}.{k}") 
            for k, v in value.items()
        )
    if isinstance(value, list):
        return all(
            validate_json_value(v, f"{path}[{i}]") 
            for i, v in enumerate(value)
        )
    return isinstance(value, (str, int, float, bool, type(None)))

Serializable Interface

class Serializable(Protocol):
    def to_dict(self) -> Dict[str, JsonValue]: ...

class Metadata(MutableMapping):
    def _process_value(self, value: Any) -> JsonValue:
        if isinstance(value, Serializable):
            return value.to_dict()
        # existing processing...

Update Validation

def __setitem__(self, key: str, value: Any) -> None:
    self._data[key] = self._process_value(value)

Decision¶

For the prototyping phase:

Keep current shallow validation
Document known limitations
Use type processors for common cases
Accept some type safety compromises for flexibility

Consequences¶

Positive:

Simple, workable implementation for prototyping
Clear path for future enhancement
Basic type safety for common cases
Flexible enough for rapid development

Negative:

Incomplete validation
Potential for invalid nested data
No standardized object serialization
Some type safety compromises

Future Directions¶

Validation Options:
Full recursive validation
Schema-based validation
Custom validation rules
Serialization Enhancement:
Standard serialization protocol
Custom serializers registry
Validation hooks
Type Processing:
Extended type processor registry
Custom processor registration
Update validation

Notes¶

This design purposefully favors flexibility and simplicity during prototyping while documenting paths for future enhancement. The current implementation acknowledges and accepts certain limitations in favor of development velocity.

ADR-MD01: Adoption of JSON-LD for Metadata Management¶

Context¶

Decision Drivers¶

Decision¶

Technical Implementation¶

Example Usage¶

Consequences¶

Positive¶

Negative¶

Neutral¶

Alternative Approaches Considered¶

Simple YAML Frontmatter¶

Custom Metadata Format¶

Implementation Strategy¶

Notes¶

Related Documents¶

ADR 002: Metadata Implementation Strategy¶

Status¶

Context¶

Decision Drivers¶

Decision¶

Consequences¶

Positive¶

Negative¶

Neutral¶

Alternative Approaches Considered¶

Type Alias¶

Pydantic Model¶

attrs/dataclasses¶

jsonschema/JSON Schema¶

Existing Metadata Libraries¶

Related Project Approaches¶

Implementation Strategy¶

Notes¶

Related Documents¶

ADR 003: Metadata Validation and Serialization Strategy¶

Status¶

Context¶

Current Implementation (Prototype Phase)¶

Future Considerations¶

Decision¶

Consequences¶

Future Directions¶

Notes¶