ADR-PV01: Provenance & Tracing Infrastructure Strategy¶
Establishes provenance and tracing as foundational cross-cutting infrastructure that provides unified patterns for tracking data lineage, request tracing, and operation provenance across all TNH Scholar layers.
- Filename:
adr-pv01-provenance-tracing-strat.md - Heading:
# ADR-PV01: Provenance & Tracing Infrastructure Strategy - Status: Proposed
- Date: 2025-12-19
- Authors: Aaron Solomon, Claude Sonnet 4.5
- Related ADRs:
- ADR-MD02: Metadata Infrastructure Object-Service Integration
- ADR-OS01: Object-Service Architecture V3
ADR Editing Policy¶
IMPORTANT: How you edit this ADR depends on its status.
proposedstatus: ADR is in the design loop. We may rewrite or edit the document as needed to refine the design.accepted,wipstatus: Coding has begun. NEVER edit the original Context/Decision/Consequences sections. Only append addendums.
Rationale: Once implementation begins, the original decision must be preserved for historical context.
Context¶
Discovery: Fragmented Provenance Concepts¶
During Sourcery code review of PR #21 (tnh-gen CLI), we identified a correlation ID inconsistency: two separate IDs were being generated for a single CLI invocation—one for CLI output and another for provenance headers in output files. This revealed a deeper architectural gap.
Current state across TNH Scholar:
- GenAI Service:
Provenanceclass (provider, model, timestamps, fingerprint) - CLI Layer:
correlation_id(request tracing) - Metadata Infrastructure:
ProcessMetadata(document processing history) - Object-Service §12: Mentions "correlation IDs" in provenance.params but no standard model
Problem: We're using different terminologies and models for essentially the same concept—tracking "where things came from"—across different operational scopes. There's no unified guidance on:
- Standard terminology (provenance vs correlation_id vs trace_id vs fingerprint)
- When to use which tracking mechanism
- How to propagate tracking data through call chains
- How to nest/aggregate provenance across multi-stage operations
- Storage and serialization patterns
Architectural Questions¶
- Is provenance a service? No—it's foundational infrastructure (like Metadata)
- Does it need ports/adapters? No—pure data models used across all layers
- How does it fit object-service architecture? Cross-cutting concern, available everywhere
- Should we standardize terminology? Yes—prevents confusion and implementation errors
- Do we need layered models? Yes—different scopes need different granularity
Relationship to Existing Infrastructure¶
Metadata (ADR-MD02) provides:
- Frontmatter parsing for .md files
- ProcessMetadata for document transformations
- JSON-LD serialization
Provenance/Tracing extends this with: - Request-level tracing (CLI invocations, API requests) - Service-level provenance (AI generations, processing operations) - Transport-level tracking (HTTP request IDs, retry counts) - Aggregation patterns for multi-stage workflows
Decision¶
1. Provenance System Role¶
Provenance/Tracing is FOUNDATIONAL CROSS-CUTTING INFRASTRUCTURE, similar to Metadata:
- Available everywhere: All layers (domain, service, adapter, mapper, transport) can import
- No protocols/ports: Pure data models with no abstraction needed
- Cross-cutting concern: Supports object-service architecture without being a service itself
- Reproducibility enabler: System can recreate results and trace operations
2. Unified Terminology¶
Establish standard terms to replace fragmented naming:
| Term | Scope | Purpose | Example |
|---|---|---|---|
| Trace ID | Request/Invocation | Track single operation end-to-end | CLI command, API request, batch job |
| Correlation ID | Alias for Trace ID | (Legacy term, prefer "trace_id") | Same as trace_id |
| Provenance | Service Operation | Record how result was generated | AI generation, document processing |
| Fingerprint | Content Identity | Content-based hash for reproducibility | Prompt hash, input file hash |
| Lineage | Data Flow | Chain of transformations | Source → derivative artifacts |
| Process Metadata | Document Transformation | Metadata about processing operations | (Existing in metadata infrastructure) |
Migration Path:
- New code: use trace_id
- Existing correlation_id: acceptable as alias, gradually refactor
- Always use provenance for service-level result tracking
3. Layered Provenance Model¶
Define standard data models for different operational scopes:
Layer 1: Request Provenance (CLI/API Layer)¶
Purpose: Track individual operations through the system
from dataclasses import dataclass
from datetime import datetime
@dataclass
class RequestProvenance:
"""Provenance for CLI commands, API requests, batch jobs."""
trace_id: str # Unique identifier for this operation
operation: str # e.g., "tnh-gen run", "api.generate", "batch.process"
started_at: datetime
finished_at: datetime | None = None
user_context: dict[str, Any] | None = None # User, environment, etc.
service_provenance: list["ServiceProvenance"] = field(default_factory=list)
Usage:
- Generate trace_id once at operation entry point
- Propagate through all downstream calls
- Include in all outputs (stdout JSON, file headers, logs)
Layer 2: Service Provenance (GenAI/Processing Layer)¶
Purpose: Record how AI/processing results were generated
@dataclass
class ServiceProvenance:
"""Provenance for AI generations, document processing, transformations."""
provider: str # "openai", "anthropic", "pyannote", "docprocessor"
model: str | None = None # Model name if applicable
fingerprint: Fingerprint | None = None # Content hash for reproducibility
started_at: datetime
finished_at: datetime
attempt_count: int = 1
parameters: dict[str, Any] = field(default_factory=dict) # Effective params
policy_version: str | None = None
transport_provenance: "TransportProvenance | None" = None
Usage: - Created by services (GenAIService, DocumentProcessor, etc.) - Embedded in result envelopes - Nested within RequestProvenance for full trace
Layer 3: Transport Provenance (HTTP/SDK Layer)¶
Purpose: Track external system interactions
@dataclass
class TransportProvenance:
"""Provenance for HTTP requests, SDK calls, external APIs."""
request_id: str | None = None # Backend-provided request ID
retry_count: int = 0
backend_metadata: dict[str, Any] = field(default_factory=dict)
sdk_version: str | None = None
Usage: - Captured by adapters when calling external systems - Nested within ServiceProvenance - Aids debugging and cost tracking
Fingerprint Model (Content Identity)¶
@dataclass
class Fingerprint:
"""Content-based identity for reproducibility."""
content_hash: str # Hash of primary input content
variables_hash: str | None = None # Hash of template variables
algorithm: str = "sha256" # Hash algorithm used
Usage: - Generate for all deterministic inputs (prompts, templates, source files) - Enables cache lookups and reproducibility verification
4. Propagation Rules¶
Rule 1: Single Trace ID Per Operation - Generate trace_id at entry point (CLI command, API handler) - Pass through all function calls via context or explicit parameter - Never regenerate within the same operation
Rule 2: Nesting for Multi-Stage Operations - RequestProvenance contains list of ServiceProvenance - ServiceProvenance optionally contains TransportProvenance - Preserve full chain for debugging and audit
Rule 3: Aggregation for Pipelines - Multi-stage pipelines: maintain list of ServiceProvenance (one per stage) - Each stage records its own provenance - Top-level RequestProvenance aggregates all stages
Rule 4: Propagation Through Object-Service Layers
CLI/Application Layer:
└─ Creates RequestProvenance with trace_id
│
Service Layer:
└─ Creates ServiceProvenance (embedded in result Envelope)
│
Adapter Layer:
└─ Propagates trace_id to transport, captures TransportProvenance
│
Transport Layer:
└─ Includes trace_id in HTTP headers (X-Trace-ID)
5. Storage & Serialization Patterns¶
File Headers (Human-Readable)¶
Use YAML frontmatter with --- delimiters (consistent with TNH Scholar metadata patterns):
---
# Provenance metadata (generated output)
trace_id: abc123def456
operation: tnh-gen run summarize-talk
provider: openai
model: gpt-4-turbo-2024-04-09
fingerprint: sha256:8f3e4d...
generated: 2025-12-19T10:30:45Z
---
[Generated content follows...]
Rationale: Maintains consistency with existing TNH Scholar patterns:
- Metadata infrastructure (ADR-MD01/MD02) uses YAML frontmatter
- All
.mdfiles in corpus use---delimiters - Enables reuse of
Frontmatter.extract()utilities - Machine-parseable and human-readable
JSON Output (Machine-Readable)¶
{
"status": "succeeded",
"result": {...},
"provenance": {
"trace_id": "abc123def456",
"operation": "tnh-gen run summarize-talk",
"started_at": "2025-12-19T10:30:42Z",
"finished_at": "2025-12-19T10:30:45Z",
"service_provenance": [
{
"provider": "openai",
"model": "gpt-4-turbo-2024-04-09",
"fingerprint": {
"content_hash": "8f3e4d...",
"algorithm": "sha256"
},
"parameters": {...},
"attempt_count": 1
}
]
}
}
Database Schema (Future)¶
CREATE TABLE request_provenance (
trace_id TEXT PRIMARY KEY,
operation TEXT NOT NULL,
started_at TIMESTAMP NOT NULL,
finished_at TIMESTAMP,
user_context JSONB
);
CREATE TABLE service_provenance (
id SERIAL PRIMARY KEY,
trace_id TEXT REFERENCES request_provenance(trace_id),
provider TEXT NOT NULL,
model TEXT,
fingerprint_hash TEXT,
started_at TIMESTAMP NOT NULL,
finished_at TIMESTAMP NOT NULL,
parameters JSONB,
policy_version TEXT
);
6. Integration with Existing Infrastructure¶
Relationship to Metadata (ADR-MD02)¶
Metadata provides:
- Frontmatter parsing (Frontmatter.extract())
- ProcessMetadata for document transformations
- JSON-LD serialization
Provenance extends: - Request tracing (trace_id) - Service result provenance - Cross-operation aggregation
Integration Pattern:
from tnh_scholar.metadata import ProcessMetadata
from tnh_scholar.provenance import RequestProvenance, ServiceProvenance
# ProcessMetadata records transformation details
process_meta = ProcessMetadata(
operation="summarize",
version="1.0",
parameters={"max_length": 500}
)
# ServiceProvenance records how AI generated result
service_prov = ServiceProvenance(
provider="openai",
model="gpt-4",
parameters=process_meta.to_dict(), # Link to ProcessMetadata
fingerprint=Fingerprint(content_hash="...")
)
# RequestProvenance tracks full CLI operation
request_prov = RequestProvenance(
trace_id="abc123",
operation="tnh-gen run summarize",
service_provenance=[service_prov]
)
Relationship to Object-Service Provenance (ADR-OS01 §12)¶
ADR-OS01 §12.2 states:
Always record in
provenance.params: - Upstream IDs (job_id/request_id) and correlation IDs
ADR-PV01 standardizes this:
- trace_id = standard correlation ID
- ServiceProvenance.parameters = effective params
- TransportProvenance.request_id = upstream IDs
Migration: Existing Provenance class in GenAI service becomes ServiceProvenance (backward compatible with alias)
7. Implementation Location¶
New module: src/tnh_scholar/provenance/
src/tnh_scholar/provenance/
├── __init__.py # Export public models
├── models.py # RequestProvenance, ServiceProvenance, etc.
├── fingerprint.py # Fingerprint generation utilities
└── serialization.py # JSON, file header, JSON-LD serializers
Available to all layers (like metadata): - Domain: Include in result models - Service: Create ServiceProvenance - Adapter: Capture TransportProvenance - Transport: Propagate trace_id in HTTP headers - CLI: Generate RequestProvenance
Consequences¶
Positive¶
- Unified Terminology: Eliminates confusion between correlation_id, trace_id, provenance
- Consistent Tracking: Standard patterns across CLI, service, transport layers
- Debuggability: Trace_id links logs, errors, outputs for same operation
- Reproducibility: Fingerprints + provenance enable result recreation
- Audit Trail: Full lineage from request to result to output file
- Error Prevention: Prevents bugs like duplicate correlation_id generation
- Cross-Layer Visibility: Single trace_id flows through entire operation
- Standards Alignment: Uses industry-standard terms (trace_id from OpenTelemetry/W3C)
Negative¶
- Migration Effort: Existing code uses
correlation_id(low risk: alias acceptable) - Storage Overhead: Additional metadata in outputs and databases
- Complexity: Developers must understand layered provenance model
- Learning Curve: New team members need to learn provenance patterns
Trade-offs¶
- Verbosity vs Clarity: More structured models increase boilerplate but improve clarity
- Performance vs Traceability: Provenance tracking adds minimal overhead for significant debugging value
- Flexibility vs Standards: Opinionated models constrain implementation but ensure consistency
Alternatives Considered¶
Alternative 1: Keep Current Fragmented Approach¶
Rejected because: - Continues allowing bugs like duplicate correlation_id generation - No guidance for developers on when to use which tracking mechanism - Difficult to trace operations across layers
Alternative 2: Single Flat Provenance Model¶
Rejected because: - Different layers need different granularity - Doesn't support nesting for multi-stage operations - Mixes concerns (request tracing vs service provenance)
Alternative 3: Use OpenTelemetry Directly¶
Considered but deferred because: - OpenTelemetry is complex infrastructure (spans, traces, instrumentation) - Our needs are simpler (data models, not distributed tracing infrastructure) - Can adopt OpenTelemetry later and map our models to their schema - Future path: Provenance models are compatible with OpenTelemetry semantic conventions
Open Questions¶
Note: These questions will be addressed in supporting decimal ADRs (see Future Extensions below).
- JSON-LD Integration: Should provenance use JSON-LD schema.org vocabulary? (Metadata does)
-
Direction: ADR-PV01.1 will define Schema.org mappings and integration patterns
-
Fingerprint Algorithm: Always SHA-256 or configurable per use case?
-
Direction: ADR-PV01.5 will specify pluggable algorithm interface with SHA-256 as default
-
OpenTelemetry Compatibility: How to map internal models to OTEL semantic conventions?
-
Direction: ADR-PV01.3 will provide compatibility mapping for future instrumentation
-
Logging Integration: How to inject trace_id into log messages and structured logging?
-
Direction: ADR-PV01.4 will define context injection and formatter patterns
-
Testing Strategy: How to test provenance propagation in integration tests?
-
Direction: ADR-PV01.2 will establish testing patterns and CI integration
-
Provenance Compression: For large pipelines, how to avoid verbose provenance chains?
-
Future consideration: May need summarization or sampling strategies
-
Database Schema: When to implement full provenance database?
-
Future consideration: Deferred until usage patterns emerge
-
Backward Compatibility: Timeline for migrating existing
correlation_idtotrace_id? - Decision: Gradual migration;
correlation_idacceptable as alias during transition
Implementation Plan¶
Phase 1: Foundation (Immediate)¶
- ✅ Create
src/tnh_scholar/provenance/module - ✅ Define
RequestProvenance,ServiceProvenance,TransportProvenancemodels - ✅ Define
Fingerprintmodel - ✅ Add serialization utilities (JSON, file headers)
- ✅ Update documentation with terminology standards
Phase 2: CLI Integration (Next PR)¶
- Refactor tnh-gen CLI to use
RequestProvenancewithtrace_id - Update file header generation to use standard format
- Add trace_id to all JSON outputs
- Update error handling to include trace_id
Phase 3: Service Integration (Subsequent PR)¶
- Alias existing
Provenanceclass toServiceProvenance - Update GenAIService to use new model
- Add fingerprint generation to prompt rendering
- Nest ServiceProvenance within RequestProvenance in CLI
Phase 4: Adapter & Transport (Future)¶
- Add
TransportProvenancecapture in OpenAI adapter - Propagate trace_id in HTTP headers (X-Trace-ID)
- Document adapter patterns for other services
Future Extensions¶
The following decimal ADRs will provide detailed implementation guidance for specific aspects of the provenance infrastructure. These extend ADR-PV01 without modifying the core strategy decisions.
ADR-PV01.1: JSON-LD Provenance Vocabulary¶
Type: implementation-guide
Status: Planned
Scope:
- Define Schema.org vocabulary mappings for provenance models
- Integration with existing metadata JSON-LD infrastructure (ADR-MD01/MD02)
- Serialization examples and templates
- Benefits: semantic web compatibility, dataset interoperability, advanced querying
Key Decisions:
- Map
RequestProvenance,ServiceProvenance,TransportProvenanceto Schema.org types - Use
schema:Provenance,schema:Action,schema:SoftwareApplicationvocabulary - Provide bidirectional conversion utilities (internal models ↔ JSON-LD)
ADR-PV01.2: Provenance Testing Strategy¶
Type: testing-strategy
Status: Planned
Scope:
- Integration test patterns for trace_id propagation across layers
- Property-based tests for consistency (same trace_id in all outputs)
- Serialization/deserialization round-trip tests
- CI/CD integration and regression prevention
Key Decisions:
- Test trace_id flows: CLI → service → adapter → transport
- Verify provenance nesting in multi-stage operations
- Assert trace_id consistency in stdout JSON, file headers, logs
- Add provenance assertions to existing service tests
ADR-PV01.3: OpenTelemetry Compatibility Mapping¶
Type: design-detail
Status: Planned
Scope:
- Mapping table: internal provenance models → OpenTelemetry semantic conventions
- Optional instrumentation layer for OTEL export
- Future migration path to full OTEL tracing
- Preserves investment in current models while enabling future observability
Key Decisions:
- Map
trace_id→ OTELtrace_id(W3C Trace Context format) - Map
ServiceProvenance→ OTEL span attributes (provider, model, parameters) - Map
TransportProvenance.request_id→ OTEL span/event attributes - Provide serializer module for OTEL export (opt-in)
Benefits:
- Future-proofs provenance infrastructure
- Enables integration with OTEL-compatible tools (Jaeger, Zipkin, DataDog)
- No immediate migration required; gradual adoption path
ADR-PV01.4: Logging & Observability Integration¶
Type: implementation-guide
Status: Planned
Scope:
- Context injection patterns for trace_id in log messages
- Structured logging formatter integration (JSON logs)
- Correlation between provenance and application logs
- Log aggregation and querying patterns
Key Decisions:
- Standardized logging context manager (
with trace_context(trace_id)) - Custom logging formatter that extracts trace_id from context
- Log message format:
[trace_id=abc123] Operation completed - Integration with Python
loggingmodule and structured loggers (e.g., structlog)
Benefits:
- Significantly enhances debugging (grep logs by trace_id)
- Links errors, warnings, and info messages to specific operations
- Enables log-based analysis and alerting
ADR-PV01.5: Fingerprint Algorithms & Extensibility¶
Type: design-detail
Status: Planned
Scope:
- Pluggable fingerprint algorithm interface
- Default SHA-256 implementation
- Support for alternative algorithms (e.g., BLAKE3 for performance, MD5 for legacy)
- Algorithm selection based on data type or performance requirements
Key Decisions:
- Define
FingerprintAlgorithmprotocol withhash(content: bytes) -> strmethod - Register algorithms in a factory or registry pattern
- Default to SHA-256 for security and compatibility
- Allow per-operation algorithm override via configuration
Benefits:
- Flexibility for different use cases (text, audio, images)
- Performance tuning for large datasets
- Future-proof for new hash algorithms
As-Built Notes & Addendums¶
Optional section for post-decision updates.
References¶
- W3C Trace Context Specification - Industry standard for trace propagation
- OpenTelemetry Semantic Conventions - Standard attribute naming
- Schema.org Provenance Vocabulary - JSON-LD provenance schema
- PROV-DM: The PROV Data Model - W3C provenance standard