tnh-gen Golden Artifact Preservation Note — May 2026¶
Short operations note from the real-world journal case-study rerun.
Context¶
The journal pipeline golden was first run on an incomplete four-page source. Later review showed the article actually continued onto a fifth page. Rebuilding the source and rerunning the workflow exposed an important weakness in the current testing posture: some of the most useful workflow artifacts were local JSON files that had not been tracked cleanly enough to support later historical comparison.
Main Lesson¶
For tnh-gen golden work, structured outputs are not just transient helper files. They are
part of the evidence trail.
In particular:
- section maps
- reviewed section-map corrections
- vars files that carry section/document context
- preserved comparison runs with different split targets
need to survive later reruns if we want to compare prompt behavior, model behavior, and human review interventions honestly.
Resulting Working Rules¶
- Keep raw model outputs unchanged when they serve as golden evidence.
- Put human-reviewed changes in separately named artifacts such as
_correctedor_edited. - If the underlying source dataset changes materially, preserve the old artifact set and write the new run under a clearly distinct path such as a dated or source-scope-specific directory.
- Track the JSON artifacts that are operationally important to the workflow rather than relying on git history to reconstruct them later.
- Treat assembled user-facing
.mddocuments differently fromtnh-genworkflow artifacts: keep only the current canonical refined assembly, and rely on preserved source, section, clean, and translation artifacts to reconstruct earlier or alternate assemblies when needed.
Why This Matters¶
Without these rules, later prompt or source improvements can erase the exact structured handoff state of earlier runs. That makes it harder to answer questions like:
- Did the model improve, or did the source change?
- Did the section boundary move because of a better prompt, a different target count, or a human review correction?
- What context was actually passed into downstream translation runs?
For real-world golden evaluation, preserving that structure is part of the provenance story, not an optional convenience.