tnh-gen GPT-5 Structured Output Eval — May 2026¶
Short report from live golden evaluation of the maintained tnh-gen prompt slice against GPT-5-family models, focused on walkthrough viability, bounded human use, artifact-backed journal runs, and whether tnh-gen can actually do maintained tasks cleanly enough to support examples and follow-on design.
Scope¶
Prompts evaluated in this loop:
default_sectionsection_by_breakgenerate_sections_engenerate_sections_multi_langdefault_cleantranslate_section_dt_entranslate_json
Models evaluated:
gpt-5-minigpt-5.4gpt-5(translation spot check only)
Primary evaluation goals:
- verify that maintained JSON prompts can complete small real runs successfully
- check whether the prompt outputs are usable for walkthrough examples
- compare GPT-5-family behavior where prompt/model/schema changes might improve real outcomes
- note operator friction when a human would realistically try to run these commands
Findings¶
1. Sectioning is viable on gpt-5.4¶
The maintained sectioning prompts were structurally successful on GPT-5-family models, but gpt-5-mini still allowed semantic misses on line coverage for some cases.
gpt-5.4 materially improved the sectioning path:
generate_sections_enproduced contiguous full-line coveragesection_by_breakproduced contiguous full-line coveragedefault_sectionandgenerate_sections_multi_langwere also usable
Resulting decision:
- keep the maintained sectioning prompt family on
gpt-5.4by default for now
2. Prompt wording still matters even with structured output¶
Line coverage was previously expressed mainly as prompt instruction rather than enforced strongly enough by the acceptance surface.
Tightening the sectioning prompts to require:
- first section starts at line 1
- each next section starts exactly one line after the previous section ends
- final section ends at the final input line
helped make the successful default-model path more legible and robust.
3. translate_json is not a good maintained tnh-gen walkthrough or prompt-catalog example in its current form¶
translate_json remained semantically unsuccessful across:
gpt-5-minigpt-5.4gpt-5
Observed failure mode:
- the run succeeds structurally
- the model returns the original English JSON payload unchanged
This persisted even after tightening prompt wording.
Interpretation:
- this is not just a model-capacity issue
- this is a poor fit between the prompt goal and the current
tnh-genartifact-contract surface - broad "any JSON" structural acceptance is too weak to prove translation happened
Resulting decision:
- remove
translate_jsonfrom the maintained prompt workspace - if JSON translation is needed later, design a different pathway rather than treating this prompt as a healthy maintained contract
4. Real journal clean and section-translation runs are now credible enough to serve as walkthrough sample artifacts¶
A real-world sectioning pass was run against tests/golden/journal-pipeline/source_numbered.txt, with the result written as an artifact:
tests/golden/journal-pipeline/walkthrough/clean_translate/sections_gpt54.json
That run produced four usable sections spanning the full 146-line source. Two representative sections were then processed through a file-based clean then translate loop, again writing artifacts at each stage:
- raw extracted section text
- cleaned section text via
default_clean - translated section text via
translate_section_dt_en
Representative outputs:
tests/golden/journal-pipeline/walkthrough/clean_translate/section_01_cleaned.txttests/golden/journal-pipeline/walkthrough/clean_translate/section_01_translated_dt_en.txttests/golden/journal-pipeline/walkthrough/clean_translate/section_04_cleaned.txttests/golden/journal-pipeline/walkthrough/clean_translate/section_04_translated_dt_en.txt
Observed result:
- the clean stage is doing useful real work on journal prose, not just cosmetic rewriting
- the section translation surface is more plausible for walkthrough use than
default_line_translate - the artifact chain is reviewable and reusable for follow-on prompt iteration
Interpretation:
- for this class of journal text, a section-first handoff followed by plain-text clean and plain-text section translation is a more believable human workflow than line-level translation
default_line_translateshould be viewed as a narrower control-loop surface, not the main journal walkthrough surface
UI/UX Notes¶
The human operator path is still awkward for richer sectioning prompts.
Main friction observed:
- multiline structured metadata passed inline through
--var metadata=... - repeated flags and long commands for otherwise small live checks
- vars files are much more believable than large inline payloads, but JSON-only
--varsis still less friendly than it should be for common metadata-heavy workflows
This is usable for bounded engineering evaluation, but it is not an especially friendly human CLI surface for routine operators.
The sectioning prompts and the journal clean→translate slice are now viable enough for walkthrough-oriented use; the remaining UX issue is mostly variable-entry ergonomics rather than core runtime instability.
Resulting Actions¶
- Keep maintained sectioning prompts on
gpt-5.4defaults. - Keep
default_cleanand the section translation surface ongpt-5.4defaults for now. - Remove
translate_jsonfrom maintained prompts. - Treat future JSON-translation support as separate design work, likely involving deterministic JSON traversal plus targeted string translation rather than a broad "translate arbitrary JSON" prompt contract.
- Treat file-driven vars and saved output artifacts as the realistic operator path for richer multi-step walkthroughs; revisit broader YAML vars / metadata ergonomics later as separate UX work.