ADR-003: Canonical XML AST for English Parsing¶
Status: Accepted
Date: 2025-09-07
Authors: Aaron Solomon
Context¶
... [existing content unchanged] ...
Consequences¶
- Mixed inline styling and nesting are preserved exactly.
- Viewer, search, and downstream NLP can derive their specific projections from one canonical model.
- Future transformations (HTML/Markdown/round-trip) are enabled without revisiting raw XML.
- Slightly higher upfront complexity in parse step; reduced complexity overall by avoiding parallel structures (no EnBlock).
As-Built Addendum (json_builder3.py, 2025-09-14)¶
Scope¶
Documents the implemented behavior of json_builder3.py where it narrows, clarifies, or extends ADR-003.
Confirmed decisions¶
- Dual AID model: Logical sAID (
s:{doc_id}:{i:02d}) and physical pAID (p:{doc_id}:{page}) with a publishedaid_index {section_to_pages, page_to_section}. - EN AST as first-class: Section-scoped English AST serialized to JSON, including
Paragraph,Heading,Author,ListBlock/ListItemBlock,Toc,Raw, and explicitPageBreak. - Notes isolation:
<notes>and<translation-notes>stored in an annotations sidecar keyed by sAID. Notes do not affect pagination or linearized EN text. - Section blocks:
section_blocks[]carryen,vi,en_ast, andpagination {target_pages, status, breaks, method}. Status isexact|incomplete|pending.
Clarifications and simplifications vs prior intent¶
- VI pagination policy (simplified)
- Vietnamese OCR is parsed sequentially starting at page 1.
<pagebreak page="N">attributes are ignored (keep_pagebreaks=False). - Consequence: VI page numbers reflect file order, not embedded attributes. This removes drift caused by bad OCR tags.
- VI section text with hard page separators
- Section-level VI is consolidated with U+000C form feed (
\f) between pages. Newlines remain available for future paragraphing. - Rationale: unambiguous page boundaries inside section text while allowing later paragraph insertion.
- EN pagination threading (explicit, side-effect free)
- Page context (
cur) is threaded through AST emission. Only an explicitPageBreak(page=N)advances context toN+1. Other blocks do not mutate context. - Top-level structure repair (strict)
- A repair pass moves any top-level
pagebreak/notes/translation-notes/otherbetween sections into the preceding section. - Any content before the first
<section>raisesValueError. Multiple consecutive pagebreaks are warned, not fatal. - Images → pages policy
- Page numbers are extracted from filenames; when absent, deterministic fallback numbering is assigned with a warning. Duplicate numbers are an error. URLs are normalized to
/images/{doc_id}/{filename}. - Alignment intent signaling
- If a page belongs to a section (by metadata) and EN is empty, a
Span.align = {"status":"pending","source":"section"}is emitted to flag later reconciliation. - TOC handling
- TOC is parsed into a nested AST and linearizes to indented lines for page text. This preserves structure without inventing semantics.
- Raw passthrough
- Unknown tags are preserved as
Raw{tag, attrs, text}in AST and contribute text to linearized EN if present.
Derived behaviors¶
- Section ordering follows metadata order; titles come only from metadata.
- Bundle title is derived from
doc_idpattern…YYYY…Volwhen possible. - Validation reports: duplicate AIDs, missing images, empty VI/EN, EN pending, missing EN pagebreaks inside targeted ranges, and expected pages not present in spans.
Reasons¶
- Reduce fragility from noisy OCR page markers.
- Make pagination deterministic and auditable.
- Keep notes non-intrusive to alignment.
- Prepare for human-in-the-loop pagination repair by emitting clear signals (
pending,breaksplaceholder,editslog scaffold).
Operational consequences¶
- Rebuilding with the same inputs yields stable page indices even if OCR page attributes change.
- Any future tool that re-numbers VI pages must run before this builder or supply corrected images/metadata.
- UI can rely on
\fwithinSectionBlock.vito segment pages in section view.
Migration notes¶
- Existing consumers should switch to
section_blocks[].vifor section-scoped VI with page splits and prefersection_blocks[].en_astfor rich rendering. - Tools that previously read EN notes from linearized text must read
annotations[sAID]instead. - If prior workflows assumed VI page numbers from OCR attributes, update them to the sequential policy or pre-normalize the XML.
Open questions / future work¶
- Populate
pagination.breaksfrom observed EN pagebreaks to mark in-section splits. - Optional mode to honor VI
<pagebreak page="N">when verified metadata exists. - Emit per-node source spans to enable diff-based reconciliation.
- Define minimal edit ops in
edits[]and wire a reviewer UI.