TNH FAB Design Document
Overview
tnh-fab
is a command-line text processing tool that provides standalone but pipeable text processing operations, with a focus on simplicity and flexibility. It is part of the tnh-scholar
suite of tools.
Core Functionality
- Text punctuation
- Section creation and management
- Translation (line-by-line and block)
- General text processing with patterns
Command Structure
tnh-fab <command> [options] [input_file]
Commands:
- punctuate
: Add punctuation and structure to text
- section
: Create text sections
- translate
: Perform line-by-line translation
- process
: Execute pattern-based text processing (typically outputting XML format)
Global Options
-d, --output-dir DIR Output directory (default: current)
-v, --verbose Detailed logging
-l, --language LANG Source language code of input text (auto-detect if not specified)
Command-Specific Options
Punctuate
-p, --pattern NAME Pattern name for punctuation (uses default if not specified)
-y, --style STYLE Punctuation style (default: APA)
-o, --output FILE Output file (default: stdout or FILE_punct.txt)
Section
-p, --pattern NAME Pattern name for sectioning
-n, --num NUM Target number of sections
-o, --output FILE Output JSON file (default: stdout or FILE_sections.json)
Translate
-p, --pattern NAME Pattern for translation (uses default if not specified)
-t, --template FILE Template values file
-o, --output FILE Output file (default: stdout)
-r, --target LANG The target output language code (default is 'en,' English)
Process
-p, --pattern NAME Pattern for processing (REQUIRED)
-g, --paragraph Use line separated paragraphs as sections.
-s, --sections FILE JSON file containing section data
-t, --template FILE Template values file
-f, --format FORMAT Output format: txt/json/yaml (default: txt)
-o, --output FILE Output file (default: stdout)
Usage Examples
Basic Usage
# Punctuate text
tnh-fab punctuate input.txt
tnh-fab punctuate -l vi input.txt > punctuated.txt
# Create sections
tnh-fab section input.txt -n 5 -o sections.json
cat input.txt | tnh-fab section > sections.json
# Translate text
tnh-fab translate input.txt -p vi_en
cat input.txt | tnh-fab translate -p vi_en > translated.txt
# Process with pattern
tnh-fab process -p format_xml -s sections.json input.txt
cat sections.json | tnh-fab process -p format_xml input.txt
# Process by paragraphs
tnh-fab process -g -p format_xml input.txt
Pipeline Examples
# Punctuate and section
cat input.txt | tnh-fab punctuate | tnh-fab section > sections.json
# Section and process
tnh-fab section input.txt | tnh-fab process -p format_xml > output.xml
# Translate and process
cat input.txt | tnh-fab translate | tnh-fab process -p format_md
Configuration
Directory Structure
~/.config/tnh_scholar/
├── patterns/ # Pattern files
│ ├── punctuate/
│ ├── section/
│ ├── translate/
│ └── process/
└── tnh-fab/
└── settings.yaml # Default configurations
Default Configuration (settings.yaml)
defaults:
punctuate:
pattern: default_punctuate
style: APA
language: auto
section:
pattern: default_section
num_sections: auto
review_count: 3
translate:
pattern: default_translate
process:
format: txt
pattern_path: ~/.config/tnh_scholar/patterns
Input/Output Handling
Input Sources
- File specified as argument
- STDIN (piped input)
- Section data (JSON format)
- Template files (YAML format)
Output Handling
- STDOUT (default for piping)
- Specified output file (-o)
- Default file naming (if no -o): input_stage.ext
- JSON output for sections
Pattern Management
- Uses existing PatternManager class for pattern resolution
- Uses configured Pattern path
Special Notes
- Translation is implemented as a standalone command (
translate
) for line-by-line processing, however can also be accomplished as a process pattern option for section translation - Each command is standalone but designed for pipeline compatibility
- All commands default to STDIN/STDOUT unless specific files are provided
- Section data is always in JSON format for compatibility
---
TNH-FAB PROCESS: detailed Specification
Overview
The process
command applies pattern-based text processing using optional section data. It can receive input from files and/or STDIN, with flexible output options. Typical usage is XML output.
Command Format
tnh-fab process [options] [input_file]
Options
-p, --pattern NAME Pattern name for processing (REQUIRED)
-s, --sections FILE JSON file containing section data
-g, --paragraph Process text by newliine separated paragraphs
-t, --template FILE Template values file (YAML format)
-k, --key-values PAIRS Space-separated key:value pairs (e.g., speaker:"Name" title:"Title")
-f, --format FORMAT Output format: XML/txt (default: XML)
-o, --output FILE Output file (default: stdout)
Input Handling
Input Sources
- Text content can come from:
- File specified as argument
- STDIN
- Section data can come from:
- JSON file specified with -s
- STDIN when paired with input file
Input Scenarios
- Single File Input
bash tnh-fab process -p format_xml input.txt
-
Processes input.txt directly
-
STDIN Only
bash cat input.txt | tnh-fab process -p format_xml cat input.txt | tnh-fab process -g -p format_xml # process by paragraphs
-
Processes text from STDIN
-
File + Sections File
bash tnh-fab process -p format_xml -s sections.json input.txt
-
Processes input.txt using sections from sections.json
-
STDIN + Sections File
bash cat input.txt | tnh-fab process -p format_xml -s sections.json
-
Processes STDIN text using sections from sections.json
-
Section Stream + Input File
bash tnh-fab section input.txt | tnh-fab process -p format_xml input.txt
- Processes input.txt using sections from STDIN
Input Validation
- When sections are provided (via -s or STDIN):
- Validates JSON format matches TextObject schema
- Checks source_file field in TextObject if present
- Warns if source_file doesn't match input file name
- Validates section line ranges against input text
Template Value Handling
Priority Order (highest to lowest)
- Command line key-values (-k)
- Template file values (-t)
- Default values from pattern
Key-Value Format
- Space-separated pairs
- Key and value joined by colon
- Values with spaces must be quoted
- Example:
speaker:"Robert Smith" title:"My Journey"
Output Handling
Output Destinations
- STDOUT (default)
- File specified by -o option
Format Options
- txt (default): Plain text output
- json: JSON formatted output
- yaml: YAML formatted output
Error Handling
Input Errors
- Missing required pattern
- Invalid section JSON format
- Section/input file mismatch
- Missing input when required
Template Errors
- Invalid template file format
- Invalid key-value pair format
- Missing required template values
Usage Examples
# Basic file processing
tnh-fab process -p format_pattern input.txt > output.xml
# Process with template values
tnh-fab process -p format_pattern -k speaker:"Robert Smith" input.txt
# Process with sections file
tnh-fab process -p format_pattern -s sections.json input.txt
# Process STDIN with sections
cat input.txt | tnh-fab process -p format_pattern -s sections.json
# Pipeline from section command
tnh-fab section input.txt | tnh-fab process -p format_pattern input.txt
# Complete example with all options
tnh-fab process -p format_pattern \
-s sections.json \
-t template.yaml \
-k speaker:"Robert Smith" \
-f json \
-o output.json \
input.txt