Best Practices¶
This guide outlines recommended practices for using TNH Scholar effectively.
General Guidelines¶
Text Processing¶
1. Input Preparation¶
- Ensure text files use consistent line endings
- Remove any special formatting or control characters
- Use UTF-8 encoding for all text files
- Consider running
nfmton input files to standardize formatting
2. Language Handling¶
- Specify language codes explicitly when known
- Use ISO 639-1 two-letter codes (e.g., 'en', 'vi')
- Allow auto-detection only for simple cases
3. File Management¶
- Keep original files backed up
- Use descriptive file names
- Maintain consistent directory structure
- Store intermediate results when running long pipelines
Command-Line Tools¶
TNH-Gen¶
1. Prompt Selection¶
- Use default prompts for initial testing
- Create custom prompts for specific needs
- Test prompts with small samples first
- Document prompt modifications
2. Pipeline Design¶
- Break complex processing into steps
- Use intermediate files for long pipelines
- Validate output at each stage
- Consider using
teefor debugging
The recommended pipeline uses tnh-gen for every stage, including the split:
# Step 1: Number raw lines (preprocessing — no model call)
awk '{print NR":"$0}' source.txt > source_numbered.txt
# Step 2: Section — tnh-gen identifies page/article breaks and produces metadata
tnh-gen run --prompt section_by_break \
--input-file source_numbered.txt \
--var source_language=Vietnamese \
--var section_count=4 \
--var metadata='{"title":"...","author":"..."}' \
--output-file sections.json
# Step 3: Clean each section (focused per-page context)
for i in 1 2 3 4; do
tnh-gen run --prompt default_clean_numbered \
--input-file section_${i}_raw.txt \
--var source_language=Vietnamese \
--output-file section_${i}_clean.txt
done
# Step 4: Translate each section with full document context
for i in 1 2 3 4; do
tnh-gen run --prompt default_line_translate \
--input-file section_${i}_clean.txt \
--vars sections.json \
--var source_language=Vietnamese \
--var target_language=English \
--var style=scholarly \
--output-file section_${i}_translated.txt
done
# Step 5: Combine
cat section_{1,2,3,4}_translated.txt > final_translated.txt
Processing section-by-section (rather than one large batch) keeps each model call focused on a single page's worth of content and contains errors locally. The sections.json produced in step 2 carries section titles, key concepts, and a document summary that give the translator full context for every section.
# Never: overwrite your only input copy
tnh-gen run --prompt default_clean --input-file source.txt --output-file source.txt
For the full walkthrough with a real OCR source, extraction helper script, and golden output structure, see the Thầy Edited Journal Text Case Study.
3. Error Handling¶
- Check command exit codes
- Save error output for debugging
- Use verbose mode for troubleshooting
- Keep log files organized
Audio-Transcribe¶
1. Audio Processing¶
- Use appropriate chunk sizes for content
- Consider silence detection for natural breaks
- Monitor transcription quality
- Save intermediate audio chunks
2. YouTube Downloads¶
- Verify video availability before batch processing
- Use CSV files for bulk operations
- Include timestamps when needed
- Save downloaded audio files
Pattern Development¶
1. Pattern Design¶
- Keep patterns focused and single-purpose
- Include clear documentation
- Test with various input types
- Patterns are automatically version controlled using git.
- Use git tools to inspect and manage versions.
2. Template Variables¶
- Use descriptive variable names
- Provide default values when appropriate
- Document required variables
- Test variable combinations
Example pattern structure:
---
description: Example pattern for formatting
version: 1.0
author: TNH Scholar
---
Process this text according to these parameters:
Language: {{ language }}
Style: {{ style_convention }}
Review Count: {{ review_count }}
Additional instructions...
Performance Optimization¶
1. Resource Management¶
- Monitor token usage
- Use appropriate batch sizes
- Consider chunking for large files
- Cache intermediate results
2. API Usage¶
- Implement rate limiting
- Handle API errors gracefully
- Monitor usage and costs
- Use appropriate models for tasks
Testing and Validation¶
1. Input Validation¶
- Test with various file sizes
- Include different languages
- Check character encodings
- Verify line endings
2. Output Validation¶
- Verify file formats
- Check content integrity
- Validate XML structure
- Compare with expected results
3. Process Validation¶
- Monitor system resources
- Track processing time
- Log error conditions
- Document edge cases
Security Considerations¶
1. API Keys¶
- Use environment variables
- Never commit keys to version control
- Rotate keys regularly
- Monitor API usage
2. File Handling¶
- Validate input files
- Use secure file permissions
- Clean up temporary files
- Handle sensitive content appropriately
Documentation¶
1. Code Comments¶
- Document complex logic
- Explain pattern usage
- Note assumptions
- Include examples
2. Process Documentation¶
- Document workflows
- Create usage examples
- Update for changes
- Include troubleshooting guides