Best Practices
This guide outlines recommended practices for using TNH Scholar effectively.
General Guidelines
Text Processing
1. Input Preparation
- Ensure text files use consistent line endings
- Remove any special formatting or control characters
- Use UTF-8 encoding for all text files
- Consider running
nfmt
on input files to standardize formatting
2. Language Handling
- Specify language codes explicitly when known
- Use ISO 639-1 two-letter codes (e.g., 'en', 'vi')
- Allow auto-detection only for simple cases
3. File Management
- Keep original files backed up
- Use descriptive file names
- Maintain consistent directory structure
- Store intermediate results when running long pipelines
Command-Line Tools
TNH-FAB
1. Pattern Selection
- Use default patterns for initial testing
- Create custom patterns for specific needs
- Test patterns with small samples first
- Document pattern modifications
2. Pipeline Design
- Break complex processing into steps
- Use intermediate files for long pipelines
- Validate output at each stage
- Consider using
tee
for debugging
Example of good pipeline practice:
# Good: Save intermediate results
cat input.txt | \
tnh-fab punctuate > punctuated.txt && \
tnh-fab section punctuated.txt > sections.json && \
tnh-fab process -p format_xml -s sections.json punctuated.txt > final.xml
# Not recommended: Direct pipeline without saves
cat input.txt | tnh-fab punctuate | tnh-fab section | tnh-fab process -p format_xml
3. Error Handling
- Check command exit codes
- Save error output for debugging
- Use verbose mode for troubleshooting
- Keep log files organized
Audio-Transcribe
1. Audio Processing
- Use appropriate chunk sizes for content
- Consider silence detection for natural breaks
- Monitor transcription quality
- Save intermediate audio chunks
2. YouTube Downloads
- Verify video availability before batch processing
- Use CSV files for bulk operations
- Include timestamps when needed
- Save downloaded audio files
Pattern Development
1. Pattern Design
- Keep patterns focused and single-purpose
- Include clear documentation
- Test with various input types
- Patterns are automatically version controlled using git.
- Use git tools to inspect and manage versions.
2. Template Variables
- Use descriptive variable names
- Provide default values when appropriate
- Document required variables
- Test variable combinations
Example pattern structure:
---
description: Example pattern for formatting
version: 1.0
author: TNH Scholar
---
Process this text according to these parameters:
Language: {{ language }}
Style: {{ style_convention }}
Review Count: {{ review_count }}
Additional instructions...
Performance Optimization
1. Resource Management
- Monitor token usage
- Use appropriate batch sizes
- Consider chunking for large files
- Cache intermediate results
2. API Usage
- Implement rate limiting
- Handle API errors gracefully
- Monitor usage and costs
- Use appropriate models for tasks
Testing and Validation
1. Input Validation
- Test with various file sizes
- Include different languages
- Check character encodings
- Verify line endings
2. Output Validation
- Verify file formats
- Check content integrity
- Validate XML structure
- Compare with expected results
3. Process Validation
- Monitor system resources
- Track processing time
- Log error conditions
- Document edge cases
Security Considerations
1. API Keys
- Use environment variables
- Never commit keys to version control
- Rotate keys regularly
- Monitor API usage
2. File Handling
- Validate input files
- Use secure file permissions
- Clean up temporary files
- Handle sensitive content appropriately
Documentation
1. Code Comments
- Document complex logic
- Explain pattern usage
- Note assumptions
- Include examples
2. Process Documentation
- Document workflows
- Create usage examples
- Update for changes
- Include troubleshooting guides