TNH Scholar Knowledge Base: Design Document¶
Design document for the TNH Scholar knowledge base and semantic search stack.
Project Overview¶
The TNH Scholar Knowledge Base project aims to create an AI-driven searchable knowledge base for the works of Thich Nhat Hanh, enabling researchers to perform semantic searches across a comprehensive corpus of teachings. The system will support complex queries ranging from thematic exploration to perspective synthesis while maintaining scholarly accuracy and source attribution.
Core Objectives¶
- Primary Goal: Enable semantic search across Thich Nhat Hanh's complete works with accurate source attribution
- Target Users: Monastics and senior community members for research purposes (internal use only)
- Output Focus: Concrete passage references rather than synthesized responses (initially)
- Technical Approach: Leverage off-the-shelf systems and technologies where possible
Query Types to Support¶
The system must handle diverse query patterns:
- Factual Retrieval: "Where is the best description of touching the earth practice?"
- Thematic Exploration: "Find all works referencing climate change and transcending separate self"
- Perspective Analysis: "What does Thay say about soldiers, police officers, and nonviolence?"
- Qualitative Assessment: "What were some of Thay's most difficult/profound/loving moments?"
- Cross-Domain Inquiry: "Does Thay ever talk about physics and dharma teachings?"
- Synthetic Analysis (future): "What would Thay say about the Israel Palestine conflict today?"
Content Scope and Processing¶
Source Materials¶
- Text Sources: Hundreds of texts in Vietnamese and English (predominant languages)
- Audio/Video Sources: Potentially thousands of sources requiring transcription
- Processing Approach: All sources converted to digital text with unified metadata structure
- Transcription Capability: Robust transcription tools already available in the project
Content Processing Pipeline¶
1. Text Normalization¶
- Consistent formatting, punctuation, and character handling
- Language-specific processing for Vietnamese and English
- Special character and encoding standardization
2. Transcription Integration¶
- Audio/video content processed through existing transcription tools
- Quality assurance for transcribed content
- Timestamping preservation for multimedia sources
3. Metadata Enrichment¶
Development of consistent metadata schema including:
Bibliographic Information:
- Source identification and publication details
- Date, author/speaker, and publication context
- Original format and processing information
Content Classification:
- Topics, themes, and spiritual practices
- Target audiences and contexts
- Relationship to other works in corpus
Structural Metadata:
- Position within larger works
- Section and subsection organization
- Cross-references and citations
Technical Metadata:
- Processing timestamps and methods
- Language identification and translation links
- Quality scores and confidence measures
4. Text Segmentation Strategy¶
Key Decision Point: Determining optimal chunk size for retrieval
Options for consideration:
- Paragraph-level chunks: Balance between context and precision
- Section-based chunks: Preserving logical content boundaries
- Sliding window approaches: Overlapping chunks to prevent context loss
- Hierarchical chunking: Multiple granularities (section → paragraph → sentence)
Search and Retrieval Architecture¶
Vector Embeddings: Technical Foundation¶
Vector embeddings convert text into high-dimensional numerical representations where semantically similar content occupies nearby positions in vector space. This enables semantic search beyond keyword matching.
Key Concepts:
- Text converted to vectors (lists of numbers) representing meaning
- Similar concepts have mathematically similar vectors
- Search performed by finding nearest neighbors in vector space
- Typical dimensions: 768-3072 numbers per text segment
Example Flow:
Text: "The practice of mindful breathing"
→ Embedding Model → [0.016, -0.028, 0.044, ..., 0.037]
Embedding Model Options¶
Commercial Solutions¶
- OpenAI text-embedding-3-large: 3,072 dimensions, high quality, API-based
- Cohere Embed: Multilingual support, commercial API
- Azure AI Embeddings: Enterprise integration capabilities
Open Source Solutions¶
- Sentence-BERT (SBERT): Wide model variety, multilingual options available
- E5 Models: Microsoft's efficient embeddings with strong performance
- BGE Models: Multilingual embeddings with strong cross-language capabilities
- Multilingual Models: Specialized for Vietnamese-English bilingual corpus
Key Decision Point: Commercial vs. Open Source Embedding Strategy
- Commercial: Higher quality, faster development, ongoing costs
- Open Source: Full control, customization potential, infrastructure overhead
Search Paradigms¶
1. Vector-Based Semantic Search¶
- Approach: Dense vector embeddings with nearest neighbor search
- Strengths: Captures conceptual relationships, handles paraphrasing
- Limitations: Computationally expensive, may lose precision on technical terms
- Best For: Thematic queries, conceptual exploration
2. Keyword-Based Search¶
- Approach: Traditional BM25/TF-IDF algorithms with boolean operators
- Strengths: Precise terminology matching, computationally efficient
- Limitations: Misses semantic relationships, language-dependent
- Best For: Exact phrase matching, specific terminology
3. Hybrid Search Systems¶
- Approach: Combining vector and keyword search with result fusion
- Strengths: Balances precision and recall, handles diverse query types
- Limitations: Complex implementation, requires careful tuning
- Best For: Production systems serving varied query patterns
Recommended Approach: Start with pure vector search, evolve to hybrid system based on user feedback and query patterns.
Storage and Infrastructure Options¶
Vector Database Solutions¶
Specialized Vector Databases:
- Pinecone: Fully managed, simple API, commercial service
- Weaviate: Open-source, schema support, self-hosted or cloud
- Milvus: High-performance, enterprise features, complex setup
- Qdrant: Fast performance, good filtering, moderate complexity
- Chroma: Simple open-source, ideal for prototypes
- PGVector: PostgreSQL extension, leverages existing database skills
Traditional Search with Vector Extensions:
- Elasticsearch: Mature ecosystem, hybrid search capabilities, complex configuration
- Solr: Similar to Elasticsearch, vector plugins available
Infrastructure Decision Matrix¶
| Solution | Setup Complexity | Performance | Cost Model | Best For |
|---|---|---|---|---|
| Pinecone | Low | High | Usage-based | Rapid prototyping |
| Elasticsearch | Medium | High | Infrastructure | Hybrid search needs |
| Chroma | Low | Medium | Self-hosted | Initial development |
| Weaviate | Medium | High | Flexible | Production deployment |
Advanced Retrieval Techniques¶
Passage Ranking and Re-ranking¶
Methods to improve relevance of returned results:
- Cross-Encoders: Direct query-passage relevance scoring
- LLM-based Reranking: Large language models for relevance judgment
- Learning to Rank: Machine learning approaches using multiple features
- Multi-stage Retrieval: Initial retrieval followed by sophisticated ranking
Query Processing Enhancement¶
- Query Expansion: Adding related terms to improve recall
- Query Rewriting: Reformulating queries for better matching
- Retrieval Augmented Generation for Queries: LLM-generated query improvements
- Intent Classification: Understanding query type to optimize search strategy
Implementation Pathways¶
Fully Commercial Pathway¶
Components:
- Embedding: OpenAI text-embedding-3-large API
- Storage: Pinecone or Azure Cognitive Search
- Query Processing: OpenAI API for query understanding
- Infrastructure: Cloud providers (AWS, Azure, GCP)
Cost Estimate: $500-2000/month depending on usage Timeline: 2-4 weeks for initial prototype Advantages: Rapid development, managed infrastructure, high-quality embeddings Disadvantages: Recurring costs, vendor dependency, data privacy considerations
Fully Open Source Pathway¶
Components:
- Embedding: SBERT, E5, or BGE models (self-hosted)
- Storage: Elasticsearch, Qdrant, or Milvus (self-hosted)
- Query Processing: Open source rerankers, self-hosted LLMs
- Infrastructure: Self-managed servers
Cost Estimate: Primarily hardware and operational overhead
Timeline: 6-12 weeks for initial prototype
Advantages: No recurring API costs, full control, complete customization
Disadvantages: Higher development effort, infrastructure management, technical expertise required
Hybrid Approach (Recommended for Prototyping)¶
Phase 1 - Rapid Prototype:
- Commercial embedding API + simple vector database
- Focus on core functionality validation
- Minimal infrastructure overhead
Phase 2 - Production System:
- Evaluate commercial vs. open source based on Phase 1 learnings
- Implement sophisticated ranking and query processing
- Scale infrastructure based on usage patterns
Evaluation Framework¶
Technical Metrics¶
Relevance Metrics:
- Precision/Recall: Classical search performance measures
- Mean Average Precision (MAP): Standard information retrieval metric
- Normalized Discounted Cumulative Gain (nDCG): Ranking quality assessment
- Mean Reciprocal Rank (MRR): Position of first relevant result
System Performance:
- Query response time and throughput
- Index size and memory requirements
- Computational resource utilization
- System availability and reliability
User-Centered Evaluation¶
Relevance Assessment:
- Human judgment of search result quality
- Task completion rates for researchers
- Side-by-side comparison of different approaches
- Feedback collection and analysis systems
Usage Analytics:
- Query patterns and frequency analysis
- Result click-through and usage patterns
- User satisfaction surveys and interviews
- Feature usage and adoption metrics
Evaluation Data Requirements¶
Test Query Development:
- Representative query sets across all supported types
- Ground truth relevance judgments
- Edge cases and challenging queries
- Multilingual query examples
Benchmark Creation:
- Gold standard query-document pairs
- Cross-validation datasets for model training
- Hold-out test sets for final evaluation
- Adversarial examples for robustness testing
Open Decision Points and Areas for Exploration¶
1. Chunking Strategy Selection¶
Decision Required: Optimal text segment size and boundaries Options:
- Paragraph-level with context preservation
- Section-based following logical boundaries
- Sliding window with overlap
- Hierarchical multi-granularity approach
Exploration Needed:
- Performance testing with different chunk sizes
- User feedback on result granularity preferences
- Context preservation vs. precision trade-offs
2. Multilingual Handling Approach¶
Decision Required: Strategy for Vietnamese-English bilingual corpus Options:
- Separate indices for each language
- Cross-lingual embeddings for unified search
- Translation-based query expansion
- Language-specific optimization
Exploration Needed:
- Cross-language retrieval effectiveness
- Translation quality impact on search results
- User preference for language-specific vs. unified results
3. Metadata Schema Design¶
Decision Required: Comprehensive metadata structure for all content types Critical Elements:
- Standardization across diverse source materials
- Balance between detail and usability
- Automatic vs. manual metadata generation
- Evolution and versioning strategy
Exploration Needed:
- Analysis of existing source material organization
- User requirements for filtering and faceting
- Automated metadata extraction capabilities
4. Commercial vs. Open Source Technology Mix¶
Decision Required: Optimal balance of commercial and open source components Considerations:
- Budget constraints and cost predictability
- Data privacy and control requirements
- Development timeline and resource availability
- Long-term maintenance and scaling needs
Exploration Needed:
- Pilot testing of different technology combinations
- Total cost of ownership analysis
- Performance and quality comparisons
- Risk assessment for vendor dependencies
5. Advanced Feature Development Priority¶
Decision Required: Roadmap for sophisticated features beyond basic retrieval Potential Features:
- Synthetic query answering ("What would Thay say about...")
- Cross-reference and citation analysis
- Temporal analysis of teaching evolution
- Thematic clustering and visualization
- Collaborative annotation and correction systems
Exploration Needed:
- User interviews to prioritize feature importance
- Technical feasibility assessment for advanced features
- Resource requirements for feature development
- Integration complexity with core search functionality
6. Query-Response Pair Integration Strategy¶
Decision Required: Optimal use of existing query-response pair prototype Options:
- Training data for custom embedding fine-tuning
- Evaluation benchmarks for system performance
- Re-ranking model training data
- Query expansion and reformulation examples
Exploration Needed:
- Quality assessment of existing query-response pairs
- Expansion strategies for broader coverage
- Integration methods with chosen search architecture
- Contribution to overall system performance
Implementation Phases¶
Phase 1: Foundation Development (4-6 weeks)¶
Objectives:
- Establish core document processing pipeline
- Implement basic vector search functionality
- Create initial metadata schema
- Develop evaluation framework
Deliverables:
- Document ingestion and processing system
- Basic search interface for internal testing
- Initial performance benchmarks
- Technology stack validation
Phase 2: Enhanced Retrieval (6-8 weeks)¶
Objectives:
- Implement sophisticated ranking and filtering
- Optimize query processing and understanding
- Expand metadata richness and utility
- Integrate user feedback mechanisms
Deliverables:
- Production-quality search system
- Comprehensive evaluation results
- User training materials and documentation
- Performance optimization and scaling plan
Phase 3: Advanced Features (8-12 weeks)¶
Objectives:
- Develop synthetic query capabilities
- Implement collaborative features
- Create analytical and visualization tools
- Establish long-term maintenance procedures
Deliverables:
- Feature-complete knowledge base system
- Advanced query processing capabilities
- Analytics dashboard and reporting tools
- Comprehensive system documentation
Success Criteria¶
Technical Success Metrics¶
- Query response time under 2 seconds for 95% of searches
- Relevance scores above 0.8 for top-3 results on benchmark queries
- System availability above 99% during operating hours
- Support for concurrent users without performance degradation
User Success Metrics¶
- User satisfaction scores above 4.0/5.0 in system evaluation
- Task completion rates above 85% for research queries
- Adoption rate above 90% among target user community
- Positive feedback on system utility and accuracy
Content Success Metrics¶
- Coverage of 95% of available source materials
- Metadata completeness above 90% for all indexed content
- Cross-language retrieval accuracy comparable to single-language performance
- Successful handling of all defined query types
This design document provides a comprehensive framework for developing the TNH Scholar Knowledge Base while highlighting critical decision points that require further investigation and stakeholder input. The modular approach allows for iterative development and refinement based on user feedback and technical validation.