ADR-K01: Preliminary Architectural Strategy for TNH Scholar Knowledge Base¶

Outlines a two-phase managed-to-open knowledge base plan so we can validate retrieval quality before heavy investment.

Status: Proposed
Date: 2025-02-14
Stakeholders: Research tooling team, GenAI service maintainers, metadata working group

Context¶

The research note in docs/research/tnh-scholar-knowledge-vector-search.md captures objectives, content scope, architectural options, and open questions for the TNH Scholar Knowledge Base. The document establishes the target user base, outlines the ingestion pipeline, compares embedding/search/storage technologies, and lists evaluation metrics. However, several foundational choices remain unresolved (chunking strategy, multilingual plan, metadata schema, and the commercial vs. open-source stack). An initial Architectural Decision Record is needed to align stakeholders on a phased approach while these investigations continue.

Decision¶

Adopt a two-phase strategy:

Phase 1 – Learning Prototype
Use managed services for speed: OpenAI text-embedding-3-large, a lightweight vector store (Pinecone or Chroma), and a simple ingestion pipeline leveraging existing transcription tools.
Focus on English content first, using paragraph-level chunks with conservative overlap to validate retrieval quality.
Capture user feedback from monastics/senior researchers to refine query patterns and metadata expectations.
Phase 2 – Production Architecture
Based on Phase 1 metrics, evaluate migration to an open-source or hybrid stack (e.g., BGE/E5 embeddings + Weaviate/Qdrant/Elasticsearch with BM25).
Finalize a bilingual metadata schema and chunking policy, incorporating cross-lingual retrieval requirements.
Introduce advanced ranking (re-rankers or intent-aware routing) only after core precision/recall targets are met.

Rationale¶

The research document identifies rapid prototyping as the recommended path (docs/research/tnh-scholar-knowledge-vector-search.md:200-209). Managed embeddings and hosted vector DBs minimize infrastructure drag while user needs crystallize.
Paragraph-level chunks provide good context/precision balance (docs/research/tnh-scholar-knowledge-vector-search.md:76-85). We can adjust chunking once evaluation data indicates better boundaries.
Focusing on English first limits scope while the multilingual strategy (separate indices vs. cross-lingual embeddings) is still under investigation (docs/research/tnh-scholar-knowledge-vector-search.md:210-248).
A phased migration plan keeps the door open for cost/control optimization once we have empirical data on query mix, cost-per-query, and operational complexity.

Consequences¶

Positive: Enables a demonstrable prototype in weeks, surfaces real user queries, and produces concrete metrics needed for later ADRs.
Neutral/Deferred: Multilingual retrieval, sophisticated reranking, and hybrid search remain research tracks tied to Phase 1 learnings.
Negative: Short-term vendor lock-in (OpenAI + Pinecone) and recurring API costs until the open-source evaluation completes.

Next Steps & Open Questions¶

Chunking Experiments: Run A/B tests across paragraph, sliding-window, and hierarchical strategies to inform ADR-K02.
Metadata Schema Draft: Collaborate with the metadata working group to prototype the bibliographic/content/structural fields listed in the research doc (docs/research/tnh-scholar-knowledge-vector-search.md:48-75).
Multilingual Plan: Prototype cross-language retrieval on a bilingual subset to assess whether cross-lingual embeddings or dual indices perform better.
Cost & Privacy Analysis: Document the operational cost envelope for the managed stack and the data-handling implications before onboarding sensitive transcripts.
Evaluation Harness: Build the test query set, gold judgments, and logging needed to compute precision, recall, MAP, and nDCG as outlined (docs/research/tnh-scholar-knowledge-vector-search.md:214-273).

Approval of this ADR should be revisited once Phase 1 metrics and user feedback reports are available.