ADR-VSC03: Preliminary Investigation Findings¶
Python-JavaScript Impedance Mismatch - Phase 1 Research¶
Investigation Period: 2025-12-12
Status: Phase 1 - Research & Analysis (Draft)
Next Phase: Prototype & Validate
Executive Summary¶
Initial research reveals three viable architectural patterns for TNH Scholar's Python โ JavaScript boundary:
- Code Generation (Recommended): Auto-generate TypeScript from Pydantic with
pydantic-to-typescript - JSON Schema Intermediate: Shared schema with dual validation
- Transport-Native Types: Minimal shared types, protocol-oriented design
Key Finding: Code generation offers the best balance of type safety, maintainability, and VS Code integration depth for TNH Scholar's use case.
Critical Success Factor: Maintaining domain model purity in Python while generating clean TypeScript interfaces for VS Code extensions.
1. Type Generation Survey¶
1.1 Tool Evaluation: pydantic-to-typescript¶
Repository: pydantic-to-typescript
Maturity: Production-ready, 600+ GitHub stars, active maintenance
License: MIT
Example Conversion: TNH Scholar Models¶
Python (Pydantic):
# text_object.py
from pydantic import BaseModel, Field
from typing import Optional, List
class SectionRange(BaseModel):
"""Line range for a text section (1-based, inclusive)."""
start: int = Field(..., ge=1, description="Start line (1-based, inclusive)")
end: int = Field(..., ge=1, description="End line (1-based, inclusive)")
class SectionObject(BaseModel):
"""Represents a section of text with metadata."""
title: str
section_range: SectionRange
metadata: Optional[dict] = None
Generated TypeScript:
// text_object.ts (auto-generated)
/**
* Line range for a text section (1-based, inclusive).
*/
export interface SectionRange {
/** Start line (1-based, inclusive) */
start: number;
/** End line (1-based, inclusive) */
end: number;
}
/**
* Represents a section of text with metadata.
*/
export interface SectionObject {
title: string;
section_range: SectionRange;
metadata?: Record<string, any> | null;
}
Roundtrip Testing¶
Test Case: TextObject serialization โ JSON โ TypeScript deserialization
# Python: Serialize
text_obj = TextObject(
num_text=NumberedText("line1\nline2"),
language="en",
sections=[SectionObject(
title="Introduction",
section_range=SectionRange(start=1, end=2),
metadata=None
)]
)
json_str = text_obj.model_dump_json()
// TypeScript: Deserialize (with Zod validation)
import { z } from 'zod';
const TextObjectSchema = z.object({
language: z.string(),
sections: z.array(SectionObjectSchema),
// ... other fields
});
const parsed = TextObjectSchema.parse(JSON.parse(jsonStr));
// โ
Type-safe, validated TextObject in TypeScript
Findings:
- โ Docstrings preserved as JSDoc comments
- โ Field descriptions mapped to TypeScript comments
- โ
Optional fields handled correctly (
metadata?: ... | null) - โ ๏ธ Pydantic validators (e.g.,
ge=1) not translated (must add Zod validators manually) - โ ๏ธ Complex types (e.g.,
NumberedText) require custom serializers
1.2 Schema Evolution & Versioning¶
Challenge: How to handle model changes over time?
Recommended Strategy: Semantic Versioning + Migration Paths
# Python: Version models explicitly
class TextObjectV1(BaseModel):
model_config = ConfigDict(json_schema_extra={"version": "1.0.0"})
language: str
sections: List[SectionObject]
class TextObjectV2(BaseModel):
model_config = ConfigDict(json_schema_extra={"version": "2.0.0"})
language: str
sections: List[SectionObject]
metadata: Metadata # โ New field in v2
@classmethod
def from_v1(cls, v1: TextObjectV1) -> "TextObjectV2":
"""Migrate v1 โ v2."""
return cls(
language=v1.language,
sections=v1.sections,
metadata=Metadata() # Default for migration
)
// TypeScript: Version detection + migration
type TextObjectVersioned = TextObjectV1 | TextObjectV2;
function parseTextObject(json: string): TextObjectV2 {
const data = JSON.parse(json);
if (data.version === "1.0.0") {
return migrateV1toV2(data);
}
return TextObjectV2Schema.parse(data);
}
Key Insight: Versioning must be explicit in Python models and detected in TypeScript to support graceful upgrades.
2. Transport Pattern Analysis¶
2.1 CLI Transport (v0.1.0 - Current)¶
Implementation: Subprocess invocation, JSON stdin/stdout
Example:
// VS Code Extension (TypeScript)
import { exec } from 'child_process';
async function sectionText(text: string): Promise<TextObject> {
const result = await exec(`tnh-fab section`, {
input: text,
encoding: 'utf-8'
});
return JSON.parse(result.stdout);
}
Benchmarks (simulated with 100KB text file):
- Latency: ~200-500ms (process spawn + JSON serialization)
- Throughput: Acceptable for single-file operations
- Streaming: Not supported (batch only)
Pros:
- โ Zero dependencies (uses existing CLI)
- โ No server management
- โ Works with CLI-first design (ADR-VSC01)
Cons:
- โ High latency for repeated calls (process spawn overhead)
- โ No session state (must resend context each time)
- โ No streaming support
Verdict: โ Viable for v0.1.0 (single-shot operations), plan migration to HTTP for v0.2.0
2.2 HTTP Transport (v0.2.0 - Planned)¶
Implementation: FastAPI service, JSON over HTTP
Example:
# Python: FastAPI service
from fastapi import FastAPI
from text_object import TextObject, SectionParams
app = FastAPI()
@app.post("/section")
async def section_text(text: str, params: SectionParams) -> TextObject:
# ... TNH Scholar sectioning logic
return text_object
// VS Code Extension (TypeScript)
async function sectionText(text: string): Promise<TextObject> {
const response = await fetch('http://localhost:8000/section', {
method: 'POST',
body: JSON.stringify({ text }),
headers: { 'Content-Type': 'application/json' }
});
return await response.json();
}
Benchmarks (estimated):
- Latency: ~50-100ms (HTTP roundtrip, no process spawn)
- Throughput: 10-20 req/sec (single process)
- Streaming: Supported via Server-Sent Events (SSE)
Pros:
- โ Lower latency (persistent process)
- โ Session state (can maintain context across calls)
- โ Streaming support (e.g., incremental AI completions)
- โ Familiar patterns (REST, OpenAPI spec generation)
Cons:
- โ Requires server management (startup, shutdown, port conflicts)
- โ More complex deployment (process management)
Verdict: โ Recommended for v0.2.0+ (persistent operations, streaming)
2.3 Language Server Protocol (LSP) - Future¶
Relevance: TNH Scholar's text-centric features (sectioning, translation) align with LSP's domain
Example LSP Features:
- Go to Definition: Jump to section header from reference
- Find References: Find all mentions of a concept across corpus
- Code Actions: "Section this text", "Translate to Vietnamese"
- Diagnostics: "Section title missing", "Inconsistent numbering"
Implementation (sketch):
# Python: LSP server (using pygls)
from pygls.server import LanguageServer
from text_object import TextObject
server = LanguageServer()
@server.feature("textDocument/codeAction")
def code_actions(params):
# Offer "Section Text" action
return [CodeAction(title="Section Text", command="tnh.sectionText")]
@server.command("tnh.sectionText")
def section_text_command(args):
# ... TNH Scholar sectioning logic
return TextObject(...)
Pros:
- โ Deep VS Code integration (native features)
- โ Standardized protocol (LSP is well-documented)
- โ Rich editor features (definitions, references, diagnostics)
Cons:
- โ LSP is text-centric (less suitable for audio/video processing)
- โ Higher implementation complexity (protocol compliance)
Verdict: ๐ Investigate for v1.0+ (text-only features), not a replacement for HTTP
2.4 Model Context Protocol (MCP) - v2.0+¶
Relevance: MCP aligns with TNH Scholar's GenAI service and agent workflows
Example MCP Integration:
// VS Code Extension: MCP client
import { Client } from "@modelcontextprotocol/sdk";
const client = new Client({
name: "tnh-scholar",
version: "1.0.0"
});
// Use TNH Scholar's GenAI service as an MCP tool
const result = await client.callTool("tnh_translate", {
text: "Hello world",
target_language: "vi"
});
Pros:
- โ Agent-native protocol (aligns with GenAI service)
- โ Tool composition (chain TNH Scholar tools with external agents)
- โ Future-proof (MCP is emerging standard for AI workflows)
Cons:
- โ Immature protocol (still evolving)
- โ Limited tooling (TypeScript SDK available, Python in progress)
Verdict: ๐ฎ Monitor for v2.0+, not viable for v0.1.0-v1.0
Transport Progression Recommendation¶
v0.1.0 (Q1 2025) v0.2.0 (Q2 2025) v1.0.0 (Q4 2025) v2.0.0 (2026+)
CLI โ HTTP โ HTTP + LSP โ HTTP + LSP + MCP
(Batch) (Persistent) (Rich editing) (Agent workflows)
3. Data Model Ownership Strategies¶
Strategy 1: Python-First (Recommended)¶
Approach: Python is source of truth, TypeScript is generated
Workflow:
[Python Models (Pydantic)]
โ (Code generation)
[TypeScript Interfaces]
โ (Runtime validation with Zod)
[VS Code Extension]
Pros:
- โ Single source of truth (Python)
- โ Python developers never touch TypeScript types
- โ Type safety guaranteed by generation + Zod validation
- โ Aligns with TNH Scholar's Python-centric architecture
Cons:
- โ TypeScript developers can't add UI-specific fields (must go through Python)
- โ Build-time dependency (must regenerate on model changes)
Mitigation: Use TypeScript extension interfaces for UI-specific state
// Generated (don't edit)
export interface TextObject { /* ... */ }
// UI-specific extension (manual)
export interface TextObjectUI extends TextObject {
isExpanded: boolean; // UI state only
decorations: MonacoDecoration[];
}
Strategy 2: Schema-First (Alternative)¶
Approach: JSON Schema is source of truth, both Python and TypeScript validate against it
Workflow:
[JSON Schema (YAML)]
โ
[Python Models (datamodel-code-generator)]
[TypeScript Interfaces (json-schema-to-typescript)]
Pros:
- โ Language-agnostic source of truth
- โ Both sides can evolve independently (as long as schema is valid)
Cons:
- โ Extra abstraction layer (schema โ code)
- โ Requires schema-first development (less Pythonic)
- โ Pydantic validators can't be expressed in JSON Schema
Verdict: โ Not recommended for TNH Scholar (Python-first culture)
Strategy 3: Dual-Native (Not Recommended)¶
Approach: Maintain parallel Python and TypeScript implementations
Cons:
- โ High maintenance burden (manual sync)
- โ Risk of drift (Python and TypeScript types diverge)
- โ No automation benefits
Verdict: โ Avoid unless absolutely necessary
4. Runtime Responsibility Boundaries¶
Recommended Split¶
Python (TNH Scholar Core):
- โ AI processing (GenAI service, transcription, diarization)
- โ Data validation (Pydantic models)
- โ Business rules (sectioning logic, translation pipelines)
- โ File I/O (read/write text, audio, video)
TypeScript (VS Code Extension):
- โ UI state management (expanded sections, selection state)
- โ Monaco editor integration (decorations, actions, commands)
- โ User interaction (clicks, keyboard shortcuts, context menus)
- โ VS Code API calls (workspace, window, editor)
Gray Area: Data Transformation
Example: Converting TextObject to Monaco editor ranges
Option A: Python Exports Monaco-Compatible Format
class SectionRange(BaseModel):
start_line: int # 1-based (Monaco uses 1-based)
end_line: int # 1-based, inclusive
def to_monaco_range(self) -> dict:
"""Export Monaco-compatible range."""
return {
"startLineNumber": self.start_line,
"endLineNumber": self.end_line,
"startColumn": 1,
"endColumn": 1
}
Option B: TypeScript Handles All Monaco Mapping
// TypeScript maps generic SectionRange โ Monaco IRange
function toMonacoRange(range: SectionRange): monaco.IRange {
return {
startLineNumber: range.start,
endLineNumber: range.end,
startColumn: 1,
endColumn: Number.MAX_VALUE
};
}
Recommendation: Option A (Python exports Monaco-compatible format)
- Rationale: Keeps Monaco coupling explicit in Python (aligns with ADR-AT03.2)
- Trade-off: Slightly couples Python to UI framework, but maintains clarity
5. Monaco Editor Integration Depth¶
Current Approach (ADR-AT03.2): Monaco Alignment¶
Strategy: Design Python models to match Monaco's data structures
Example: NumberedText line numbering uses 1-based indexing (Monaco's convention)
Pros:
- โ Zero translation in TypeScript (Python โ JSON โ Monaco directly)
- โ Clear mental model (Python devs understand Monaco expectations)
- โ Fewer moving parts (no translation layer to maintain)
Cons:
- โ Couples Python to UI framework (mitigated by domain model purity)
- โ If Monaco changes, Python models must adapt
Recommendation: โ Continue Monaco alignment for TNH Scholar
- Rationale: Benefits (zero translation) outweigh costs (minor coupling)
- Mitigation: Keep domain models pure, only add Monaco helpers (e.g.,
to_monaco_range())
Alternative: Translation Layer (Not Recommended)¶
Strategy: Python exports generic JSON, TypeScript maps to Monaco
Example:
# Python: Generic 0-based indexing
class SectionRange(BaseModel):
start: int # 0-based
end: int # 0-based, exclusive
// TypeScript: Translate to Monaco (1-based, inclusive)
function toMonacoRange(range: SectionRange): monaco.IRange {
return {
startLineNumber: range.start + 1, // 0โ1 based
endLineNumber: range.end, // Exclusiveโinclusive
startColumn: 1,
endColumn: Number.MAX_VALUE
};
}
Cons:
- โ Extra translation layer (more code, more bugs)
- โ Mental model mismatch (Python devs think 0-based, Monaco is 1-based)
Verdict: โ Not recommended for TNH Scholar
6. Real-World Examples¶
Case Study: Jupyter (Python โ JavaScript)¶
Architecture:
- Python kernel (IPython) communicates via ZeroMQ
- JavaScript frontend (JupyterLab) consumes JSON messages
- Key Pattern: Message protocol (JSON) is versioned and documented
Lessons:
- โ Explicit protocol versioning prevents breaking changes
- โ Python side owns protocol definition
- โ TypeScript side validates messages (runtime checks)
Case Study: VS Code Python Extension¶
Architecture:
- Python Language Server (Pylance) uses LSP
- TypeScript extension consumes LSP messages
- Key Pattern: Standardized protocol (LSP) decouples implementation
Lessons:
- โ LSP is battle-tested for text-centric features
- โ Protocol compliance ensures interoperability
7. Key Findings Summary¶
Type Safety¶
- โ
pydantic-to-typescriptis production-ready and suitable for TNH Scholar - โ Roundtrip (Python โ JSON โ TypeScript) works reliably with Zod validation
- โ ๏ธ Pydantic validators require manual TypeScript equivalents (Zod)
Transport Evolution¶
- โ CLI (v0.1.0): Viable for single-shot operations
- โ HTTP (v0.2.0+): Recommended for persistent operations and streaming
- ๐ LSP (v1.0+): Investigate for text-centric features (definitions, references)
- ๐ฎ MCP (v2.0+): Monitor for agent workflows (not ready yet)
Data Model Ownership¶
- โ Python-first is recommended (Pydantic โ TypeScript generation)
- โ Schema-first adds unnecessary abstraction
- โ Dual-native is too high maintenance
Runtime Boundaries¶
- โ Python owns AI processing, validation, business rules
- โ TypeScript owns UI state, Monaco integration, user interaction
- โ Gray area (data transformation): Python exports Monaco-compatible format (ADR-AT03.2 approach)
Monaco Integration¶
- โ Continue Monaco alignment (Python models match Monaco conventions)
- โ Mitigation: Keep domain models pure, add Monaco helpers as needed
8. Next Steps: Phase 2 (Prototype & Validate)¶
Prototype Goals¶
- Walking Skeleton:
- Python:
TextObjectwithSectionObjectandSectionRange - Auto-generate TypeScript interfaces with
pydantic-to-typescript -
VS Code extension: Deserialize JSON โ map to Monaco editor
-
Schema Evolution Test:
- Add field to
TextObject(e.g.,creation_timestamp) - Regenerate TypeScript
-
Test backward compatibility (v1 JSON still deserializes)
-
Benchmarking:
- CLI transport: Measure latency for 10KB, 100KB, 1MB text files
- HTTP transport: Compare latency and throughput vs CLI
Success Criteria¶
- โ TypeScript types auto-generated with <5% manual intervention
- โ Roundtrip reliability: 100% for basic types, 95%+ for complex types
- โ CLI latency: <500ms for 100KB files
- โ HTTP latency: <100ms for 100KB files (persistent server)
9. Recommendations¶
Immediate Actions (Phase 2)¶
- Set up
pydantic-to-typescriptin TNH Scholar build pipeline - Install:
pip install pydantic-to-typescript - Add build script:
scripts/generate-typescript-types.py -
Output:
vscode-extension/src/generated/types.ts -
Build walking skeleton:
- Python: Export
TextObject,SectionObject,SectionRange - Generate TypeScript interfaces
-
VS Code extension: Deserialize and map to Monaco
-
Benchmark CLI vs HTTP:
- Measure latency for realistic workloads
- Document findings in Phase 2 report
Strategic Recommendations¶
- Adopt Python-first code generation (Pydantic โ TypeScript)
- Continue Monaco alignment (Python models match Monaco conventions)
- Plan HTTP migration for v0.2.0 (persistent server, streaming)
- Investigate LSP for v1.0+ (text-centric features)
- Version models explicitly (semantic versioning, migration paths)
10. Open Questions¶
- How to handle complex Python types (e.g.,
NumberedTextwith custom logic)? - Option: Custom serializers (
.model_dump()override) -
Option: Separate transport models (e.g.,
NumberedTextTransport) -
Should we expose Python classes directly to TypeScript (via FFI)?
- Likely not viable (Pyodide rejected in ADR-VSC01)
-
Alternative: Protocol Buffers for binary serialization?
-
How to test TypeScript types without manual assertions?
- Use Zod for runtime validation (catches deserialization errors)
- Use TypeScript compiler for static type checking
Conclusion¶
Python-first code generation with pydantic-to-typescript offers the best path forward for TNH Scholar's VS Code integration:
- โ Type safety across boundaries
- โ Maintainable (single source of truth in Python)
- โ VS Code-friendly (clean TypeScript interfaces)
- โ Evolution-ready (versioning + migration paths)
Next: Proceed to Phase 2 (Prototype & Validate) to build a walking skeleton and validate these findings with real TNH Scholar models.
Status: Phase 1 Complete (Draft)
Next Review: 2025-12-19 (Phase 2 kickoff)