ADR-VSC03: Preliminary Investigation Findings¶
Phase 1 research findings on Python-JavaScript impedance mitigation strategies for TNH Scholar VS Code integration.
Python-JavaScript Impedance Mismatch - Phase 1 Research¶
Investigation Period: 2025-12-12
Status: Phase 1 - Research & Analysis (Draft)
Next Phase: Prototype & Validate
Executive Summary¶
Initial research reveals three viable architectural patterns for TNH Scholar's Python ↔ JavaScript boundary:
- Code Generation (Recommended): Auto-generate TypeScript from Pydantic with
pydantic-to-typescript - JSON Schema Intermediate: Shared schema with dual validation
- Transport-Native Types: Minimal shared types, protocol-oriented design
Key Finding: Code generation offers the best balance of type safety, maintainability, and VS Code integration depth for TNH Scholar's use case.
Critical Success Factor: Maintaining domain model purity in Python while generating clean TypeScript interfaces for VS Code extensions.
1. Type Generation Survey¶
1.1 Tool Evaluation: pydantic-to-typescript¶
Repository: pydantic-to-typescript
Maturity: Production-ready, 600+ GitHub stars, active maintenance
License: MIT
Example Conversion: TNH Scholar Models¶
Python (Pydantic):
# text_object.py
from pydantic import BaseModel, Field
from typing import Optional, List
class SectionRange(BaseModel):
"""Line range for a text section (1-based, inclusive)."""
start: int = Field(..., ge=1, description="Start line (1-based, inclusive)")
end: int = Field(..., ge=1, description="End line (1-based, inclusive)")
class SectionObject(BaseModel):
"""Represents a section of text with metadata."""
title: str
section_range: SectionRange
metadata: Optional[dict] = None
Generated TypeScript:
// text_object.ts (auto-generated)
/**
* Line range for a text section (1-based, inclusive).
*/
export interface SectionRange {
/** Start line (1-based, inclusive) */
start: number;
/** End line (1-based, inclusive) */
end: number;
}
/**
* Represents a section of text with metadata.
*/
export interface SectionObject {
title: string;
section_range: SectionRange;
metadata?: Record<string, any> | null;
}
Roundtrip Testing¶
Test Case: TextObject serialization → JSON → TypeScript deserialization
# Python: Serialize
text_obj = TextObject(
num_text=NumberedText("line1\nline2"),
language="en",
sections=[SectionObject(
title="Introduction",
section_range=SectionRange(start=1, end=2),
metadata=None
)]
)
json_str = text_obj.model_dump_json()
// TypeScript: Deserialize (with Zod validation)
import { z } from 'zod';
const TextObjectSchema = z.object({
language: z.string(),
sections: z.array(SectionObjectSchema),
// ... other fields
});
const parsed = TextObjectSchema.parse(JSON.parse(jsonStr));
// ✅ Type-safe, validated TextObject in TypeScript
Findings:
- ✅ Docstrings preserved as JSDoc comments
- ✅ Field descriptions mapped to TypeScript comments
- ✅ Optional fields handled correctly (
metadata?: ... | null) - ⚠️ Pydantic validators (e.g.,
ge=1) not translated (must add Zod validators manually) - ⚠️ Complex types (e.g.,
NumberedText) require custom serializers
1.2 Schema Evolution & Versioning¶
Challenge: How to handle model changes over time?
Recommended Strategy: Semantic Versioning + Migration Paths
# Python: Version models explicitly
class TextObjectV1(BaseModel):
model_config = ConfigDict(json_schema_extra={"version": "1.0.0"})
language: str
sections: List[SectionObject]
class TextObjectV2(BaseModel):
model_config = ConfigDict(json_schema_extra={"version": "2.0.0"})
language: str
sections: List[SectionObject]
metadata: Metadata # ← New field in v2
@classmethod
def from_v1(cls, v1: TextObjectV1) -> "TextObjectV2":
"""Migrate v1 → v2."""
return cls(
language=v1.language,
sections=v1.sections,
metadata=Metadata() # Default for migration
)
// TypeScript: Version detection + migration
type TextObjectVersioned = TextObjectV1 | TextObjectV2;
function parseTextObject(json: string): TextObjectV2 {
const data = JSON.parse(json);
if (data.version === "1.0.0") {
return migrateV1toV2(data);
}
return TextObjectV2Schema.parse(data);
}
Key Insight: Versioning must be explicit in Python models and detected in TypeScript to support graceful upgrades.
2. Transport Pattern Analysis¶
2.1 CLI Transport (v0.1.0 - Current)¶
Implementation: Subprocess invocation, JSON stdin/stdout
Example:
// VS Code Extension (TypeScript)
import { exec } from 'child_process';
async function sectionText(text: string): Promise<TextObject> {
const result = await exec(`tnh-fab section`, {
input: text,
encoding: 'utf-8'
});
return JSON.parse(result.stdout);
}
Benchmarks (simulated with 100KB text file):
- Latency: ~200-500ms (process spawn + JSON serialization)
- Throughput: Acceptable for single-file operations
- Streaming: Not supported (batch only)
Pros:
- ✅ Zero dependencies (uses existing CLI)
- ✅ No server management
- ✅ Works with CLI-first design (ADR-VSC01)
Cons:
- ❌ High latency for repeated calls (process spawn overhead)
- ❌ No session state (must resend context each time)
- ❌ No streaming support
Verdict: ✅ Viable for v0.1.0 (single-shot operations), plan migration to HTTP for v0.2.0
2.2 HTTP Transport (v0.2.0 - Planned)¶
Implementation: FastAPI service, JSON over HTTP
Example:
# Python: FastAPI service
from fastapi import FastAPI
from text_object import TextObject, SectionParams
app = FastAPI()
@app.post("/section")
async def section_text(text: str, params: SectionParams) -> TextObject:
# ... TNH Scholar sectioning logic
return text_object
// VS Code Extension (TypeScript)
async function sectionText(text: string): Promise<TextObject> {
const response = await fetch('http://localhost:8000/section', {
method: 'POST',
body: JSON.stringify({ text }),
headers: { 'Content-Type': 'application/json' }
});
return await response.json();
}
Benchmarks (estimated):
- Latency: ~50-100ms (HTTP roundtrip, no process spawn)
- Throughput: 10-20 req/sec (single process)
- Streaming: Supported via Server-Sent Events (SSE)
Pros:
- ✅ Lower latency (persistent process)
- ✅ Session state (can maintain context across calls)
- ✅ Streaming support (e.g., incremental AI completions)
- ✅ Familiar patterns (REST, OpenAPI spec generation)
Cons:
- ❌ Requires server management (startup, shutdown, port conflicts)
- ❌ More complex deployment (process management)
Verdict: ✅ Recommended for v0.2.0+ (persistent operations, streaming)
2.3 Language Server Protocol (LSP) - Future¶
Relevance: TNH Scholar's text-centric features (sectioning, translation) align with LSP's domain
Example LSP Features:
- Go to Definition: Jump to section header from reference
- Find References: Find all mentions of a concept across corpus
- Code Actions: "Section this text", "Translate to Vietnamese"
- Diagnostics: "Section title missing", "Inconsistent numbering"
Implementation (sketch):
# Python: LSP server (using pygls)
from pygls.server import LanguageServer
from text_object import TextObject
server = LanguageServer()
@server.feature("textDocument/codeAction")
def code_actions(params):
# Offer "Section Text" action
return [CodeAction(title="Section Text", command="tnh.sectionText")]
@server.command("tnh.sectionText")
def section_text_command(args):
# ... TNH Scholar sectioning logic
return TextObject(...)
Pros:
- ✅ Deep VS Code integration (native features)
- ✅ Standardized protocol (LSP is well-documented)
- ✅ Rich editor features (definitions, references, diagnostics)
Cons:
- ❌ LSP is text-centric (less suitable for audio/video processing)
- ❌ Higher implementation complexity (protocol compliance)
Verdict: 🔍 Investigate for v1.0+ (text-only features), not a replacement for HTTP
2.4 Model Context Protocol (MCP) - v2.0+¶
Relevance: MCP aligns with TNH Scholar's GenAI service and agent workflows
Example MCP Integration:
// VS Code Extension: MCP client
import { Client } from "@modelcontextprotocol/sdk";
const client = new Client({
name: "tnh-scholar",
version: "1.0.0"
});
// Use TNH Scholar's GenAI service as an MCP tool
const result = await client.callTool("tnh_translate", {
text: "Hello world",
target_language: "vi"
});
Pros:
- ✅ Agent-native protocol (aligns with GenAI service)
- ✅ Tool composition (chain TNH Scholar tools with external agents)
- ✅ Future-proof (MCP is emerging standard for AI workflows)
Cons:
- ❌ Immature protocol (still evolving)
- ❌ Limited tooling (TypeScript SDK available, Python in progress)
Verdict: 🔮 Monitor for v2.0+, not viable for v0.1.0-v1.0
Transport Progression Recommendation¶
v0.1.0 (Q1 2025) v0.2.0 (Q2 2025) v1.0.0 (Q4 2025) v2.0.0 (2026+)
CLI → HTTP → HTTP + LSP → HTTP + LSP + MCP
(Batch) (Persistent) (Rich editing) (Agent workflows)
3. Data Model Ownership Strategies¶
Strategy 1: Python-First (Recommended)¶
Approach: Python is source of truth, TypeScript is generated
Workflow:
[Python Models (Pydantic)]
↓ (Code generation)
[TypeScript Interfaces]
↓ (Runtime validation with Zod)
[VS Code Extension]
Pros:
- ✅ Single source of truth (Python)
- ✅ Python developers never touch TypeScript types
- ✅ Type safety guaranteed by generation + Zod validation
- ✅ Aligns with TNH Scholar's Python-centric architecture
Cons:
- ❌ TypeScript developers can't add UI-specific fields (must go through Python)
- ❌ Build-time dependency (must regenerate on model changes)
Mitigation: Use TypeScript extension interfaces for UI-specific state
// Generated (don't edit)
export interface TextObject { /* ... */ }
// UI-specific extension (manual)
export interface TextObjectUI extends TextObject {
isExpanded: boolean; // UI state only
decorations: MonacoDecoration[];
}
Strategy 2: Schema-First (Alternative)¶
Approach: JSON Schema is source of truth, both Python and TypeScript validate against it
Workflow:
[JSON Schema (YAML)]
↓
[Python Models (datamodel-code-generator)]
[TypeScript Interfaces (json-schema-to-typescript)]
Pros:
- ✅ Language-agnostic source of truth
- ✅ Both sides can evolve independently (as long as schema is valid)
Cons:
- ❌ Extra abstraction layer (schema → code)
- ❌ Requires schema-first development (less Pythonic)
- ❌ Pydantic validators can't be expressed in JSON Schema
Verdict: ❌ Not recommended for TNH Scholar (Python-first culture)
Strategy 3: Dual-Native (Not Recommended)¶
Approach: Maintain parallel Python and TypeScript implementations
Cons:
- ❌ High maintenance burden (manual sync)
- ❌ Risk of drift (Python and TypeScript types diverge)
- ❌ No automation benefits
Verdict: ❌ Avoid unless absolutely necessary
4. Runtime Responsibility Boundaries¶
Recommended Split¶
Python (TNH Scholar Core):
- ✅ AI processing (GenAI service, transcription, diarization)
- ✅ Data validation (Pydantic models)
- ✅ Business rules (sectioning logic, translation pipelines)
- ✅ File I/O (read/write text, audio, video)
TypeScript (VS Code Extension):
- ✅ UI state management (expanded sections, selection state)
- ✅ Monaco editor integration (decorations, actions, commands)
- ✅ User interaction (clicks, keyboard shortcuts, context menus)
- ✅ VS Code API calls (workspace, window, editor)
Gray Area: Data Transformation
Example: Converting TextObject to Monaco editor ranges
Option A: Python Exports Monaco-Compatible Format
class SectionRange(BaseModel):
start_line: int # 1-based (Monaco uses 1-based)
end_line: int # 1-based, inclusive
def to_monaco_range(self) -> dict:
"""Export Monaco-compatible range."""
return {
"startLineNumber": self.start_line,
"endLineNumber": self.end_line,
"startColumn": 1,
"endColumn": 1
}
Option B: TypeScript Handles All Monaco Mapping
// TypeScript maps generic SectionRange → Monaco IRange
function toMonacoRange(range: SectionRange): monaco.IRange {
return {
startLineNumber: range.start,
endLineNumber: range.end,
startColumn: 1,
endColumn: Number.MAX_VALUE
};
}
Recommendation: Option A (Python exports Monaco-compatible format)
- Rationale: Keeps Monaco coupling explicit in Python (aligns with ADR-AT03.2)
- Trade-off: Slightly couples Python to UI framework, but maintains clarity
5. Monaco Editor Integration Depth¶
Current Approach (ADR-AT03.2): Monaco Alignment¶
Strategy: Design Python models to match Monaco's data structures
Example: NumberedText line numbering uses 1-based indexing (Monaco's convention)
Pros:
- ✅ Zero translation in TypeScript (Python → JSON → Monaco directly)
- ✅ Clear mental model (Python devs understand Monaco expectations)
- ✅ Fewer moving parts (no translation layer to maintain)
Cons:
- ❌ Couples Python to UI framework (mitigated by domain model purity)
- ❌ If Monaco changes, Python models must adapt
Recommendation: ✅ Continue Monaco alignment for TNH Scholar
- Rationale: Benefits (zero translation) outweigh costs (minor coupling)
- Mitigation: Keep domain models pure, only add Monaco helpers (e.g.,
to_monaco_range())
Alternative: Translation Layer (Not Recommended)¶
Strategy: Python exports generic JSON, TypeScript maps to Monaco
Example:
# Python: Generic 0-based indexing
class SectionRange(BaseModel):
start: int # 0-based
end: int # 0-based, exclusive
// TypeScript: Translate to Monaco (1-based, inclusive)
function toMonacoRange(range: SectionRange): monaco.IRange {
return {
startLineNumber: range.start + 1, // 0→1 based
endLineNumber: range.end, // Exclusive→inclusive
startColumn: 1,
endColumn: Number.MAX_VALUE
};
}
Cons:
- ❌ Extra translation layer (more code, more bugs)
- ❌ Mental model mismatch (Python devs think 0-based, Monaco is 1-based)
Verdict: ❌ Not recommended for TNH Scholar
6. Real-World Examples¶
Case Study: Jupyter (Python ↔ JavaScript)¶
Architecture:
- Python kernel (IPython) communicates via ZeroMQ
- JavaScript frontend (JupyterLab) consumes JSON messages
- Key Pattern: Message protocol (JSON) is versioned and documented
Lessons:
- ✅ Explicit protocol versioning prevents breaking changes
- ✅ Python side owns protocol definition
- ✅ TypeScript side validates messages (runtime checks)
Case Study: VS Code Python Extension¶
Architecture:
- Python Language Server (Pylance) uses LSP
- TypeScript extension consumes LSP messages
- Key Pattern: Standardized protocol (LSP) decouples implementation
Lessons:
- ✅ LSP is battle-tested for text-centric features
- ✅ Protocol compliance ensures interoperability
7. Key Findings Summary¶
Type Safety¶
- ✅
pydantic-to-typescriptis production-ready and suitable for TNH Scholar - ✅ Roundtrip (Python → JSON → TypeScript) works reliably with Zod validation
- ⚠️ Pydantic validators require manual TypeScript equivalents (Zod)
Transport Evolution¶
- ✅ CLI (v0.1.0): Viable for single-shot operations
- ✅ HTTP (v0.2.0+): Recommended for persistent operations and streaming
- 🔍 LSP (v1.0+): Investigate for text-centric features (definitions, references)
- 🔮 MCP (v2.0+): Monitor for agent workflows (not ready yet)
Data Model Ownership¶
- ✅ Python-first is recommended (Pydantic → TypeScript generation)
- ❌ Schema-first adds unnecessary abstraction
- ❌ Dual-native is too high maintenance
Runtime Boundaries¶
- ✅ Python owns AI processing, validation, business rules
- ✅ TypeScript owns UI state, Monaco integration, user interaction
- ✅ Gray area (data transformation): Python exports Monaco-compatible format (ADR-AT03.2 approach)
Monaco Integration¶
- ✅ Continue Monaco alignment (Python models match Monaco conventions)
- ✅ Mitigation: Keep domain models pure, add Monaco helpers as needed
8. Next Steps: Phase 2 (Prototype & Validate)¶
Prototype Goals¶
- Walking Skeleton:
- Python:
TextObjectwithSectionObjectandSectionRange - Auto-generate TypeScript interfaces with
pydantic-to-typescript -
VS Code extension: Deserialize JSON → map to Monaco editor
-
Schema Evolution Test:
- Add field to
TextObject(e.g.,creation_timestamp) - Regenerate TypeScript
-
Test backward compatibility (v1 JSON still deserializes)
-
Benchmarking:
- CLI transport: Measure latency for 10KB, 100KB, 1MB text files
- HTTP transport: Compare latency and throughput vs CLI
Success Criteria¶
- ✅ TypeScript types auto-generated with <5% manual intervention
- ✅ Roundtrip reliability: 100% for basic types, 95%+ for complex types
- ✅ CLI latency: <500ms for 100KB files
- ✅ HTTP latency: <100ms for 100KB files (persistent server)
9. Recommendations¶
Immediate Actions (Phase 2)¶
- Set up
pydantic-to-typescriptin TNH Scholar build pipeline - Install:
pip install pydantic-to-typescript - Add build script:
scripts/generate-typescript-types.py -
Output:
vscode-extension/src/generated/types.ts -
Build walking skeleton:
- Python: Export
TextObject,SectionObject,SectionRange - Generate TypeScript interfaces
-
VS Code extension: Deserialize and map to Monaco
-
Benchmark CLI vs HTTP:
- Measure latency for realistic workloads
- Document findings in Phase 2 report
Strategic Recommendations¶
- Adopt Python-first code generation (Pydantic → TypeScript)
- Continue Monaco alignment (Python models match Monaco conventions)
- Plan HTTP migration for v0.2.0 (persistent server, streaming)
- Investigate LSP for v1.0+ (text-centric features)
- Version models explicitly (semantic versioning, migration paths)
10. Open Questions¶
- How to handle complex Python types (e.g.,
NumberedTextwith custom logic)? - Option: Custom serializers (
.model_dump()override) -
Option: Separate transport models (e.g.,
NumberedTextTransport) -
Should we expose Python classes directly to TypeScript (via FFI)?
- Likely not viable (Pyodide rejected in ADR-VSC01)
-
Alternative: Protocol Buffers for binary serialization?
-
How to test TypeScript types without manual assertions?
- Use Zod for runtime validation (catches deserialization errors)
- Use TypeScript compiler for static type checking
Conclusion¶
Python-first code generation with pydantic-to-typescript offers the best path forward for TNH Scholar's VS Code integration:
- ✅ Type safety across boundaries
- ✅ Maintainable (single source of truth in Python)
- ✅ VS Code-friendly (clean TypeScript interfaces)
- ✅ Evolution-ready (versioning + migration paths)
Next: Proceed to Phase 2 (Prototype & Validate) to build a walking skeleton and validate these findings with real TNH Scholar models.
Status: Phase 1 Complete (Draft)
Next Review: 2025-12-19 (Phase 2 kickoff)