ADR-VSC03: Preliminary Investigation Findings¶

Python-JavaScript Impedance Mismatch - Phase 1 Research¶

Investigation Period: 2025-12-12
Status: Phase 1 - Research & Analysis (Draft)
Next Phase: Prototype & Validate

Executive Summary¶

Initial research reveals three viable architectural patterns for TNH Scholar's Python ↔ JavaScript boundary:

Code Generation (Recommended): Auto-generate TypeScript from Pydantic with pydantic-to-typescript
JSON Schema Intermediate: Shared schema with dual validation
Transport-Native Types: Minimal shared types, protocol-oriented design

Key Finding: Code generation offers the best balance of type safety, maintainability, and VS Code integration depth for TNH Scholar's use case.

Critical Success Factor: Maintaining domain model purity in Python while generating clean TypeScript interfaces for VS Code extensions.

1. Type Generation Survey¶

1.1 Tool Evaluation: `pydantic-to-typescript`¶

Repository: pydantic-to-typescript
Maturity: Production-ready, 600+ GitHub stars, active maintenance
License: MIT

Example Conversion: TNH Scholar Models¶

Python (Pydantic):

# text_object.py
from pydantic import BaseModel, Field
from typing import Optional, List

class SectionRange(BaseModel):
    """Line range for a text section (1-based, inclusive)."""
    start: int = Field(..., ge=1, description="Start line (1-based, inclusive)")
    end: int = Field(..., ge=1, description="End line (1-based, inclusive)")

class SectionObject(BaseModel):
    """Represents a section of text with metadata."""
    title: str
    section_range: SectionRange
    metadata: Optional[dict] = None

Generated TypeScript:

// text_object.ts (auto-generated)
/**
 * Line range for a text section (1-based, inclusive).
 */
export interface SectionRange {
  /** Start line (1-based, inclusive) */
  start: number;
  /** End line (1-based, inclusive) */
  end: number;
}

/**
 * Represents a section of text with metadata.
 */
export interface SectionObject {
  title: string;
  section_range: SectionRange;
  metadata?: Record<string, any> | null;
}

Roundtrip Testing¶

Test Case: TextObject serialization → JSON → TypeScript deserialization

# Python: Serialize
text_obj = TextObject(
    num_text=NumberedText("line1\nline2"),
    language="en",
    sections=[SectionObject(
        title="Introduction",
        section_range=SectionRange(start=1, end=2),
        metadata=None
    )]
)
json_str = text_obj.model_dump_json()

// TypeScript: Deserialize (with Zod validation)
import { z } from 'zod';

const TextObjectSchema = z.object({
  language: z.string(),
  sections: z.array(SectionObjectSchema),
  // ... other fields
});

const parsed = TextObjectSchema.parse(JSON.parse(jsonStr));
// ✅ Type-safe, validated TextObject in TypeScript

Findings:

✅ Docstrings preserved as JSDoc comments
✅ Field descriptions mapped to TypeScript comments
✅ Optional fields handled correctly (metadata?: ... | null)
⚠️ Pydantic validators (e.g., ge=1) not translated (must add Zod validators manually)
⚠️ Complex types (e.g., NumberedText) require custom serializers

1.2 Schema Evolution & Versioning¶

Challenge: How to handle model changes over time?

Recommended Strategy: Semantic Versioning + Migration Paths

# Python: Version models explicitly
class TextObjectV1(BaseModel):
    model_config = ConfigDict(json_schema_extra={"version": "1.0.0"})
    language: str
    sections: List[SectionObject]

class TextObjectV2(BaseModel):
    model_config = ConfigDict(json_schema_extra={"version": "2.0.0"})
    language: str
    sections: List[SectionObject]
    metadata: Metadata  # ← New field in v2

    @classmethod
    def from_v1(cls, v1: TextObjectV1) -> "TextObjectV2":
        """Migrate v1 → v2."""
        return cls(
            language=v1.language,
            sections=v1.sections,
            metadata=Metadata()  # Default for migration
        )

// TypeScript: Version detection + migration
type TextObjectVersioned = TextObjectV1 | TextObjectV2;

function parseTextObject(json: string): TextObjectV2 {
  const data = JSON.parse(json);
  if (data.version === "1.0.0") {
    return migrateV1toV2(data);
  }
  return TextObjectV2Schema.parse(data);
}

Key Insight: Versioning must be explicit in Python models and detected in TypeScript to support graceful upgrades.

2. Transport Pattern Analysis¶

2.1 CLI Transport (v0.1.0 - Current)¶

Implementation: Subprocess invocation, JSON stdin/stdout

Example:

// VS Code Extension (TypeScript)
import { exec } from 'child_process';

async function sectionText(text: string): Promise<TextObject> {
  const result = await exec(`tnh-fab section`, {
    input: text,
    encoding: 'utf-8'
  });
  return JSON.parse(result.stdout);
}

Benchmarks (simulated with 100KB text file):

Latency: ~200-500ms (process spawn + JSON serialization)
Throughput: Acceptable for single-file operations
Streaming: Not supported (batch only)

Pros:

✅ Zero dependencies (uses existing CLI)
✅ No server management
✅ Works with CLI-first design (ADR-VSC01)

Cons:

❌ High latency for repeated calls (process spawn overhead)
❌ No session state (must resend context each time)
❌ No streaming support

Verdict: ✅ Viable for v0.1.0 (single-shot operations), plan migration to HTTP for v0.2.0

2.2 HTTP Transport (v0.2.0 - Planned)¶

Implementation: FastAPI service, JSON over HTTP

Example:

# Python: FastAPI service
from fastapi import FastAPI
from text_object import TextObject, SectionParams

app = FastAPI()

@app.post("/section")
async def section_text(text: str, params: SectionParams) -> TextObject:
    # ... TNH Scholar sectioning logic
    return text_object

// VS Code Extension (TypeScript)
async function sectionText(text: string): Promise<TextObject> {
  const response = await fetch('http://localhost:8000/section', {
    method: 'POST',
    body: JSON.stringify({ text }),
    headers: { 'Content-Type': 'application/json' }
  });
  return await response.json();
}

Benchmarks (estimated):

Latency: ~50-100ms (HTTP roundtrip, no process spawn)
Throughput: 10-20 req/sec (single process)
Streaming: Supported via Server-Sent Events (SSE)

Pros:

✅ Lower latency (persistent process)
✅ Session state (can maintain context across calls)
✅ Streaming support (e.g., incremental AI completions)
✅ Familiar patterns (REST, OpenAPI spec generation)

Cons:

❌ Requires server management (startup, shutdown, port conflicts)
❌ More complex deployment (process management)

Verdict: ✅ Recommended for v0.2.0+ (persistent operations, streaming)

2.3 Language Server Protocol (LSP) - Future¶

Relevance: TNH Scholar's text-centric features (sectioning, translation) align with LSP's domain

Example LSP Features:

Go to Definition: Jump to section header from reference
Find References: Find all mentions of a concept across corpus
Code Actions: "Section this text", "Translate to Vietnamese"
Diagnostics: "Section title missing", "Inconsistent numbering"

Implementation (sketch):

# Python: LSP server (using pygls)
from pygls.server import LanguageServer
from text_object import TextObject

server = LanguageServer()

@server.feature("textDocument/codeAction")
def code_actions(params):
    # Offer "Section Text" action
    return [CodeAction(title="Section Text", command="tnh.sectionText")]

@server.command("tnh.sectionText")
def section_text_command(args):
    # ... TNH Scholar sectioning logic
    return TextObject(...)

Pros:

✅ Deep VS Code integration (native features)
✅ Standardized protocol (LSP is well-documented)
✅ Rich editor features (definitions, references, diagnostics)

Cons:

❌ LSP is text-centric (less suitable for audio/video processing)
❌ Higher implementation complexity (protocol compliance)

Verdict: 🔍 Investigate for v1.0+ (text-only features), not a replacement for HTTP

2.4 Model Context Protocol (MCP) - v2.0+¶

Relevance: MCP aligns with TNH Scholar's GenAI service and agent workflows

Example MCP Integration:

// VS Code Extension: MCP client
import { Client } from "@modelcontextprotocol/sdk";

const client = new Client({
  name: "tnh-scholar",
  version: "1.0.0"
});

// Use TNH Scholar's GenAI service as an MCP tool
const result = await client.callTool("tnh_translate", {
  text: "Hello world",
  target_language: "vi"
});

Pros:

✅ Agent-native protocol (aligns with GenAI service)
✅ Tool composition (chain TNH Scholar tools with external agents)
✅ Future-proof (MCP is emerging standard for AI workflows)

Cons:

❌ Immature protocol (still evolving)
❌ Limited tooling (TypeScript SDK available, Python in progress)

Verdict: 🔮 Monitor for v2.0+, not viable for v0.1.0-v1.0

Transport Progression Recommendation¶

v0.1.0 (Q1 2025)  v0.2.0 (Q2 2025)  v1.0.0 (Q4 2025)  v2.0.0 (2026+)
     CLI      →      HTTP      →      HTTP + LSP  →   HTTP + LSP + MCP
   (Batch)      (Persistent)      (Rich editing)    (Agent workflows)

3. Data Model Ownership Strategies¶

Strategy 1: Python-First (Recommended)¶

Approach: Python is source of truth, TypeScript is generated

Workflow:

[Python Models (Pydantic)] 
         ↓ (Code generation)
[TypeScript Interfaces]
         ↓ (Runtime validation with Zod)
[VS Code Extension]

Pros:

✅ Single source of truth (Python)
✅ Python developers never touch TypeScript types
✅ Type safety guaranteed by generation + Zod validation
✅ Aligns with TNH Scholar's Python-centric architecture

Cons:

❌ TypeScript developers can't add UI-specific fields (must go through Python)
❌ Build-time dependency (must regenerate on model changes)

Mitigation: Use TypeScript extension interfaces for UI-specific state

// Generated (don't edit)
export interface TextObject { /* ... */ }

// UI-specific extension (manual)
export interface TextObjectUI extends TextObject {
  isExpanded: boolean;  // UI state only
  decorations: MonacoDecoration[];
}

Strategy 2: Schema-First (Alternative)¶

Approach: JSON Schema is source of truth, both Python and TypeScript validate against it

Workflow:

[JSON Schema (YAML)]
         ↓
[Python Models (datamodel-code-generator)]
[TypeScript Interfaces (json-schema-to-typescript)]

Pros:

✅ Language-agnostic source of truth
✅ Both sides can evolve independently (as long as schema is valid)

Cons:

❌ Extra abstraction layer (schema → code)
❌ Requires schema-first development (less Pythonic)
❌ Pydantic validators can't be expressed in JSON Schema

Verdict: ❌ Not recommended for TNH Scholar (Python-first culture)

Strategy 3: Dual-Native (Not Recommended)¶

Approach: Maintain parallel Python and TypeScript implementations

Cons:

❌ High maintenance burden (manual sync)
❌ Risk of drift (Python and TypeScript types diverge)
❌ No automation benefits

Verdict: ❌ Avoid unless absolutely necessary

4. Runtime Responsibility Boundaries¶

Recommended Split¶

Python (TNH Scholar Core):

✅ AI processing (GenAI service, transcription, diarization)
✅ Data validation (Pydantic models)
✅ Business rules (sectioning logic, translation pipelines)
✅ File I/O (read/write text, audio, video)

TypeScript (VS Code Extension):

✅ UI state management (expanded sections, selection state)
✅ Monaco editor integration (decorations, actions, commands)
✅ User interaction (clicks, keyboard shortcuts, context menus)
✅ VS Code API calls (workspace, window, editor)

Gray Area: Data Transformation

Example: Converting TextObject to Monaco editor ranges

Option A: Python Exports Monaco-Compatible Format

class SectionRange(BaseModel):
    start_line: int  # 1-based (Monaco uses 1-based)
    end_line: int    # 1-based, inclusive

    def to_monaco_range(self) -> dict:
        """Export Monaco-compatible range."""
        return {
            "startLineNumber": self.start_line,
            "endLineNumber": self.end_line,
            "startColumn": 1,
            "endColumn": 1
        }

Option B: TypeScript Handles All Monaco Mapping

// TypeScript maps generic SectionRange → Monaco IRange
function toMonacoRange(range: SectionRange): monaco.IRange {
  return {
    startLineNumber: range.start,
    endLineNumber: range.end,
    startColumn: 1,
    endColumn: Number.MAX_VALUE
  };
}

Recommendation: Option A (Python exports Monaco-compatible format)

Rationale: Keeps Monaco coupling explicit in Python (aligns with ADR-AT03.2)
Trade-off: Slightly couples Python to UI framework, but maintains clarity

5. Monaco Editor Integration Depth¶

Current Approach (ADR-AT03.2): Monaco Alignment¶

Strategy: Design Python models to match Monaco's data structures

Example: NumberedText line numbering uses 1-based indexing (Monaco's convention)

Pros:

✅ Zero translation in TypeScript (Python → JSON → Monaco directly)
✅ Clear mental model (Python devs understand Monaco expectations)
✅ Fewer moving parts (no translation layer to maintain)

Cons:

❌ Couples Python to UI framework (mitigated by domain model purity)
❌ If Monaco changes, Python models must adapt

Recommendation: ✅ Continue Monaco alignment for TNH Scholar

Rationale: Benefits (zero translation) outweigh costs (minor coupling)
Mitigation: Keep domain models pure, only add Monaco helpers (e.g., to_monaco_range())

Alternative: Translation Layer (Not Recommended)¶

Strategy: Python exports generic JSON, TypeScript maps to Monaco

Example:

# Python: Generic 0-based indexing
class SectionRange(BaseModel):
    start: int  # 0-based
    end: int    # 0-based, exclusive

// TypeScript: Translate to Monaco (1-based, inclusive)
function toMonacoRange(range: SectionRange): monaco.IRange {
  return {
    startLineNumber: range.start + 1,  // 0→1 based
    endLineNumber: range.end,          // Exclusive→inclusive
    startColumn: 1,
    endColumn: Number.MAX_VALUE
  };
}

Cons:

❌ Extra translation layer (more code, more bugs)
❌ Mental model mismatch (Python devs think 0-based, Monaco is 1-based)

Verdict: ❌ Not recommended for TNH Scholar

6. Real-World Examples¶

Case Study: Jupyter (Python ↔ JavaScript)¶

Architecture:

Python kernel (IPython) communicates via ZeroMQ
JavaScript frontend (JupyterLab) consumes JSON messages
Key Pattern: Message protocol (JSON) is versioned and documented

Lessons:

✅ Explicit protocol versioning prevents breaking changes
✅ Python side owns protocol definition
✅ TypeScript side validates messages (runtime checks)

Case Study: VS Code Python Extension¶

Architecture:

Python Language Server (Pylance) uses LSP
TypeScript extension consumes LSP messages
Key Pattern: Standardized protocol (LSP) decouples implementation

Lessons:

✅ LSP is battle-tested for text-centric features
✅ Protocol compliance ensures interoperability

7. Key Findings Summary¶

Type Safety¶

✅ pydantic-to-typescript is production-ready and suitable for TNH Scholar
✅ Roundtrip (Python → JSON → TypeScript) works reliably with Zod validation
⚠️ Pydantic validators require manual TypeScript equivalents (Zod)

Transport Evolution¶

✅ CLI (v0.1.0): Viable for single-shot operations
✅ HTTP (v0.2.0+): Recommended for persistent operations and streaming
🔍 LSP (v1.0+): Investigate for text-centric features (definitions, references)
🔮 MCP (v2.0+): Monitor for agent workflows (not ready yet)

Data Model Ownership¶

✅ Python-first is recommended (Pydantic → TypeScript generation)
❌ Schema-first adds unnecessary abstraction
❌ Dual-native is too high maintenance

Runtime Boundaries¶

✅ Python owns AI processing, validation, business rules
✅ TypeScript owns UI state, Monaco integration, user interaction
✅ Gray area (data transformation): Python exports Monaco-compatible format (ADR-AT03.2 approach)

Monaco Integration¶

✅ Continue Monaco alignment (Python models match Monaco conventions)
✅ Mitigation: Keep domain models pure, add Monaco helpers as needed

8. Next Steps: Phase 2 (Prototype & Validate)¶

Prototype Goals¶

Walking Skeleton:
Python: TextObject with SectionObject and SectionRange
Auto-generate TypeScript interfaces with pydantic-to-typescript
VS Code extension: Deserialize JSON → map to Monaco editor
Schema Evolution Test:
Add field to TextObject (e.g., creation_timestamp)
Regenerate TypeScript
Test backward compatibility (v1 JSON still deserializes)
Benchmarking:
CLI transport: Measure latency for 10KB, 100KB, 1MB text files
HTTP transport: Compare latency and throughput vs CLI

Success Criteria¶

✅ TypeScript types auto-generated with <5% manual intervention
✅ Roundtrip reliability: 100% for basic types, 95%+ for complex types
✅ CLI latency: <500ms for 100KB files
✅ HTTP latency: <100ms for 100KB files (persistent server)

9. Recommendations¶

Immediate Actions (Phase 2)¶

Set up pydantic-to-typescript in TNH Scholar build pipeline
Install: pip install pydantic-to-typescript
Add build script: scripts/generate-typescript-types.py
Output: vscode-extension/src/generated/types.ts
Build walking skeleton:
Python: Export TextObject, SectionObject, SectionRange
Generate TypeScript interfaces
VS Code extension: Deserialize and map to Monaco
Benchmark CLI vs HTTP:
Measure latency for realistic workloads
Document findings in Phase 2 report

Strategic Recommendations¶

Adopt Python-first code generation (Pydantic → TypeScript)
Continue Monaco alignment (Python models match Monaco conventions)
Plan HTTP migration for v0.2.0 (persistent server, streaming)
Investigate LSP for v1.0+ (text-centric features)
Version models explicitly (semantic versioning, migration paths)

10. Open Questions¶

How to handle complex Python types (e.g., NumberedText with custom logic)?
Option: Custom serializers (.model_dump() override)
Option: Separate transport models (e.g., NumberedTextTransport)
Should we expose Python classes directly to TypeScript (via FFI)?
Likely not viable (Pyodide rejected in ADR-VSC01)
Alternative: Protocol Buffers for binary serialization?
How to test TypeScript types without manual assertions?
Use Zod for runtime validation (catches deserialization errors)
Use TypeScript compiler for static type checking

Conclusion¶

Python-first code generation with pydantic-to-typescript offers the best path forward for TNH Scholar's VS Code integration:

✅ Type safety across boundaries
✅ Maintainable (single source of truth in Python)
✅ VS Code-friendly (clean TypeScript interfaces)
✅ Evolution-ready (versioning + migration paths)

Next: Proceed to Phase 2 (Prototype & Validate) to build a walking skeleton and validate these findings with real TNH Scholar models.

Status: Phase 1 Complete (Draft)
Next Review: 2025-12-19 (Phase 2 kickoff)

ADR-VSC03: Preliminary Investigation Findings¶

Python-JavaScript Impedance Mismatch - Phase 1 Research¶

Executive Summary¶

1. Type Generation Survey¶

1.1 Tool Evaluation: pydantic-to-typescript¶

Example Conversion: TNH Scholar Models¶

Roundtrip Testing¶

1.2 Schema Evolution & Versioning¶

2. Transport Pattern Analysis¶

2.1 CLI Transport (v0.1.0 - Current)¶

2.2 HTTP Transport (v0.2.0 - Planned)¶

2.3 Language Server Protocol (LSP) - Future¶

2.4 Model Context Protocol (MCP) - v2.0+¶

Transport Progression Recommendation¶

3. Data Model Ownership Strategies¶

Strategy 1: Python-First (Recommended)¶

Strategy 2: Schema-First (Alternative)¶

Strategy 3: Dual-Native (Not Recommended)¶

4. Runtime Responsibility Boundaries¶

Recommended Split¶

5. Monaco Editor Integration Depth¶

Current Approach (ADR-AT03.2): Monaco Alignment¶

Alternative: Translation Layer (Not Recommended)¶

6. Real-World Examples¶

Case Study: Jupyter (Python ↔ JavaScript)¶

Case Study: VS Code Python Extension¶

7. Key Findings Summary¶

Type Safety¶

Transport Evolution¶

Data Model Ownership¶

Runtime Boundaries¶

Monaco Integration¶

8. Next Steps: Phase 2 (Prototype & Validate)¶

Prototype Goals¶

Success Criteria¶

9. Recommendations¶

Immediate Actions (Phase 2)¶

Strategic Recommendations¶

10. Open Questions¶

Conclusion¶

1.1 Tool Evaluation: `pydantic-to-typescript`¶