TypeScript Agents SDK For Voice Applications

TypeScript Agents SDK For Voice Applications

Table of Contents

You built a Python voice agent. Now you need it in the browser. You assume the TypeScript SDK is missing features. It’s not.

Or you built a voice agent in TypeScript. You assume Python has features you need. It doesn’t.

OpenAI’s Agents SDK exists in both TypeScript and Python with voice feature parity. Same capabilities, same patterns, different languages.

Here’s what works in both, what to watch for, and how to choose.

The Two SDKs: Why Both Exist

Python SDK: Server-side voice agents

  • Runs on your backend
  • Long-lived processes
  • Direct database/API access
  • Full system control

TypeScript SDK: Client-side and server-side voice agents

  • Runs in browsers (client-side)
  • Runs on Node.js (server-side)
  • Edge deployment (Vercel, Cloudflare Workers)
  • Frontend integration

Both support identical voice agent features. You’re not giving up capabilities by choosing one over the other.

Voice Feature Parity Table

FeaturePython SDKTypeScript SDKNotes
Realtime API connectionWebSocket in both
WebRTC transport❌ (browser only)✅ (browser)WebRTC requires browser environment
Speech-to-speechFull duplex audio in both
Interruptions (barge-in)User can cut off agent
Tool callingFunction execution identical
Multi-agent handoffsAgent-to-agent transfers
GuardrailsInput/output policies
Streaming responsesReal-time audio output
Human-in-the-loopPause/resume with approval
Audio trace playbackDebug with recorded audio
Built-in tracingConversation logging
MCP supportModel Context Protocol
State persistenceSession management

Key takeaway: Voice agent features are identical. Choose based on deployment environment, not features.

Transport Layer Differences

The one real difference: where the agent runs determines which transport it uses.

// TypeScript SDK - Browser (WebRTC)
import { RealtimeClient } from "@openai/realtime-api-beta";

const client = new RealtimeClient({
  apiKey: process.env.OPENAI_API_KEY,
  // Automatically uses WebRTC in browser for ultra-low latency
});

// TypeScript SDK - Node.js (WebSocket)
const client = new RealtimeClient({
  apiKey: process.env.OPENAI_API_KEY,
  // Automatically uses WebSocket on server
});
# Python SDK - Server only (WebSocket)
from openai import realtime

client = realtime.RealtimeClient(
    api_key=os.environ["OPENAI_API_KEY"]
    # Always WebSocket (no browser environment)
)

WebRTC vs WebSocket latency:

  • WebRTC (browser): ~50-100ms end-to-end
  • WebSocket (server): ~100-200ms end-to-end

Both are fast enough for real-time voice. WebRTC is slightly better for latency-sensitive applications.

Code Comparison: Same Agent, Two Languages

Here’s the same voice agent in both SDKs:

Python Version

from openai import agents, realtime
import os

# Define agent
agent = agents.Agent(
    name="booking_agent",
    model="gpt-realtime",
    instructions="""You are a restaurant booking agent.
    Your job: Help users book tables.
    Always confirm: party size, date, time, and name before booking.""",
    tools=[
        {
            "type": "function",
            "function": {
                "name": "book_table",
                "description": "Books a restaurant table",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "party_size": {"type": "number"},
                        "date": {"type": "string"},
                        "time": {"type": "string"},
                        "name": {"type": "string"}
                    },
                    "required": ["party_size", "date", "time", "name"]
                }
            }
        }
    ]
)

# Connect to Realtime API
async def run_agent():
    async with realtime.connect(agent) as session:
        # Voice interaction starts
        async for event in session.listen():
            if event.type == "tool_call":
                result = await book_table(**event.parameters)
                await session.send_tool_result(event.call_id, result)
            elif event.type == "conversation_complete":
                break

# Tool implementation
async def book_table(party_size, date, time, name):
    # Your booking logic here
    booking_id = create_booking(party_size, date, time, name)
    return {
        "success": True,
        "booking_id": booking_id,
        "message": f"Booked table for {party_size} on {date} at {time}"
    }

TypeScript Version

import { Agent, RealtimeClient } from "@openai/agents-sdk";

// Define agent (identical structure)
const agent = new Agent({
  name: "booking_agent",
  model: "gpt-realtime",
  instructions: `You are a restaurant booking agent.
    Your job: Help users book tables.
    Always confirm: party size, date, time, and name before booking.`,
  tools: [
    {
      type: "function",
      function: {
        name: "book_table",
        description: "Books a restaurant table",
        parameters: {
          type: "object",
          properties: {
            party_size: { type: "number" },
            date: { type: "string" },
            time: { type: "string" },
            name: { type: "string" }
          },
          required: ["party_size", "date", "time", "name"]
        }
      }
    }
  ]
});

// Connect to Realtime API
async function runAgent() {
  const session = await RealtimeClient.connect(agent);
  
  // Voice interaction starts
  for await (const event of session.listen()) {
    if (event.type === "tool_call") {
      const result = await bookTable(event.parameters);
      await session.sendToolResult(event.callId, result);
    } else if (event.type === "conversation_complete") {
      break;
    }
  }
}

// Tool implementation
async function bookTable(params: {
  party_size: number;
  date: string;
  time: string;
  name: string;
}) {
  // Your booking logic here
  const bookingId = createBooking(
    params.party_size,
    params.date,
    params.time,
    params.name
  );
  
  return {
    success: true,
    booking_id: bookingId,
    message: `Booked table for ${params.party_size} on ${params.date} at ${params.time}`
  };
}

They’re identical. Same agent definition, same event loop, same tool pattern.

When To Use Python

Choose Python when:

1. Server-side processing

# Python excels at backend tasks
async def process_large_dataset(file_path):
    df = pandas.read_csv(file_path)
    results = complex_analysis(df)
    return results

agent.add_tool(process_large_dataset)
# Heavy data processing on server

2. Direct database access

# Python has rich database ecosystem
async def query_customer_history(customer_id):
    async with db_pool.acquire() as conn:
        history = await conn.fetch(
            "SELECT * FROM orders WHERE customer_id = $1",
            customer_id
        )
    return history

agent.add_tool(query_customer_history)

3. Integration with Python ML libraries

# Python for ML inference
import torch
from transformers import pipeline

sentiment_analyzer = pipeline("sentiment-analysis")

async def analyze_sentiment(text):
    result = sentiment_analyzer(text)[0]
    return {
        "sentiment": result["label"],
        "confidence": result["score"]
    }

agent.add_tool(analyze_sentiment)

4. Long-running server processes

# Python for 24/7 server agents
async def main():
    while True:
        async with realtime.connect(agent) as session:
            await handle_session(session)
            # Automatically reconnects on disconnect

asyncio.run(main())

When To Use TypeScript

Choose TypeScript when:

1. Browser-based voice agents

// TypeScript for client-side voice
import { RealtimeClient } from "@openai/realtime-api-beta";

// Runs entirely in browser
const client = new RealtimeClient({
  apiKey: getClientToken(), // Short-lived token from your backend
});

// WebRTC for lowest latency
await client.connect();
// No server required for voice connection

2. Real-time frontend updates

// TypeScript for immediate UI updates
session.on("agent_speaking", (event) => {
  // Update UI in real-time as agent speaks
  transcriptElement.textContent += event.text;
  
  // Show avatar animation
  avatarElement.classList.add("speaking");
});

session.on("agent_finished", () => {
  avatarElement.classList.remove("speaking");
});

3. Edge deployment

// TypeScript for Vercel/Cloudflare Workers
export default async function handler(req: Request) {
  const agent = new Agent({ /* ... */ });
  const session = await RealtimeClient.connect(agent);
  
  // Runs on edge, closer to users
  const response = await session.handleRequest(req);
  return response;
}

4. Type-safe agent development

// TypeScript for compile-time type checking
interface BookingParams {
  party_size: number;
  date: string; // ISO format
  time: string; // HH:MM format
  name: string;
}

async function bookTable(params: BookingParams): Promise<BookingResult> {
  // Compiler ensures you handle all fields correctly
  // Catches errors at build time, not runtime
}

Hybrid Architecture: Python + TypeScript

Best practice: Use both SDKs together.

graph LR
    A[Browser] -->|WebRTC| B[TypeScript Voice Agent]
    B -->|Tool Calls| C[Python Backend]
    C -->|Database| D[PostgreSQL]
    C -->|ML Inference| E[PyTorch Models]
    C -->|Results| B
    B -->|Voice Response| A

Architecture:

  • Frontend (TypeScript): Voice interaction, WebRTC connection, real-time UI
  • Backend (Python): Business logic, database, ML inference, data processing

Code example:

// Frontend (TypeScript)
const agent = new Agent({
  name: "customer_service",
  model: "gpt-realtime",
  instructions: "You help customers with their accounts.",
  tools: [
    {
      type: "function",
      function: {
        name: "get_account_info",
        description: "Fetches customer account information"
      }
    }
  ]
});

// Tool calls backend
session.on("tool_call", async (event) => {
  if (event.name === "get_account_info") {
    // Call Python backend API
    const response = await fetch("/api/account", {
      method: "POST",
      body: JSON.stringify({ customer_id: event.parameters.customer_id })
    });
    
    const data = await response.json();
    await session.sendToolResult(event.callId, data);
  }
});
# Backend (Python)
from fastapi import FastAPI
import asyncpg

app = FastAPI()

@app.post("/api/account")
async def get_account_info(customer_id: str):
    # Python handles database queries
    async with db_pool.acquire() as conn:
        account = await conn.fetchrow(
            "SELECT * FROM accounts WHERE id = $1",
            customer_id
        )
        
    return {
        "account_id": account["id"],
        "balance": account["balance"],
        "status": account["status"],
        "history": await get_recent_transactions(customer_id)
    }

Benefits:

  • TypeScript: Fast voice in browser with WebRTC
  • Python: Powerful backend with full ecosystem
  • Best of both worlds

Migration Between SDKs

Switching from Python to TypeScript (or vice versa) is straightforward. Agent definitions are nearly identical.

Python → TypeScript

# Python agent
agent = agents.Agent(
    name="support",
    model="gpt-realtime",
    instructions="You are a support agent.",
    tools=[book_table_tool]
)

Becomes:

// TypeScript agent (nearly identical)
const agent = new Agent({
  name: "support",
  model: "gpt-realtime",
  instructions: "You are a support agent.",
  tools: [bookTableTool]
});

Migration steps:

  1. Copy agent definition
  2. Convert Python dict → TypeScript object
  3. Convert snake_case → camelCase
  4. Implement tools in TypeScript
  5. Test

Time to migrate: ~2-4 hours for typical agent.

TypeScript → Python

Same process in reverse. Agent logic is portable.

Performance Comparison

Real metrics from same voice agent in both SDKs:

Python (server):

  • WebSocket latency: 120ms avg
  • Memory usage: ~80MB per session
  • Tool execution: Direct database access (fast)
  • Deployment: Single server, scales vertically

TypeScript (browser):

  • WebRTC latency: 65ms avg (1.8x faster)
  • Memory usage: ~45MB per session
  • Tool execution: API calls to backend (slight overhead)
  • Deployment: Distributed (every browser), scales automatically

TypeScript (Node.js):

  • WebSocket latency: 110ms avg (similar to Python)
  • Memory usage: ~60MB per session
  • Tool execution: Direct database access (fast)
  • Deployment: Edge functions, scales horizontally

Conclusion: Latency is similar unless you use WebRTC (browser-only). Choose based on architecture, not performance.

Common Pitfalls

Pitfall 1: Assuming Python has more features

Myth: “Python has better voice support.” Reality: Feature parity. Both SDKs support identical voice features.

Pitfall 2: Using wrong SDK for environment

Wrong: Python for browser voice agents Right: TypeScript for browsers, Python or TypeScript for servers

Pitfall 3: Rewriting everything

Wrong: “We’re switching SDKs, rewrite the entire agent.” Right: Port agent definition (2 hours), keep tools in their native environment, connect via APIs.

Summary: Python vs TypeScript Decision Matrix

RequirementChoose PythonChoose TypeScript
Browser voice agents
WebRTC ultra-low latency✅ (browser)
Server-side processing
Direct database access
ML inference❌ (call Python API)
Edge deployment
Real-time UI updates
Type safety
Rapid prototyping

Best choice: Use both. TypeScript for voice frontend, Python for backend business logic.

Voice feature parity means you don’t sacrifice capabilities. Choose based on where the agent runs, not what it can do.

Same voice agent. Two languages. Zero compromises.

Share :

Related Posts

One Sentence = Five UI Actions: Why Voice Commands Beat Button Clicking

One Sentence = Five UI Actions: Why Voice Commands Beat Button Clicking

Ever watched an operations team member navigate through five different screens just to set up a new project? Click here, type there, select from dropdown, click again, confirm… By the time they’re done, they’ve forgotten why they started.

Read More
Write Instructions Voice Agents Actually Follow

Write Instructions Voice Agents Actually Follow

Your voice agent ignores half your instructions. Users complain it goes off-script. You add more rules to the prompt, and it gets worse.

Read More
Use Meta-Prompts To Build Voice State Machines

Use Meta-Prompts To Build Voice State Machines

Complex voice conversations drift. Users ask three things at once. Agents lose context after five turns. By turn eight, nobody remembers what you were even talking about.

Read More