Why Voice Agents Use WebRTC In Browsers

Why Voice Agents Use WebRTC In Browsers

Table of Contents

Transport layer isn’t something most developers think about. But when you’re building voice agents, it’s the difference between 50ms latency and 500ms latency.

The OpenAI Agents SDK automatically chooses the right transport:

  • WebRTC in browsers (ultra-low latency)
  • WebSocket on servers (simpler, still fast)

You don’t have to decide. The SDK picks the optimal transport based on your environment.

Why Transport Matters For Voice

Voice is real-time. Users expect instant responses. Every millisecond of latency makes conversations feel slower.

Transport layer determines the round-trip time from:

  1. User speaks
  2. Audio reaches server
  3. Agent processes
  4. Response returns
  5. User hears reply
graph LR
    A[User Speaks] -->|Transport| B[Server Processes]
    B -->|Transport| C[User Hears]
    
    style A fill:#e1f5ff
    style C fill:#e1f5ff

Transport accounts for 30-40% of total latency. Choose wrong, and your agent feels sluggish even if processing is fast.

WebRTC: Built For Real-Time

WebRTC (Web Real-Time Communication) was designed for video calls. It’s optimized for low-latency, peer-to-peer audio/video.

How WebRTC Works

graph TD
    A[Browser] -->|STUN/TURN| B[NAT Traversal]
    B --> C[Direct P2P Connection]
    C --> D[Server]
    D --> C
    C --> A
    
    style C fill:#d4f1d4

Key features:

  • UDP-based: No TCP handshake overhead
  • NAT traversal: Works behind firewalls
  • Adaptive bitrate: Adjusts to network conditions
  • Jitter buffering: Smooths out packet arrival

Latency profile:

  • Connection setup: 500-1000ms (one-time cost)
  • Per-message: 20-50ms (fast)
  • Audio encoding/decoding: 10-20ms

Total voice turn latency with WebRTC: ~300-500ms

WebRTC In The Browser

Browsers have native WebRTC support:

// SDK handles this automatically, but here's what it does:
const peerConnection = new RTCPeerConnection({
  iceServers: [{ urls: 'stun:stun.l.google.com:19302' }]
});

// Create audio track
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioTrack = stream.getAudioTracks()[0];
peerConnection.addTrack(audioTrack, stream);

// Establish connection
const offer = await peerConnection.createOffer();
await peerConnection.setLocalDescription(offer);

// Exchange ICE candidates
peerConnection.onicecandidate = (event) => {
  if (event.candidate) {
    // Send candidate to server
  }
};

The Agents SDK handles all of this. You just call:

const agent = new Agent({ ... });
await agent.connect(); // SDK uses WebRTC automatically in browser

Why WebRTC For Browsers?

  1. Native support: Every modern browser has WebRTC
  2. Low latency: UDP-based, no TCP overhead
  3. Firewall-friendly: NAT traversal built in
  4. Adaptive: Handles network changes gracefully
  5. Secure: Encrypted by default (DTLS)

WebSocket: Simpler Alternative

WebSocket is a TCP-based protocol for two-way communication over HTTP.

How WebSocket Works

graph TD
    A[Client] -->|HTTP Upgrade| B[WebSocket Handshake]
    B --> C[Persistent TCP Connection]
    C -->|Binary Messages| D[Server]
    D -->|Binary Messages| C
    C --> A

Key features:

  • TCP-based: Reliable delivery
  • HTTP compatible: Works through proxies
  • Simple API: Easier than WebRTC
  • No NAT issues: Uses HTTP ports

Latency profile:

  • Connection setup: 100-200ms (faster than WebRTC)
  • Per-message: 50-100ms (slower than WebRTC)
  • Audio encoding/decoding: 10-20ms

Total voice turn latency with WebSocket: ~400-600ms

WebSocket On The Server

Server environments don’t have browser APIs. WebSocket is the practical choice:

// SDK handles this automatically
const WebSocket = require('ws');
const ws = new WebSocket('wss://api.openai.com/v1/realtime');

ws.on('open', () => {
  // Send audio chunks
  ws.send(audioBuffer);
});

ws.on('message', (data) => {
  // Receive agent response
  playAudio(data);
});

Again, the SDK abstracts this:

const agent = new Agent({ ... });
await agent.connect(); // SDK uses WebSocket automatically on server

Why WebSocket For Servers?

  1. Simpler: No NAT traversal needed
  2. Reliable: TCP guarantees delivery
  3. HTTP-compatible: Works through corporate proxies
  4. Good enough: 50-100ms latency is acceptable
  5. Universal: Works everywhere

WebRTC vs WebSocket: Comparison

FeatureWebRTCWebSocket
ProtocolUDPTCP
Latency20-50ms50-100ms
Setup time500-1000ms100-200ms
ComplexityHighLow
Browser supportNativeNative
Server supportManualEasy
FirewallNeeds STUN/TURNHTTP ports
ReliabilityBest-effortGuaranteed
Best forBrowser voiceServer voice

SDK Transport Selection

The Agents SDK automatically picks the right transport:

// In browser
if (typeof window !== 'undefined' && window.RTCPeerConnection) {
  // Use WebRTC (low latency)
  transport = new WebRTCTransport();
} else {
  // Use WebSocket (simpler)
  transport = new WebSocketTransport();
}

You don’t write conditional logic. The SDK handles it.

Real-World Latency: WebRTC vs WebSocket

Let’s compare actual voice turn latency:

WebRTC (Browser):

  • User speaks: 0ms
  • Audio capture: 20ms
  • Network transport: 30ms
  • Server processing: 200ms
  • Network transport: 30ms
  • Audio playback: 20ms
  • Total: ~300ms

WebSocket (Browser):

  • User speaks: 0ms
  • Audio capture: 20ms
  • Network transport: 60ms
  • Server processing: 200ms
  • Network transport: 60ms
  • Audio playback: 20ms
  • Total: ~360ms

WebSocket (Server):

  • Voice input: 0ms
  • Audio encode: 10ms
  • Network transport: 50ms
  • Server processing: 200ms
  • Network transport: 50ms
  • Audio decode: 10ms
  • Total: ~320ms

WebRTC saves ~60ms vs WebSocket in browsers. That’s 20% faster.

When WebRTC Setup Cost Matters

WebRTC has higher setup cost (500-1000ms). But once connected, it’s faster.

Short conversations (< 5 turns):

  • WebSocket might be faster overall (simpler setup)

Long conversations (> 10 turns):

  • WebRTC wins (lower per-turn latency)

The SDK uses WebRTC for browser-based voice because voice conversations are typically long (10+ turns).

NAT Traversal: Why WebRTC Needs STUN/TURN

Most users are behind NATs (Network Address Translation). WebRTC needs to punch through:

graph TD
    A[Browser Behind NAT] --> B[STUN Server]
    B --> C{Can Direct Connect?}
    C -->|Yes| D[Direct P2P]
    C -->|No| E[TURN Relay]
    E --> F[Server]
    D --> F
  • STUN: Discovers public IP/port
  • TURN: Relays traffic when P2P fails

The SDK handles STUN/TURN configuration automatically.

Audio Codec Selection

Different transports support different codecs:

WebRTC:

  • Opus (default, best for voice)
  • G.711 (fallback)
  • Adaptive bitrate

WebSocket:

  • PCM (raw audio)
  • Opus (compressed)
  • Custom codecs

The SDK uses Opus for both (efficient, high quality).

Error Recovery

WebRTC:

  • Packet loss: ~1-2% (UDP is best-effort)
  • Jitter buffer compensates
  • Forward Error Correction (FEC) optional

WebSocket:

  • Packet loss: 0% (TCP retransmits)
  • But retransmission adds latency
  • Less graceful under poor network

For voice, WebRTC’s approach is better: small packet loss is acceptable if latency stays low.

Browser Compatibility

WebRTC:

  • Chrome: ✅ Full support
  • Firefox: ✅ Full support
  • Safari: ✅ Full support (iOS 11+)
  • Edge: ✅ Full support

WebSocket:

  • Chrome: ✅ Full support
  • Firefox: ✅ Full support
  • Safari: ✅ Full support
  • Edge: ✅ Full support

Both are universally supported. WebRTC is the better choice for low-latency voice.

Server Deployment Considerations

WebRTC on server:

  • Requires manual setup (no browser APIs)
  • Need STUN/TURN infrastructure
  • More complex deployment

WebSocket on server:

  • Simple HTTP upgrade
  • No extra infrastructure
  • Easy deployment

This is why the SDK uses WebSocket for server environments.

Measuring Transport Performance

The SDK exposes transport metrics:

agent.on('transport_metrics', (metrics) => {
  console.log({
    protocol: metrics.protocol, // 'webrtc' or 'websocket'
    latency_ms: metrics.averageLatency,
    packet_loss: metrics.packetLossRate,
    jitter_ms: metrics.jitter
  });
});

Track these metrics to ensure voice quality.

Best Practices

1. Let the SDK choose transport

Don’t override unless you have specific needs:

// Good: SDK decides
const agent = new Agent({ ... });

// Bad: Forcing specific transport
const agent = new Agent({ transport: 'websocket' }); // Usually unnecessary

2. Monitor latency

Set alerts if latency exceeds thresholds:

agent.on('transport_metrics', (metrics) => {
  if (metrics.averageLatency > 500) {
    alert('Voice latency high - check network');
  }
});

3. Handle connection failures gracefully

agent.on('transport_error', async (error) => {
  // Attempt reconnection
  await agent.reconnect();
});

4. Test both transports

Even though SDK chooses automatically, test both in your environment:

# Force WebSocket for testing
TRANSPORT=websocket npm start

# Force WebRTC for testing
TRANSPORT=webrtc npm start

5. Consider network conditions

On poor networks (< 1 Mbps), WebSocket might actually be better (reliable TCP).

Conclusion

Transport layer matters for voice agent latency.

The Agents SDK automatically chooses:

  • WebRTC in browsers (20-50ms per message)
  • WebSocket on servers (50-100ms per message)

You don’t write conditional logic. The SDK picks the optimal transport based on environment.

Result: Voice agents with minimal latency, no transport configuration required.


Implementation Guide:

  1. Let SDK auto-select transport (WebRTC browser, WebSocket server)
  2. Monitor transport metrics for latency/packet loss
  3. Handle transport errors with reconnection logic
  4. Test both transports in your environment
  5. Set latency alerts (> 500ms is problematic)

The SDK handles NAT traversal, codec selection, and error recovery automatically.


Links:

Next: Explore how the SDK’s built-in tracing captures full voice conversations with audio playback for debugging.

Share :

Related Posts

Secure Voice Sessions With Short-Lived Tokens: Ephemeral Auth for Real-Time

Secure Voice Sessions With Short-Lived Tokens: Ephemeral Auth for Real-Time

Your voice agent needs low latency. So you connect clients directly to the OpenAI Realtime API using WebRTC. Performance is great—users love it.

Read More
Latency Is The Product: Why WebRTC Makes Voice Agents Feel Natural

Latency Is The Product: Why WebRTC Makes Voice Agents Feel Natural

You ask your voice agent a question. One second passes. Two seconds. Three seconds.

Read More