Why Voice Agents Use WebRTC In Browsers
- ZH+
- Performance , Architecture
- December 24, 2025
Table of Contents
Transport layer isn’t something most developers think about. But when you’re building voice agents, it’s the difference between 50ms latency and 500ms latency.
The OpenAI Agents SDK automatically chooses the right transport:
- WebRTC in browsers (ultra-low latency)
- WebSocket on servers (simpler, still fast)
You don’t have to decide. The SDK picks the optimal transport based on your environment.
Why Transport Matters For Voice
Voice is real-time. Users expect instant responses. Every millisecond of latency makes conversations feel slower.
Transport layer determines the round-trip time from:
- User speaks
- Audio reaches server
- Agent processes
- Response returns
- User hears reply
graph LR
A[User Speaks] -->|Transport| B[Server Processes]
B -->|Transport| C[User Hears]
style A fill:#e1f5ff
style C fill:#e1f5ff
Transport accounts for 30-40% of total latency. Choose wrong, and your agent feels sluggish even if processing is fast.
WebRTC: Built For Real-Time
WebRTC (Web Real-Time Communication) was designed for video calls. It’s optimized for low-latency, peer-to-peer audio/video.
How WebRTC Works
graph TD
A[Browser] -->|STUN/TURN| B[NAT Traversal]
B --> C[Direct P2P Connection]
C --> D[Server]
D --> C
C --> A
style C fill:#d4f1d4
Key features:
- UDP-based: No TCP handshake overhead
- NAT traversal: Works behind firewalls
- Adaptive bitrate: Adjusts to network conditions
- Jitter buffering: Smooths out packet arrival
Latency profile:
- Connection setup: 500-1000ms (one-time cost)
- Per-message: 20-50ms (fast)
- Audio encoding/decoding: 10-20ms
Total voice turn latency with WebRTC: ~300-500ms
WebRTC In The Browser
Browsers have native WebRTC support:
// SDK handles this automatically, but here's what it does:
const peerConnection = new RTCPeerConnection({
iceServers: [{ urls: 'stun:stun.l.google.com:19302' }]
});
// Create audio track
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioTrack = stream.getAudioTracks()[0];
peerConnection.addTrack(audioTrack, stream);
// Establish connection
const offer = await peerConnection.createOffer();
await peerConnection.setLocalDescription(offer);
// Exchange ICE candidates
peerConnection.onicecandidate = (event) => {
if (event.candidate) {
// Send candidate to server
}
};
The Agents SDK handles all of this. You just call:
const agent = new Agent({ ... });
await agent.connect(); // SDK uses WebRTC automatically in browser
Why WebRTC For Browsers?
- Native support: Every modern browser has WebRTC
- Low latency: UDP-based, no TCP overhead
- Firewall-friendly: NAT traversal built in
- Adaptive: Handles network changes gracefully
- Secure: Encrypted by default (DTLS)
WebSocket: Simpler Alternative
WebSocket is a TCP-based protocol for two-way communication over HTTP.
How WebSocket Works
graph TD
A[Client] -->|HTTP Upgrade| B[WebSocket Handshake]
B --> C[Persistent TCP Connection]
C -->|Binary Messages| D[Server]
D -->|Binary Messages| C
C --> A
Key features:
- TCP-based: Reliable delivery
- HTTP compatible: Works through proxies
- Simple API: Easier than WebRTC
- No NAT issues: Uses HTTP ports
Latency profile:
- Connection setup: 100-200ms (faster than WebRTC)
- Per-message: 50-100ms (slower than WebRTC)
- Audio encoding/decoding: 10-20ms
Total voice turn latency with WebSocket: ~400-600ms
WebSocket On The Server
Server environments don’t have browser APIs. WebSocket is the practical choice:
// SDK handles this automatically
const WebSocket = require('ws');
const ws = new WebSocket('wss://api.openai.com/v1/realtime');
ws.on('open', () => {
// Send audio chunks
ws.send(audioBuffer);
});
ws.on('message', (data) => {
// Receive agent response
playAudio(data);
});
Again, the SDK abstracts this:
const agent = new Agent({ ... });
await agent.connect(); // SDK uses WebSocket automatically on server
Why WebSocket For Servers?
- Simpler: No NAT traversal needed
- Reliable: TCP guarantees delivery
- HTTP-compatible: Works through corporate proxies
- Good enough: 50-100ms latency is acceptable
- Universal: Works everywhere
WebRTC vs WebSocket: Comparison
| Feature | WebRTC | WebSocket |
|---|---|---|
| Protocol | UDP | TCP |
| Latency | 20-50ms | 50-100ms |
| Setup time | 500-1000ms | 100-200ms |
| Complexity | High | Low |
| Browser support | Native | Native |
| Server support | Manual | Easy |
| Firewall | Needs STUN/TURN | HTTP ports |
| Reliability | Best-effort | Guaranteed |
| Best for | Browser voice | Server voice |
SDK Transport Selection
The Agents SDK automatically picks the right transport:
// In browser
if (typeof window !== 'undefined' && window.RTCPeerConnection) {
// Use WebRTC (low latency)
transport = new WebRTCTransport();
} else {
// Use WebSocket (simpler)
transport = new WebSocketTransport();
}
You don’t write conditional logic. The SDK handles it.
Real-World Latency: WebRTC vs WebSocket
Let’s compare actual voice turn latency:
WebRTC (Browser):
- User speaks: 0ms
- Audio capture: 20ms
- Network transport: 30ms
- Server processing: 200ms
- Network transport: 30ms
- Audio playback: 20ms
- Total: ~300ms
WebSocket (Browser):
- User speaks: 0ms
- Audio capture: 20ms
- Network transport: 60ms
- Server processing: 200ms
- Network transport: 60ms
- Audio playback: 20ms
- Total: ~360ms
WebSocket (Server):
- Voice input: 0ms
- Audio encode: 10ms
- Network transport: 50ms
- Server processing: 200ms
- Network transport: 50ms
- Audio decode: 10ms
- Total: ~320ms
WebRTC saves ~60ms vs WebSocket in browsers. That’s 20% faster.
When WebRTC Setup Cost Matters
WebRTC has higher setup cost (500-1000ms). But once connected, it’s faster.
Short conversations (< 5 turns):
- WebSocket might be faster overall (simpler setup)
Long conversations (> 10 turns):
- WebRTC wins (lower per-turn latency)
The SDK uses WebRTC for browser-based voice because voice conversations are typically long (10+ turns).
NAT Traversal: Why WebRTC Needs STUN/TURN
Most users are behind NATs (Network Address Translation). WebRTC needs to punch through:
graph TD
A[Browser Behind NAT] --> B[STUN Server]
B --> C{Can Direct Connect?}
C -->|Yes| D[Direct P2P]
C -->|No| E[TURN Relay]
E --> F[Server]
D --> F
- STUN: Discovers public IP/port
- TURN: Relays traffic when P2P fails
The SDK handles STUN/TURN configuration automatically.
Audio Codec Selection
Different transports support different codecs:
WebRTC:
- Opus (default, best for voice)
- G.711 (fallback)
- Adaptive bitrate
WebSocket:
- PCM (raw audio)
- Opus (compressed)
- Custom codecs
The SDK uses Opus for both (efficient, high quality).
Error Recovery
WebRTC:
- Packet loss: ~1-2% (UDP is best-effort)
- Jitter buffer compensates
- Forward Error Correction (FEC) optional
WebSocket:
- Packet loss: 0% (TCP retransmits)
- But retransmission adds latency
- Less graceful under poor network
For voice, WebRTC’s approach is better: small packet loss is acceptable if latency stays low.
Browser Compatibility
WebRTC:
- Chrome: ✅ Full support
- Firefox: ✅ Full support
- Safari: ✅ Full support (iOS 11+)
- Edge: ✅ Full support
WebSocket:
- Chrome: ✅ Full support
- Firefox: ✅ Full support
- Safari: ✅ Full support
- Edge: ✅ Full support
Both are universally supported. WebRTC is the better choice for low-latency voice.
Server Deployment Considerations
WebRTC on server:
- Requires manual setup (no browser APIs)
- Need STUN/TURN infrastructure
- More complex deployment
WebSocket on server:
- Simple HTTP upgrade
- No extra infrastructure
- Easy deployment
This is why the SDK uses WebSocket for server environments.
Measuring Transport Performance
The SDK exposes transport metrics:
agent.on('transport_metrics', (metrics) => {
console.log({
protocol: metrics.protocol, // 'webrtc' or 'websocket'
latency_ms: metrics.averageLatency,
packet_loss: metrics.packetLossRate,
jitter_ms: metrics.jitter
});
});
Track these metrics to ensure voice quality.
Best Practices
1. Let the SDK choose transport
Don’t override unless you have specific needs:
// Good: SDK decides
const agent = new Agent({ ... });
// Bad: Forcing specific transport
const agent = new Agent({ transport: 'websocket' }); // Usually unnecessary
2. Monitor latency
Set alerts if latency exceeds thresholds:
agent.on('transport_metrics', (metrics) => {
if (metrics.averageLatency > 500) {
alert('Voice latency high - check network');
}
});
3. Handle connection failures gracefully
agent.on('transport_error', async (error) => {
// Attempt reconnection
await agent.reconnect();
});
4. Test both transports
Even though SDK chooses automatically, test both in your environment:
# Force WebSocket for testing
TRANSPORT=websocket npm start
# Force WebRTC for testing
TRANSPORT=webrtc npm start
5. Consider network conditions
On poor networks (< 1 Mbps), WebSocket might actually be better (reliable TCP).
Conclusion
Transport layer matters for voice agent latency.
The Agents SDK automatically chooses:
- WebRTC in browsers (20-50ms per message)
- WebSocket on servers (50-100ms per message)
You don’t write conditional logic. The SDK picks the optimal transport based on environment.
Result: Voice agents with minimal latency, no transport configuration required.
Implementation Guide:
- Let SDK auto-select transport (WebRTC browser, WebSocket server)
- Monitor transport metrics for latency/packet loss
- Handle transport errors with reconnection logic
- Test both transports in your environment
- Set latency alerts (> 500ms is problematic)
The SDK handles NAT traversal, codec selection, and error recovery automatically.
Links:
Next: Explore how the SDK’s built-in tracing captures full voice conversations with audio playback for debugging.