Stream Voice Like You Stream Text
- ZH+
- Performance
- January 9, 2026
Table of Contents
User asks a question. Three seconds of silence. Then the agent speaks.
That’s how 80% of voice agents work today. They wait. Generate the full response. Then play audio.
The result? Users think the system froze. They repeat themselves. They hang up.
The fix: Stream audio as it’s generated—just like you stream text in chat interfaces.
The Problem With Non-Streaming Voice
In text chat, non-streaming is tolerable:
User: "What's the weather?"
[2 seconds]
Agent: "It's 72°F and sunny in San Francisco."
Users see a typing indicator. They know the agent is working. Two seconds feels fine.
In voice, there’s no typing indicator. There’s just silence:
User: "What's the weather?"
[3 seconds of dead air]
Agent: "It's 72°F and sunny in San Francisco."
Three seconds feels like forever. Users assume:
- The system didn’t hear them
- The call dropped
- Something broke
They repeat the question, talk over the agent, or hang up. 37% of users abandon after 3+ seconds of silence.
Why Voice Agents Have Latency
Speech-to-speech involves multiple stages:
1. Record audio chunk (100-300ms)
2. Send to server (50-200ms network)
3. Transcribe audio to text (200-500ms)
4. Generate response with LLM (1000-3000ms) ← Longest step
5. Convert text to speech (500-1500ms)
6. Send audio back (50-200ms network)
7. Play audio (duration of audio)
Total latency: 2-6 seconds before first audio plays.
The LLM generation step dominates. For complex questions, it can take 5+ seconds.
Streaming: Start Speaking While Processing
Streaming audio works like streaming text—send chunks as they’re generated:
graph TB
A[User Speaks] --> B[Audio Recording Complete]
B --> C[Transcription Starts]
C --> D[Text Available]
D --> E[LLM Generates Token 1]
E --> F[Convert Token 1 to Audio]
F --> G[Stream Audio Chunk 1]
G --> H[User Hears First Sound]
E --> I[LLM Generates Token 2]
I --> J[Convert Token 2 to Audio]
J --> K[Stream Audio Chunk 2]
K --> L[User Hears Continuation]
I --> M[LLM Generates Token 3...]
M --> N[Continue Streaming]
H -.-> O[Perceived Latency: ~500ms]
B -.-> P[Actual Processing Time: 3000ms]
style H fill:#d4f4dd
style O fill:#d4f4dd
style P fill:#ffe1e1
The key insight: Users perceive latency as time-to-first-sound, not time-to-complete-response.
- Non-streaming: 3 seconds of silence → agent speaks
- Streaming: 0.5 seconds → agent starts speaking (still processing rest)
Perceived latency drops 83% even though total processing time is the same.
Architecture: Streaming Voice Responses
Here’s how OpenAI Realtime API handles streaming:
import { RealtimeClient } from '@openai/realtime-api-beta';
class StreamingVoiceAgent {
constructor() {
this.client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY });
}
async setup() {
await this.client.connect();
// Enable streaming by default
await this.client.updateSession({
instructions: 'You are a helpful assistant.',
voice: 'alloy',
modalities: ['audio'],
// Key setting: turn_detection manages when to start responding
turn_detection: {
type: 'server_vad', // Voice Activity Detection
threshold: 0.5, // Sensitivity
prefix_padding_ms: 300, // Include 300ms before speech
silence_duration_ms: 500 // Wait 500ms of silence before responding
}
});
// Audio delta events = streaming chunks
this.client.on('response.audio.delta', (event) => {
// event.delta = base64-encoded audio chunk
this.playAudioChunk(event.delta);
});
// Track when streaming starts
this.client.on('response.audio.started', () => {
console.log('Agent started speaking (streaming)');
this.measureLatency('first_audio');
});
// Track when streaming completes
this.client.on('response.audio.done', () => {
console.log('Agent finished speaking');
this.measureLatency('complete_audio');
});
}
playAudioChunk(base64Audio) {
// Decode base64 to audio samples
const audioData = Buffer.from(base64Audio, 'base64');
// Play immediately (don't wait for full response)
this.audioContext.decodeAudioData(audioData, (buffer) => {
const source = this.audioContext.createBufferSource();
source.buffer = buffer;
source.connect(this.audioContext.destination);
source.start();
});
}
measureLatency(event) {
const now = Date.now();
if (event === 'first_audio') {
console.log(`Time to first audio: ${now - this.queryStartTime}ms`);
} else if (event === 'complete_audio') {
console.log(`Total response time: ${now - this.queryStartTime}ms`);
}
}
}
Non-Streaming vs Streaming: Real Comparison
Let’s measure actual latency differences:
Non-Streaming Implementation
class NonStreamingAgent {
async getResponse(userQuestion) {
const startTime = Date.now();
// 1. Generate complete response
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-4',
messages: [{ role: 'user', content: userQuestion }]
})
});
const data = await response.json();
const text = data.choices[0].message.content;
console.log(`LLM response time: ${Date.now() - startTime}ms`);
// 2. Convert entire text to speech
const ttsStart = Date.now();
const audioResponse = await fetch('https://api.openai.com/v1/audio/speech', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'tts-1',
voice: 'alloy',
input: text
})
});
const audio = await audioResponse.arrayBuffer();
console.log(`TTS time: ${Date.now() - ttsStart}ms`);
console.log(`Total time to first sound: ${Date.now() - startTime}ms`);
// 3. Play audio (user hears first sound NOW)
return audio;
}
}
// Example timing:
// LLM response time: 2400ms
// TTS time: 1200ms
// Total time to first sound: 3600ms ← User waits 3.6 seconds
Streaming Implementation
class StreamingAgent {
async getResponse(userQuestion) {
const startTime = Date.now();
let firstAudioTime = null;
// Use Realtime API with streaming
await this.client.sendUserMessageContent([{
type: 'text',
text: userQuestion
}]);
// Audio chunks arrive as they're generated
this.client.on('response.audio.delta', (event) => {
if (!firstAudioTime) {
firstAudioTime = Date.now();
console.log(`Time to first sound: ${firstAudioTime - startTime}ms`);
}
// Play chunk immediately
this.playAudioChunk(event.delta);
});
this.client.on('response.done', () => {
console.log(`Total response time: ${Date.now() - startTime}ms`);
});
}
}
// Example timing:
// Time to first sound: 520ms ← User hears voice in 0.5 seconds
// Total response time: 2800ms (processing continues while speaking)
Result:
- Non-streaming: 3.6 second wait
- Streaming: 0.52 second wait
- Improvement: 85% faster perceived response
Real-World Implementation
Here’s production-ready streaming voice agent code:
import { RealtimeClient } from '@openai/realtime-api-beta';
class ProductionStreamingAgent {
constructor() {
this.client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY });
this.audioQueue = [];
this.isPlaying = false;
this.metrics = {
queries: 0,
avgTimeToFirstAudio: 0,
avgTotalTime: 0
};
}
async initialize() {
await this.client.connect();
await this.client.updateSession({
instructions: `
You are a helpful customer service agent.
Respond naturally and conversationally.
If you need to think, start speaking general context while you process specifics.
`,
voice: 'alloy',
modalities: ['audio'],
turn_detection: {
type: 'server_vad',
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 700 // Slightly longer for natural pauses
}
});
// Set up streaming audio handling
this.setupAudioStreaming();
this.setupMetrics();
}
setupAudioStreaming() {
// Queue audio chunks for smooth playback
this.client.on('response.audio.delta', async (event) => {
const audioChunk = this.decodeAudio(event.delta);
this.audioQueue.push(audioChunk);
// Start playing if not already playing
if (!this.isPlaying) {
this.playAudioQueue();
}
});
this.client.on('response.audio.done', () => {
// Mark end of audio stream
this.audioQueue.push(null); // Sentinel value
});
}
async playAudioQueue() {
this.isPlaying = true;
while (this.audioQueue.length > 0) {
const chunk = this.audioQueue.shift();
// Null = end of stream
if (chunk === null) {
break;
}
// Play chunk
await this.playChunk(chunk);
}
this.isPlaying = false;
}
async playChunk(audioData) {
return new Promise((resolve) => {
const source = this.audioContext.createBufferSource();
source.buffer = audioData;
source.connect(this.audioContext.destination);
source.onended = resolve;
source.start();
});
}
setupMetrics() {
let queryStartTime = null;
let firstAudioTime = null;
this.client.on('conversation.item.created', (event) => {
if (event.item.role === 'user') {
queryStartTime = Date.now();
firstAudioTime = null;
}
});
this.client.on('response.audio.started', () => {
if (queryStartTime && !firstAudioTime) {
firstAudioTime = Date.now();
const latency = firstAudioTime - queryStartTime;
// Update metrics
this.metrics.queries++;
this.metrics.avgTimeToFirstAudio =
(this.metrics.avgTimeToFirstAudio * (this.metrics.queries - 1) + latency) /
this.metrics.queries;
console.log(`Time to first audio: ${latency}ms`);
console.log(`Average (all queries): ${this.metrics.avgTimeToFirstAudio.toFixed(0)}ms`);
}
});
this.client.on('response.done', () => {
if (queryStartTime) {
const totalTime = Date.now() - queryStartTime;
this.metrics.avgTotalTime =
(this.metrics.avgTotalTime * (this.metrics.queries - 1) + totalTime) /
this.metrics.queries;
console.log(`Total response time: ${totalTime}ms`);
console.log(`Average (all queries): ${this.metrics.avgTotalTime.toFixed(0)}ms`);
}
});
}
decodeAudio(base64) {
const buffer = Buffer.from(base64, 'base64');
return this.audioContext.decodeAudioData(buffer);
}
getMetrics() {
return {
total_queries: this.metrics.queries,
avg_time_to_first_audio_ms: Math.round(this.metrics.avgTimeToFirstAudio),
avg_total_response_time_ms: Math.round(this.metrics.avgTotalTime),
avg_processing_while_speaking_ms: Math.round(
this.metrics.avgTotalTime - this.metrics.avgTimeToFirstAudio
)
};
}
}
// Usage
const agent = new ProductionStreamingAgent();
await agent.initialize();
// After 100 queries:
console.log(agent.getMetrics());
// {
// total_queries: 100,
// avg_time_to_first_audio_ms: 580,
// avg_total_response_time_ms: 2750,
// avg_processing_while_speaking_ms: 2170
// }
Business Impact: Real Numbers
An insurance company tested streaming vs non-streaming for customer service:
Non-streaming voice agent:
- Average time-to-first-audio: 3.2 seconds
- User abandonment during silence: 22%
- Calls completed: 78%
- Customer satisfaction: 3.1/5
Streaming voice agent:
- Average time-to-first-audio: 0.6 seconds
- User abandonment during silence: 4%
- Calls completed: 96%
- Customer satisfaction: 4.3/5
Impact:
- 81% reduction in abandonment
- 18% more calls completed
- 39% higher satisfaction
Revenue impact: With 50,000 calls/month and $25 average revenue per completed call:
- Non-streaming: 50K × 78% = 39K completed × $25 = $975K/month
- Streaming: 50K × 96% = 48K completed × $25 = $1.2M/month
- Gain: $225K/month from streaming alone
When Streaming Matters Most
| Critical For Streaming | Less Critical |
|---|---|
| Customer service voice agents | Pre-recorded voice messages |
| Real-time voice assistants | Batch processing tasks |
| Interactive conversations | One-way announcements |
| High-latency questions (complex) | Simple, fast responses (<1 sec) |
| Public-facing applications | Internal tools |
Streaming matters when humans wait for responses in real-time.
Common Streaming Pitfalls
Pitfall 1: Audio Buffering Too Aggressive
// ❌ Wrong: Buffer 5 seconds before playing
if (audioQueue.length < 5) {
return; // Wait for more chunks
}
// ✅ Right: Play as soon as first chunk arrives
if (!isPlaying && audioQueue.length > 0) {
playAudioQueue(); // Start immediately
}
Pitfall 2: Network Jitter Causes Gaps
// ❌ Wrong: Play each chunk exactly when received
audioChunk.play(); // Creates gaps if network delays
// ✅ Right: Use small buffer to smooth jitter
if (audioQueue.length < 2) {
await wait(100ms); // Tiny buffer to smooth jitter
}
audioChunk.play();
Pitfall 3: Not Handling Backpressure
// ❌ Wrong: Queue grows unbounded if playback is slower
audioQueue.push(chunk); // Could OOM if chunks arrive faster than playback
// ✅ Right: Drop chunks or slow down if queue too large
if (audioQueue.length > 50) {
console.warn('Audio queue backed up, dropping oldest chunk');
audioQueue.shift(); // Remove oldest
}
audioQueue.push(chunk);
Advanced: Dynamic Streaming Strategy
Adapt streaming based on response complexity:
class AdaptiveStreamingAgent {
async getResponse(query) {
// Estimate response complexity
const complexity = await this.estimateComplexity(query);
if (complexity === 'simple') {
// Fast response coming, don't stream (avoid audio artifacts)
return this.getNonStreamingResponse(query);
} else {
// Slow response, stream to reduce perceived latency
return this.getStreamingResponse(query);
}
}
async estimateComplexity(query) {
// Quick check: Is this a simple fact or complex reasoning?
const simplePatterns = [
/what time/i,
/what's the weather/i,
/who is/i,
/when is/i
];
if (simplePatterns.some(pattern => pattern.test(query))) {
return 'simple'; // Likely <1 second response
}
return 'complex'; // Likely 2+ seconds, benefit from streaming
}
}
Cost Considerations
Streaming doesn’t cost more per se—but it enables longer conversations:
- Non-streaming: Users abandon due to latency
- Streaming: Users stay engaged, have longer conversations
Example costs (OpenAI Realtime API):
- Input audio: $0.06/minute
- Output audio: $0.24/minute
Non-streaming scenario:
- Average conversation: 2 minutes (short due to abandonment)
- Cost: $0.12 + $0.48 = $0.60/conversation
- Revenue per conversation: $8 (many abandoned early)
Streaming scenario:
- Average conversation: 3.5 minutes (users stay engaged)
- Cost: $0.21 + $0.84 = $1.05/conversation
- Revenue per conversation: $15 (more completed)
Net result: Streaming costs 75% more per conversation but generates 88% more revenue. ROI is positive.
Implementation Timeline
Week 1: Enable streaming in Realtime API
- Update session configuration
- Add audio delta event handlers
- Test with simple queries
Week 2: Optimize playback queue
- Implement smooth audio queueing
- Handle network jitter
- Add backpressure handling
Week 3: Measure latency improvements
- Track time-to-first-audio before/after
- Monitor abandonment rates
- A/B test streaming vs non-streaming
Week 4: Deploy and monitor
- Roll out to production gradually (10% → 50% → 100%)
- Watch for audio artifacts or gaps
- Tune buffer sizes based on real usage
The Future: Predictive Streaming
Next generation: Start streaming before the user finishes speaking:
// Agent predicts user's question mid-sentence
// Starts generating response before user stops talking
// By the time user finishes, audio is already playing
// Time to first audio: ~0ms (feels instantaneous)
This requires:
- Real-time intent detection
- Speculative response generation
- Rollback if prediction was wrong
OpenAI’s Realtime API is evolving toward this. The result: Voice conversations that feel as fast as human-to-human.
What’s Next
If you want voice agents with streaming responses, we can implement real-time audio streaming with OpenAI Realtime API. The result: No more awkward silence. Users hear responses immediately, even for complex questions. Conversations feel natural and responsive.