Stream Voice Like You Stream Text

Stream Voice Like You Stream Text

Table of Contents

User asks a question. Three seconds of silence. Then the agent speaks.

That’s how 80% of voice agents work today. They wait. Generate the full response. Then play audio.

The result? Users think the system froze. They repeat themselves. They hang up.

The fix: Stream audio as it’s generated—just like you stream text in chat interfaces.

The Problem With Non-Streaming Voice

In text chat, non-streaming is tolerable:

User: "What's the weather?"
[2 seconds]
Agent: "It's 72°F and sunny in San Francisco."

Users see a typing indicator. They know the agent is working. Two seconds feels fine.

In voice, there’s no typing indicator. There’s just silence:

User: "What's the weather?"
[3 seconds of dead air]
Agent: "It's 72°F and sunny in San Francisco."

Three seconds feels like forever. Users assume:

  • The system didn’t hear them
  • The call dropped
  • Something broke

They repeat the question, talk over the agent, or hang up. 37% of users abandon after 3+ seconds of silence.

Why Voice Agents Have Latency

Speech-to-speech involves multiple stages:

1. Record audio chunk (100-300ms)
2. Send to server (50-200ms network)
3. Transcribe audio to text (200-500ms)
4. Generate response with LLM (1000-3000ms)  ← Longest step
5. Convert text to speech (500-1500ms)
6. Send audio back (50-200ms network)
7. Play audio (duration of audio)

Total latency: 2-6 seconds before first audio plays.

The LLM generation step dominates. For complex questions, it can take 5+ seconds.

Streaming: Start Speaking While Processing

Streaming audio works like streaming text—send chunks as they’re generated:

graph TB
    A[User Speaks] --> B[Audio Recording Complete]
    B --> C[Transcription Starts]
    C --> D[Text Available]
    D --> E[LLM Generates Token 1]
    E --> F[Convert Token 1 to Audio]
    F --> G[Stream Audio Chunk 1]
    G --> H[User Hears First Sound]
    
    E --> I[LLM Generates Token 2]
    I --> J[Convert Token 2 to Audio]
    J --> K[Stream Audio Chunk 2]
    K --> L[User Hears Continuation]
    
    I --> M[LLM Generates Token 3...]
    M --> N[Continue Streaming]
    
    H -.-> O[Perceived Latency: ~500ms]
    B -.-> P[Actual Processing Time: 3000ms]
    
    style H fill:#d4f4dd
    style O fill:#d4f4dd
    style P fill:#ffe1e1

The key insight: Users perceive latency as time-to-first-sound, not time-to-complete-response.

  • Non-streaming: 3 seconds of silence → agent speaks
  • Streaming: 0.5 seconds → agent starts speaking (still processing rest)

Perceived latency drops 83% even though total processing time is the same.

Architecture: Streaming Voice Responses

Here’s how OpenAI Realtime API handles streaming:

import { RealtimeClient } from '@openai/realtime-api-beta';

class StreamingVoiceAgent {
  constructor() {
    this.client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY });
  }

  async setup() {
    await this.client.connect();
    
    // Enable streaming by default
    await this.client.updateSession({
      instructions: 'You are a helpful assistant.',
      voice: 'alloy',
      modalities: ['audio'],
      
      // Key setting: turn_detection manages when to start responding
      turn_detection: { 
        type: 'server_vad',  // Voice Activity Detection
        threshold: 0.5,      // Sensitivity
        prefix_padding_ms: 300,   // Include 300ms before speech
        silence_duration_ms: 500  // Wait 500ms of silence before responding
      }
    });

    // Audio delta events = streaming chunks
    this.client.on('response.audio.delta', (event) => {
      // event.delta = base64-encoded audio chunk
      this.playAudioChunk(event.delta);
    });

    // Track when streaming starts
    this.client.on('response.audio.started', () => {
      console.log('Agent started speaking (streaming)');
      this.measureLatency('first_audio');
    });

    // Track when streaming completes
    this.client.on('response.audio.done', () => {
      console.log('Agent finished speaking');
      this.measureLatency('complete_audio');
    });
  }

  playAudioChunk(base64Audio) {
    // Decode base64 to audio samples
    const audioData = Buffer.from(base64Audio, 'base64');
    
    // Play immediately (don't wait for full response)
    this.audioContext.decodeAudioData(audioData, (buffer) => {
      const source = this.audioContext.createBufferSource();
      source.buffer = buffer;
      source.connect(this.audioContext.destination);
      source.start();
    });
  }

  measureLatency(event) {
    const now = Date.now();
    if (event === 'first_audio') {
      console.log(`Time to first audio: ${now - this.queryStartTime}ms`);
    } else if (event === 'complete_audio') {
      console.log(`Total response time: ${now - this.queryStartTime}ms`);
    }
  }
}

Non-Streaming vs Streaming: Real Comparison

Let’s measure actual latency differences:

Non-Streaming Implementation

class NonStreamingAgent {
  async getResponse(userQuestion) {
    const startTime = Date.now();
    
    // 1. Generate complete response
    const response = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'gpt-4',
        messages: [{ role: 'user', content: userQuestion }]
      })
    });

    const data = await response.json();
    const text = data.choices[0].message.content;
    
    console.log(`LLM response time: ${Date.now() - startTime}ms`);
    
    // 2. Convert entire text to speech
    const ttsStart = Date.now();
    const audioResponse = await fetch('https://api.openai.com/v1/audio/speech', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'tts-1',
        voice: 'alloy',
        input: text
      })
    });

    const audio = await audioResponse.arrayBuffer();
    
    console.log(`TTS time: ${Date.now() - ttsStart}ms`);
    console.log(`Total time to first sound: ${Date.now() - startTime}ms`);
    
    // 3. Play audio (user hears first sound NOW)
    return audio;
  }
}

// Example timing:
// LLM response time: 2400ms
// TTS time: 1200ms
// Total time to first sound: 3600ms  ← User waits 3.6 seconds

Streaming Implementation

class StreamingAgent {
  async getResponse(userQuestion) {
    const startTime = Date.now();
    let firstAudioTime = null;
    
    // Use Realtime API with streaming
    await this.client.sendUserMessageContent([{
      type: 'text',
      text: userQuestion
    }]);

    // Audio chunks arrive as they're generated
    this.client.on('response.audio.delta', (event) => {
      if (!firstAudioTime) {
        firstAudioTime = Date.now();
        console.log(`Time to first sound: ${firstAudioTime - startTime}ms`);
      }
      
      // Play chunk immediately
      this.playAudioChunk(event.delta);
    });

    this.client.on('response.done', () => {
      console.log(`Total response time: ${Date.now() - startTime}ms`);
    });
  }
}

// Example timing:
// Time to first sound: 520ms  ← User hears voice in 0.5 seconds
// Total response time: 2800ms (processing continues while speaking)

Result:

  • Non-streaming: 3.6 second wait
  • Streaming: 0.52 second wait
  • Improvement: 85% faster perceived response

Real-World Implementation

Here’s production-ready streaming voice agent code:

import { RealtimeClient } from '@openai/realtime-api-beta';

class ProductionStreamingAgent {
  constructor() {
    this.client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY });
    this.audioQueue = [];
    this.isPlaying = false;
    this.metrics = {
      queries: 0,
      avgTimeToFirstAudio: 0,
      avgTotalTime: 0
    };
  }

  async initialize() {
    await this.client.connect();
    
    await this.client.updateSession({
      instructions: `
You are a helpful customer service agent. 
Respond naturally and conversationally.
If you need to think, start speaking general context while you process specifics.
`,
      voice: 'alloy',
      modalities: ['audio'],
      turn_detection: {
        type: 'server_vad',
        threshold: 0.5,
        prefix_padding_ms: 300,
        silence_duration_ms: 700  // Slightly longer for natural pauses
      }
    });

    // Set up streaming audio handling
    this.setupAudioStreaming();
    this.setupMetrics();
  }

  setupAudioStreaming() {
    // Queue audio chunks for smooth playback
    this.client.on('response.audio.delta', async (event) => {
      const audioChunk = this.decodeAudio(event.delta);
      this.audioQueue.push(audioChunk);
      
      // Start playing if not already playing
      if (!this.isPlaying) {
        this.playAudioQueue();
      }
    });

    this.client.on('response.audio.done', () => {
      // Mark end of audio stream
      this.audioQueue.push(null);  // Sentinel value
    });
  }

  async playAudioQueue() {
    this.isPlaying = true;
    
    while (this.audioQueue.length > 0) {
      const chunk = this.audioQueue.shift();
      
      // Null = end of stream
      if (chunk === null) {
        break;
      }
      
      // Play chunk
      await this.playChunk(chunk);
    }
    
    this.isPlaying = false;
  }

  async playChunk(audioData) {
    return new Promise((resolve) => {
      const source = this.audioContext.createBufferSource();
      source.buffer = audioData;
      source.connect(this.audioContext.destination);
      source.onended = resolve;
      source.start();
    });
  }

  setupMetrics() {
    let queryStartTime = null;
    let firstAudioTime = null;

    this.client.on('conversation.item.created', (event) => {
      if (event.item.role === 'user') {
        queryStartTime = Date.now();
        firstAudioTime = null;
      }
    });

    this.client.on('response.audio.started', () => {
      if (queryStartTime && !firstAudioTime) {
        firstAudioTime = Date.now();
        const latency = firstAudioTime - queryStartTime;
        
        // Update metrics
        this.metrics.queries++;
        this.metrics.avgTimeToFirstAudio = 
          (this.metrics.avgTimeToFirstAudio * (this.metrics.queries - 1) + latency) / 
          this.metrics.queries;
        
        console.log(`Time to first audio: ${latency}ms`);
        console.log(`Average (all queries): ${this.metrics.avgTimeToFirstAudio.toFixed(0)}ms`);
      }
    });

    this.client.on('response.done', () => {
      if (queryStartTime) {
        const totalTime = Date.now() - queryStartTime;
        
        this.metrics.avgTotalTime = 
          (this.metrics.avgTotalTime * (this.metrics.queries - 1) + totalTime) / 
          this.metrics.queries;
        
        console.log(`Total response time: ${totalTime}ms`);
        console.log(`Average (all queries): ${this.metrics.avgTotalTime.toFixed(0)}ms`);
      }
    });
  }

  decodeAudio(base64) {
    const buffer = Buffer.from(base64, 'base64');
    return this.audioContext.decodeAudioData(buffer);
  }

  getMetrics() {
    return {
      total_queries: this.metrics.queries,
      avg_time_to_first_audio_ms: Math.round(this.metrics.avgTimeToFirstAudio),
      avg_total_response_time_ms: Math.round(this.metrics.avgTotalTime),
      avg_processing_while_speaking_ms: Math.round(
        this.metrics.avgTotalTime - this.metrics.avgTimeToFirstAudio
      )
    };
  }
}

// Usage
const agent = new ProductionStreamingAgent();
await agent.initialize();

// After 100 queries:
console.log(agent.getMetrics());
// {
//   total_queries: 100,
//   avg_time_to_first_audio_ms: 580,
//   avg_total_response_time_ms: 2750,
//   avg_processing_while_speaking_ms: 2170
// }

Business Impact: Real Numbers

An insurance company tested streaming vs non-streaming for customer service:

Non-streaming voice agent:

  • Average time-to-first-audio: 3.2 seconds
  • User abandonment during silence: 22%
  • Calls completed: 78%
  • Customer satisfaction: 3.1/5

Streaming voice agent:

  • Average time-to-first-audio: 0.6 seconds
  • User abandonment during silence: 4%
  • Calls completed: 96%
  • Customer satisfaction: 4.3/5

Impact:

  • 81% reduction in abandonment
  • 18% more calls completed
  • 39% higher satisfaction

Revenue impact: With 50,000 calls/month and $25 average revenue per completed call:

  • Non-streaming: 50K × 78% = 39K completed × $25 = $975K/month
  • Streaming: 50K × 96% = 48K completed × $25 = $1.2M/month
  • Gain: $225K/month from streaming alone

When Streaming Matters Most

Critical For StreamingLess Critical
Customer service voice agentsPre-recorded voice messages
Real-time voice assistantsBatch processing tasks
Interactive conversationsOne-way announcements
High-latency questions (complex)Simple, fast responses (<1 sec)
Public-facing applicationsInternal tools

Streaming matters when humans wait for responses in real-time.

Common Streaming Pitfalls

Pitfall 1: Audio Buffering Too Aggressive

// ❌ Wrong: Buffer 5 seconds before playing
if (audioQueue.length < 5) {
  return; // Wait for more chunks
}

// ✅ Right: Play as soon as first chunk arrives
if (!isPlaying && audioQueue.length > 0) {
  playAudioQueue(); // Start immediately
}

Pitfall 2: Network Jitter Causes Gaps

// ❌ Wrong: Play each chunk exactly when received
audioChunk.play();  // Creates gaps if network delays

// ✅ Right: Use small buffer to smooth jitter
if (audioQueue.length < 2) {
  await wait(100ms);  // Tiny buffer to smooth jitter
}
audioChunk.play();

Pitfall 3: Not Handling Backpressure

// ❌ Wrong: Queue grows unbounded if playback is slower
audioQueue.push(chunk);  // Could OOM if chunks arrive faster than playback

// ✅ Right: Drop chunks or slow down if queue too large
if (audioQueue.length > 50) {
  console.warn('Audio queue backed up, dropping oldest chunk');
  audioQueue.shift();  // Remove oldest
}
audioQueue.push(chunk);

Advanced: Dynamic Streaming Strategy

Adapt streaming based on response complexity:

class AdaptiveStreamingAgent {
  async getResponse(query) {
    // Estimate response complexity
    const complexity = await this.estimateComplexity(query);
    
    if (complexity === 'simple') {
      // Fast response coming, don't stream (avoid audio artifacts)
      return this.getNonStreamingResponse(query);
    } else {
      // Slow response, stream to reduce perceived latency
      return this.getStreamingResponse(query);
    }
  }

  async estimateComplexity(query) {
    // Quick check: Is this a simple fact or complex reasoning?
    const simplePatterns = [
      /what time/i,
      /what's the weather/i,
      /who is/i,
      /when is/i
    ];
    
    if (simplePatterns.some(pattern => pattern.test(query))) {
      return 'simple';  // Likely <1 second response
    }
    
    return 'complex';  // Likely 2+ seconds, benefit from streaming
  }
}

Cost Considerations

Streaming doesn’t cost more per se—but it enables longer conversations:

  • Non-streaming: Users abandon due to latency
  • Streaming: Users stay engaged, have longer conversations

Example costs (OpenAI Realtime API):

  • Input audio: $0.06/minute
  • Output audio: $0.24/minute

Non-streaming scenario:

  • Average conversation: 2 minutes (short due to abandonment)
  • Cost: $0.12 + $0.48 = $0.60/conversation
  • Revenue per conversation: $8 (many abandoned early)

Streaming scenario:

  • Average conversation: 3.5 minutes (users stay engaged)
  • Cost: $0.21 + $0.84 = $1.05/conversation
  • Revenue per conversation: $15 (more completed)

Net result: Streaming costs 75% more per conversation but generates 88% more revenue. ROI is positive.

Implementation Timeline

Week 1: Enable streaming in Realtime API

  • Update session configuration
  • Add audio delta event handlers
  • Test with simple queries

Week 2: Optimize playback queue

  • Implement smooth audio queueing
  • Handle network jitter
  • Add backpressure handling

Week 3: Measure latency improvements

  • Track time-to-first-audio before/after
  • Monitor abandonment rates
  • A/B test streaming vs non-streaming

Week 4: Deploy and monitor

  • Roll out to production gradually (10% → 50% → 100%)
  • Watch for audio artifacts or gaps
  • Tune buffer sizes based on real usage

The Future: Predictive Streaming

Next generation: Start streaming before the user finishes speaking:

// Agent predicts user's question mid-sentence
// Starts generating response before user stops talking
// By the time user finishes, audio is already playing

// Time to first audio: ~0ms (feels instantaneous)

This requires:

  • Real-time intent detection
  • Speculative response generation
  • Rollback if prediction was wrong

OpenAI’s Realtime API is evolving toward this. The result: Voice conversations that feel as fast as human-to-human.

What’s Next

If you want voice agents with streaming responses, we can implement real-time audio streaming with OpenAI Realtime API. The result: No more awkward silence. Users hear responses immediately, even for complex questions. Conversations feel natural and responsive.

Share :

Related Posts

Latency Is The Product: Why WebRTC Makes Voice Agents Feel Natural

Latency Is The Product: Why WebRTC Makes Voice Agents Feel Natural

You ask your voice agent a question. One second passes. Two seconds. Three seconds.

Read More
Why Voice Agents Use WebRTC In Browsers

Why Voice Agents Use WebRTC In Browsers

Transport layer isn’t something most developers think about. But when you’re building voice agents, it’s the difference between 50ms latency and 500ms latency.

Read More
Pacing Is A Feature: Dynamic Speech Speed Controls Per Context

Pacing Is A Feature: Dynamic Speech Speed Controls Per Context

Your voice agent reads out a complex legal disclaimer at the same speed it says “Got it!”

Read More