Stream Voice Like You Stream Text

Stream Voice Like You Stream Text

Table of Contents

User asks a question. Three seconds of silence. Then the agent speaks.

That’s how 80% of voice agents work today. They wait. Generate the full response. Then play audio.

The result? Users think the system froze. They repeat themselves. They hang up.

The fix: Stream audio as it’s generated—just like you stream text in chat interfaces.

The Problem With Non-Streaming Voice

In text chat, non-streaming is tolerable:

User: "What's the weather?"
[2 seconds]
Agent: "It's 72°F and sunny in San Francisco."

Users see a typing indicator. They know the agent is working. Two seconds feels fine.

In voice, there’s no typing indicator. There’s just silence:

User: "What's the weather?"
[3 seconds of dead air]
Agent: "It's 72°F and sunny in San Francisco."

Three seconds feels like forever. Users assume:

  • The system didn’t hear them
  • The call dropped
  • Something broke

They repeat the question, talk over the agent, or hang up. 37% of users abandon after 3+ seconds of silence.

Why Voice Agents Have Latency

Speech-to-speech involves multiple stages:

1. Record audio chunk (100-300ms)
2. Send to server (50-200ms network)
3. Transcribe audio to text (200-500ms)
4. Generate response with LLM (1000-3000ms)  ← Longest step
5. Convert text to speech (500-1500ms)
6. Send audio back (50-200ms network)
7. Play audio (duration of audio)

Total latency: 2-6 seconds before first audio plays.

The LLM generation step dominates. For complex questions, it can take 5+ seconds.

Streaming: Start Speaking While Processing

Streaming audio works like streaming text—send chunks as they’re generated:

graph TB
    A[User Speaks] --> B[Audio Recording Complete]
    B --> C[Transcription Starts]
    C --> D[Text Available]
    D --> E[LLM Generates Token 1]
    E --> F[Convert Token 1 to Audio]
    F --> G[Stream Audio Chunk 1]
    G --> H[User Hears First Sound]
    
    E --> I[LLM Generates Token 2]
    I --> J[Convert Token 2 to Audio]
    J --> K[Stream Audio Chunk 2]
    K --> L[User Hears Continuation]
    
    I --> M[LLM Generates Token 3...]
    M --> N[Continue Streaming]
    
    H -.-> O[Perceived Latency: ~500ms]
    B -.-> P[Actual Processing Time: 3000ms]
    
    style H fill:#d4f4dd
    style O fill:#d4f4dd
    style P fill:#ffe1e1

The key insight: Users perceive latency as time-to-first-sound, not time-to-complete-response.

  • Non-streaming: 3 seconds of silence → agent speaks
  • Streaming: 0.5 seconds → agent starts speaking (still processing rest)

Perceived latency drops 83% even though total processing time is the same.

Architecture: Streaming Voice Responses

Here’s how OpenAI Realtime API handles streaming:

import { RealtimeClient } from '@openai/realtime-api-beta';

class StreamingVoiceAgent {
  constructor() {
    this.client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY });
  }

  async setup() {
    await this.client.connect();
    
    // Enable streaming by default
    await this.client.updateSession({
      instructions: 'You are a helpful assistant.',
      voice: 'alloy',
      modalities: ['audio'],
      
      // Key setting: turn_detection manages when to start responding
      turn_detection: { 
        type: 'server_vad',  // Voice Activity Detection
        threshold: 0.5,      // Sensitivity
        prefix_padding_ms: 300,   // Include 300ms before speech
        silence_duration_ms: 500  // Wait 500ms of silence before responding
      }
    });

    // Audio delta events = streaming chunks
    this.client.on('response.audio.delta', (event) => {
      // event.delta = base64-encoded audio chunk
      this.playAudioChunk(event.delta);
    });

    // Track when streaming starts
    this.client.on('response.audio.started', () => {
      console.log('Agent started speaking (streaming)');
      this.measureLatency('first_audio');
    });

    // Track when streaming completes
    this.client.on('response.audio.done', () => {
      console.log('Agent finished speaking');
      this.measureLatency('complete_audio');
    });
  }

  playAudioChunk(base64Audio) {
    // Decode base64 to audio samples
    const audioData = Buffer.from(base64Audio, 'base64');
    
    // Play immediately (don't wait for full response)
    this.audioContext.decodeAudioData(audioData, (buffer) => {
      const source = this.audioContext.createBufferSource();
      source.buffer = buffer;
      source.connect(this.audioContext.destination);
      source.start();
    });
  }

  measureLatency(event) {
    const now = Date.now();
    if (event === 'first_audio') {
      console.log(`Time to first audio: ${now - this.queryStartTime}ms`);
    } else if (event === 'complete_audio') {
      console.log(`Total response time: ${now - this.queryStartTime}ms`);
    }
  }
}

Non-Streaming vs Streaming: Real Comparison

Let’s measure actual latency differences:

Non-Streaming Implementation

class NonStreamingAgent {
  async getResponse(userQuestion) {
    const startTime = Date.now();
    
    // 1. Generate complete response
    const response = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'gpt-4',
        messages: [{ role: 'user', content: userQuestion }]
      })
    });

    const data = await response.json();
    const text = data.choices[0].message.content;
    
    console.log(`LLM response time: ${Date.now() - startTime}ms`);
    
    // 2. Convert entire text to speech
    const ttsStart = Date.now();
    const audioResponse = await fetch('https://api.openai.com/v1/audio/speech', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'tts-1',
        voice: 'alloy',
        input: text
      })
    });

    const audio = await audioResponse.arrayBuffer();
    
    console.log(`TTS time: ${Date.now() - ttsStart}ms`);
    console.log(`Total time to first sound: ${Date.now() - startTime}ms`);
    
    // 3. Play audio (user hears first sound NOW)
    return audio;
  }
}

// Example timing:
// LLM response time: 2400ms
// TTS time: 1200ms
// Total time to first sound: 3600ms  ← User waits 3.6 seconds

Streaming Implementation

class StreamingAgent {
  async getResponse(userQuestion) {
    const startTime = Date.now();
    let firstAudioTime = null;
    
    // Use Realtime API with streaming
    await this.client.sendUserMessageContent([{
      type: 'text',
      text: userQuestion
    }]);

    // Audio chunks arrive as they're generated
    this.client.on('response.audio.delta', (event) => {
      if (!firstAudioTime) {
        firstAudioTime = Date.now();
        console.log(`Time to first sound: ${firstAudioTime - startTime}ms`);
      }
      
      // Play chunk immediately
      this.playAudioChunk(event.delta);
    });

    this.client.on('response.done', () => {
      console.log(`Total response time: ${Date.now() - startTime}ms`);
    });
  }
}

// Example timing:
// Time to first sound: 520ms  ← User hears voice in 0.5 seconds
// Total response time: 2800ms (processing continues while speaking)

Result:

  • Non-streaming: 3.6 second wait
  • Streaming: 0.52 second wait
  • Improvement: 85% faster perceived response

Real-World Implementation

Here’s production-ready streaming voice agent code:

import { RealtimeClient } from '@openai/realtime-api-beta';

class ProductionStreamingAgent {
  constructor() {
    this.client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY });
    this.audioQueue = [];
    this.isPlaying = false;
    this.metrics = {
      queries: 0,
      avgTimeToFirstAudio: 0,
      avgTotalTime: 0
    };
  }

  async initialize() {
    await this.client.connect();
    
    await this.client.updateSession({
      instructions: `
You are a helpful customer service agent. 
Respond naturally and conversationally.
If you need to think, start speaking general context while you process specifics.
`,
      voice: 'alloy',
      modalities: ['audio'],
      turn_detection: {
        type: 'server_vad',
        threshold: 0.5,
        prefix_padding_ms: 300,
        silence_duration_ms: 700  // Slightly longer for natural pauses
      }
    });

    // Set up streaming audio handling
    this.setupAudioStreaming();
    this.setupMetrics();
  }

  setupAudioStreaming() {
    // Queue audio chunks for smooth playback
    this.client.on('response.audio.delta', async (event) => {
      const audioChunk = this.decodeAudio(event.delta);
      this.audioQueue.push(audioChunk);
      
      // Start playing if not already playing
      if (!this.isPlaying) {
        this.playAudioQueue();
      }
    });

    this.client.on('response.audio.done', () => {
      // Mark end of audio stream
      this.audioQueue.push(null);  // Sentinel value
    });
  }

  async playAudioQueue() {
    this.isPlaying = true;
    
    while (this.audioQueue.length > 0) {
      const chunk = this.audioQueue.shift();
      
      // Null = end of stream
      if (chunk === null) {
        break;
      }
      
      // Play chunk
      await this.playChunk(chunk);
    }
    
    this.isPlaying = false;
  }

  async playChunk(audioData) {
    return new Promise((resolve) => {
      const source = this.audioContext.createBufferSource();
      source.buffer = audioData;
      source.connect(this.audioContext.destination);
      source.onended = resolve;
      source.start();
    });
  }

  setupMetrics() {
    let queryStartTime = null;
    let firstAudioTime = null;

    this.client.on('conversation.item.created', (event) => {
      if (event.item.role === 'user') {
        queryStartTime = Date.now();
        firstAudioTime = null;
      }
    });

    this.client.on('response.audio.started', () => {
      if (queryStartTime && !firstAudioTime) {
        firstAudioTime = Date.now();
        const latency = firstAudioTime - queryStartTime;
        
        // Update metrics
        this.metrics.queries++;
        this.metrics.avgTimeToFirstAudio = 
          (this.metrics.avgTimeToFirstAudio * (this.metrics.queries - 1) + latency) / 
          this.metrics.queries;
        
        console.log(`Time to first audio: ${latency}ms`);
        console.log(`Average (all queries): ${this.metrics.avgTimeToFirstAudio.toFixed(0)}ms`);
      }
    });

    this.client.on('response.done', () => {
      if (queryStartTime) {
        const totalTime = Date.now() - queryStartTime;
        
        this.metrics.avgTotalTime = 
          (this.metrics.avgTotalTime * (this.metrics.queries - 1) + totalTime) / 
          this.metrics.queries;
        
        console.log(`Total response time: ${totalTime}ms`);
        console.log(`Average (all queries): ${this.metrics.avgTotalTime.toFixed(0)}ms`);
      }
    });
  }

  decodeAudio(base64) {
    const buffer = Buffer.from(base64, 'base64');
    return this.audioContext.decodeAudioData(buffer);
  }

  getMetrics() {
    return {
      total_queries: this.metrics.queries,
      avg_time_to_first_audio_ms: Math.round(this.metrics.avgTimeToFirstAudio),
      avg_total_response_time_ms: Math.round(this.metrics.avgTotalTime),
      avg_processing_while_speaking_ms: Math.round(
        this.metrics.avgTotalTime - this.metrics.avgTimeToFirstAudio
      )
    };
  }
}

// Usage
const agent = new ProductionStreamingAgent();
await agent.initialize();

// After 100 queries:
console.log(agent.getMetrics());
// {
//   total_queries: 100,
//   avg_time_to_first_audio_ms: 580,
//   avg_total_response_time_ms: 2750,
//   avg_processing_while_speaking_ms: 2170
// }

Business Impact: Real Numbers

An insurance company tested streaming vs non-streaming for customer service:

Non-streaming voice agent:

  • Average time-to-first-audio: 3.2 seconds
  • User abandonment during silence: 22%
  • Calls completed: 78%
  • Customer satisfaction: 3.1/5

Streaming voice agent:

  • Average time-to-first-audio: 0.6 seconds
  • User abandonment during silence: 4%
  • Calls completed: 96%
  • Customer satisfaction: 4.3/5

Impact:

  • 81% reduction in abandonment
  • 18% more calls completed
  • 39% higher satisfaction

Revenue impact: With 50,000 calls/month and $25 average revenue per completed call:

  • Non-streaming: 50K × 78% = 39K completed × $25 = $975K/month
  • Streaming: 50K × 96% = 48K completed × $25 = $1.2M/month
  • Gain: $225K/month from streaming alone

When Streaming Matters Most

Critical For StreamingLess Critical
Customer service voice agentsPre-recorded voice messages
Real-time voice assistantsBatch processing tasks
Interactive conversationsOne-way announcements
High-latency questions (complex)Simple, fast responses (<1 sec)
Public-facing applicationsInternal tools

Streaming matters when humans wait for responses in real-time.

Common Streaming Pitfalls

Pitfall 1: Audio Buffering Too Aggressive

// ❌ Wrong: Buffer 5 seconds before playing
if (audioQueue.length < 5) {
  return; // Wait for more chunks
}

// ✅ Right: Play as soon as first chunk arrives
if (!isPlaying && audioQueue.length > 0) {
  playAudioQueue(); // Start immediately
}

Pitfall 2: Network Jitter Causes Gaps

// ❌ Wrong: Play each chunk exactly when received
audioChunk.play();  // Creates gaps if network delays

// ✅ Right: Use small buffer to smooth jitter
if (audioQueue.length < 2) {
  await wait(100ms);  // Tiny buffer to smooth jitter
}
audioChunk.play();

Pitfall 3: Not Handling Backpressure

// ❌ Wrong: Queue grows unbounded if playback is slower
audioQueue.push(chunk);  // Could OOM if chunks arrive faster than playback

// ✅ Right: Drop chunks or slow down if queue too large
if (audioQueue.length > 50) {
  console.warn('Audio queue backed up, dropping oldest chunk');
  audioQueue.shift();  // Remove oldest
}
audioQueue.push(chunk);

Advanced: Dynamic Streaming Strategy

Adapt streaming based on response complexity:

class AdaptiveStreamingAgent {
  async getResponse(query) {
    // Estimate response complexity
    const complexity = await this.estimateComplexity(query);
    
    if (complexity === 'simple') {
      // Fast response coming, don't stream (avoid audio artifacts)
      return this.getNonStreamingResponse(query);
    } else {
      // Slow response, stream to reduce perceived latency
      return this.getStreamingResponse(query);
    }
  }

  async estimateComplexity(query) {
    // Quick check: Is this a simple fact or complex reasoning?
    const simplePatterns = [
      /what time/i,
      /what's the weather/i,
      /who is/i,
      /when is/i
    ];
    
    if (simplePatterns.some(pattern => pattern.test(query))) {
      return 'simple';  // Likely <1 second response
    }
    
    return 'complex';  // Likely 2+ seconds, benefit from streaming
  }
}

Cost Considerations

Streaming doesn’t cost more per se—but it enables longer conversations:

  • Non-streaming: Users abandon due to latency
  • Streaming: Users stay engaged, have longer conversations

Example costs (OpenAI Realtime API):

  • Input audio: $0.06/minute
  • Output audio: $0.24/minute

Non-streaming scenario:

  • Average conversation: 2 minutes (short due to abandonment)
  • Cost: $0.12 + $0.48 = $0.60/conversation
  • Revenue per conversation: $8 (many abandoned early)

Streaming scenario:

  • Average conversation: 3.5 minutes (users stay engaged)
  • Cost: $0.21 + $0.84 = $1.05/conversation
  • Revenue per conversation: $15 (more completed)

Net result: Streaming costs 75% more per conversation but generates 88% more revenue. ROI is positive.

Implementation Timeline

Week 1: Enable streaming in Realtime API

  • Update session configuration
  • Add audio delta event handlers
  • Test with simple queries

Week 2: Optimize playback queue

  • Implement smooth audio queueing
  • Handle network jitter
  • Add backpressure handling

Week 3: Measure latency improvements

  • Track time-to-first-audio before/after
  • Monitor abandonment rates
  • A/B test streaming vs non-streaming

Week 4: Deploy and monitor

  • Roll out to production gradually (10% → 50% → 100%)
  • Watch for audio artifacts or gaps
  • Tune buffer sizes based on real usage

The Future: Predictive Streaming

Next generation: Start streaming before the user finishes speaking:

// Agent predicts user's question mid-sentence
// Starts generating response before user stops talking
// By the time user finishes, audio is already playing

// Time to first audio: ~0ms (feels instantaneous)

This requires:

  • Real-time intent detection
  • Speculative response generation
  • Rollback if prediction was wrong

OpenAI’s Realtime API is evolving toward this. The result: Voice conversations that feel as fast as human-to-human.

What’s Next

If you want voice agents with streaming responses, we can implement real-time audio streaming with OpenAI Realtime API. The result: No more awkward silence. Users hear responses immediately, even for complex questions. Conversations feel natural and responsive.

Share :

Related Posts

Latency Is The Product: Why WebRTC Makes Voice Agents Feel Natural

Latency Is The Product: Why WebRTC Makes Voice Agents Feel Natural

You ask your voice agent a question. One second passes. Two seconds. Three seconds.

Read More
Undo For Agents: Building Reversible Voice Actions With Checkpoints

Undo For Agents: Building Reversible Voice Actions With Checkpoints

“Delete the draft project.” Your voice agent heard it. Executed it. The project is gone.

Read More
Pacing Is A Feature: Dynamic Speech Speed Controls Per Context

Pacing Is A Feature: Dynamic Speech Speed Controls Per Context

Your voice agent reads out a complex legal disclaimer at the same speed it says “Got it!”

Read More