Voice Agents That Don't Break When You Interrupt

ZH+
Ux design
November 1, 2025

Table of Contents

Natural conversations don’t follow strict turn-taking rules. People interrupt. They finish each other’s sentences. They talk over each other.

Most voice systems completely fail at this. Try interrupting Alexa mid-sentence. Try talking while a phone tree is reading options. The system either ignores you or crashes the entire interaction.

Real-time speech-to-speech agents can handle barge-in gracefully—without losing context.

The Turn-Taking Problem

Traditional voice systems operate like walkie-talkies: only one person can talk at a time.

Here’s what happens in a typical interaction:

System starts speaking
Audio output blocks audio input
User tries to interrupt → nothing happens
User waits for system to finish
System finally stops → “I didn’t catch that, please repeat”

This isn’t how humans talk. When your colleague is explaining something and you suddenly understand, you jump in with “Oh! So it’s like…” They stop, acknowledge, and adjust.

Voice agents need to do the same.

How Speech-To-Speech Handles Interruptions

OpenAI’s Realtime API uses full-duplex audio—both parties can speak simultaneously.

When a user interrupts:

System detects overlapping speech instantly
Stops its own output mid-sentence
Captures the user’s input
Decides whether to respond or continue

graph TD
    A[Agent speaking] --> B{User starts talking}
    B --> C[Detect barge-in]
    C --> D[Stop agent audio]
    D --> E[Capture user input]
    E --> F{Interruption type?}
    F -->|Question| G[Answer immediately]
    F -->|Correction| H[Acknowledge + adjust]
    F -->|Impatience| I[Summarize + move on]
    G --> J[Resume or redirect]
    H --> J
    I --> J

Real Implementation: Barge-In Detection

Here’s how to build interruption handling:

import { RealtimeClient } from '@openai/realtime-api-beta';

const client = new RealtimeClient({
  apiKey: process.env.OPENAI_API_KEY,
  model: 'gpt-realtime',
});

let agentSpeaking = false;
let lastUtteranceContext = '';

// When agent starts speaking
client.on('response.audio_transcript.delta', (event) => {
  agentSpeaking = true;
  lastUtteranceContext += event.delta; // Track what agent was saying
});

// When agent finishes
client.on('response.audio_transcript.done', () => {
  agentSpeaking = false;
  lastUtteranceContext = ''; // Clear context
});

// Detect user interruption
client.on('conversation.item.input_audio_transcription.delta', (event) => {
  if (agentSpeaking && event.delta.length > 10) {
    // User is talking while agent is speaking = interruption
    handleInterruption(event.delta, lastUtteranceContext);
  }
});

async function handleInterruption(userInput, agentContext) {
  // Stop current agent output
  await client.cancelResponse();
  
  // Analyze interruption type
  const interruptionType = classifyInterruption(userInput, agentContext);
  
  switch (interruptionType) {
    case 'question':
      // User has a clarifying question
      await client.sendText({
        text: userInput,
        instructions: "Answer this question, then ask if they want me to continue where I left off."
      });
      break;
      
    case 'correction':
      // User is correcting information
      await client.sendText({
        text: "Got it. " + userInput,
        instructions: "Acknowledge the correction and incorporate it into the response."
      });
      break;
      
    case 'impatience':
      // User wants to skip ahead
      await client.sendText({
        text: "I'll get to the point.",
        instructions: "Summarize the key information quickly."
      });
      break;
      
    case 'agreement':
      // User is signaling understanding
      await client.sendText({
        text: "Great, moving on.",
        instructions: "Acknowledge and continue to next topic."
      });
      break;
  }
}

function classifyInterruption(userText, agentContext) {
  const lowerInput = userText.toLowerCase();
  
  // Question indicators
  if (lowerInput.includes('wait') || lowerInput.includes('what') || 
      lowerInput.includes('why') || lowerInput.includes('how')) {
    return 'question';
  }
  
  // Correction indicators
  if (lowerInput.includes('actually') || lowerInput.includes('no') || 
      lowerInput.includes('that\'s wrong')) {
    return 'correction';
  }
  
  // Impatience indicators
  if (lowerInput.includes('skip') || lowerInput.includes('get to the point') || 
      lowerInput.includes('bottom line')) {
    return 'impatience';
  }
  
  // Agreement indicators
  if (lowerInput.includes('got it') || lowerInput.includes('yeah') || 
      lowerInput.includes('okay')) {
    return 'agreement';
  }
  
  return 'question'; // Default to question
}

Context Recovery: The Critical Piece

Handling the interruption is only half the battle. You also need to recover context.

Bad approach:
User interrupts → Agent answers → Agent starts over from the beginning

Good approach:
User interrupts → Agent answers → Agent asks “Should I continue where I left off, or is that enough?”

async function recoverContext(lastContext) {
  await client.sendText({
    text: "Should I continue explaining, or do you have everything you need?",
    instructions: `You were previously saying: "${lastContext}". If user wants more, continue from that point. If user is satisfied, move on.`
  });
}

Business Impact: Conversation Efficiency

A healthcare provider implemented barge-in handling for appointment scheduling:

Before (rigid turn-taking):

Average call duration: 3.2 minutes
42% of users interrupted but system didn’t respond
28% repeat rate (“I already said that…”)

After (with interruption handling):

Average call duration: 2.1 minutes (34% faster)
89% successful interruption recognition
12% repeat rate (57% reduction)

Why it worked: Users could correct mistakes immediately, skip unnecessary information, and ask questions without waiting. Conversations felt natural.

The Subtlety of Acknowledgment

When you interrupt someone, they usually acknowledge with a micro-response:

“Go ahead”
“Yeah?”
“Mm-hmm”

Voice agents should do the same. Even a brief “Hold on” or “Let me answer that” signals that the interruption was heard.

Without acknowledgment:
Agent: “First, you’ll need to—”
User: “Wait, what about—”
Agent: [continues] “—verify your email address…”
User: [frustrated] “Are you even listening?”

With acknowledgment:
Agent: “First, you’ll need to—”
User: “Wait, what about—”
Agent: [stops] “Go ahead”
User: “What about my password?”
Agent: “Good question. You can reset that after…”

Implementation Checklist

Want to add interruption handling? Here’s what you need:

Technical:

Full-duplex audio (simultaneous input/output)
Barge-in detection (user speaking while agent speaks)
Response cancellation (stop agent mid-sentence)
Context preservation (remember what agent was saying)
Interruption classification (question, correction, impatience)

Design:

Acknowledgment phrases (“Go ahead”, “Hold on”)
Context recovery prompts (“Should I continue?”)
Skip-ahead patterns (summarize on demand)
Correction flows (acknowledge + adjust)

Testing:

Test with impatient users
Test with clarifying questions mid-explanation
Test with simultaneous speech (user and agent overlap)
Measure context preservation accuracy

Edge Cases To Handle

1. False Positive Barge-Ins
Background voices, coughs, or “uh-huh” sounds might trigger interruption detection. Use voice activity detection (VAD) thresholds to filter out short sounds.

2. Partial Overlaps
Sometimes users start talking just as the agent finishes. Don’t treat this as interruption—it’s natural turn-taking.

3. Multiple Interruptions
If a user interrupts multiple times quickly, they might be very confused. After 2-3 interruptions in 30 seconds, offer a different explanation style or human handoff.

4. Silent Interruptions
User might be thinking/processing while agent is speaking. Don’t require verbal interruption—add “Does this make sense?” checkpoints every 20-30 seconds.

The Natural Conversation Feel

Here’s what makes barge-in handling successful: it removes the “I’m talking to a robot” friction.

When you can interrupt naturally, ask questions mid-explanation, and correct mistakes immediately, the system starts to feel responsive.

This doesn’t mean chaos. The agent still guides the conversation. But it allows for the natural back-and-forth that makes spoken communication effective.

Want to build this? Check out OpenAI’s Realtime API documentation for full-duplex audio patterns and interruption handling examples.

Ready to add barge-in support? Start with simple interruption detection. Add acknowledgment responses. Build context recovery. Test with real users who interrupt naturally.

The goal isn’t perfect interruption handling—it’s making users feel heard when they need to jump in.

Voice Agents That Don't Break When You Interrupt

The Turn-Taking Problem

How Speech-To-Speech Handles Interruptions

Real Implementation: Barge-In Detection

Context Recovery: The Critical Piece

Business Impact: Conversation Efficiency

The Subtlety of Acknowledgment

Implementation Checklist

Edge Cases To Handle

The Natural Conversation Feel

Tags :

Share :

Related Posts

Support That Actually Hears Frustration: How Voice AI Detects Emotion in Real Time

Stop Typing - Edit Your App By Talking