Voice Agents That Don't Break When You Interrupt
Table of Contents
Natural conversations don’t follow strict turn-taking rules. People interrupt. They finish each other’s sentences. They talk over each other.
Most voice systems completely fail at this. Try interrupting Alexa mid-sentence. Try talking while a phone tree is reading options. The system either ignores you or crashes the entire interaction.
Real-time speech-to-speech agents can handle barge-in gracefully—without losing context.
The Turn-Taking Problem
Traditional voice systems operate like walkie-talkies: only one person can talk at a time.
Here’s what happens in a typical interaction:
- System starts speaking
- Audio output blocks audio input
- User tries to interrupt → nothing happens
- User waits for system to finish
- System finally stops → “I didn’t catch that, please repeat”
This isn’t how humans talk. When your colleague is explaining something and you suddenly understand, you jump in with “Oh! So it’s like…” They stop, acknowledge, and adjust.
Voice agents need to do the same.
How Speech-To-Speech Handles Interruptions
OpenAI’s Realtime API uses full-duplex audio—both parties can speak simultaneously.
When a user interrupts:
- System detects overlapping speech instantly
- Stops its own output mid-sentence
- Captures the user’s input
- Decides whether to respond or continue
graph TD
A[Agent speaking] --> B{User starts talking}
B --> C[Detect barge-in]
C --> D[Stop agent audio]
D --> E[Capture user input]
E --> F{Interruption type?}
F -->|Question| G[Answer immediately]
F -->|Correction| H[Acknowledge + adjust]
F -->|Impatience| I[Summarize + move on]
G --> J[Resume or redirect]
H --> J
I --> J
Real Implementation: Barge-In Detection
Here’s how to build interruption handling:
import { RealtimeClient } from '@openai/realtime-api-beta';
const client = new RealtimeClient({
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-realtime',
});
let agentSpeaking = false;
let lastUtteranceContext = '';
// When agent starts speaking
client.on('response.audio_transcript.delta', (event) => {
agentSpeaking = true;
lastUtteranceContext += event.delta; // Track what agent was saying
});
// When agent finishes
client.on('response.audio_transcript.done', () => {
agentSpeaking = false;
lastUtteranceContext = ''; // Clear context
});
// Detect user interruption
client.on('conversation.item.input_audio_transcription.delta', (event) => {
if (agentSpeaking && event.delta.length > 10) {
// User is talking while agent is speaking = interruption
handleInterruption(event.delta, lastUtteranceContext);
}
});
async function handleInterruption(userInput, agentContext) {
// Stop current agent output
await client.cancelResponse();
// Analyze interruption type
const interruptionType = classifyInterruption(userInput, agentContext);
switch (interruptionType) {
case 'question':
// User has a clarifying question
await client.sendText({
text: userInput,
instructions: "Answer this question, then ask if they want me to continue where I left off."
});
break;
case 'correction':
// User is correcting information
await client.sendText({
text: "Got it. " + userInput,
instructions: "Acknowledge the correction and incorporate it into the response."
});
break;
case 'impatience':
// User wants to skip ahead
await client.sendText({
text: "I'll get to the point.",
instructions: "Summarize the key information quickly."
});
break;
case 'agreement':
// User is signaling understanding
await client.sendText({
text: "Great, moving on.",
instructions: "Acknowledge and continue to next topic."
});
break;
}
}
function classifyInterruption(userText, agentContext) {
const lowerInput = userText.toLowerCase();
// Question indicators
if (lowerInput.includes('wait') || lowerInput.includes('what') ||
lowerInput.includes('why') || lowerInput.includes('how')) {
return 'question';
}
// Correction indicators
if (lowerInput.includes('actually') || lowerInput.includes('no') ||
lowerInput.includes('that\'s wrong')) {
return 'correction';
}
// Impatience indicators
if (lowerInput.includes('skip') || lowerInput.includes('get to the point') ||
lowerInput.includes('bottom line')) {
return 'impatience';
}
// Agreement indicators
if (lowerInput.includes('got it') || lowerInput.includes('yeah') ||
lowerInput.includes('okay')) {
return 'agreement';
}
return 'question'; // Default to question
}
Context Recovery: The Critical Piece
Handling the interruption is only half the battle. You also need to recover context.
Bad approach:
User interrupts → Agent answers → Agent starts over from the beginning
Good approach:
User interrupts → Agent answers → Agent asks “Should I continue where I left off, or is that enough?”
async function recoverContext(lastContext) {
await client.sendText({
text: "Should I continue explaining, or do you have everything you need?",
instructions: `You were previously saying: "${lastContext}". If user wants more, continue from that point. If user is satisfied, move on.`
});
}
Business Impact: Conversation Efficiency
A healthcare provider implemented barge-in handling for appointment scheduling:
Before (rigid turn-taking):
- Average call duration: 3.2 minutes
- 42% of users interrupted but system didn’t respond
- 28% repeat rate (“I already said that…”)
After (with interruption handling):
- Average call duration: 2.1 minutes (34% faster)
- 89% successful interruption recognition
- 12% repeat rate (57% reduction)
Why it worked: Users could correct mistakes immediately, skip unnecessary information, and ask questions without waiting. Conversations felt natural.
The Subtlety of Acknowledgment
When you interrupt someone, they usually acknowledge with a micro-response:
- “Go ahead”
- “Yeah?”
- “Mm-hmm”
Voice agents should do the same. Even a brief “Hold on” or “Let me answer that” signals that the interruption was heard.
Without acknowledgment:
Agent: “First, you’ll need to—”
User: “Wait, what about—”
Agent: [continues] “—verify your email address…”
User: [frustrated] “Are you even listening?”
With acknowledgment:
Agent: “First, you’ll need to—”
User: “Wait, what about—”
Agent: [stops] “Go ahead”
User: “What about my password?”
Agent: “Good question. You can reset that after…”
Implementation Checklist
Want to add interruption handling? Here’s what you need:
Technical:
- Full-duplex audio (simultaneous input/output)
- Barge-in detection (user speaking while agent speaks)
- Response cancellation (stop agent mid-sentence)
- Context preservation (remember what agent was saying)
- Interruption classification (question, correction, impatience)
Design:
- Acknowledgment phrases (“Go ahead”, “Hold on”)
- Context recovery prompts (“Should I continue?”)
- Skip-ahead patterns (summarize on demand)
- Correction flows (acknowledge + adjust)
Testing:
- Test with impatient users
- Test with clarifying questions mid-explanation
- Test with simultaneous speech (user and agent overlap)
- Measure context preservation accuracy
Edge Cases To Handle
1. False Positive Barge-Ins
Background voices, coughs, or “uh-huh” sounds might trigger interruption detection. Use voice activity detection (VAD) thresholds to filter out short sounds.
2. Partial Overlaps
Sometimes users start talking just as the agent finishes. Don’t treat this as interruption—it’s natural turn-taking.
3. Multiple Interruptions
If a user interrupts multiple times quickly, they might be very confused. After 2-3 interruptions in 30 seconds, offer a different explanation style or human handoff.
4. Silent Interruptions
User might be thinking/processing while agent is speaking. Don’t require verbal interruption—add “Does this make sense?” checkpoints every 20-30 seconds.
The Natural Conversation Feel
Here’s what makes barge-in handling successful: it removes the “I’m talking to a robot” friction.
When you can interrupt naturally, ask questions mid-explanation, and correct mistakes immediately, the system starts to feel responsive.
This doesn’t mean chaos. The agent still guides the conversation. But it allows for the natural back-and-forth that makes spoken communication effective.
Want to build this? Check out OpenAI’s Realtime API documentation for full-duplex audio patterns and interruption handling examples.
Ready to add barge-in support? Start with simple interruption detection. Add acknowledgment responses. Build context recovery. Test with real users who interrupt naturally.
The goal isn’t perfect interruption handling—it’s making users feel heard when they need to jump in.