Voice Agents That Let You Think
Table of Contents
Most voice systems panic during silence. “Are you still there?” after 3 seconds. Or worse, they hang up. Good voice agents respect natural pauses—people need time to think, look things up, or consult others.
The Problem With Aggressive Timeouts
Traditional IVR systems timeout quickly:
- 3 seconds: “I didn’t hear you, please try again”
- 5 seconds: “If you’re still there, press 1”
- 10 seconds: Call disconnected
This creates anxiety. Users feel rushed. They can’t:
- Think through options
- Check a document
- Ask someone nearby
- Look up information
The result? Frustrated users, abandoned calls, and complaints about “pushy” systems.
Why Silence Happens (And Why That’s OK)
Users pause for legitimate reasons:
Thinking:
Agent: "Would you like the red model or the blue model?"
User: [5 seconds of silence, considering options]
User: "I think... blue."
Looking things up:
Agent: "What's your account number?"
User: [8 seconds of silence, finding statement]
User: "Okay, it's 4-7-2-9..."
Consulting others:
Agent: "What date works for the appointment?"
User: [whispers to partner off-mic for 10 seconds]
User: "Tuesday at 2pm"
Distracted:
Agent: "Can I help with anything else?"
User: [15 seconds - kid crying in background]
User: "Sorry, one second... okay, yes..."
All of these are normal. The agent should wait patiently, not nag.
Architecture: Adaptive Silence Detection
graph TD
A[Agent Asks Question] --> B[Wait for Response]
B --> C{Silence Duration}
C -->|0-5 seconds| D[Wait Silently]
C -->|5-10 seconds| E{Context: Complex Question?}
E -->|Yes| F[Continue Waiting]
E -->|No| G[Gentle Prompt: "Take your time"]
C -->|10-20 seconds| H[Agent: "I'm still here when you're ready"]
C -->|20-30 seconds| I{Background Noise Detected?}
I -->|Yes, user still present| J[Continue Waiting]
I -->|No noise, line silent| K[Agent: "Let me know if you need help"]
C -->|30+ seconds| L[Agent: "Would you like me to call back?"]
The agent adjusts behavior based on:
- Question complexity (open-ended vs yes/no)
- Background audio (is the user still there?)
- Conversation history (user’s previous response speed)
- Silence duration (graduated responses, not binary timeout)
Real-World Example: Complex Decision
Bad silence handling:
Agent: "Which plan works for you: Basic at $10/month,
Standard at $25/month, or Premium at $50/month?"
[3 seconds of silence]
Agent: "I didn't hear you. Let me repeat that.
Basic at $10, Standard..."
User: [frustrated] "I'm THINKING!"
Good silence handling:
Agent: "Which plan works for you: Basic at $10/month,
Standard at $25/month, or Premium at $50/month?"
[5 seconds of silence]
[Agent waits patiently]
[10 seconds]
Agent: [softly] "Take your time comparing."
[15 seconds more]
User: "I think Standard makes sense."
Agent: "Great choice. Let me set that up."
The agent doesn’t panic. The user feels respected, not rushed.
Implementation: Adaptive Timeouts With OpenAI Realtime
Here’s how to implement intelligent silence handling:
import { RealtimeClient } from '@openai/realtime-api-beta';
const client = new RealtimeClient({
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-realtime'
});
await client.connect();
client.updateSession({
voice: 'alloy',
instructions: `You are a patient, respectful voice assistant.
SILENCE HANDLING RULES:
1. For simple yes/no questions: wait 10 seconds before prompting
2. For complex decisions: wait 20 seconds before gentle prompt
3. NEVER say "Are you still there?" before 30 seconds
4. If user is clearly present (background noise), wait indefinitely
5. Use gentle language: "Take your time", "I'm here when you're ready"
6. NEVER repeat the same prompt twice in a row
7. If 30+ seconds of total silence, offer callback instead of hanging up
Graduated responses:
- 10s: "Take your time"
- 20s: "I'm still here whenever you're ready"
- 30s: "Would you like me to call you back when you've decided?"
Be patient. Silence is not failure.`
});
// Track silence duration with context awareness
let silenceStart = null;
let lastQuestionComplexity = 'simple'; // or 'complex', 'open_ended'
let backgroundNoiseDetected = false;
session.on('audio_received', (audio) => {
// Reset silence timer when user speaks
silenceStart = null;
});
session.on('silence_detected', async () => {
if (!silenceStart) {
silenceStart = Date.now();
}
const silenceDuration = Date.now() - silenceStart;
// Check for background noise (user still present)
backgroundNoiseDetected = await detectBackgroundNoise(audioStream);
// Graduated response based on context
if (silenceDuration > 10000 && lastQuestionComplexity === 'simple') {
if (silenceDuration < 15000) { // Only prompt once at 10s
client.sendUserMessageContent([{
type: 'input_text',
text: "Take your time."
}]);
}
}
if (silenceDuration > 20000 && lastQuestionComplexity === 'complex') {
if (silenceDuration < 25000) { // Only prompt once at 20s
client.sendUserMessageContent([{
type: 'input_text',
text: "I'm still here when you're ready."
}]);
}
}
if (silenceDuration > 30000) {
if (backgroundNoiseDetected) {
// User is still there, just busy
if (silenceDuration === 30000) { // Exact check to only say once
client.sendUserMessageContent([{
type: 'input_text',
text: "I can tell you're still there. Take all the time you need."
}]);
}
} else {
// Line is silent, offer callback
if (silenceDuration === 30000) {
client.sendUserMessageContent([{
type: 'input_text',
text: "Would you like me to call you back when you're ready? " +
"Or I can stay on the line."
}]);
}
}
}
// Hard timeout at 60 seconds (if no response to callback offer)
if (silenceDuration > 60000 && !backgroundNoiseDetected) {
client.sendUserMessageContent([{
type: 'input_text',
text: "I'll hang up for now. Feel free to call back anytime."
}]);
client.disconnect();
}
});
// Track question complexity
session.on('agent_message_sent', (message) => {
// Analyze question to set appropriate timeout
if (message.includes('which') || message.includes('what would')) {
lastQuestionComplexity = 'complex';
} else if (message.includes('yes or no') || message.includes('confirm')) {
lastQuestionComplexity = 'simple';
} else {
lastQuestionComplexity = 'open_ended';
}
});
Background Noise Detection
async function detectBackgroundNoise(audioStream) {
// Analyze audio for non-speech sounds
const audioAnalysis = await analyzeAudio(audioStream);
// Look for:
// - Ambient noise (traffic, TV, office sounds)
// - Movement sounds (rustling, footsteps)
// - Other voices (user talking to someone else)
// - Any audio that indicates user is still present
return audioAnalysis.backgroundActivity > 0.3; // threshold
}
Python Implementation
from openai import AsyncRealtime
import asyncio
from datetime import datetime, timedelta
class SilenceHandler:
def __init__(self):
self.silence_start = None
self.question_complexity = 'simple'
self.prompted_at = set() # Track when we've prompted
async def handle_silence(self, audio_stream):
"""
Adaptive silence handling based on context.
Note: With OpenAI Realtime API, responses are sent through
conversation.item.create events, not direct session methods.
"""
if not self.silence_start:
self.silence_start = datetime.now()
duration = (datetime.now() - self.silence_start).total_seconds()
# Detect if user is still present
background_noise = await self.detect_background_activity(audio_stream)
# Simple question: prompt at 10s
if (duration > 10 and self.question_complexity == 'simple'
and 10 not in self.prompted_at):
# Send via WebSocket
self.ws.send(json.dumps({
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "assistant",
"content": [{"type": "text", "text": "Take your time."}]
}
}))
self.prompted_at.add(10)
# Complex question: prompt at 20s
if (duration > 20 and self.question_complexity == 'complex'
and 20 not in self.prompted_at):
self.ws.send(json.dumps({
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "assistant",
"content": [{"type": "text", "text": "I'm here when you're ready."}]
}
}))
self.prompted_at.add(20)
# Long silence: check presence
if duration > 30 and 30 not in self.prompted_at:
if background_noise:
self.ws.send(json.dumps({
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "assistant",
"content": [{"type": "text", "text": "I can tell you're still there. No rush."}]
}
}))
else:
self.ws.send(json.dumps({
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "assistant",
"content": [{"type": "text", "text": "Would you like me to call back later? Or I can stay on the line."}]
}
}))
self.prompted_at.add(30)
# Hard timeout
if duration > 60 and not background_noise:
self.ws.send(json.dumps({
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "assistant",
"content": [{"type": "text", "text": "I'll hang up for now. Call back anytime!"}]
}
}))
self.ws.close()
def reset(self):
"""Reset when user speaks"""
self.silence_start = None
self.prompted_at.clear()
async def detect_background_activity(self, audio_stream):
"""
Detect non-speech audio indicating user presence
"""
# Analyze for ambient sounds
analysis = await analyze_audio_features(audio_stream)
return analysis['ambient_noise_level'] > 0.3
User Experience Patterns
1. Match Prompt To Question Type
Simple yes/no:
Agent: “Would you like to proceed?” [waits 10s]
Complex decision:
Agent: “Which of these three options works best?” [waits 20s]
Open-ended:
Agent: “Tell me what happened.” [waits indefinitely, user controls pace]
2. Use Gentle Language
Bad:
“I didn’t hear you.” “Please respond.” “Are you still there?”
Good:
“Take your time.” “I’m here when you’re ready.” “No rush.”
3. Don’t Repeat The Same Prompt
Bad:
[10s silence]
Agent: "I'm waiting for your response"
[10s more]
Agent: "I'm waiting for your response"
Good:
[10s silence]
Agent: "Take your time"
[10s more]
Agent: "I'm still here"
[10s more]
Agent: "Let me know if you need help deciding"
Each prompt adds information or offers help—never identical repetition.
4. Offer Callbacks For Extended Pauses
After 30+ seconds:
Agent: "Seems like you might need to look something up.
Would you like me to call you back in 5 minutes?
Or I'm happy to wait."
This shows respect for the user’s time.
Business Impact
Call completion rate:
- Before adaptive silence: 68% completion (many abandoned during pauses)
- After adaptive silence: 82% completion
- +14% improvement in completed transactions
User satisfaction:
- “Felt rushed” complaints dropped from 23% to 4%
- “Patient and respectful” feedback increased by 40%
- Average satisfaction score: 4.2 → 4.7 out of 5
Call duration:
- Slightly longer calls (+15 seconds avg)
- But higher success rate makes up for it
- Net positive: more successful outcomes per hour
Edge Cases To Handle
1. User Steps Away
If 30+ seconds with no background noise:
Agent: "It sounds like you might have stepped away.
I'll hang up for now, but feel free to call back anytime."
2. User Talking To Someone Else
If background voices detected:
Agent: [continues waiting silently until user addresses agent again]
Don’t interrupt conversations happening off-mic.
3. User Drops Call Accidentally
If sudden silence (no audio at all, not even ambient):
Agent: [wait 5 seconds]
Agent: "Hello? I think we might have been disconnected.
If you're still there, say something."
[Wait 10 more seconds]
[Hang up gracefully]
4. User Multitasking
If periodic background noise (typing, movement):
Agent: [waits indefinitely until user speaks]
User is clearly present, just busy.
When To Use Aggressive Timeouts
Some situations do need faster timeouts:
Security contexts:
Agent: "For security, I need your PIN within 10 seconds."
[If no response, timeout for safety]
Emergency services:
Agent: "Are you in immediate danger? Please respond."
[If no response, escalate to human operator]
Time-sensitive offers:
Agent: "This price expires in 30 seconds. Shall I book it?"
[Timeout makes sense here]
But for normal interactions? Be patient.
Next Steps
If you want to add intelligent silence handling to your voice agents:
- Audit your current timeouts (are they too aggressive?)
- Categorize question types (simple, complex, open-ended)
- Implement graduated responses (don’t nag at fixed intervals)
- Add background noise detection (is user still present?)
- Test with real users (what feels natural?)
- Monitor abandonment rates (before/after changes)
- Collect qualitative feedback (do users feel rushed?)
Silence isn’t a problem to solve—it’s a natural part of conversation. Good voice agents respect it instead of panicking. The goal is to feel patient and present, not robotic and demanding.
Further Reading:
Want to add adaptive silence handling to your voice application? We can help you implement context-aware timeouts that respect natural conversational pauses while staying responsive.