Design Tools For Voice, Not Text

Design Tools For Voice, Not Text

Table of Contents

Your voice agent makes 8 tool calls to book a flight. Eight.

Search availability. Filter by price. Sort by duration. Paginate results. Select option. Add to cart. Check out. Confirm.

The user spoke once. The agent spoke eight times. The conversation took 4 minutes.

The problem: Your tools were designed for text agents. Voice agents need different abstractions.

Why Text-Agent Tools Break Voice Conversations

Text agents can afford granularity:

Agent: I found 47 flights. Let me sort by price...
       [tool_call: sort_flights("price")]
       Okay, sorted. Now filtering for direct flights...
       [tool_call: filter_flights("direct")]
       Great. Showing top 5 results...
       [tool_call: paginate(page=1, size=5)]

Users tolerate this in chat. They don’t tolerate it in voice.

In voice:

  • Every tool call adds 1-2 seconds of latency
  • Users hear the agent “thinking” between each call
  • Multi-step workflows feel sluggish
  • Conversations become transactional, not conversational

The insight: Voice agents need tools that match how humans think—not how databases work.

The Difference: Low-Level vs High-Level Tools

Low-Level Tools (Designed For Text)

// Tool 1: Search
async function searchFlights(origin, destination, date) {
  return await db.flights.find({ origin, destination, date });
}

// Tool 2: Filter
async function filterFlights(results, criteria) {
  return results.filter(f => meetsCriteria(f, criteria));
}

// Tool 3: Sort
async function sortFlights(results, sortBy) {
  return results.sort((a, b) => compare(a, b, sortBy));
}

// Tool 4: Paginate
async function paginateFlights(results, page, size) {
  return results.slice(page * size, (page + 1) * size);
}

// Tool 5: Select
async function selectFlight(flightId) {
  return await db.flights.findById(flightId);
}

Voice agent behavior:

User: "Find me a flight to Chicago tomorrow"

Agent: "Searching flights..."
        [tool_call: searchFlights]
        "Found 47 options. Let me filter for morning departures..."
        [tool_call: filterFlights]
        "Okay, 12 morning flights. Sorting by price..."
        [tool_call: sortFlights]
        "Got it. Here are the top 3..."
        [tool_call: paginateFlights]
        
Total: 4 tool calls, 8 seconds, user heard "let me..." 4 times

High-Level Tools (Designed For Voice)

// Single tool: Find best match
async function findBestFlight(criteria) {
  // Encapsulates: search, filter, sort, rank, select
  const flights = await db.flights.find({
    origin: criteria.origin,
    destination: criteria.destination,
    date: criteria.date
  });
  
  const filtered = flights.filter(f => 
    matchesPreferences(f, criteria.preferences)
  );
  
  const ranked = rankByRelevance(filtered, criteria);
  
  return {
    best_match: ranked[0],
    alternatives: ranked.slice(1, 3),
    why_best: explainRanking(ranked[0], criteria)
  };
}

Voice agent behavior:

User: "Find me a flight to Chicago tomorrow"

Agent: "Looking for morning flights to Chicago..."
        [tool_call: findBestFlight]
        "I found a United flight at 8:15 AM for $220. 
         It's direct and arrives by 10:30. Want this one?"
        
Total: 1 tool call, 2 seconds, natural conversation

Time saved: 75%. Conversation quality: dramatically better.

Architecture: Voice-First Tool Design

Here’s how to structure tools for voice agents:

graph TB
    A[User Intent] --> B{Tool Design}
    
    B -->|Low-Level| C[Multiple Tool Calls]
    C --> D[search]
    C --> E[filter]
    C --> F[sort]
    C --> G[paginate]
    C --> H[select]
    D --> I[Agent Speaks Between Each]
    E --> I
    F --> I
    G --> I
    H --> I
    I --> J[8 seconds, 4 interruptions]
    
    B -->|High-Level| K[Single Tool Call]
    K --> L[findBestMatch]
    L --> M[Internal: search → filter → sort → rank]
    M --> N[Agent Speaks Once]
    N --> O[2 seconds, 1 turn]
    
    J --> P[User Experience: Slow]
    O --> Q[User Experience: Fast]
    
    style A fill:#e1f5ff
    style K fill:#d4f4dd
    style L fill:#d4f4dd
    style C fill:#ffe1e1
    style J fill:#ffe1e1
    style Q fill:#d4f4dd

The pattern: Encapsulate workflows, not database operations.

Implementation: Voice-Optimized Tools

Here’s how to refactor tools for OpenAI Realtime API:

Before: Text-Agent Tools

const tools = [
  {
    type: 'function',
    name: 'search_products',
    description: 'Search product catalog',
    parameters: {
      type: 'object',
      properties: {
        query: { type: 'string' },
        category: { type: 'string' }
      }
    }
  },
  {
    type: 'function',
    name: 'filter_products',
    description: 'Filter product list by criteria',
    parameters: {
      type: 'object',
      properties: {
        products: { type: 'array' },
        max_price: { type: 'number' },
        min_rating: { type: 'number' }
      }
    }
  },
  {
    type: 'function',
    name: 'sort_products',
    description: 'Sort product list',
    parameters: {
      type: 'object',
      properties: {
        products: { type: 'array' },
        sort_by: { type: 'string', enum: ['price', 'rating', 'popularity'] }
      }
    }
  }
];

// Agent makes 3+ tool calls for simple request

After: Voice-Agent Tools

const tools = [
  {
    type: 'function',
    name: 'find_product_recommendation',
    description: `Find the best product match for user needs. 
                  Handles search, filtering, sorting, and ranking internally.
                  Returns: best match + 2 alternatives + explanation.`,
    parameters: {
      type: 'object',
      properties: {
        user_need: { 
          type: 'string',
          description: 'What the user is looking for, in their own words'
        },
        constraints: {
          type: 'object',
          properties: {
            max_price: { type: 'number' },
            category: { type: 'string' },
            required_features: { type: 'array', items: { type: 'string' } }
          }
        },
        preferences: {
          type: 'object',
          properties: {
            prioritize: { 
              type: 'string', 
              enum: ['price', 'quality', 'speed', 'popularity'],
              description: 'What matters most to the user'
            }
          }
        }
      },
      required: ['user_need']
    }
  }
];

// Agent makes 1 tool call, gets complete answer

Implementation

import { RealtimeClient } from '@openai/realtime-api-beta';

class VoiceOptimizedTools {
  constructor() {
    this.client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY });
  }

  async setupVoiceAgent() {
    await this.client.connect();
    
    // Register high-level tool
    await this.client.updateSession({
      tools: [
        {
          type: 'function',
          name: 'find_product_recommendation',
          description: `Find best product for user needs. Encapsulates: 
                        search, filter, rank, compare. Returns ready-to-speak 
                        recommendation with explanation.`,
          parameters: {
            type: 'object',
            properties: {
              user_need: { type: 'string' },
              max_price: { type: 'number' },
              category: { type: 'string' },
              prioritize: { 
                type: 'string', 
                enum: ['price', 'quality', 'speed'] 
              }
            }
          }
        }
      ],
      instructions: `
You are a helpful shopping assistant. When users describe what they need,
use find_product_recommendation ONCE to get a complete answer. Don't make
multiple tool calls - the tool handles everything internally.

After getting the recommendation, present it conversationally:
"I found a great option for you: [product]. It's [why it's good]. 
 I also have [alternative 1] and [alternative 2] if you want to compare."
`,
      voice: 'alloy',
      modalities: ['audio']
    });

    // Handle tool calls
    this.client.on('conversation.item.input_audio_transcription.completed', 
      async (event) => {
        console.log('User said:', event.transcript);
      }
    );

    this.client.on('response.function_call_arguments.done', async (event) => {
      if (event.name === 'find_product_recommendation') {
        const result = await this.findProductRecommendation(
          JSON.parse(event.arguments)
        );
        
        // Return result to agent
        await this.client.sendItemContent([{
          type: 'function_call_output',
          call_id: event.call_id,
          output: JSON.stringify(result)
        }]);
      }
    });
  }

  async findProductRecommendation(params) {
    // HIGH-LEVEL TOOL: Encapsulates entire workflow
    
    // Step 1: Search (internal, user doesn't hear this)
    const allProducts = await this.searchProducts(params.user_need, params.category);
    
    // Step 2: Filter (internal)
    const filtered = this.filterProducts(allProducts, {
      max_price: params.max_price,
      min_rating: 4.0  // default quality threshold
    });
    
    // Step 3: Rank (internal)
    const ranked = this.rankProducts(filtered, params.prioritize || 'quality');
    
    // Step 4: Select best + alternatives
    const best = ranked[0];
    const alternatives = ranked.slice(1, 3);
    
    // Step 5: Generate explanation
    const explanation = this.explainRecommendation(best, params);
    
    // Return everything agent needs to speak naturally
    return {
      recommendation: {
        name: best.name,
        price: best.price,
        rating: best.rating,
        key_features: best.features.slice(0, 3),
        why_recommended: explanation
      },
      alternatives: alternatives.map(p => ({
        name: p.name,
        price: p.price,
        key_difference: this.compareToRecommendation(p, best)
      })),
      search_summary: `Found ${allProducts.length} products, narrowed to ${filtered.length} matches`
    };
  }

  async searchProducts(query, category) {
    // Your actual search logic
    return await db.products.find({
      $text: { $search: query },
      category: category
    }).limit(100);
  }

  filterProducts(products, constraints) {
    return products.filter(p => 
      (!constraints.max_price || p.price <= constraints.max_price) &&
      (!constraints.min_rating || p.rating >= constraints.min_rating)
    );
  }

  rankProducts(products, prioritize) {
    const scoreFunctions = {
      price: (p) => 1 / p.price,  // lower is better
      quality: (p) => p.rating * p.review_count,
      speed: (p) => p.shipping_days < 2 ? 10 : 1,
      popularity: (p) => p.sales_rank
    };
    
    const scoreFunc = scoreFunctions[prioritize] || scoreFunctions.quality;
    
    return products
      .map(p => ({ ...p, score: scoreFunc(p) }))
      .sort((a, b) => b.score - a.score);
  }

  explainRecommendation(product, params) {
    const reasons = [];
    
    if (params.prioritize === 'price') {
      reasons.push(`best value at $${product.price}`);
    } else if (params.prioritize === 'quality') {
      reasons.push(`highly rated (${product.rating} stars from ${product.review_count} reviews)`);
    }
    
    if (product.features.some(f => params.user_need.toLowerCase().includes(f.toLowerCase()))) {
      reasons.push(`has the features you mentioned`);
    }
    
    return reasons.join(', ');
  }

  compareToRecommendation(alternative, best) {
    if (alternative.price < best.price * 0.8) {
      return `much cheaper at $${alternative.price}`;
    } else if (alternative.rating > best.rating) {
      return `higher rated (${alternative.rating} stars)`;
    } else {
      return `different feature set`;
    }
  }
}

// Usage
const voiceTools = new VoiceOptimizedTools();
await voiceTools.setupVoiceAgent();

// User: "I need a laptop for video editing under $2000"
// Agent makes 1 tool call, gets complete recommendation, speaks naturally

Real-World Results

A retail company refactored their voice shopping assistant:

Before (low-level tools):

  • Average conversation: 12 turns
  • Average time: 4.5 minutes
  • Tool calls per session: 8.3
  • User satisfaction: 3.2/5
  • “Agent feels slow”: 67% of feedback

After (high-level tools):

  • Average conversation: 5 turns
  • Average time: 2.1 minutes
  • Tool calls per session: 2.1
  • User satisfaction: 4.6/5
  • “Agent feels slow”: 12% of feedback

Impact:

  • 53% faster conversations
  • 75% fewer tool calls
  • 44% improvement in satisfaction
  • $180K saved annually (less compute time)

Design Patterns For Voice-First Tools

Pattern 1: Task-Based Not Operation-Based

// ❌ Operation-based (text agent style)
await searchUsers();
await filterByRole();
await sortByActivity();
await selectTop5();

// ✅ Task-based (voice agent style)
await findRelevantTeamMembers({ task: 'code review', skills: ['TypeScript'] });

Pattern 2: Return Speaking-Ready Data

// ❌ Returns raw data
{
  results: [...],
  total: 47,
  page: 1
}

// ✅ Returns presentation-ready data
{
  top_match: { name: "...", why: "..." },
  alternatives: [ ... ],
  summary: "Found 3 great options out of 47 total",
  next_question: "Would you like to hear more about the top choice?"
}

Pattern 3: Include Context For Follow-Ups

// ❌ Agent forgets what it found
{
  result: { id: 123, name: "Product A" }
}

// ✅ Agent remembers for follow-up questions
{
  result: { id: 123, name: "Product A" },
  context: {
    search_query: "wireless headphones under $200",
    alternatives_ids: [124, 125],
    why_chosen: "best battery life in price range"
  },
  follow_up_suggestions: [
    "Check shipping time",
    "Compare to alternatives",
    "Add to cart"
  ]
}

Pattern 4: Anticipate Next Steps

// ❌ Requires separate tool call for each action
await getProduct(id);
await checkInventory(id);
await getShipping(id);

// ✅ Returns everything user likely needs next
await getProductDetails(id) {
  return {
    product: { ... },
    in_stock: true,
    ships_in: "2 days",
    related_products: [...],
    can_add_to_cart: true
  };
}

Tool Guidelines For Voice Agents

DoDon’t
Encapsulate workflowsExpose database operations
Return explanation textReturn raw IDs or codes
Handle edge cases internallyForce agent to handle errors
Anticipate follow-up needsRequire multiple calls for related data
Include why you returned this resultJust return data without context
Make tools match human thinkingMake tools match database schema

Implementation Timeline

Week 1: Audit existing tools

  • List all tool calls made in typical conversations
  • Identify sequential patterns (search → filter → sort)
  • Find tools that require 3+ calls to complete a task

Week 2: Design high-level replacements

  • Group related operations into single tools
  • Add explanation fields to responses
  • Include context for follow-ups

Week 3: Test with voice agent

  • Measure conversation length before/after
  • Count tool calls per session
  • Gather user feedback on speed

Week 4: Optimize and deploy

  • Refine tool descriptions for better agent understanding
  • Add caching for repeated queries
  • Monitor latency and adjust

Cost Impact

Higher-level tools reduce costs:

Realtime API pricing:

  • Input audio: $0.06/minute
  • Output audio: $0.24/minute
  • Average conversation: 3 minutes = $0.90

Reducing tool calls:

  • 8 tool calls → 2 tool calls = 75% less latency
  • 4.5 minute conversation → 2.1 minutes = 53% shorter
  • Cost per conversation: $0.90 → $0.42 = $0.48 saved

At 10,000 conversations/month: $4,800/month savings

Plus: Better user experience leads to higher completion rates (more revenue).

When To Use High-Level Tools

Use High-Level Tools WhenUse Low-Level Tools When
Voice conversationsText-based chat
Multi-step workflows are commonOperations are truly independent
Speed matters more than flexibilityUsers need granular control
Agent decides the workflowUser directs each step explicitly

Most voice agents should use high-level tools. Low-level tools make sense for power users who want control—not typical voice interactions.

What’s Next

Voice-optimized tools evolve toward:

  • Adaptive complexity: Tool adjusts based on user expertise
  • Streaming responses: Tool returns partial results as they’re ready
  • Learning from usage: Tools refine based on which results users actually use

The end state: Tools that match the pace of human speech, not database queries.

If you want voice agents with optimized tool design, we can refactor your function calls for voice-first interactions. The result: faster conversations, fewer turns, better user experience.

Share :

Related Posts

Stop Building 'Do Everything' Agents

Stop Building 'Do Everything' Agents

You built a voice agent. It handles customer questions, processes orders, schedules appointments, updates accounts, and answers technical queries. One agent, five responsibilities. You’re proud of how much it can do.

Read More
Use Meta-Prompts To Build Voice State Machines

Use Meta-Prompts To Build Voice State Machines

Complex voice conversations drift. Users ask three things at once. Agents lose context after five turns. By turn eight, nobody remembers what you were even talking about.

Read More
Announce-Before-Act: The UX Rule That Makes Voice Agents Feel Responsive

Announce-Before-Act: The UX Rule That Makes Voice Agents Feel Responsive

Picture this: You ask your voice agent to update a document. The agent goes silent. Three seconds pass. Five seconds. Still nothing.

Read More