Design Tools For Voice, Not Text

ZH+
Tool design
January 6, 2026

Table of Contents

Your voice agent makes 8 tool calls to book a flight. Eight.

Search availability. Filter by price. Sort by duration. Paginate results. Select option. Add to cart. Check out. Confirm.

The user spoke once. The agent spoke eight times. The conversation took 4 minutes.

The problem: Your tools were designed for text agents. Voice agents need different abstractions.

Why Text-Agent Tools Break Voice Conversations

Text agents can afford granularity:

Agent: I found 47 flights. Let me sort by price...
       [tool_call: sort_flights("price")]
       Okay, sorted. Now filtering for direct flights...
       [tool_call: filter_flights("direct")]
       Great. Showing top 5 results...
       [tool_call: paginate(page=1, size=5)]

Users tolerate this in chat. They don’t tolerate it in voice.

In voice:

Every tool call adds 1-2 seconds of latency
Users hear the agent “thinking” between each call
Multi-step workflows feel sluggish
Conversations become transactional, not conversational

The insight: Voice agents need tools that match how humans think—not how databases work.

The Difference: Low-Level vs High-Level Tools

Low-Level Tools (Designed For Text)

// Tool 1: Search
async function searchFlights(origin, destination, date) {
  return await db.flights.find({ origin, destination, date });
}

// Tool 2: Filter
async function filterFlights(results, criteria) {
  return results.filter(f => meetsCriteria(f, criteria));
}

// Tool 3: Sort
async function sortFlights(results, sortBy) {
  return results.sort((a, b) => compare(a, b, sortBy));
}

// Tool 4: Paginate
async function paginateFlights(results, page, size) {
  return results.slice(page * size, (page + 1) * size);
}

// Tool 5: Select
async function selectFlight(flightId) {
  return await db.flights.findById(flightId);
}

Voice agent behavior:

User: "Find me a flight to Chicago tomorrow"

Agent: "Searching flights..."
        [tool_call: searchFlights]
        "Found 47 options. Let me filter for morning departures..."
        [tool_call: filterFlights]
        "Okay, 12 morning flights. Sorting by price..."
        [tool_call: sortFlights]
        "Got it. Here are the top 3..."
        [tool_call: paginateFlights]
        
Total: 4 tool calls, 8 seconds, user heard "let me..." 4 times

High-Level Tools (Designed For Voice)

// Single tool: Find best match
async function findBestFlight(criteria) {
  // Encapsulates: search, filter, sort, rank, select
  const flights = await db.flights.find({
    origin: criteria.origin,
    destination: criteria.destination,
    date: criteria.date
  });
  
  const filtered = flights.filter(f => 
    matchesPreferences(f, criteria.preferences)
  );
  
  const ranked = rankByRelevance(filtered, criteria);
  
  return {
    best_match: ranked[0],
    alternatives: ranked.slice(1, 3),
    why_best: explainRanking(ranked[0], criteria)
  };
}

Voice agent behavior:

User: "Find me a flight to Chicago tomorrow"

Agent: "Looking for morning flights to Chicago..."
        [tool_call: findBestFlight]
        "I found a United flight at 8:15 AM for $220. 
         It's direct and arrives by 10:30. Want this one?"
        
Total: 1 tool call, 2 seconds, natural conversation

Time saved: 75%. Conversation quality: dramatically better.

Architecture: Voice-First Tool Design

Here’s how to structure tools for voice agents:

graph TB
    A[User Intent] --> B{Tool Design}
    
    B -->|Low-Level| C[Multiple Tool Calls]
    C --> D[search]
    C --> E[filter]
    C --> F[sort]
    C --> G[paginate]
    C --> H[select]
    D --> I[Agent Speaks Between Each]
    E --> I
    F --> I
    G --> I
    H --> I
    I --> J[8 seconds, 4 interruptions]
    
    B -->|High-Level| K[Single Tool Call]
    K --> L[findBestMatch]
    L --> M[Internal: search → filter → sort → rank]
    M --> N[Agent Speaks Once]
    N --> O[2 seconds, 1 turn]
    
    J --> P[User Experience: Slow]
    O --> Q[User Experience: Fast]
    
    style A fill:#e1f5ff
    style K fill:#d4f4dd
    style L fill:#d4f4dd
    style C fill:#ffe1e1
    style J fill:#ffe1e1
    style Q fill:#d4f4dd

The pattern: Encapsulate workflows, not database operations.

Implementation: Voice-Optimized Tools

Here’s how to refactor tools for OpenAI Realtime API:

Before: Text-Agent Tools

const tools = [
  {
    type: 'function',
    name: 'search_products',
    description: 'Search product catalog',
    parameters: {
      type: 'object',
      properties: {
        query: { type: 'string' },
        category: { type: 'string' }
      }
    }
  },
  {
    type: 'function',
    name: 'filter_products',
    description: 'Filter product list by criteria',
    parameters: {
      type: 'object',
      properties: {
        products: { type: 'array' },
        max_price: { type: 'number' },
        min_rating: { type: 'number' }
      }
    }
  },
  {
    type: 'function',
    name: 'sort_products',
    description: 'Sort product list',
    parameters: {
      type: 'object',
      properties: {
        products: { type: 'array' },
        sort_by: { type: 'string', enum: ['price', 'rating', 'popularity'] }
      }
    }
  }
];

// Agent makes 3+ tool calls for simple request

After: Voice-Agent Tools

const tools = [
  {
    type: 'function',
    name: 'find_product_recommendation',
    description: `Find the best product match for user needs. 
                  Handles search, filtering, sorting, and ranking internally.
                  Returns: best match + 2 alternatives + explanation.`,
    parameters: {
      type: 'object',
      properties: {
        user_need: { 
          type: 'string',
          description: 'What the user is looking for, in their own words'
        },
        constraints: {
          type: 'object',
          properties: {
            max_price: { type: 'number' },
            category: { type: 'string' },
            required_features: { type: 'array', items: { type: 'string' } }
          }
        },
        preferences: {
          type: 'object',
          properties: {
            prioritize: { 
              type: 'string', 
              enum: ['price', 'quality', 'speed', 'popularity'],
              description: 'What matters most to the user'
            }
          }
        }
      },
      required: ['user_need']
    }
  }
];

// Agent makes 1 tool call, gets complete answer

Implementation

import { RealtimeClient } from '@openai/realtime-api-beta';

class VoiceOptimizedTools {
  constructor() {
    this.client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY });
  }

  async setupVoiceAgent() {
    await this.client.connect();
    
    // Register high-level tool
    await this.client.updateSession({
      tools: [
        {
          type: 'function',
          name: 'find_product_recommendation',
          description: `Find best product for user needs. Encapsulates: 
                        search, filter, rank, compare. Returns ready-to-speak 
                        recommendation with explanation.`,
          parameters: {
            type: 'object',
            properties: {
              user_need: { type: 'string' },
              max_price: { type: 'number' },
              category: { type: 'string' },
              prioritize: { 
                type: 'string', 
                enum: ['price', 'quality', 'speed'] 
              }
            }
          }
        }
      ],
      instructions: `
You are a helpful shopping assistant. When users describe what they need,
use find_product_recommendation ONCE to get a complete answer. Don't make
multiple tool calls - the tool handles everything internally.

After getting the recommendation, present it conversationally:
"I found a great option for you: [product]. It's [why it's good]. 
 I also have [alternative 1] and [alternative 2] if you want to compare."
`,
      voice: 'alloy',
      modalities: ['audio']
    });

    // Handle tool calls
    this.client.on('conversation.item.input_audio_transcription.completed', 
      async (event) => {
        console.log('User said:', event.transcript);
      }
    );

    this.client.on('response.function_call_arguments.done', async (event) => {
      if (event.name === 'find_product_recommendation') {
        const result = await this.findProductRecommendation(
          JSON.parse(event.arguments)
        );
        
        // Return result to agent
        await this.client.sendItemContent([{
          type: 'function_call_output',
          call_id: event.call_id,
          output: JSON.stringify(result)
        }]);
      }
    });
  }

  async findProductRecommendation(params) {
    // HIGH-LEVEL TOOL: Encapsulates entire workflow
    
    // Step 1: Search (internal, user doesn't hear this)
    const allProducts = await this.searchProducts(params.user_need, params.category);
    
    // Step 2: Filter (internal)
    const filtered = this.filterProducts(allProducts, {
      max_price: params.max_price,
      min_rating: 4.0  // default quality threshold
    });
    
    // Step 3: Rank (internal)
    const ranked = this.rankProducts(filtered, params.prioritize || 'quality');
    
    // Step 4: Select best + alternatives
    const best = ranked[0];
    const alternatives = ranked.slice(1, 3);
    
    // Step 5: Generate explanation
    const explanation = this.explainRecommendation(best, params);
    
    // Return everything agent needs to speak naturally
    return {
      recommendation: {
        name: best.name,
        price: best.price,
        rating: best.rating,
        key_features: best.features.slice(0, 3),
        why_recommended: explanation
      },
      alternatives: alternatives.map(p => ({
        name: p.name,
        price: p.price,
        key_difference: this.compareToRecommendation(p, best)
      })),
      search_summary: `Found ${allProducts.length} products, narrowed to ${filtered.length} matches`
    };
  }

  async searchProducts(query, category) {
    // Your actual search logic
    return await db.products.find({
      $text: { $search: query },
      category: category
    }).limit(100);
  }

  filterProducts(products, constraints) {
    return products.filter(p => 
      (!constraints.max_price || p.price <= constraints.max_price) &&
      (!constraints.min_rating || p.rating >= constraints.min_rating)
    );
  }

  rankProducts(products, prioritize) {
    const scoreFunctions = {
      price: (p) => 1 / p.price,  // lower is better
      quality: (p) => p.rating * p.review_count,
      speed: (p) => p.shipping_days < 2 ? 10 : 1,
      popularity: (p) => p.sales_rank
    };
    
    const scoreFunc = scoreFunctions[prioritize] || scoreFunctions.quality;
    
    return products
      .map(p => ({ ...p, score: scoreFunc(p) }))
      .sort((a, b) => b.score - a.score);
  }

  explainRecommendation(product, params) {
    const reasons = [];
    
    if (params.prioritize === 'price') {
      reasons.push(`best value at $${product.price}`);
    } else if (params.prioritize === 'quality') {
      reasons.push(`highly rated (${product.rating} stars from ${product.review_count} reviews)`);
    }
    
    if (product.features.some(f => params.user_need.toLowerCase().includes(f.toLowerCase()))) {
      reasons.push(`has the features you mentioned`);
    }
    
    return reasons.join(', ');
  }

  compareToRecommendation(alternative, best) {
    if (alternative.price < best.price * 0.8) {
      return `much cheaper at $${alternative.price}`;
    } else if (alternative.rating > best.rating) {
      return `higher rated (${alternative.rating} stars)`;
    } else {
      return `different feature set`;
    }
  }
}

// Usage
const voiceTools = new VoiceOptimizedTools();
await voiceTools.setupVoiceAgent();

// User: "I need a laptop for video editing under $2000"
// Agent makes 1 tool call, gets complete recommendation, speaks naturally

Real-World Results

A retail company refactored their voice shopping assistant:

Before (low-level tools):

Average conversation: 12 turns
Average time: 4.5 minutes
Tool calls per session: 8.3
User satisfaction: 3.2/5
“Agent feels slow”: 67% of feedback

After (high-level tools):

Average conversation: 5 turns
Average time: 2.1 minutes
Tool calls per session: 2.1
User satisfaction: 4.6/5
“Agent feels slow”: 12% of feedback

Impact:

53% faster conversations
75% fewer tool calls
44% improvement in satisfaction
$180K saved annually (less compute time)

Design Patterns For Voice-First Tools

Pattern 1: Task-Based Not Operation-Based

// ❌ Operation-based (text agent style)
await searchUsers();
await filterByRole();
await sortByActivity();
await selectTop5();

// ✅ Task-based (voice agent style)
await findRelevantTeamMembers({ task: 'code review', skills: ['TypeScript'] });

Pattern 2: Return Speaking-Ready Data

// ❌ Returns raw data
{
  results: [...],
  total: 47,
  page: 1
}

// ✅ Returns presentation-ready data
{
  top_match: { name: "...", why: "..." },
  alternatives: [ ... ],
  summary: "Found 3 great options out of 47 total",
  next_question: "Would you like to hear more about the top choice?"
}

Pattern 3: Include Context For Follow-Ups

// ❌ Agent forgets what it found
{
  result: { id: 123, name: "Product A" }
}

// ✅ Agent remembers for follow-up questions
{
  result: { id: 123, name: "Product A" },
  context: {
    search_query: "wireless headphones under $200",
    alternatives_ids: [124, 125],
    why_chosen: "best battery life in price range"
  },
  follow_up_suggestions: [
    "Check shipping time",
    "Compare to alternatives",
    "Add to cart"
  ]
}

Pattern 4: Anticipate Next Steps

// ❌ Requires separate tool call for each action
await getProduct(id);
await checkInventory(id);
await getShipping(id);

// ✅ Returns everything user likely needs next
await getProductDetails(id) {
  return {
    product: { ... },
    in_stock: true,
    ships_in: "2 days",
    related_products: [...],
    can_add_to_cart: true
  };
}

Tool Guidelines For Voice Agents

Do	Don’t
Encapsulate workflows	Expose database operations
Return explanation text	Return raw IDs or codes
Handle edge cases internally	Force agent to handle errors
Anticipate follow-up needs	Require multiple calls for related data
Include why you returned this result	Just return data without context
Make tools match human thinking	Make tools match database schema

Implementation Timeline

Week 1: Audit existing tools

List all tool calls made in typical conversations
Identify sequential patterns (search → filter → sort)
Find tools that require 3+ calls to complete a task

Week 2: Design high-level replacements

Group related operations into single tools
Add explanation fields to responses
Include context for follow-ups

Week 3: Test with voice agent

Measure conversation length before/after
Count tool calls per session
Gather user feedback on speed

Week 4: Optimize and deploy

Refine tool descriptions for better agent understanding
Add caching for repeated queries
Monitor latency and adjust

Cost Impact

Higher-level tools reduce costs:

Realtime API pricing:

Input audio: $0.06/minute
Output audio: $0.24/minute
Average conversation: 3 minutes = $0.90

Reducing tool calls:

8 tool calls → 2 tool calls = 75% less latency
4.5 minute conversation → 2.1 minutes = 53% shorter
Cost per conversation: $0.90 → $0.42 = $0.48 saved

At 10,000 conversations/month: $4,800/month savings

Plus: Better user experience leads to higher completion rates (more revenue).

When To Use High-Level Tools

Use High-Level Tools When	Use Low-Level Tools When
Voice conversations	Text-based chat
Multi-step workflows are common	Operations are truly independent
Speed matters more than flexibility	Users need granular control
Agent decides the workflow	User directs each step explicitly

Most voice agents should use high-level tools. Low-level tools make sense for power users who want control—not typical voice interactions.

What’s Next

Voice-optimized tools evolve toward:

Adaptive complexity: Tool adjusts based on user expertise
Streaming responses: Tool returns partial results as they’re ready
Learning from usage: Tools refine based on which results users actually use

The end state: Tools that match the pace of human speech, not database queries.

If you want voice agents with optimized tool design, we can refactor your function calls for voice-first interactions. The result: faster conversations, fewer turns, better user experience.