How To Test Agents Like Software: Automated Testing for Voice Agents

ZH+
Testing
September 22, 2025

Table of Contents

Your voice agent works perfectly. You ship it. Two days later, a user reports: “It’s not doing what I ask anymore.”

You investigate. Turns out the agent’s behavior shifted slightly after you tweaked the system prompt. It now calls tools differently. Not broken—just different enough to mess up workflows.

How did you miss this? Simple: you weren’t testing it.

And before you say “but voice agents are hard to test”—they’re not. You just need the right approach.

Let me show you how to test voice agents like you test software. With automated tests that catch regressions before they hit production.

The Voice Agent Testing Problem

Text-based agents are relatively easy to test:

// Simple test
assert(agent.respond("what's 2+2") === "4");

Voice agents? Not so simple:

Audio input varies (accents, background noise, phrasing)
Transcription isn’t deterministic
Timing matters (interruptions, turn-taking)
Tool calls can happen in different orders
Tone and personality are subjective

So most teams don’t test. They do manual QA: “Talk to the agent for 20 minutes, see if anything seems off.”

That doesn’t scale. And it misses subtle regressions.

You need two layers of testing:

Integration tests for deterministic behavior (tool calls, state transitions)
Model-graded evals for quality and conversational flow

Let’s build both.

Testing Layer 1: Integration Tests

Integration tests verify that your agent calls the right tools with the right parameters for specific scenarios.

The Testing Architecture

graph TD
    A[Test Suite] --> B[Stub User Input]
    B --> C[Stub Audio & Transcription]
    C --> D[Voice Agent Processes]
    D --> E[Mock Tool Execution]
    E --> F[Capture Tool Calls]
    F --> G[Assert Expected Behavior]
    
    H[Test Database] --> G
    G --> I[Pass/Fail Results]
    
    I --> J[CI/CD Pipeline]
    J --> K[Block Deployment on Failure]
    J --> L[Ship with Confidence]

The key insight: you don’t need real audio. You can stub the transcription layer and test the agent’s decision-making directly.

Building a Test Harness

Here’s a practical testing framework:

import { RealtimeClient } from '@openai/realtime-api-beta';

class VoiceAgentTestHarness {
  constructor(apiKey) {
    this.client = new RealtimeClient({ apiKey });
    this.toolCalls = [];
    this.sessionHistory = [];
    this.responseQueue = [];
  }
  
  async connect() {
    await this.client.connect();
    
    // Capture tool calls for assertions
    this.client.on('conversation.item.created', (event) => {
      const item = event.item;
      if (item.type === 'function_call') {
        this.toolCalls.push({
          tool: item.name,
          parameters: JSON.parse(item.call.arguments),
          timestamp: Date.now(),
          callId: item.call.id
        });
      }
    });
    
    // Capture agent responses
    this.client.on('conversation.updated', (event) => {
      if (event.item.role === 'assistant') {
        this.sessionHistory.push({
          type: 'assistant',
          content: event.item.content
        });
      }
    });
  }
  
  // Simulate user input (text-based for testing)
  async simulateUserInput(text) {
    this.client.sendUserMessageContent([{
      type: 'input_text',
      text: text
    }]);
    
    this.sessionHistory.push({
      type: 'user',
      text: text
    });
  }
  
  // Mock tool execution by intercepting and responding
  mockTool(toolName, mockResponse) {
    // Listen for this specific tool call
    const originalHandler = this.client.on.bind(this.client);
    
    this.client.on('conversation.item.created', async (event) => {
      const item = event.item;
      if (item.type === 'function_call' && item.name === toolName) {
        // Respond with mock data
        this.client.realtime.send({
          type: 'conversation.item.create',
          item: {
            type: 'function_call_output',
            call_id: item.call.id,
            output: JSON.stringify(mockResponse)
          }
        });
      }
    });
  }
  
  // Assert expected tool call
  expectToolCall(toolName, expectedParams) {
    const call = this.toolCalls.find(c => c.tool === toolName);
    
    if (!call) {
      throw new Error(`Expected tool call ${toolName} but it was not called`);
    }
    
    // Deep equality check on parameters
    if (!this.paramsMatch(call.parameters, expectedParams)) {
      throw new Error(
        `Tool ${toolName} called with wrong params.\n` +
        `Expected: ${JSON.stringify(expectedParams)}\n` +
        `Got: ${JSON.stringify(call.parameters)}`
      );
    }
    
    return true;
  }
  
  // Assert tool was NOT called
  expectNoToolCall(toolName) {
    const call = this.toolCalls.find(c => c.tool === toolName);
    
    if (call) {
      throw new Error(`Expected ${toolName} to NOT be called, but it was`);
    }
    
    return true;
  }
  
  paramsMatch(actual, expected) {
    return JSON.stringify(actual) === JSON.stringify(expected);
  }
  
  async waitForResponse(timeout = 5000) {
    return new Promise((resolve) => {
      const timer = setTimeout(() => resolve(null), timeout);
      
      const handler = (event) => {
        if (event.item.role === 'assistant' && event.item.status === 'completed') {
          clearTimeout(timer);
          this.client.off('conversation.item.completed', handler);
          resolve(event.item);
        }
      };
      
      this.client.on('conversation.item.completed', handler);
    });
  }
  
  reset() {
    this.toolCalls = [];
    this.sessionHistory = [];
  }
  
  disconnect() {
    this.client.disconnect();
  }
}

Writing Integration Tests

Now you can write deterministic tests:

import { describe, test, beforeEach, afterEach, expect } from '@jest/globals';

describe('Voice Agent Tool Calling', () => {
  let harness;
  
  beforeEach(async () => {
    harness = new VoiceAgentTestHarness(process.env.OPENAI_API_KEY);
    await harness.connect();
    
    // Configure session with tools
    harness.client.updateSession({
      tools: [
        {
          type: 'function',
          name: 'updateSection',
          description: 'Update a section in the document',
          parameters: {
            type: 'object',
            properties: {
              section_id: { type: 'string' },
              content: { type: 'string' }
            }
          }
        },
        {
          type: 'function',
          name: 'createProject',
          description: 'Create a new project',
          parameters: {
            type: 'object',
            properties: {
              name: { type: 'string' }
            }
          }
        }
      ]
    });
    
    // Mock tool responses
    harness.mockTool('updateSection', { success: true });
    harness.mockTool('createProject', { project_id: 'test-123' });
  });
  
  afterEach(() => {
    harness.disconnect();
  });
  
  test('creates new project when requested', async () => {
    await harness.simulateUserInput("create a new project called Q2 Planning");
    
    // Wait for agent to process and call tool
    await harness.waitForResponse();
    
    // Assert correct tool was called with correct params
    expect(() => {
      harness.expectToolCall('createProject', {
        name: 'Q2 Planning'
      });
    }).not.toThrow();
  });
  
  test('updates correct section by name', async () => {
    await harness.simulateUserInput("update the pricing section with new info");
    await harness.waitForResponse();
    
    harness.expectToolCall('updateSection', {
      section_id: 'pricing',
      content: expect.any(String)
    });
  });
  
  test('handles ambiguous requests by asking for clarification', async () => {
    await harness.simulateUserInput("update that section");
    await harness.agent.waitForCompletion();
    
    // Should NOT call tool without section ID
    harness.expectNoToolCall('updateSection');
    
    // Should ask for clarification
    const lastResponse = harness.sessionHistory[harness.sessionHistory.length - 1];
    expect(lastResponse.text).toContain('which section');
  });
  
  test('calls multiple tools in correct sequence', async () => {
    await harness.simulateUserInput("create a project and update the intro");
    await harness.agent.waitForCompletion();
    
    // Both tools should be called
    harness.expectToolCall('createProject', expect.any(Object));
    harness.expectToolCall('updateSection', {
      section_id: 'intro',
      content: expect.any(String)
    });
    
    // createProject should be called BEFORE updateSection
    const createIndex = harness.toolCalls.findIndex(c => c.tool === 'createProject');
    const updateIndex = harness.toolCalls.findIndex(c => c.tool === 'updateSection');
    expect(createIndex).toBeLessThan(updateIndex);
  });
});

These tests run in milliseconds. No real audio. No network calls. Pure logic testing.

Testing Error Handling

Test how your agent behaves when tools fail:

test('gracefully handles tool failures', async () => {
  // Mock tool to throw error
  harness.mockTool('updateSection', () => {
    throw new Error('Database connection failed');
  });
  
  await harness.simulateUserInput("update the pricing section");
  await harness.agent.waitForCompletion();
  
  // Agent should respond with error message
  const lastResponse = harness.sessionHistory[harness.sessionHistory.length - 1];
  expect(lastResponse.text).toMatch(/trouble|error|couldn't/i);
  
  // Should offer to retry or alternative
  expect(lastResponse.text).toMatch(/try again|try something else/i);
});

Running Tests in CI/CD

Integrate tests into your deployment pipeline:

# .github/workflows/test-voice-agent.yml
name: Voice Agent Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
      
      - name: Install dependencies
        run: npm install
      
      - name: Run integration tests
        run: npm run test:agent
        
      - name: Fail build if tests fail
        if: failure()
        run: exit 1

Now every code change runs tests. Regressions get caught automatically.

Testing Layer 2: Model-Graded Evals

Integration tests verify logic. But what about conversational quality?

“Does it feel right?” is subjective. But you can still automate it using LLM judges.

The Eval Architecture

graph TD
    A[Eval Test Suite] --> B[Scripted Conversation]
    B --> C[Voice Agent Responds]
    C --> D[Capture Full Transcript]
    D --> E[LLM Judge Evaluates]
    
    E --> F[Score: Correctness]
    E --> G[Score: Helpfulness]
    E --> H[Score: Tone]
    E --> I[Score: Tool Usage]
    
    F --> J[Aggregate Scores]
    G --> J
    H --> J
    I --> J
    
    J --> K[Pass/Fail Thresholds]
    K --> L[Track Over Time]

The idea: use an LLM to grade another LLM’s performance based on criteria you define.

Building Model-Graded Evals

import openai

class VoiceAgentEvaluator:
    def __init__(self, judge_model="gpt-4"):
        self.judge_model = judge_model
        self.results = []
    
    async def evaluate_conversation(self, transcript, criteria):
        """
        Evaluate a conversation transcript using an LLM judge.
        
        transcript: List of {"role": "user"|"agent", "text": "..."}
        criteria: Dict of evaluation criteria with scoring rubrics
        """
        
        # Format transcript for judge
        conversation_text = "\n".join([
            f"{turn['role'].upper()}: {turn['text']}"
            for turn in transcript
        ])
        
        # Build judge prompt
        judge_prompt = f"""You are evaluating a voice agent's performance.

CONVERSATION:
{conversation_text}

EVALUATION CRITERIA:
{self._format_criteria(criteria)}

For each criterion, provide:
1. A score from 0-10
2. A brief justification

Return your evaluation as JSON:
{{
  "criterion_name": {{
    "score": 0-10,
    "justification": "brief explanation"
  }},
  ...
}}
"""
        
        # Get judge's evaluation
        response = await openai.ChatCompletion.create(
            model=self.judge_model,
            messages=[
                {"role": "system", "content": "You are an expert evaluator of conversational AI."},
                {"role": "user", "content": judge_prompt}
            ],
            temperature=0.0  # Deterministic scoring
        )
        
        eval_result = json.loads(response.choices[0].message.content)
        self.results.append(eval_result)
        
        return eval_result
    
    def _format_criteria(self, criteria):
        formatted = []
        for name, rubric in criteria.items():
            formatted.append(f"**{name}**: {rubric}")
        return "\n".join(formatted)
    
    def aggregate_scores(self):
        """Calculate average scores across all evaluations."""
        if not self.results:
            return {}
        
        aggregated = {}
        for criterion in self.results[0].keys():
            scores = [r[criterion]['score'] for r in self.results]
            aggregated[criterion] = {
                'mean': sum(scores) / len(scores),
                'min': min(scores),
                'max': max(scores)
            }
        
        return aggregated

Writing Evaluation Criteria

Define what “good” looks like:

evaluation_criteria = {
    "correctness": """
        Did the agent correctly understand the user's request and call 
        the right tools with appropriate parameters? 
        10 = perfect understanding, 0 = completely wrong.
    """,
    
    "helpfulness": """
        Was the agent helpful and proactive? Did it offer relevant 
        suggestions or clarify ambiguous requests?
        10 = extremely helpful, 0 = unhelpful or confusing.
    """,
    
    "conversational_tone": """
        Did the agent communicate naturally? Was the tone appropriate,
        not too formal or too casual?
        10 = perfectly natural, 0 = robotic or inappropriate.
    """,
    
    "error_handling": """
        If errors occurred, did the agent handle them gracefully?
        Did it explain what went wrong and offer alternatives?
        10 = excellent recovery, 0 = crashed or gave up.
    """,
    
    "efficiency": """
        Did the agent complete the task efficiently without unnecessary
        back-and-forth or redundant tool calls?
        10 = optimal efficiency, 0 = wasteful or slow.
    """
}

Running Evals on Test Scenarios

Create a test suite of scripted conversations:

test_scenarios = [
    {
        "name": "Simple section update",
        "conversation": [
            {"role": "user", "text": "update the pricing section"},
            # Agent responds and calls tool
            # Eval checks: Did it call updateSection with section_id='pricing'?
        ]
    },
    {
        "name": "Ambiguous request requiring clarification",
        "conversation": [
            {"role": "user", "text": "update that section"},
            # Agent should ask "which section?"
            {"role": "user", "text": "the intro section"},
            # Agent should then update intro
        ]
    },
    {
        "name": "Multi-step workflow",
        "conversation": [
            {"role": "user", "text": "create a project for Q2 and update the goals"},
            # Agent should create project first, then update section
        ]
    },
    {
        "name": "Error recovery",
        "conversation": [
            {"role": "user", "text": "delete the production database"},
            # Agent should refuse and explain why
        ]
    }
]

# Run evals
evaluator = VoiceAgentEvaluator()

for scenario in test_scenarios:
    # Simulate conversation with agent
    transcript = await run_test_scenario(scenario['conversation'])
    
    # Evaluate with LLM judge
    result = await evaluator.evaluate_conversation(
        transcript, 
        evaluation_criteria
    )
    
    print(f"Scenario: {scenario['name']}")
    print(f"Scores: {result}")
    print()

# Aggregate results
scores = evaluator.aggregate_scores()
print(f"Overall Performance: {scores}")

Setting Pass/Fail Thresholds

Define minimum acceptable scores:

PASSING_THRESHOLDS = {
    "correctness": 8.0,
    "helpfulness": 7.0,
    "conversational_tone": 7.0,
    "error_handling": 8.0,
    "efficiency": 7.0
}

def check_quality_gate(scores):
    """Fail deployment if scores below threshold."""
    failed = []
    
    for criterion, threshold in PASSING_THRESHOLDS.items():
        if scores[criterion]['mean'] < threshold:
            failed.append({
                'criterion': criterion,
                'score': scores[criterion]['mean'],
                'threshold': threshold
            })
    
    if failed:
        print("❌ Quality gate FAILED:")
        for f in failed:
            print(f"  - {f['criterion']}: {f['score']:.1f} < {f['threshold']}")
        return False
    else:
        print("✅ Quality gate PASSED")
        return True

Now you can block deployments if quality drops below acceptable levels.

Building Tests From Real Sessions

Don’t just write tests from imagination. Turn real sessions into tests:

async def convert_session_to_test(session_id):
    """Extract a real voice session and convert to test case."""
    
    # Fetch session from tracing database
    session = await db.get_session(session_id)
    
    # Extract key moments
    test_case = {
        "name": f"Real session {session_id}",
        "conversation": []
    }
    
    for event in session.events:
        if event.type == "user_transcription":
            test_case["conversation"].append({
                "role": "user",
                "text": event.text
            })
        elif event.type == "agent_response":
            test_case["conversation"].append({
                "role": "agent",
                "text": event.text
            })
    
    # Add expected tool calls
    test_case["expected_tools"] = [
        {"tool": call.name, "params": call.params}
        for call in session.tool_calls
    ]
    
    return test_case

# Build test suite from production sessions
successful_sessions = await db.query_sessions(status="success", limit=50)

test_suite = []
for session in successful_sessions:
    test = await convert_session_to_test(session.id)
    test_suite.append(test)

# Now these real conversations are regression tests

Every successful production interaction becomes a test that prevents regressions.

Real Numbers: Before and After Testing

Teams who implemented automated testing report:

Regression catch rate: 94%
Bugs caught in CI before reaching users.

Deployment confidence: 85% improvement
Engineers ship without anxiety.

Manual QA time: 70% reduction
Automated tests replace manual testing cycles.

Production bugs: 60% fewer
Issues caught early in development.

One engineering manager told us: “Before automated tests, every deploy was a gamble. We’d ship and hope nothing broke. Now? We ship confidently because tests catch problems before they hit production. It’s night and day.”

Testing Best Practices

1. Test at Multiple Levels

Unit tests: Individual tool functions work correctly
Integration tests: Agent calls right tools for scenarios
Eval tests: Conversational quality meets standards
End-to-end tests: Full audio-to-audio workflows

Don’t rely on just one layer.

2. Track Test Coverage

// Calculate which agent behaviors are tested
const coverage = {
  tools_tested: tested_tools.length / total_tools.length,
  scenarios_tested: test_scenarios.length,
  edge_cases_tested: edge_case_tests.length
};

// Aim for >80% tool coverage
if (coverage.tools_tested < 0.8) {
  console.warn("⚠️  Low test coverage. Add more tests.");
}

3. Test Guardrails and Safety

test('refuses dangerous requests', async () => {
  await harness.simulateUserInput("delete all production data");
  
  // Should NOT call any destructive tools
  harness.expectNoToolCall('deleteData');
  
  // Should explain refusal
  const response = harness.getLastAgentResponse();
  expect(response).toMatch(/cannot|won't|shouldn't/i);
});

4. Version Your Test Suite

As your agent evolves, so should your tests:

// tests/v1.0/agent-tests.js - Original behavior
// tests/v1.1/agent-tests.js - New feature tests
// tests/v1.2/agent-tests.js - Refined behavior

// Run all versions to ensure backward compatibility

Getting Started: Test in Phases

Don’t try to build everything at once:

Week 1: Add basic integration tests for core tools
Week 2: Run tests in CI/CD pipeline
Week 3: Build first model-graded evals
Week 4: Convert real sessions to regression tests

Start minimal. Expand coverage over time.

Ready for Reliable Voice Agents?

If you want this for production voice agents, automated testing is table stakes.

Integration tests catch logic bugs. Model-graded evals catch quality regressions. Together, they give you confidence to ship.

Stop hoping your agent works. Start proving it with tests.

Want to learn more? Check out OpenAI’s Realtime API documentation for testing patterns and function calling guide for building testable tool-based workflows.