10 minute read

You build a voice agent, test it with your own voice in a quiet room, and it sounds great. Then it hits users and you discover the agent loses track of domain policies, stumbles on interruptions, and fails transfers. τ-Voice is the benchmark that surfaces these failures before they reach users.

TL;DR: τ-Voice tests voice agents on real customer-service tasks (retail, airline, telecom) under realistic audio conditions. The best systems score only 38% on task completion, roughly half what equivalent text agents achieve. Failures concentrate on transcription errors during authentication, hallucination, and poor accent handling. This post explains what the benchmark measures and gives you a checklist to pressure-test your own agent before shipping.

A laboratory testing station with two precision measurement microphones facing each other across a calibrated gap, analog VU meters and spectrum an...


What existing voice agent benchmarks miss

Standard ASR and TTS benchmarks measure one thing: the quality of a single transformation. Word Error Rate tells you how accurately speech becomes text. MOS scores tell you how natural synthesized speech sounds. Neither tells you whether your agent can book a flight.

Voice agent failures don’t happen at the transcription layer. They happen where transcription quality, reasoning quality, tool use, and conversational dynamics all collide at once. A voice agent that transcribes perfectly can still hallucinate a policy, mis-route a transfer, or talk over a user who is correcting a mistake.

Before τ-Voice, the closest evaluation options were text-based task benchmarks (like τ²-bench) that stripped out audio entirely, or turn-taking benchmarks (like Full-Duplex-Bench) that measured conversational dynamics on synthetic tasks with no real tool calls. Neither captures what actually breaks in production, which is the combination of all three failure sources running simultaneously.

τ-Voice is the first benchmark to put grounded task completion, realistic audio, and full-duplex interaction into a single evaluation. It runs 91 task simulations across three customer service domains, with noise, accents, and interruptions by default.

The number to know: The best text agent on equivalent tasks scores 85% task completion. The best voice agent scores 51% under clean audio and 38% under realistic conditions. That is a 47-point gap under the conditions your actual users are calling from.


What τ-Voice actually measures

The benchmark covers 278 tasks across three customer-service domains: retail (114 tasks: returns, exchanges, cancellations, order modifications), airline (50 tasks: flight changes, cancellations, seat upgrades), and telecom (114 tasks: plan changes, billing inquiries, activations).

Each task ends in a verifiable database state change. The agent either processed the return correctly or it didn’t. No partial credit for sounding confident.

The evaluation runs two passes: clean (standard audio, neutral accent) and realistic (background noise at 15 dB SNR, burst noise, frame drops modeled on Gilbert-Elliott channel statistics at ~2%, dynamic muffling, and seven simulated personas with diverse accents). This separates capability from robustness. The gap between the two passes tells you how much your agent relies on ideal conditions.

τ-Voice also tracks four interaction-quality dimensions:

Dimension What it measures Direction
Latency Time from end of user turn to start of agent response Lower
Responsiveness Share of user turns that receive an agent response Higher
Interrupt rate How often the agent talks over the user mid-utterance Lower
Selectivity How often the agent correctly ignores backchannels and filler Higher

These four dimensions sit in tension. An agent optimized for low latency tends to respond before users finish, which drives up interrupt rate. An agent that maximizes selectivity may miss genuine starts and drop responsiveness. τ-Voice makes this tradeoff visible rather than collapsing it into a single score.

The voice user simulator makes 200ms discrete decisions, checking every two seconds whether to interrupt or backchannel. It injects out-of-turn speech (“hold on,” coughs, sneezes) at roughly 0.7 events per minute. The tick-based architecture makes runs reproducible, which matters when comparing agent versions over time.


Where current leading voice agents fail

Three production voice agents were evaluated: OpenAI gpt-realtime-1.5 (released February 2026), Google gemini-live-2.5-flash-native-audio (December 2025), and xAI grok-voice-agent (December 2025).

Task completion (Pass@1)

Agent Clean Realistic Drop
xAI grok-voice-agent 51% 38% -24pp
OpenAI gpt-realtime-1.5 49% 35% -28pp
Google gemini-live-2.5-flash-native-audio 31% 26% -17pp
Text SOTA (non-reasoning) 85% 85% n/a

Interaction quality

Metric             OpenAI    Google    xAI
─────────────────────────────────────────
Latency (s)         0.90      —        —
Responsiveness      100%      69%      —
Selectivity           6%      —       57%
Interrupt rate        —       21%     84%

The failure analysis of 91 failed simulations found that 79-90% of failures came from agent behavior, not simulator artifacts.

Logical errors ran 13-16 per cohort. Agents fail at reasoning even when transcription succeeds: misapplied policies, skipped tool calls, confirmed actions they hadn’t completed. Transcription errors ran 10-16 per cohort and concentrated at authentication. When the agent mishears a name or email address, everything downstream fails because it cannot pull the correct account. Hallucination showed up repeatedly: agents verbally confirming task completion without making tool calls, with the database showing no change.

The accent vulnerability numbers are the sharpest single finding. xAI drops 18 percentage points in task completion when users have non-native accents. Google drops 2. That gap is not a minor tuning difference. It is the difference between working for most of your users and working for roughly 80% of them.

The interaction quality tradeoffs tell a different story for each provider. OpenAI gets 100% responsiveness at 0.90s latency but only 6% selectivity, meaning it treats almost every filler word as a new turn and fragments the conversation constantly. xAI reaches 57% selectivity but interrupts on 84% of turns. Google has the lowest interrupt rate (21%) but misses more than one in four user turns entirely.

No agent in this cohort is simultaneously fast, responsive, selective, and non-interruptive. That is not a maturity gap. It is a fundamental tradeoff the field has not resolved, and τ-Voice is the first tool that lets you see exactly where a given agent sits on those axes.


A voice agent testing checklist you can apply before shipping

The τ-Voice methodology translates directly into a pre-launch testing protocol. You don’t need to run the full benchmark to apply its logic.

1. Test authentication with real ambiguous names and emails

Authentication is the highest-risk transcription point. Feed your agent names with common pronunciation ambiguities (Stefan vs. Steven, Nguyen, O’Brien), email addresses with unusual domains, and confirmation codes with similar-sounding characters (B vs. P, M vs. N). Measure what share of sessions fail at account lookup versus later steps. If failures concentrate here, the problem is transcription, not reasoning.

2. Run with background noise before you run without it

A quiet-room test tells you nothing about production. Test with SNR around 15 dB. If task completion drops more than 20 points relative to your clean baseline, your audio pipeline is brittle. Fix that before tuning prompts.

3. Test with three accent profiles

Native speaker, moderate non-native accent, strong non-native accent. The gap between your clean and accented scores is your accent penalty. An 18-point penalty means you are failing one in five users from non-native-speaking backgrounds by default.

4. Measure selectivity

Inject filler turns (“uh-huh,” “okay,” “right”) at natural conversation points. Count how often your agent treats them as new intents. Below 50%, your agent will fragment conversations and generate spurious tool calls.

5. Watch for interruptions on correction turns

Users correct themselves mid-sentence constantly: “I want to change the flight, actually cancel it.” Does your agent wait for the correction or respond to the initial fragment? A high interrupt rate on correction turns means users spend energy fighting the agent rather than completing their task.

6. Verify tool calls happened

For every session where the agent says a task is done, check the database. τ-Voice found agents that verbally confirmed completions without making tool calls. A post-session audit catches this class of hallucination before users see it.

7. Stress-test multi-step policy sequencing

Real customer-service tasks have ordering requirements: verify identity before discussing account details, confirm cancellation before processing a refund. Build a small set of tasks with two or three required policy steps in sequence. Logical errors (13-16 per cohort in τ-Voice) often show up as policy shortcuts taken under conversational pressure.


Benchmark coverage comparison

Capability                   ASR bench   Turn-taking   τ-Voice
────────────────────────────────────────────────────────────────
Transcription accuracy          yes          no           yes
Turn-taking dynamics            no           yes          yes
Grounded task completion        no           no           yes
Policy adherence                no           no           yes
Realistic audio conditions      no           partial      yes
Accent diversity                partial      no           yes
Verifiable database state       no           no           yes

FAQ

What is τ-Voice? τ-Voice (arXiv:2603.13686) evaluates full-duplex voice agents on grounded real-world tasks across retail, airline, and telecom domains. It combines verifiable task completion, realistic audio, and simultaneous speech handling in a single framework. It extends τ²-bench, Sierra’s earlier text-based agent benchmark.

How do current voice agents score? OpenAI gpt-realtime-1.5, Google gemini-live-2.5-flash-native-audio, and xAI grok-voice-agent score 31-51% under clean conditions and 26-38% under realistic audio. The best text agent scores 85% on identical tasks. No voice agent in this cohort reaches even 50% of text capability under realistic conditions.

What does τ-Voice measure that ASR benchmarks don’t? ASR benchmarks measure transcription accuracy on a single audio segment. τ-Voice measures whether the agent completes a real task across a full multi-turn voice conversation, with noise, accents, interruptions, and backchannels in the mix. Measuring a component versus measuring the system.

Why does authentication fail so often? Authentication requires users to speak names, email addresses, or confirmation codes. A single transcription error cascades: the agent cannot pull the account, so every subsequent tool call fails. τ-Voice found 10-16 transcription errors per 91 failed simulations concentrated at this stage. This is why authentication is the right place to start stress-testing.

What is the voice gap? The performance difference between equivalent text and voice agents on the same tasks. On τ-Voice, the best voice agent (xAI, 38% realistic) retains roughly 45% of text SOTA capability (85%). Under clean conditions the retention is closer to 60%. The voice gap is the central thing τ-Voice is designed to track over time as the field improves.


Related posts:


Sources:

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch