τ-Voice benchmark: what full-duplex voice agents actually get wrong
You build a voice agent, test it with your own voice in a quiet room, and it sounds great. Then it hits users and you discover the agent loses track of domain policies, stumbles on interruptions, and fails transfers. τ-Voice is the benchmark that surfaces these failures before they reach users.
TL;DR: τ-Voice tests voice agents on real customer-service tasks (retail, airline, telecom) under realistic audio conditions. The best systems score only 38% on task completion, roughly half what equivalent text agents achieve. Failures concentrate on transcription errors during authentication, hallucination, and poor accent handling. This post explains what the benchmark measures and gives you a checklist to pressure-test your own agent before shipping.

What existing voice agent benchmarks miss
Standard ASR and TTS benchmarks measure one thing: the quality of a single transformation. Word Error Rate tells you how accurately speech becomes text. MOS scores tell you how natural synthesized speech sounds. Neither tells you whether your agent can book a flight.
Voice agent failures don’t happen at the transcription layer. They happen where transcription quality, reasoning quality, tool use, and conversational dynamics all collide at once. A voice agent that transcribes perfectly can still hallucinate a policy, mis-route a transfer, or talk over a user who is correcting a mistake.
Before τ-Voice, the closest evaluation options were text-based task benchmarks (like τ²-bench) that stripped out audio entirely, or turn-taking benchmarks (like Full-Duplex-Bench) that measured conversational dynamics on synthetic tasks with no real tool calls. Neither captures what actually breaks in production, which is the combination of all three failure sources running simultaneously.
τ-Voice is the first benchmark to put grounded task completion, realistic audio, and full-duplex interaction into a single evaluation. It runs 91 task simulations across three customer service domains, with noise, accents, and interruptions by default.
The number to know: The best text agent on equivalent tasks scores 85% task completion. The best voice agent scores 51% under clean audio and 38% under realistic conditions. That is a 47-point gap under the conditions your actual users are calling from.
What τ-Voice actually measures
The benchmark covers 278 tasks across three customer-service domains: retail (114 tasks: returns, exchanges, cancellations, order modifications), airline (50 tasks: flight changes, cancellations, seat upgrades), and telecom (114 tasks: plan changes, billing inquiries, activations).
Each task ends in a verifiable database state change. The agent either processed the return correctly or it didn’t. No partial credit for sounding confident.
The evaluation runs two passes: clean (standard audio, neutral accent) and realistic (background noise at 15 dB SNR, burst noise, frame drops modeled on Gilbert-Elliott channel statistics at ~2%, dynamic muffling, and seven simulated personas with diverse accents). This separates capability from robustness. The gap between the two passes tells you how much your agent relies on ideal conditions.
τ-Voice also tracks four interaction-quality dimensions:
| Dimension | What it measures | Direction |
|---|---|---|
| Latency | Time from end of user turn to start of agent response | Lower |
| Responsiveness | Share of user turns that receive an agent response | Higher |
| Interrupt rate | How often the agent talks over the user mid-utterance | Lower |
| Selectivity | How often the agent correctly ignores backchannels and filler | Higher |
These four dimensions sit in tension. An agent optimized for low latency tends to respond before users finish, which drives up interrupt rate. An agent that maximizes selectivity may miss genuine starts and drop responsiveness. τ-Voice makes this tradeoff visible rather than collapsing it into a single score.
The voice user simulator makes 200ms discrete decisions, checking every two seconds whether to interrupt or backchannel. It injects out-of-turn speech (“hold on,” coughs, sneezes) at roughly 0.7 events per minute. The tick-based architecture makes runs reproducible, which matters when comparing agent versions over time.
Where current leading voice agents fail
Three production voice agents were evaluated: OpenAI gpt-realtime-1.5 (released February 2026), Google gemini-live-2.5-flash-native-audio (December 2025), and xAI grok-voice-agent (December 2025).
Task completion (Pass@1)
| Agent | Clean | Realistic | Drop |
|---|---|---|---|
| xAI grok-voice-agent | 51% | 38% | -24pp |
| OpenAI gpt-realtime-1.5 | 49% | 35% | -28pp |
| Google gemini-live-2.5-flash-native-audio | 31% | 26% | -17pp |
| Text SOTA (non-reasoning) | 85% | 85% | n/a |
Interaction quality
Metric OpenAI Google xAI
─────────────────────────────────────────
Latency (s) 0.90 — —
Responsiveness 100% 69% —
Selectivity 6% — 57%
Interrupt rate — 21% 84%
The failure analysis of 91 failed simulations found that 79-90% of failures came from agent behavior, not simulator artifacts.
Logical errors ran 13-16 per cohort. Agents fail at reasoning even when transcription succeeds: misapplied policies, skipped tool calls, confirmed actions they hadn’t completed. Transcription errors ran 10-16 per cohort and concentrated at authentication. When the agent mishears a name or email address, everything downstream fails because it cannot pull the correct account. Hallucination showed up repeatedly: agents verbally confirming task completion without making tool calls, with the database showing no change.
The accent vulnerability numbers are the sharpest single finding. xAI drops 18 percentage points in task completion when users have non-native accents. Google drops 2. That gap is not a minor tuning difference. It is the difference between working for most of your users and working for roughly 80% of them.
The interaction quality tradeoffs tell a different story for each provider. OpenAI gets 100% responsiveness at 0.90s latency but only 6% selectivity, meaning it treats almost every filler word as a new turn and fragments the conversation constantly. xAI reaches 57% selectivity but interrupts on 84% of turns. Google has the lowest interrupt rate (21%) but misses more than one in four user turns entirely.
No agent in this cohort is simultaneously fast, responsive, selective, and non-interruptive. That is not a maturity gap. It is a fundamental tradeoff the field has not resolved, and τ-Voice is the first tool that lets you see exactly where a given agent sits on those axes.
A voice agent testing checklist you can apply before shipping
The τ-Voice methodology translates directly into a pre-launch testing protocol. You don’t need to run the full benchmark to apply its logic.
1. Test authentication with real ambiguous names and emails
Authentication is the highest-risk transcription point. Feed your agent names with common pronunciation ambiguities (Stefan vs. Steven, Nguyen, O’Brien), email addresses with unusual domains, and confirmation codes with similar-sounding characters (B vs. P, M vs. N). Measure what share of sessions fail at account lookup versus later steps. If failures concentrate here, the problem is transcription, not reasoning.
2. Run with background noise before you run without it
A quiet-room test tells you nothing about production. Test with SNR around 15 dB. If task completion drops more than 20 points relative to your clean baseline, your audio pipeline is brittle. Fix that before tuning prompts.
3. Test with three accent profiles
Native speaker, moderate non-native accent, strong non-native accent. The gap between your clean and accented scores is your accent penalty. An 18-point penalty means you are failing one in five users from non-native-speaking backgrounds by default.
4. Measure selectivity
Inject filler turns (“uh-huh,” “okay,” “right”) at natural conversation points. Count how often your agent treats them as new intents. Below 50%, your agent will fragment conversations and generate spurious tool calls.
5. Watch for interruptions on correction turns
Users correct themselves mid-sentence constantly: “I want to change the flight, actually cancel it.” Does your agent wait for the correction or respond to the initial fragment? A high interrupt rate on correction turns means users spend energy fighting the agent rather than completing their task.
6. Verify tool calls happened
For every session where the agent says a task is done, check the database. τ-Voice found agents that verbally confirmed completions without making tool calls. A post-session audit catches this class of hallucination before users see it.
7. Stress-test multi-step policy sequencing
Real customer-service tasks have ordering requirements: verify identity before discussing account details, confirm cancellation before processing a refund. Build a small set of tasks with two or three required policy steps in sequence. Logical errors (13-16 per cohort in τ-Voice) often show up as policy shortcuts taken under conversational pressure.
Benchmark coverage comparison
Capability ASR bench Turn-taking τ-Voice
────────────────────────────────────────────────────────────────
Transcription accuracy yes no yes
Turn-taking dynamics no yes yes
Grounded task completion no no yes
Policy adherence no no yes
Realistic audio conditions no partial yes
Accent diversity partial no yes
Verifiable database state no no yes
FAQ
What is τ-Voice? τ-Voice (arXiv:2603.13686) evaluates full-duplex voice agents on grounded real-world tasks across retail, airline, and telecom domains. It combines verifiable task completion, realistic audio, and simultaneous speech handling in a single framework. It extends τ²-bench, Sierra’s earlier text-based agent benchmark.
How do current voice agents score? OpenAI gpt-realtime-1.5, Google gemini-live-2.5-flash-native-audio, and xAI grok-voice-agent score 31-51% under clean conditions and 26-38% under realistic audio. The best text agent scores 85% on identical tasks. No voice agent in this cohort reaches even 50% of text capability under realistic conditions.
What does τ-Voice measure that ASR benchmarks don’t? ASR benchmarks measure transcription accuracy on a single audio segment. τ-Voice measures whether the agent completes a real task across a full multi-turn voice conversation, with noise, accents, interruptions, and backchannels in the mix. Measuring a component versus measuring the system.
Why does authentication fail so often? Authentication requires users to speak names, email addresses, or confirmation codes. A single transcription error cascades: the agent cannot pull the account, so every subsequent tool call fails. τ-Voice found 10-16 transcription errors per 91 failed simulations concentrated at this stage. This is why authentication is the right place to start stress-testing.
What is the voice gap? The performance difference between equivalent text and voice agents on the same tasks. On τ-Voice, the best voice agent (xAI, 38% realistic) retains roughly 45% of text SOTA capability (85%). Under clean conditions the retention is closer to 60%. The voice gap is the central thing τ-Voice is designed to track over time as the field improves.
Related posts:
- Voice agent architecture: building blocks for production systems
- Voice agent frameworks: LiveKit, Pipecat, and when to use each
- Agent evaluation frameworks: how to measure what your agent actually does
Sources:
- τ-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains (arXiv:2603.13686)
- τ³-Bench: Advancing agent evaluation to knowledge and voice (Sierra AI)
- Full-Duplex-Bench: A Benchmark to Evaluate Full-Duplex Spoken Dialogue Models on Turn-taking Capabilities
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch