ASR is solved (on benchmarks): the real-world gap every voice agent team hits

Q: What is the streaming accuracy tax in ASR?

Streaming ASR must return transcriptions within 200-300ms — before the user finishes speaking — so the model cannot use full audio context. Batch ASR with complete audio access achieves 10-17 percentage points lower WER than streaming at equivalent latency settings. For voice agents, streaming is unavoidable, so you permanently give up this accuracy margin.

14 minute read

TL;DR: Standard ASR benchmarks test clean, read speech in studio conditions. Voice agents operate on noisy phone channels, disfluency-laden conversation, and domain-specific vocabulary — none of which benchmarks measure. The Back to Basics paper (arXiv:2603.25727) quantifies the gap across seven production ASR systems: robustness does not transfer across languages, environments, or acoustic conditions. Understanding the five structural gaps between benchmark and production is the first step to building voice agents that actually work.

A lavalier microphone in a noisy café environment, illustrating the gap between controlled benchmark conditions and real-world voice agent deployments

Whisper Large-V3 scores around 2.5% Word Error Rate on LibriSpeech clean (per the original OpenAI paper; third-party evaluations with standard normalization report 2.7%). NVIDIA’s Parakeet-TDT-0.6B-v2 scores 1.69%. These numbers circulate in procurement decisions, integration announcements, and architecture reviews. They feel like permission to stop worrying about speech recognition.

Then your voice agent ships. Users call from cars, offices, noisy restaurants. They say “um” and restart sentences. They use your product’s actual name, which the model has never seen. And your transcription accuracy lands somewhere between 30% and 50% error on a hard Monday. Not because the model is bad. Because the benchmark was measuring something else entirely.

What LibriSpeech actually tests

LibriSpeech is a 1,000-hour corpus of audiobook recordings, narrated by volunteers in reasonably quiet conditions, in standard American and British English. The speakers are reading. There are no disfluencies, no background noise, no phone channel degradation, no accents outside the corpus’s demographic range.

It is an excellent benchmark for what it measures: how accurately a model transcribes prepared, articulate speech in a controlled acoustic environment. That is a real use case. It is not a voice agent.

The gap isn’t a criticism of LibriSpeech. It was designed for controlled comparison between models, and it works well for that purpose. The problem is how the scores travel. A vendor quotes 2.5% WER (or 2.7% depending on normalization methodology) and the number lands in an engineering decision with none of its context. “Near-human accuracy” becomes the implicit conclusion. The benchmark’s scope evaporates.

Here is what production conditions actually look like:

Condition	Benchmark assumption	Production reality	WER impact
Audio quality	16 kHz studio microphone	8 kHz phone channel, Bluetooth, varying mics	+5–25 pp
Background noise	Near-silent	Streets, offices, cars, hospitals	+5–30 pp
Speech style	Read, fluent	Conversational, disfluent	+0.7–1.0 pp
Vocabulary	General English	Product names, domain jargon	+6–12 pp (SER)
Processing mode	Batch, full audio context	Streaming, 200–300ms window	+10–17 pp
Speaker range	Standard accents	Global accents, code-switching	+5–15 pp

Each factor degrades accuracy independently. In a voice agent call, several apply simultaneously. Deepgram’s 2026 buyer’s guide documents 7.5x–16x degradation from benchmark to production in realistic scenarios.

The five gaps the Back to Basics paper names

The March 2026 paper “Back to Basics: Revisiting ASR in the Age of Voice Agents” (arXiv:2603.25727, Boson AI) evaluates seven widely-used ASR systems using WildASR, a new multilingual diagnostic benchmark built from real human speech. WildASR tests robustness along three independent axes: environmental degradation, demographic shift, and linguistic diversity.

The findings are worth sitting with. Model robustness does not transfer across languages or conditions. A model that handles noisy English well degrades unpredictably in French or Mandarin under the same acoustic stress. There is no safe assumption that a model’s English benchmark score predicts its behavior in another language, another accent, or another noise environment.

The paper identifies five structural gaps between what benchmarks test and what voice agents need:

Gap 1: Disfluency handling. Studies find filler words constitute 6–10% of words in spontaneous speech, and roughly one in three utterances contains at least one filler, repetition, false start, or revision. LibriSpeech has essentially none. Models trained predominantly on read speech handle disfluencies by deleting them, converting them to repeated words, or substituting semantically plausible alternatives. “I want to um — schedule my return” becomes “I want to schedule my return” on a good day, and “I want to order my return” on a bad one. The direct WER impact from disfluencies alone is modest, but the downstream intent corruption is significant.

Gap 2: Channel degradation. A customer calling from a car on a phone network sends audio sampled at 8 kHz rather than the 16 kHz standard for modern ASR. In real clinical telephony deployments, models like IndicWav2Vec reach over 40% WER (versus their clean-audio benchmarks, arXiv:2512.16401). A contact center voice agent handling 10,000 calls per day will see this condition on a meaningful fraction of them.

Gap 3: Domain vocabulary. ASR training corpora skew heavily toward general English text. Your product name, your plan tiers, your competitors’ names, your industry’s jargon: all out-of-vocabulary. Deepgram’s 2026 SER guide reports that Slot Error Rate (accuracy on the specific named entities that downstream tasks depend on) exceeds WER by 6–12 percentage points. For a voice agent processing return requests, the flight number matters more than the filler words.

Gap 4: The streaming tax. Voice agents cannot wait for users to finish sentences before transcribing. Streaming ASR must commit to transcriptions within a 200–300ms window (the perceptual threshold beyond which users unconsciously sense gaps in the conversation, per AssemblyAI 2026). Batch ASR, given the full audio, achieves 10–17 percentage points better WER than streaming at equivalent latency settings. That accuracy gap is permanent; it’s the price of real-time operation.

Gap 5: Hallucination under degraded audio. This is the least discussed and most dangerous gap. When audio is partially degraded or contains near-silence, models do not return low-confidence outputs. They generate plausible transcriptions that were never spoken. “Cancel my order” can become “Schedule my order” under acoustic pressure. Both are grammatical. Only one matches user intent. The SHALLOW benchmark (arXiv:2510.16567) categorizes these hallucinations across lexical, phonetic, morphological, and semantic dimensions. WildASR’s core finding is that this failure mode is unpredictable: it varies by language, environment, and demographic in ways that English benchmark performance cannot anticipate.

flowchart LR
    A["LibriSpeech benchmark\n(~2.5% WER)"] --> B["Whisper Large-V3\nin lab"]
    
    C["Production call\n(phone channel)"] --> D["8 kHz audio\n+channel noise"]
    D --> E["Domain vocabulary\ngap"]
    E --> F["Streaming latency\nconstraint"]
    F --> G["Voice agent\nWER: 30–50%"]
    
    B -.->|"7.5x–16x\ndegradation"| G
    
    style A fill:#e8f5e9
    style G fill:#ffebee
    style B fill:#e8f5e9

Why WER is the wrong primary metric for voice agents

Word Error Rate measures the edit distance between the ASR transcript and a reference transcription, normalized by reference length. It treats every word as equally important. For voice agents, this is wrong in two directions.

First, filler words matter less than content words. Getting “um” wrong is not the same as getting “account number” wrong. A WER of 8% could mean the model is flawlessly capturing all named entities and occasionally mangling hesitations, or it could mean it’s misreading critical values reliably. WER cannot tell the difference.

Second, Slot Error Rate is the metric that actually predicts downstream task success. SER measures whether the model correctly transcribed the tokens that matter: names, numbers, dates, addresses, product identifiers. A contact center voice agent cares whether it captured “return order 7721-B,” not whether it handled the surrounding conversation fluently. Deepgram’s 2026 developer guide on SER documents the 6–12 percentage point gap between WER and SER in production domains.

The 300ms latency wall compounds this. Beyond 300ms between user speech and agent response, users unconsciously begin to sense gaps in the conversation (Cresta, 2026). Beyond 500ms, they consciously notice. Satisfaction drops. Beyond one second, abandonment spikes. A voice agent optimizing for low latency (as all production voice agents must) is permanently operating in the streaming accuracy regime, not the batch regime.

A useful rule of thumb: if your total voice agent round-trip budget is 800ms and LLM inference takes 400ms, your ASR layer has roughly 150ms. At that latency, you are not running Whisper Large-V3 in batch mode. You are running a streaming endpoint at a fraction of the model’s maximum capability.

What WildASR reveals about model selection

The WildASR benchmark’s most actionable finding is the robustness transfer problem. Teams often evaluate a single ASR model on English benchmarks and assume multilingual performance scales predictably. It does not.

WildASR tests models independently along three axes. A model robust to environmental degradation in English may fail catastrophically under the same conditions in Mandarin. A model that handles diverse accents in English may be brittle to demographic shift in Spanish. The axes are not correlated. Each requires independent evaluation.

For teams building multilingual voice agents, or voice agents serving diverse user populations in any single language, this has direct practical implications:

Evaluate each language independently on realistic samples from your actual user base
Test each degradation axis (noise, channel, accent, disfluency) separately before combining
Do not assume English robustness predicts other-language robustness
Use WER only as a secondary signal; measure SER and task completion as primary signals

The paper’s authors note: “Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation.”

How to actually evaluate ASR for your voice agent

Standard ASR evaluation for voice agents requires a different protocol than benchmark comparison. Here is a practical starting checklist:

ASR production readiness checklist

Channel:
  [ ] Test on your actual delivery channel (phone at 8kHz, WebRTC, mobile app mic)
  [ ] Test at 10 dB SNR and 5 dB SNR for worst-case calls
  [ ] Include 2% packet loss simulation for VoIP deployments

Vocabulary:
  [ ] Build a domain-specific test set with your product names, plan names, jargon
  [ ] Measure Slot Error Rate, not just WER, on named entity tokens
  [ ] Test your word-boosting / custom vocabulary implementation

Speech style:
  [ ] Include at least 20% disfluent samples (fillers, false starts, repetitions)
  [ ] Include non-native speaker samples at your expected accent distribution
  [ ] Test at your streaming latency target (not batch)

Latency:
  [ ] Measure WER at your actual latency constraint (e.g., 150ms for an 800ms budget)
  [ ] Confirm streaming endpoint, not batch, is what gets evaluated
  [ ] Test under concurrent load conditions (WER often degrades under load)

Failure mode coverage:
  [ ] Test silent frames and near-silence for hallucination
  [ ] Test domain-critical pairs: cancel/schedule, return/retain, yes/no
  [ ] Monitor SER continuously in production, not just WER

The acoustic conditions your users call from are the benchmark that matters. LibriSpeech tells you the ceiling. Your production test set tells you where you’ll actually land.

For teams using streaming ASR architectures, the latency-accuracy tradeoff is built into the pipeline design. There is no avoiding it, only making it explicit. For the full picture of where ASR failure compounds with turn-taking and agent reasoning failures, the τ-Voice benchmark analysis documents how these errors cascade through a complete voice agent conversation. The τ-Voice results (best voice agent at 38% task completion under realistic conditions, upper bound of a 26–38% range across tested systems) are partly an ASR story. The transcription failures at authentication cascade into every subsequent tool call.

Key takeaways

Whisper Large-V3’s 2.7% WER on LibriSpeech does not predict your production accuracy. Deepgram documents 7.5x–16x degradation from benchmark to production in realistic conditions.
The five structural gaps are: disfluency handling, channel degradation, domain vocabulary, the streaming latency tax, and hallucination under degraded audio. Each adds independently to production WER.
Slot Error Rate, not WER, is the right primary metric for voice agents. SER exceeds WER by 6–12 percentage points in domain-specific deployments (Deepgram, 2026).
The WildASR paper’s key finding: model robustness does not transfer across languages or conditions. Evaluate each independently.
Streaming ASR permanently costs 10–17 percentage points of accuracy compared to batch. At a 150ms ASR budget inside an 800ms round-trip, you are not running in the benchmark regime.
Build your own production test set. Include phone-channel audio, disfluent samples, domain vocabulary, and your actual accent distribution. Then run it at your latency constraint.

FAQ

Why does ASR perform so much worse in production than benchmarks suggest? ASR benchmarks like LibriSpeech test clean, read speech recorded in studio conditions. Production voice agents encounter phone-channel audio (8 kHz vs 16 kHz), background noise at 10–15 dB SNR, natural disfluencies (um, uh, false starts), and domain-specific vocabulary absent from training data. Each factor adds 5–25 percentage points to WER independently. Combined, they produce 7.5x–16x degradation from benchmark to production (Deepgram, 2026).

What is the streaming accuracy tax in ASR? Streaming ASR must return transcriptions within 200–300ms (before the user finishes speaking), so the model cannot use full audio context. Batch ASR with complete audio access achieves 10–17 percentage points lower WER than streaming at equivalent latency settings. For voice agents, streaming is unavoidable, so you permanently give up this accuracy margin.

What is WildASR and how does it differ from LibriSpeech? WildASR (arXiv:2603.25727, Boson AI, 2026) is a multilingual diagnostic benchmark sourced from real human speech that independently measures robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. LibriSpeech uses read audiobooks in clean studio conditions. WildASR found that model robustness does not transfer across languages or conditions. A model robust in English degrades unpredictably in other languages under the same acoustic stress.

What is Slot Error Rate and why does it matter more than WER for voice agents? Slot Error Rate (SER) measures transcription accuracy specifically on the named entities that matter to downstream tasks: product names, account numbers, dates, addresses. SER consistently exceeds WER by 6–12 percentage points (Deepgram, 2026) because domain vocabulary is underrepresented in ASR training data. A voice agent booking a flight can tolerate filler words being transcribed imperfectly; it cannot tolerate the flight number being wrong.

How should teams evaluate ASR for voice agent production readiness? Use domain-specific test sets with real recordings from your target channel (phone, web, mobile). Measure Slot Error Rate alongside WER, test at your target streaming latency, include disfluency-heavy samples and non-native accents, and simulate your noisiest expected conditions. Benchmark WER on LibriSpeech predicts almost nothing about whether your voice agent will work.

ASR is solved (on benchmarks): the real-world gap every voice agent team hits

What LibriSpeech actually tests

The five gaps the Back to Basics paper names

Why WER is the wrong primary metric for voice agents

What WildASR reveals about model selection

How to actually evaluate ASR for your voice agent

Key takeaways

FAQ

Further reading

Related across topics

Share on

What LibriSpeech actually tests

The five gaps the Back to Basics paper names

Why WER is the wrong primary metric for voice agents

What WildASR reveals about model selection

How to actually evaluate ASR for your voice agent

Key takeaways

FAQ

Further reading

Related across topics

τ-Voice benchmark: what full-duplex voice agents actually get wrong

Voice Agent Architecture

Share on