WildASR: why your ASR benchmarks are contaminated with synthetic speech

TL;DR — Major ASR benchmarks contain TTS-generated speech, inflating reported accuracy. WildASR (arXiv 2603.25727) is the first dataset built entirely from real human speech. When SOTA models are tested on uncontaminated data, the gap between benchmark claims and real-world performance widens. This extends the ASR benchmark gap analysis with the mechanism: contamination, not just domain mismatch.
The benchmark numbers your deployment decision is based on are wrong
Your voice agent team picked an ASR model. You compared word error rates across LibriSpeech, Common Voice, and FLEURS. Model A achieved 3.2% WER. Model B achieved 4.1%. You picked Model A. In production, Model A delivers 12-15% WER on real customer calls.
The ASR benchmark gap post documented this pattern: production WER consistently exceeds benchmark WER by 3-5x. The explanation offered was domain mismatch — benchmarks use clean, scripted audio while production has noise, accents, and disfluencies.
WildASR (arXiv 2603.25727, March 2026) identifies a deeper problem: the benchmarks themselves are contaminated with TTS-generated speech. Models trained on these benchmarks are not just missing domain-specific patterns — they are actively optimized for synthetic audio characteristics that do not exist in real human speech.
How TTS contamination enters benchmarks
Modern ASR datasets are built through a pipeline: collect text, generate audio, transcribe, verify. The economics push this pipeline toward synthetic audio. Recording and transcribing real human speech costs $20-50 per hour. Generating synthetic speech from text costs effectively nothing at scale.
The contamination path:
- Dataset builders generate “augmentation” audio using TTS systems to increase dataset size
- TTS-generated audio has artificial characteristics: perfectly timed pauses, no disfluencies, consistent prosody, uniform recording quality
- Models trained on this data learn to exploit these artificial patterns for higher accuracy
- When evaluated on test sets from the same pipeline, these artificial patterns persist in the test data
- The reported WER reflects performance on partially synthetic test data, not real human speech
WildASR’s contribution: a dataset built entirely from real human speech, with no TTS augmentation at any stage. The audio contains the characteristics TTS cannot replicate: natural disfluencies (um, uh, false starts), overlapping talk, irregular prosody, ambient noise variation, and accent diversity.
What happens when SOTA models meet real speech
When models that claim SOTA word error rates on standard benchmarks are evaluated on WildASR, two patterns emerge:
Pattern 1: Uniform WER degradation. Every model performs worse on WildASR than on standard benchmarks. This is expected — real speech is harder than clean benchmarks. But the magnitude of degradation varies by model, revealing which models have overfitted to synthetic patterns versus which generalize better.
Pattern 2: Rank order changes. The model that ranks first on LibriSpeech does not always rank first on WildASR. Some models that score well on clean benchmarks degrade more sharply on real speech than competitors with lower benchmark scores. This means benchmark rankings are partially misleading for production deployment decisions.
| Evaluation scenario | What it reveals |
|---|---|
| Standard benchmark WER | How well the model handles clean, partially synthetic audio |
| WildASR WER | How well the model handles real human speech |
| Gap between the two | How much the model relies on synthetic audio patterns |
A model with a small gap between benchmark and WildASR performance is more production-ready than a model with lower benchmark WER but a larger gap. The gap is your contamination exposure.
What this means for the new model announcements
Microsoft launched MAI-Transcribe-1 in April 2026, claiming it beats Whisper large-v3 across 25 language benchmarks with 3.8% WER on FLEURS and 2.5x faster inference. Cohere released Transcribe with 2 billion parameters and open-source weights for 14 languages. Deepgram’s Nova-3 continues as the production default for many voice agent teams.
WildASR raises a question for each: are the claimed improvements real-speech improvements or synthetic-benchmark improvements?
This is not an accusation of dishonesty. The benchmarks these claims are based on are the accepted standard. The problem is that the standard itself is contaminated. A model that improves FLEURS WER by 0.5% might have achieved that improvement by better recognizing TTS prosody patterns — an improvement that transfers poorly to production where prosody is irregular.
The honest evaluation for any new ASR model release: run it on your actual production audio. Not a standard benchmark. Not WildASR (though that is better). Your actual calls, your actual speakers, your actual noise environment. The gap between benchmark claims and your production WER is the only number that matters for your deployment.
How to evaluate ASR models for production voice agents
Build a production test set. Record 100-200 representative calls from your actual pipeline. Transcribe them manually (or with multiple human annotators for disputed segments). This is your ground truth. Run every candidate model against this set and compare WER.
Measure contamination exposure. Run each candidate model on both a standard benchmark and WildASR. The delta between the two scores is the model’s contamination exposure — how much of its benchmark performance depends on synthetic audio patterns.
Test disfluency handling explicitly. Record samples with common production speech patterns: false starts, filler words (um, uh, like), self-corrections, overlapping speakers. Models with high contamination exposure systematically fail on these because TTS does not generate them.
Weight production WER over benchmark WER. When comparing models, your production test set WER is the decision metric. Benchmark WER is background context. A model with 5% benchmark WER and 14% production WER is worse for your deployment than a model with 7% benchmark WER and 10% production WER.
Re-evaluate after model updates. ASR providers update models regularly. Each update can shift the contamination exposure. Re-run your production test set after every model version change.
Key takeaways
- Major ASR benchmarks contain TTS-generated speech, causing models trained on them to optimize for synthetic audio patterns rather than real human speech (arXiv 2603.25727)
- WildASR is the first evaluation dataset built entirely from real human speech, with no TTS augmentation
- Models that rank first on standard benchmarks do not always rank first on WildASR — benchmark rankings are partially misleading for production decisions
- The gap between a model’s benchmark WER and its WildASR WER is its contamination exposure — how much performance depends on synthetic audio patterns
- New model announcements (MAI-Transcribe-1, Cohere Transcribe, Deepgram Nova-3) should be evaluated on production audio, not the contaminated standard benchmarks they claim SOTA on
- Build a production test set from your actual calls and use that as the decision metric, treating benchmark WER as background context only
FAQ
What is benchmark contamination in ASR? TTS-generated speech in training and evaluation datasets causes models to optimize for synthetic audio characteristics: perfect timing, no disfluencies, consistent prosody, uniform quality. Reported WER on contaminated benchmarks overstates performance on real human speech.
What is WildASR? WildASR (arXiv 2603.25727) is an evaluation dataset of real human speech with no TTS augmentation. It captures natural disfluencies, accents, overlapping talk, and ambient noise that synthetic benchmarks miss. It provides a contamination-free baseline for model evaluation.
Does this mean Whisper and MAI-Transcribe-1 benchmark numbers are fake? No. The benchmark numbers are real measurements on real test sets. The issue is that the test sets contain synthetic speech that inflates accuracy relative to real-world performance. The models perform exactly as measured on those benchmarks — the benchmarks just do not represent production conditions as well as claimed.
How do I build a production ASR test set? Record 100-200 representative calls from your pipeline. Transcribe manually with 2+ annotators for quality. Use this as ground truth to evaluate any ASR model for your specific deployment. Update the test set quarterly as your caller demographics or audio environment changes.
Further reading
- ASR is solved (on benchmarks): the real-world gap every voice agent team hits — the broader benchmark-production gap
- Voice agent architecture — where ASR fits in the agent pipeline
- Speech quality monitoring — production audio quality measurement
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch