WildASR (arXiv 2603.25727) is an ASR evaluation dataset built entirely from real human speech — not TTS-generated audio. It captures natural speech characteristics including disfluencies, diverse accents, background noise, and conversational patterns that synthetic speech benchmarks miss.

Does Microsoft MAI-Transcribe-1 actually beat Whisper?

Microsoft claims MAI-Transcribe-1 achieves lower word error rates than Whisper across 25 languages on standard benchmarks. The question WildASR raises is whether those benchmarks reflect real human speech performance. Until MAI-Transcribe-1 is independently evaluated on contamination-free datasets like WildASR, the claimed improvements may be partially inflated by synthetic speech in the test sets.

How should voice agent teams evaluate ASR models?

Test on your actual audio distribution, not standard benchmarks. Record real calls from your production pipeline, transcribe manually for ground truth, and measure WER on that data. If you cannot build a custom test set, WildASR provides a contamination-free evaluation baseline. The gap between benchmark WER and your real-world WER is your contamination exposure.

WildASR: why your ASR benchmarks are contaminated with synthetic speech

Q: What is benchmark contamination in ASR?

ASR benchmark contamination occurs when TTS-generated (synthetic) speech is included in training or evaluation datasets. Models trained on contaminated data learn to transcribe synthetic audio patterns rather than real human speech characteristics like disfluencies, overlapping talk, background noise, and accent variation. This inflates reported accuracy numbers.

8 minute read

A microscope examining audio waveform patterns on a glass slide, some synthetic and uniform, others organic and irregular

TL;DR — Major ASR benchmarks contain TTS-generated speech, inflating reported accuracy. WildASR (arXiv 2603.25727) is the first dataset built entirely from real human speech. When SOTA models are tested on uncontaminated data, the gap between benchmark claims and real-world performance widens. This extends the ASR benchmark gap analysis with the mechanism: contamination, not just domain mismatch.

The benchmark numbers your deployment decision is based on are wrong

Your voice agent team picked an ASR model. You compared word error rates across LibriSpeech, Common Voice, and FLEURS. Model A achieved 3.2% WER. Model B achieved 4.1%. You picked Model A. In production, Model A delivers 12-15% WER on real customer calls.

The ASR benchmark gap post documented this pattern: production WER consistently exceeds benchmark WER by 3-5x. The explanation offered was domain mismatch — benchmarks use clean, scripted audio while production has noise, accents, and disfluencies.

WildASR (arXiv 2603.25727, March 2026) identifies a deeper problem: the benchmarks themselves are contaminated with TTS-generated speech. Models trained on these benchmarks are not just missing domain-specific patterns — they are actively optimized for synthetic audio characteristics that do not exist in real human speech.

How TTS contamination enters benchmarks

Modern ASR datasets are built through a pipeline: collect text, generate audio, transcribe, verify. The economics push this pipeline toward synthetic audio. Recording and transcribing real human speech costs $20-50 per hour. Generating synthetic speech from text costs effectively nothing at scale.

The contamination path:

Dataset builders generate “augmentation” audio using TTS systems to increase dataset size
TTS-generated audio has artificial characteristics: perfectly timed pauses, no disfluencies, consistent prosody, uniform recording quality
Models trained on this data learn to exploit these artificial patterns for higher accuracy
When evaluated on test sets from the same pipeline, these artificial patterns persist in the test data
The reported WER reflects performance on partially synthetic test data, not real human speech

WildASR’s contribution: a dataset built entirely from real human speech, with no TTS augmentation at any stage. The audio contains the characteristics TTS cannot replicate: natural disfluencies (um, uh, false starts), overlapping talk, irregular prosody, ambient noise variation, and accent diversity.

What happens when SOTA models meet real speech

When models that claim SOTA word error rates on standard benchmarks are evaluated on WildASR, two patterns emerge:

Pattern 1: Uniform WER degradation. Every model performs worse on WildASR than on standard benchmarks. This is expected — real speech is harder than clean benchmarks. But the magnitude of degradation varies by model, revealing which models have overfitted to synthetic patterns versus which generalize better.

Pattern 2: Rank order changes. The model that ranks first on LibriSpeech does not always rank first on WildASR. Some models that score well on clean benchmarks degrade more sharply on real speech than competitors with lower benchmark scores. This means benchmark rankings are partially misleading for production deployment decisions.

Evaluation scenario	What it reveals
Standard benchmark WER	How well the model handles clean, partially synthetic audio
WildASR WER	How well the model handles real human speech
Gap between the two	How much the model relies on synthetic audio patterns

A model with a small gap between benchmark and WildASR performance is more production-ready than a model with lower benchmark WER but a larger gap. The gap is your contamination exposure.

What this means for the new model announcements

Microsoft launched MAI-Transcribe-1 in April 2026, claiming it beats Whisper large-v3 across 25 language benchmarks with 3.8% WER on FLEURS and 2.5x faster inference. Cohere released Transcribe with 2 billion parameters and open-source weights for 14 languages. Deepgram’s Nova-3 continues as the production default for many voice agent teams.

WildASR raises a question for each: are the claimed improvements real-speech improvements or synthetic-benchmark improvements?

This is not an accusation of dishonesty. The benchmarks these claims are based on are the accepted standard. The problem is that the standard itself is contaminated. A model that improves FLEURS WER by 0.5% might have achieved that improvement by better recognizing TTS prosody patterns — an improvement that transfers poorly to production where prosody is irregular.

The honest evaluation for any new ASR model release: run it on your actual production audio. Not a standard benchmark. Not WildASR (though that is better). Your actual calls, your actual speakers, your actual noise environment. The gap between benchmark claims and your production WER is the only number that matters for your deployment.

How to evaluate ASR models for production voice agents

Build a production test set. Record 100-200 representative calls from your actual pipeline. Transcribe them manually (or with multiple human annotators for disputed segments). This is your ground truth. Run every candidate model against this set and compare WER.

Measure contamination exposure. Run each candidate model on both a standard benchmark and WildASR. The delta between the two scores is the model’s contamination exposure — how much of its benchmark performance depends on synthetic audio patterns.

Test disfluency handling explicitly. Record samples with common production speech patterns: false starts, filler words (um, uh, like), self-corrections, overlapping speakers. Models with high contamination exposure systematically fail on these because TTS does not generate them.

Weight production WER over benchmark WER. When comparing models, your production test set WER is the decision metric. Benchmark WER is background context. A model with 5% benchmark WER and 14% production WER is worse for your deployment than a model with 7% benchmark WER and 10% production WER.

Re-evaluate after model updates. ASR providers update models regularly. Each update can shift the contamination exposure. Re-run your production test set after every model version change.

Key takeaways

Major ASR benchmarks contain TTS-generated speech, causing models trained on them to optimize for synthetic audio patterns rather than real human speech (arXiv 2603.25727)
WildASR is the first evaluation dataset built entirely from real human speech, with no TTS augmentation
Models that rank first on standard benchmarks do not always rank first on WildASR — benchmark rankings are partially misleading for production decisions
The gap between a model’s benchmark WER and its WildASR WER is its contamination exposure — how much performance depends on synthetic audio patterns
New model announcements (MAI-Transcribe-1, Cohere Transcribe, Deepgram Nova-3) should be evaluated on production audio, not the contaminated standard benchmarks they claim SOTA on
Build a production test set from your actual calls and use that as the decision metric, treating benchmark WER as background context only

FAQ

What is benchmark contamination in ASR? TTS-generated speech in training and evaluation datasets causes models to optimize for synthetic audio characteristics: perfect timing, no disfluencies, consistent prosody, uniform quality. Reported WER on contaminated benchmarks overstates performance on real human speech.

What is WildASR? WildASR (arXiv 2603.25727) is an evaluation dataset of real human speech with no TTS augmentation. It captures natural disfluencies, accents, overlapping talk, and ambient noise that synthetic benchmarks miss. It provides a contamination-free baseline for model evaluation.

Does this mean Whisper and MAI-Transcribe-1 benchmark numbers are fake? No. The benchmark numbers are real measurements on real test sets. The issue is that the test sets contain synthetic speech that inflates accuracy relative to real-world performance. The models perform exactly as measured on those benchmarks — the benchmarks just do not represent production conditions as well as claimed.

How do I build a production ASR test set? Record 100-200 representative calls from your pipeline. Transcribe manually with 2+ annotators for quality. Use this as ground truth to evaluate any ASR model for your specific deployment. Update the test set quarterly as your caller demographics or audio environment changes.

WildASR: why your ASR benchmarks are contaminated with synthetic speech

The benchmark numbers your deployment decision is based on are wrong

How TTS contamination enters benchmarks

What happens when SOTA models meet real speech

What this means for the new model announcements

How to evaluate ASR models for production voice agents

Key takeaways

FAQ

Further reading

Related across topics

Share on

The benchmark numbers your deployment decision is based on are wrong

How TTS contamination enters benchmarks

What happens when SOTA models meet real speech

What this means for the new model announcements

How to evaluate ASR models for production voice agents

Key takeaways

FAQ

Further reading

Related across topics

Voice Agent Architecture

Share on