Streaming ASR Architecture
Why batch ASR won’t work for voice assistants, and how streaming models transcribe speech as you speak in under 200ms.
Technical deep-dives into speech and audio ML systems, from research foundations to production deployment. Covers ASR, TTS, speaker recognition, and conversational AI with both theoretical insights and practical engineering.
From fundamentals to the production frontier:
Each post includes:
Automatic Speech Recognition (ASR):
Speech Classification:
Privacy & On-Device:
Audio Processing:
Real-time Processing:
Text-to-Speech (TTS):
Speaker Technology:
Speech Separation:
Pipeline Optimization:
Conversational AI:
Security & Abuse:
Model Optimization:
Speech MLOps:
Adaptive Systems:
Speech Model Design:
Transfer Learning & Domain Adaptation:
Model Deployment:
Search & Retrieval:
Pipeline Orchestration:
Below you’ll find all speech technology posts in chronological order:
The sub-100ms TTS race: TADA, VoXtream2, and why latency is the new quality
Long-form TTS is broken: how TADA and Borderless are fixing it
MamTra: the Mamba-Transformer hybrid that cuts TTS VRAM by 34%
τ-Voice benchmark: what full-duplex voice agents actually get wrong
ASR is solved (on benchmarks): the real-world gap every voice agent team hits
Llasa: what happens when you apply inference-time compute to speech synthesis
Microsoft VibeVoice: how ultra-low frame rate tokenization solved 90-minute multi-speaker TTS
WildASR: why your ASR benchmarks are contaminated with synthetic speech
Content created with the assistance of large language models and reviewed for technical accuracy.
Why batch ASR won’t work for voice assistants, and how streaming models transcribe speech as you speak in under 200ms.
How voice assistants recognize “turn on the lights” from raw audio in under 100ms without full ASR transcription.
How to transform raw audio waveforms into ML-ready features that capture speech characteristics for robust model training.
How voice assistants and video conferencing apps detect when you’re speaking vs silence, the critical first step in every speech pipeline.
How voice assistants recognize who’s speaking, the biometric authentication powering “Hey Alexa” and personalized experiences.
From text to natural speech: understanding modern neural TTS architectures that power Alexa, Google Assistant, and Siri.
Clean audio is the foundation of robust speech systems – master preprocessing pipelines that handle real-world noise and variability.
Build real-time speech processing pipelines that handle audio streams with minimal latency for live transcription and voice interfaces.
Build lightweight models that detect specific keywords in audio streams with minimal latency and power consumption for voice interfaces.
Build systems that enhance voice quality by removing noise, improving intelligibility, and optimizing audio for speech applications.
Separate overlapping speakers with 99%+ accuracy: Deep learning solves the cocktail party problem for meeting transcription and voice assistants.
Build production multi-speaker ASR systems: Combine speech recognition, speaker diarization, and overlap handling for real-world conversations.
Optimize speech pipeline throughput by allocating compute to bottleneck stages using greedy resource management.
Build production speech systems that combine multiple ASR/TTS models using backtracking-based selection strategies to achieve state-of-the-art accuracy.
Build production speaker diarization systems that cluster audio segments by speaker using embedding-based similarity and hash-based grouping.
Build production audio segmentation systems that detect boundaries in real-time using interval merging and temporal processing, the same principles from merg...
Design distributed training pipelines for large-scale speech models that efficiently handle hundreds of thousands of hours of sequential audio data.
Use audio augmentation techniques to make speech models robust to noise, accents, channels, and real-world conditions, built on the same matrix/tensor transf...
Design experiment management systems tailored for speech research, tracking audio data, models, metrics, and multi-dimensional experiments at scale.
Design adaptive speech models that adjust in real-time to speakers, accents, noise, and domains, using the same greedy adaptation strategy as Jump Game and o...
Design neural architecture search systems for speech models that automatically discover optimal ASR/TTS architectures, using dynamic programming and path opt...
Strategies for building profitable speech recognition systems by optimizing the entire pipeline from signal processing to hardware.
Implementing the core decoding logic of modern Speech Recognition systems, handling alignment, blanks, and language models.
The breakthrough that allows us to treat audio like text, enabling GPT-style models for speech.
How do we know if the audio sounds “good” without asking a human?
Real-time ASR is hard. Offline ASR is big.
Goodbye HMMs. Goodbye Phonemes. Goodbye Lexicons. We are teaching the machine to Listen, Attend, and Spell.
“Play Call Me Maybe”. Did you mean the song, the video, or the contact named ‘Maybe’?
“From broad categories to fine-grained speech understanding.”
“Building recommendation and moderation systems for voice-based social platforms.”
“Orchestrating complex speech processing pipelines from audio ingestion to final output.”
“Finding ‘Jon’ when the user types ‘John’, or ‘Symphony’ when they say ‘Simfoni’.”
“Deploying speech models close to users for low-latency voice experiences.”
“The brain of a task-oriented dialogue system: remembering what the user wants.”
“Knowing when to listen and when to stop.”
“One model to rule them all: ASR, Translation, and Understanding.”
“From waveforms to words, and back again.”
“Tuning speech models for peak performance.”
“Giving machines a voice.”
“Hey Siri, Alexa, OK Google: The gateway to voice AI.”
“Who spoke when? The art of untangling voices.”
“Turning acoustic probabilities into coherent text.”
“Extracting clear speech from the noise of the real world.”
“Speaking with someone else’s voice.”
“Teaching machines to hear feelings.”
“If you know how to pronounce ‘P’ in English, you’re 90% of the way to pronouncing ‘P’ in Portuguese.”
“A model that runs in a Jupyter notebook is an experiment. A model that runs on an iPhone is a product.”
“Spelling is irrelevant. Sound is everything.”
“Garbage in, Garbage out. Silence in, Hallucination out.”
“The model knows ‘Apple’ the fruit. It needs to learn ‘Apple’ the stock ticker.”
“Speech is biometric. Treat every waveform like a password, design systems that learn without listening.”
“If ASR is the brain, anomaly detection is the nervous system, it tells you when the audio reality changed.”
“If you don’t validate audio, you’ll debug ‘model regressions’ that are really microphone bugs.”
“Acoustic pattern matching is search, except your ‘strings’ are waveforms and your distance metric is learned.”
“Speech models are uniquely sensitive to temporal resolution. Neural Architecture Search (NAS) is the science of finding the perfect balance between time, fr...
“A speech model that doesn’t adapt is like a listener who doesn’t pay attention to who is speaking. Voice adaptation is about moving from ‘Universal Speech’ ...
“Scaling image models is about pixels; scaling speech models is about time. You cannot batch the past, and you cannot predict the future, you must process th...
“A voice assistant is more than a speech recognizer attached to a search engine. It is a stateful entity that must navigate the social nuances of human turn-...
“Hand-crafting speech architectures is reaching its limits. For the next generation of voice assistants, we don’t build the model, we define the search space...
“Speech models are computationally the most expensive per byte of input. Multi-tier caching is the only way to scale voice assistants to millions of users wi...
“If your voicebot can take actions, it’s an internet-facing production system, treat every utterance like untrusted input from an adversary.”
“Every month, your TTS vendor sends an invoice measured in characters. The same characters you could process on a $619 GPU.”
“Your TTS vendor’s latency number is a lie. Here’s how to read the fine print.”
“TTS demos always use one sentence. Ask yourself why.”
The moment a voice agent’s TTS model causes an OOM on the GPU that was running fine yesterday — because the conversation got longer, because you added a new ...
You build a voice agent, test it with your own voice in a quiet room, and it sounds great. Then it hits users and you discover the agent loses track of domai...
Voice cloning used to be a data problem. Record 30 minutes of audio. Maybe an hour. Feed it to a fine-tuning pipeline. Wait. That was the standard recipe in ...
TL;DR: Standard ASR benchmarks test clean, read speech in studio conditions. Voice agents operate on noisy phone channels, disfluency-laden conversation, and...
TL;DR: Llasa (arXiv:2502.04128, HKUST, February 2025) applies inference-time compute scaling to text-to-speech: instead of always taking the single most like...
TL;DR: VibeVoice (Microsoft, MIT license) generates up to 90 minutes of multi-speaker audio with 4 distinct voices, achieving MOS 3.76 on the 7B model and 1....
TL;DR — Major ASR benchmarks contain TTS-generated speech, inflating reported accuracy. WildASR (arXiv 2603.25727) is the first dataset built entirely from ...