Speech Technology

Technical deep-dives into speech and audio ML systems, from research foundations to production deployment. Covers ASR, TTS, speaker recognition, and conversational AI with both theoretical insights and practical engineering.

Start Here

From fundamentals to the production frontier:

Streaming ASR Architecture — How real-time speech recognition systems process audio as it arrives.
End-to-End TTS — The architectural shift from pipelined to end-to-end speech synthesis.
The sub-100ms TTS race — What it takes to build TTS fast enough for real-time voice agents.
Llasa: inference-time compute for speech synthesis — What happens when you apply reasoning-time scaling to TTS.
VibeVoice: 90-minute multi-speaker TTS — Why frame rate is the core constraint in long-form audio generation.

Each post includes:

Architecture and model design choices
Streaming and real-time processing considerations
Research foundations and recent advances
Production deployment and optimization strategies
Scaling, cost analysis, and performance tuning
Code examples and implementation details

Browse by Topic

Automatic Speech Recognition (ASR):

Speech Classification:

Speech Command Classification

Privacy & On-Device:

Privacy-preserving Speech

Audio Processing:

Real-time Processing:

Text-to-Speech (TTS):

TTS System Fundamentals

Speaker Technology:

Speech Separation:

Pipeline Optimization:

Conversational AI:

Conversational AI System

Security & Abuse:

Security for Voicebots: Red Teaming and Blue Teaming Production Voice Agents

Model Optimization:

Distributed Speech Training

Speech MLOps:

Adaptive Systems:

Speech Model Design:

Speech Architecture Search

Transfer Learning & Domain Adaptation:

Model Deployment:

Speech Model Export (ONNX/TFLite)

Search & Retrieval:

Phonetic Trie for Speech Recognition

Pipeline Orchestration:

Speech Tech Index

Below you’ll find all speech technology posts in chronological order:

Content created with the assistance of large language models and reviewed for technical accuracy.

Streaming ASR Architecture

24 minute read

Why batch ASR won’t work for voice assistants, and how streaming models transcribe speech as you speak in under 200ms.

Speech Command Classification

30 minute read

How voice assistants recognize “turn on the lights” from raw audio in under 100ms without full ASR transcription.

Audio Feature Extraction for Speech ML

30 minute read

How to transform raw audio waveforms into ML-ready features that capture speech characteristics for robust model training.

Voice Activity Detection (VAD)

25 minute read

How voice assistants and video conferencing apps detect when you’re speaking vs silence, the critical first step in every speech pipeline.

Speaker Recognition & Verification

23 minute read

How voice assistants recognize who’s speaking, the biometric authentication powering “Hey Alexa” and personalized experiences.

Text-to-Speech (TTS) System Fundamentals

22 minute read

From text to natural speech: understanding modern neural TTS architectures that power Alexa, Google Assistant, and Siri.

Audio Preprocessing & Signal Processing

26 minute read

Clean audio is the foundation of robust speech systems – master preprocessing pipelines that handle real-world noise and variability.

Streaming Speech Processing Pipeline

25 minute read

Build real-time speech processing pipelines that handle audio streams with minimal latency for live transcription and voice interfaces.

Real-time Keyword Spotting

26 minute read

Build lightweight models that detect specific keywords in audio streams with minimal latency and power consumption for voice interfaces.

Voice Enhancement & Noise Reduction

29 minute read

Build systems that enhance voice quality by removing noise, improving intelligibility, and optimizing audio for speech applications.

Speech Separation

24 minute read

Separate overlapping speakers with 99%+ accuracy: Deep learning solves the cocktail party problem for meeting transcription and voice assistants.

Multi-Speaker ASR

27 minute read

Build production multi-speaker ASR systems: Combine speech recognition, speaker diarization, and overlap handling for real-world conversations.

Compute Allocation for Speech Models

25 minute read

Optimize speech pipeline throughput by allocating compute to bottleneck stages using greedy resource management.

Multi-model Speech Ensemble

25 minute read

Build production speech systems that combine multiple ASR/TTS models using backtracking-based selection strategies to achieve state-of-the-art accuracy.

Speaker Clustering (Diarization)

23 minute read

Build production speaker diarization systems that cluster audio segments by speaker using embedding-based similarity and hash-based grouping.

Real-time Audio Segmentation

23 minute read

Build production audio segmentation systems that detect boundaries in real-time using interval merging and temporal processing, the same principles from merg...

Distributed Speech Training

13 minute read

Design distributed training pipelines for large-scale speech models that efficiently handle hundreds of thousands of hours of sequential audio data.

Audio Augmentation Techniques

11 minute read

Use audio augmentation techniques to make speech models robust to noise, accents, channels, and real-world conditions, built on the same matrix/tensor transf...

Speech Experiment Management

18 minute read

Design experiment management systems tailored for speech research, tracking audio data, models, metrics, and multi-dimensional experiments at scale.

Adaptive Speech Models

22 minute read

Design adaptive speech models that adjust in real-time to speakers, accents, noise, and domains, using the same greedy adaptation strategy as Jump Game and o...

Speech Architecture Search

19 minute read

Design neural architecture search systems for speech models that automatically discover optimal ASR/TTS architectures, using dynamic programming and path opt...

Cost-efficient Speech Systems

15 minute read

Strategies for building profitable speech recognition systems by optimizing the entire pipeline from signal processing to hardware.

ASR Beam Search Implementation

14 minute read

Implementing the core decoding logic of modern Speech Recognition systems, handling alignment, blanks, and language models.

Speech Tokenization

17 minute read

The breakthrough that allows us to treat audio like text, enabling GPT-style models for speech.

Speech Quality Monitoring

16 minute read

How do we know if the audio sounds “good” without asking a human?

Batch Speech Processing

14 minute read

Real-time ASR is hard. Offline ASR is big.

End-to-End Speech Model Design

25 minute read

Goodbye HMMs. Goodbye Phonemes. Goodbye Lexicons. We are teaching the machine to Listen, Attend, and Spell.

Voice Search Ranking

24 minute read

“Play Call Me Maybe”. Did you mean the song, the video, or the contact named ‘Maybe’?

Hierarchical Speech Classification

16 minute read

“From broad categories to fine-grained speech understanding.”

Social Voice Networks

16 minute read

“Building recommendation and moderation systems for voice-based social platforms.”

Speech Pipeline Orchestration

13 minute read

“Orchestrating complex speech processing pipelines from audio ingestion to final output.”

Phonetic Search in Speech

12 minute read

“Finding ‘Jon’ when the user types ‘John’, or ‘Symphony’ when they say ‘Simfoni’.”

Multi-region Speech Deployment

21 minute read

“Deploying speech models close to users for low-latency voice experiences.”

Dialog State Tracking (DST)

21 minute read

“The brain of a task-oriented dialogue system: remembering what the user wants.”

Speech Boundary Detection

21 minute read

“Knowing when to listen and when to stop.”

Multi-task Speech Learning

24 minute read

“One model to rule them all: ASR, Translation, and Understanding.”

Sequence-to-Sequence Speech Models

23 minute read

“From waveforms to words, and back again.”

Speech Hyperparameter Tuning

24 minute read

“Tuning speech models for peak performance.”

End-to-End Text-to-Speech (TTS)

16 minute read

“Giving machines a voice.”

Wake Word Detection

22 minute read

“Hey Siri, Alexa, OK Google: The gateway to voice AI.”

Speaker Diarization

22 minute read

“Who spoke when? The art of untangling voices.”

Automatic Speech Recognition (ASR) Decoding

18 minute read

“Turning acoustic probabilities into coherent text.”

Speech Enhancement

15 minute read

“Extracting clear speech from the noise of the real world.”

Voice Conversion

14 minute read

“Speaking with someone else’s voice.”

Speech Emotion Recognition

14 minute read

“Teaching machines to hear feelings.”

Cross-Lingual Speech Transfer

9 minute read

“If you know how to pronounce ‘P’ in English, you’re 90% of the way to pronouncing ‘P’ in Portuguese.”

Speech Model Export

7 minute read

“A model that runs in a Jupyter notebook is an experiment. A model that runs on an iPhone is a product.”

Phonetic Trie

7 minute read

“Spelling is irrelevant. Sound is everything.”

Speech Pipeline Dependencies

6 minute read

“Garbage in, Garbage out. Silence in, Hallucination out.”

Custom Language Modeling

6 minute read

“The model knows ‘Apple’ the fruit. It needs to learn ‘Apple’ the stock ticker.”

Privacy-preserving Speech

23 minute read

“Speech is biometric. Treat every waveform like a password, design systems that learn without listening.”

Speech Anomaly Detection

22 minute read

“If ASR is the brain, anomaly detection is the nervous system, it tells you when the audio reality changed.”

Audio Quality Validation

20 minute read

“If you don’t validate audio, you’ll debug ‘model regressions’ that are really microphone bugs.”

Acoustic Pattern Matching

24 minute read

“Acoustic pattern matching is search, except your ‘strings’ are waveforms and your distance metric is learned.”

Neural Architecture Search for Speech

19 minute read

“Speech models are uniquely sensitive to temporal resolution. Neural Architecture Search (NAS) is the science of finding the perfect balance between time, fr...

Real-time Voice Adaptation

9 minute read

“A speech model that doesn’t adapt is like a listener who doesn’t pay attention to who is speaking. Voice adaptation is about moving from ‘Universal Speech’ ...

Scaling Speech Infrastructure: From Labs to Billions

5 minute read

“Scaling image models is about pixels; scaling speech models is about time. You cannot batch the past, and you cannot predict the future, you must process th...

Architecting Conversational AI Systems

6 minute read

“A voice assistant is more than a speech recognizer attached to a search engine. It is a stateful entity that must navigate the social nuances of human turn-...

Neural Architecture Search (NAS) for Speech

7 minute read

“Hand-crafting speech architectures is reaching its limits. For the next generation of voice assistants, we don’t build the model, we define the search space...

Multi-tier Speech Caching Architecture

19 minute read

“Speech models are computationally the most expensive per byte of input. Multi-tier caching is the only way to scale voice assistants to millions of users wi...

Security for Voicebots: Red Teaming and Blue Teaming Production Voice Agents

34 minute read

“If your voicebot can take actions, it’s an internet-facing production system, treat every utterance like untrusted input from an adversary.”

When self-hosted TTS beats the API: the 2026 cost cliff

11 minute read

“Every month, your TTS vendor sends an invoice measured in characters. The same characters you could process on a $619 GPU.”

The sub-100ms TTS race: TADA, VoXtream2, and why latency is the new quality

11 minute read

“Your TTS vendor’s latency number is a lie. Here’s how to read the fine print.”

Long-form TTS is broken: how TADA and Borderless are fixing it

9 minute read

“TTS demos always use one sentence. Ask yourself why.”

MamTra: the Mamba-Transformer hybrid that cuts TTS VRAM by 34%

15 minute read

The moment a voice agent’s TTS model causes an OOM on the GPU that was running fine yesterday — because the conversation got longer, because you added a new ...

τ-Voice benchmark: what full-duplex voice agents actually get wrong

10 minute read

You build a voice agent, test it with your own voice in a quiet room, and it sounds great. Then it hits users and you discover the agent loses track of domai...

Voxtral: Mistral just rewrote the voice cloning rules

9 minute read

Voice cloning used to be a data problem. Record 30 minutes of audio. Maybe an hour. Feed it to a fine-tuning pipeline. Wait. That was the standard recipe in ...

ASR is solved (on benchmarks): the real-world gap every voice agent team hits

14 minute read

TL;DR: Standard ASR benchmarks test clean, read speech in studio conditions. Voice agents operate on noisy phone channels, disfluency-laden conversation, and...

Llasa: what happens when you apply inference-time compute to speech synthesis

10 minute read

TL;DR: Llasa (arXiv:2502.04128, HKUST, February 2025) applies inference-time compute scaling to text-to-speech: instead of always taking the single most like...

Microsoft VibeVoice: how ultra-low frame rate tokenization solved 90-minute multi-speaker TTS

10 minute read

TL;DR: VibeVoice (Microsoft, MIT license) generates up to 90 minutes of multi-speaker audio with 4 distinct voices, achieving MOS 3.76 on the 7B model and 1....

WildASR: why your ASR benchmarks are contaminated with synthetic speech

8 minute read

TL;DR — Major ASR benchmarks contain TTS-generated speech, inflating reported accuracy. WildASR (arXiv 2603.25727) is the first dataset built entirely from ...

Full-duplex voice agents break on disfluency, and the benchmark finally says how

13 minute read

TL;DR

Why I chose Whisper large-v3-turbo over Parakeet for every production pipeline

9 minute read

TL;DR