6 minute read

“Garbage in, Garbage out. Silence in, Hallucination out.”

TL;DR

A modern voice system is a cascade of tightly coupled models – VAD, speaker diarization, and ASR – that share time alignment and propagate errors downstream. Audio is grouped into “turns” using VAD triggers, then processed by diarization and ASR in parallel before merging results. Critical production considerations include 200ms audio padding before VAD boundaries, async parallel execution to meet latency budgets, and speculative execution with correction events for streaming. This is the orchestration layer that ties together ASR decoding, speech enhancement, and speaker diarization into a working product.

A row of precision dominoes arranged in a curved chain on a dark surface

1. Problem Statement

A modern “Voice Assistant” is not one model. It is a Cascade of Models.

  1. VAD: Is someone speaking?
  2. Diarization: Who is speaking?
  3. ASR: What did they say?
  4. NLP/Entity: What does it mean?

The Problem: These models depend on each other. If VAD cuts off the first 50ms of a word, ASR fails. If ASR makes a typo, NLP fails. How do we orchestrate this pipeline efficiently?


2. Fundamentals: The Tightly Coupled Chain

Unlike generic microservices where Service A calls Service B via HTTP, Speech pipelines share Time Alignment.

  • VAD output: User spoke [0.5s - 2.5s].
  • Diarization output: Speaker A at [0.4s - 2.6s].
  • The timestamps must match perfectly.
  • Error Propagation: A 10% error in VAD can cause a 50% error in Diarization (if it misses the speaker change).

3. High-Level Architecture

We can view this as a Stream Processing DAG.

graph LR
 A[Microphone Stream] --> B{VAD}
 B -- Silence --> C[Discard / Noise Profile Update]
 B -- Speech --> D[Buffer]
 D --> E[Speaker ID (Embedding)]
 D --> F[ASR (Transcription)]
 E --> G[Meeting Transcript Builder]
 F --> G

4. Component Deep-Dives

4.1 Voice Activity Detection (VAD)

The Gatekeeper.

  • Role: Save compute by ignoring silence. Prevent ASR from hallucinating on noise.
  • Latency: Must be < 10ms.
  • Models: Silero VAD (RNN), WebRTC VAD (GMM).

4.2 Speaker Diarization

The labeler.

  • Role: Assign “Speaker 1” vs “Speaker 2”.
  • Complexity: O(N^2) (Clustering). Hard to do streaming.
  • Solution: “Online Diarization” keeps a centroid for active speakers and assigns new frames to nearest centroid.

4.3 ASR

The Heavy Lifter.

  • Takes the buffered audio from VAD.
  • Returns text + timestamps.

5. Data Flow: The “Turn” Concept

Processing audio byte-by-byte is inefficient for ASR. We group audio into Turns (Utterances).

  1. VAD detects Speech Start.
  2. Pipeline starts Accumulating audio into RAM.
  3. VAD detects Speech End (trailing silence > 500ms).
  4. Trigger: Send accumulated buffer to Diarization and ASR in parallel.
  5. Merge: Combine Speaker=John and Text="Hello".

6. Model Selection & Trade-offs

Stage Model Option A (Fast) Model Option B (Accurate) Selection Logic
VAD WebRTC (CPU) Silero (NN) Use Silero. Accuracy is vital. Cost is low.
Diarization Speaker Embedding (ResNet) End-to-End (EEND) Use Embedding. EEND is too slow for real-time.
ASR Whisper-Tiny Whisper-Large Use Tiny for streaming, Large for final correction.

7. Implementation: The Pipeline Class

import torch

class SpeechPipeline:
    def __init__(self):
        self.vad_model = load_silero_vad()
        self.asr_model = load_whisper()
        self.buffer = []

    def process_frame(self, audio_chunk):
        # 1. Filter Silence
        speech_prob = self.vad_model(audio_chunk, 16000)

        if speech_prob > 0.5:
            self.buffer.append(audio_chunk)

        elif len(self.buffer) > 0:
            # Trailing silence detected -> End of Turn
            full_audio = torch.cat(self.buffer)
            self.buffer = [] # Reset

            # 2. Trigger ASR (The Dependency)
            text = self.asr_model.transcribe(full_audio)

            print(f"Turn Complete: {text}")

8. Streaming Implications

In a true streaming pipeline, we cannot wait for “End of Turn”. We use Speculative Execution.

  1. ASR runs continuously on partial buffer: H -> He -> Hel -> Hello.
  2. Diarization runs every 1 second: Speaker A.
  3. Correction: If Diarization changes its mind (Speaker A -> Speaker B), we send a “Correction Event” to the UI to overwrite the previous line.

9. Quality Metrics

  • DER (Diarization Error Rate): False Alarm + Missed Detection + Confusion.
  • CpWER (Concatenated Person WER): WER calculated per speaker. Finding out if the model is biased against Speaker B.

10. Common Failure Modes

  1. The “Schrodinger’s Word”: A word at the boundary of a VAD cut.
    • User: “Important.”
    • VAD cuts at 0.1s.
    • Audio: “…portant.”
    • ASR: “Portent.”
    • Fix: Padding. Always keep 200ms of history before the VAD trigger.
  2. Overlapping Speech: Two people talk at once.
    • Standard ASR fails.
    • Standard Diarization fails.
    • Fix: Source Separation models (rarely used in production due to cost).

11. State-of-the-Art

Joint Models (Transducers). Instead of VAD -> ASR -> NLP, train one massive Transformer: Input: Audio. Output: <speaker:1> Hello <speaker:2> Hi there <sentiment:pos>. This removes the pipeline latency but makes modular upgrades impossible.


12. Key Takeaways

  1. VAD is critical: It is the “Trigger” for the whole DAG. If it’s flawed, the system is flawed.
  2. Padding saves lives: Never feed exact VAD boundaries to ASR. Add context.
  3. Latency Budget: If Total Latency limit is 500ms, and ASR takes 400ms, VAD+Diarization must happen in 100ms.
  4. Async Design: Run ASR and Diarization in parallel threads, not sequential, to minimize wall-clock time.

FAQ

Why is VAD the most critical component in a speech pipeline?

VAD acts as the trigger for the entire processing DAG. It determines when to start buffering audio, when to invoke ASR and diarization, and when to finalize a turn. If VAD cuts off the beginning of a word, ASR receives truncated audio and produces errors. If it misses a speaker change boundary, diarization assigns the wrong speaker. A 10% VAD error rate can cascade to 50% downstream errors because every subsequent model depends on correct speech boundaries.

What is the ‘turn’ concept in speech pipeline processing?

A turn is a buffered utterance between a speech-start event and a speech-end event detected by VAD. Instead of processing audio frame-by-frame (which is inefficient for ASR), the pipeline accumulates audio into RAM during speech, then triggers parallel ASR and diarization when trailing silence exceeds a threshold (typically 500ms). The results are merged to produce an attributed transcript segment like “Speaker A: Hello.”

How do you prevent VAD boundary cuts from causing ASR errors?

Always include a padding buffer of at least 200ms of audio history before the VAD speech-start trigger point. This prevents the “Schrodinger’s Word” problem where VAD detects speech slightly late, cutting the onset of the first word. Without padding, “Important” might be truncated to “…portant” and transcribed as “Portent.” The padding ensures ASR receives the complete utterance.

How do streaming speech pipelines handle corrections in real-time?

Streaming pipelines use speculative execution: ASR runs continuously on a growing partial buffer, producing intermediate results (H, He, Hel, Hello). Diarization runs in parallel every 1 second, assigning speaker labels. If diarization changes its mind about speaker assignment, the system sends a correction event to the UI to overwrite the previously displayed line. This provides a responsive user experience with eventual consistency.


Originally published at: arunbaby.com/speech-tech/0049-speech-pipeline-dependencies

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch