Voice Activity Detection (VAD)
How voice assistants and video conferencing apps detect when you’re speaking vs silence, the critical first step in every speech pipeline.
TL;DR
VAD is the gatekeeper of all speech systems, classifying audio frames as speech or non-speech in under 5ms. Energy-based approaches work in quiet environments but fail in noise; WebRTC VAD is the production standard used by billions of users, balancing speed and robustness with configurable aggressiveness levels. ML-based CNN+LSTM models achieve the best accuracy at the cost of higher latency. Two-pass architectures (fast WebRTC pass + accurate ML refinement) offer the best of both worlds. Critical production details include 200-500ms padding to avoid clipping speech boundaries and adaptive thresholding for varying noise conditions. VAD feeds directly into streaming ASR pipelines and uses audio features like energy, zero-crossing rate, and mel-spectrograms.

Introduction
Voice Activity Detection (VAD) is the task of determining which parts of an audio stream contain speech vs non-speech (silence, background noise, music).
VAD is the gatekeeper of speech systems:
- Triggers when to start listening (wake word detection)
- Determines when utterance ends (endpoint detection)
- Saves compute by only processing speech frames
- Improves bandwidth by only transmitting speech
Why it matters:
- Power efficiency: Voice assistants sleep until speech detected
- Latency: Know when user finished speaking → respond faster
- Bandwidth: Transmit only speech frames in VoIP
- Accuracy: Reduce false alarms in ASR systems
What you’ll learn:
- Energy-based VAD (simple, fast)
- WebRTC VAD (production standard)
- ML-based VAD (state-of-the-art)
- Real-time streaming implementation
- Production deployment considerations
Problem Definition
Design a real-time voice activity detection system.
Functional Requirements
- Detection
- Classify each audio frame as speech or non-speech
- Handle noisy environments
- Detect speech from multiple speakers
- Endpoint Detection
- Determine start of speech
- Determine end of speech
- Handle pauses within utterances
- Real-time Processing
- Process audio frames as they arrive
- Minimal buffering
- Low latency
Non-Functional Requirements
- Latency
- Frame-level detection: < 5ms
- Endpoint detection: < 100ms after speech ends
- Accuracy
- True positive rate > 95% (detect speech)
- False positive rate < 5% (mistake noise for speech)
- Robustness
- Work in SNR (Signal-to-Noise Ratio) down to 0 dB
- Handle various noise types (music, traffic, crowds)
- Adapt to different speakers
Approach 1: Energy-Based VAD
Simplest approach: Speech has higher energy than silence.
Implementation
import numpy as np
import librosa
class EnergyVAD:
"""
Energy-based Voice Activity Detection
Pros: Simple, fast, no training required
Cons: Sensitive to noise, poor in low SNR
"""
def __init__(
self,
sr=16000,
frame_length_ms=20,
hop_length_ms=10,
energy_threshold=0.01
):
self.sr = sr
self.frame_length = int(sr * frame_length_ms / 1000)
self.hop_length = int(sr * hop_length_ms / 1000)
self.energy_threshold = energy_threshold
def compute_energy(self, frame):
"""
Compute frame energy (RMS)
Energy = sqrt(mean(x^2))
"""
return np.sqrt(np.mean(frame ** 2))
def detect(self, audio):
"""
Detect speech frames
Args:
audio: Audio signal
Returns:
List of booleans (True = speech, False = non-speech)
"""
# Frame the audio
frames = librosa.util.frame(
audio,
frame_length=self.frame_length,
hop_length=self.hop_length
)
# Compute energy per frame
energies = np.array([self.compute_energy(frame) for frame in frames.T])
# Threshold
is_speech = energies > self.energy_threshold
return is_speech
def get_speech_segments(self, audio):
"""
Get speech segments (start, end) in seconds
Returns:
List of (start_time, end_time) tuples
"""
is_speech = self.detect(audio)
segments = []
in_speech = False
start_frame = 0
for i, speech in enumerate(is_speech):
if speech and not in_speech:
# Speech started
start_frame = i
in_speech = True
elif not speech and in_speech:
# Speech ended
end_frame = i
in_speech = False
# Convert frames to time
start_time = start_frame * self.hop_length / self.sr
end_time = end_frame * self.hop_length / self.sr
segments.append((start_time, end_time))
# Handle case where audio ends during speech
if in_speech:
end_time = len(is_speech) * self.hop_length / self.sr
start_time = start_frame * self.hop_length / self.sr
segments.append((start_time, end_time))
return segments
# Usage
vad = EnergyVAD(energy_threshold=0.01)
# Load audio
audio, sr = librosa.load('speech_with_silence.wav', sr=16000)
# Detect speech
is_speech = vad.detect(audio)
print(f"Speech frames: {is_speech.sum()} / {len(is_speech)}")
# Get segments
segments = vad.get_speech_segments(audio)
for start, end in segments:
print(f"Speech from {start:.2f}s to {end:.2f}s")
Adaptive Thresholding
Fixed thresholds fail in varying noise conditions. Use adaptive thresholds.
class AdaptiveEnergyVAD(EnergyVAD):
"""
Energy VAD with adaptive threshold
Threshold adapts to background noise level
"""
def __init__(self, sr=16000, frame_length_ms=20, hop_length_ms=10):
super().__init__(sr, frame_length_ms, hop_length_ms)
self.noise_energy = 0.001 # Initial estimate
self.alpha = 0.95 # Smoothing factor
def detect(self, audio):
"""Detect with adaptive threshold"""
frames = librosa.util.frame(
audio,
frame_length=self.frame_length,
hop_length=self.hop_length
)
is_speech = []
for frame in frames.T:
energy = self.compute_energy(frame)
# Adaptive threshold: 3x noise energy
threshold = 3.0 * self.noise_energy
if energy > threshold:
# Likely speech
is_speech.append(True)
else:
# Likely noise/silence
is_speech.append(False)
# Update noise estimate (during silence only)
self.noise_energy = self.alpha * self.noise_energy + (1 - self.alpha) * energy
return np.array(is_speech)
Approach 2: Zero-Crossing Rate + Energy
Combine energy with zero-crossing rate for better accuracy.
Implementation
class ZCR_Energy_VAD:
"""
VAD using Energy + Zero-Crossing Rate
Intuition:
- Speech: Low ZCR (voiced sounds), moderate to high energy
- Noise: High ZCR (unvoiced), varying energy
- Silence: Low energy
"""
def __init__(
self,
sr=16000,
frame_length_ms=20,
hop_length_ms=10,
energy_threshold=0.01,
zcr_threshold=0.1
):
self.sr = sr
self.frame_length = int(sr * frame_length_ms / 1000)
self.hop_length = int(sr * hop_length_ms / 1000)
self.energy_threshold = energy_threshold
self.zcr_threshold = zcr_threshold
def compute_zcr(self, frame):
"""
Compute zero-crossing rate
ZCR = # of times signal crosses zero / # samples
"""
signs = np.sign(frame)
zcr = np.mean(np.abs(np.diff(signs))) / 2
return zcr
def detect(self, audio):
"""
Detect using both energy and ZCR
"""
frames = librosa.util.frame(
audio,
frame_length=self.frame_length,
hop_length=self.hop_length
)
is_speech = []
for frame in frames.T:
energy = np.sqrt(np.mean(frame ** 2))
zcr = self.compute_zcr(frame)
# Decision logic
if energy > self.energy_threshold:
# High energy: could be speech or noise
if zcr < self.zcr_threshold:
# Low ZCR → likely speech (voiced)
is_speech.append(True)
else:
# High ZCR → likely noise
is_speech.append(False)
else:
# Low energy → silence
is_speech.append(False)
return np.array(is_speech)
Approach 3: WebRTC VAD
Industry-standard VAD used in Chrome, Skype, etc.
Using WebRTC VAD
# WebRTC VAD requires: pip install webrtcvad
import webrtcvad
import struct
class WebRTCVAD:
"""
WebRTC Voice Activity Detector
Pros:
- Production-tested (billions of users)
- Fast, CPU-efficient
- Robust to noise
Cons:
- Only works with specific sample rates (8/16/32/48 kHz)
- Fixed frame sizes (10/20/30 ms)
"""
def __init__(self, sr=16000, frame_duration_ms=30, aggressiveness=3):
"""
Args:
sr: Sample rate (must be 8000, 16000, 32000, or 48000)
frame_duration_ms: Frame duration (10, 20, or 30 ms)
aggressiveness: 0-3 (0=least aggressive, 3=most aggressive)
- Higher = more likely to classify as non-speech
- Use 3 for noisy environments
"""
if sr not in [8000, 16000, 32000, 48000]:
raise ValueError("Sample rate must be 8000, 16000, 32000, or 48000")
if frame_duration_ms not in [10, 20, 30]:
raise ValueError("Frame duration must be 10, 20, or 30 ms")
self.sr = sr
self.frame_duration_ms = frame_duration_ms
self.frame_length = int(sr * frame_duration_ms / 1000)
# Create VAD instance
self.vad = webrtcvad.Vad(aggressiveness)
def detect(self, audio):
"""
Detect speech in audio
Args:
audio: numpy array of int16 samples
Returns:
List of booleans (True = speech)
"""
# Convert float to int16 if needed (clip to avoid overflow)
if audio.dtype == np.float32 or audio.dtype == np.float64:
audio = np.clip(audio, -1.0, 1.0)
audio = (audio * 32767).astype(np.int16)
# Frame audio
num_frames = len(audio) // self.frame_length
is_speech = []
for i in range(num_frames):
start = i * self.frame_length
end = start + self.frame_length
frame = audio[start:end]
# Convert to bytes
frame_bytes = struct.pack('%dh' % len(frame), *frame)
# Detect
speech = self.vad.is_speech(frame_bytes, self.sr)
is_speech.append(speech)
return np.array(is_speech)
def get_speech_timestamps(self, audio):
"""
Get speech timestamps
Returns:
List of (start_time, end_time) in seconds
"""
is_speech = self.detect(audio)
segments = []
in_speech = False
start_frame = 0
for i, speech in enumerate(is_speech):
if speech and not in_speech:
start_frame = i
in_speech = True
elif not speech and in_speech:
in_speech = False
start_time = start_frame * self.frame_length / self.sr
end_time = i * self.frame_length / self.sr
segments.append((start_time, end_time))
if in_speech:
start_time = start_frame * self.frame_length / self.sr
end_time = len(is_speech) * self.frame_length / self.sr
segments.append((start_time, end_time))
return segments
# Usage
vad = WebRTCVAD(sr=16000, frame_duration_ms=30, aggressiveness=3)
audio, sr = librosa.load('audio.wav', sr=16000)
segments = vad.get_speech_timestamps(audio)
print("Speech segments:")
for start, end in segments:
print(f" {start:.2f}s - {end:.2f}s")
Approach 4: ML-Based VAD
Use neural networks for state-of-the-art performance.
CNN-based VAD
import torch
import torch.nn as nn
class CNNVAD(nn.Module):
"""
CNN-based Voice Activity Detector
Input: Mel-spectrogram (time, freq)
Output: Speech probability per frame
"""
def __init__(self, n_mels=40):
super().__init__()
# CNN layers
self.conv1 = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2, 2)
)
self.conv2 = nn.Sequential(
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2, 2)
)
# LSTM for temporal modeling
self.lstm = nn.LSTM(
input_size=64 * (n_mels // 4),
hidden_size=128,
num_layers=2,
batch_first=True,
bidirectional=True
)
# Classification head
self.fc = nn.Linear(256, 1) # Binary classification
self.sigmoid = nn.Sigmoid()
def forward(self, x):
"""
Forward pass
Args:
x: (batch, 1, time, n_mels)
Returns:
Speech probabilities: (batch, time)
"""
# CNN
x = self.conv1(x) # (batch, 32, time/2, n_mels/2)
x = self.conv2(x) # (batch, 64, time/4, n_mels/4)
# Reshape for LSTM
batch, channels, time, freq = x.size()
x = x.permute(0, 2, 1, 3) # (batch, time, channels, freq)
x = x.reshape(batch, time, channels * freq)
# LSTM
x, _ = self.lstm(x) # (batch, time, 256)
# Classification
x = self.fc(x) # (batch, time, 1)
x = self.sigmoid(x) # (batch, time, 1)
return x.squeeze(-1) # (batch, time)
# Usage
model = CNNVAD(n_mels=40)
# Example input: mel-spectrogram
mel_spec = torch.randn(1, 1, 100, 40) # (batch=1, channels=1, time=100, mels=40)
# Predict
speech_prob = model(mel_spec) # (1, 100) - probability per frame
is_speech = speech_prob > 0.5 # Threshold at 0.5
print(f"Speech probability shape: {speech_prob.shape}")
print(f"Detected speech in {is_speech.sum().item()} / {is_speech.size(1)} frames")
Training ML VAD
class VADTrainer:
"""
Train VAD model
"""
def __init__(self, model, device='cuda'):
self.model = model.to(device)
self.device = device
self.criterion = nn.BCELoss()
self.optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
def train_epoch(self, train_loader):
"""Train for one epoch"""
self.model.train()
total_loss = 0
for mel_specs, labels in train_loader:
mel_specs = mel_specs.to(self.device)
labels = labels.to(self.device)
# Forward
predictions = self.model(mel_specs)
loss = self.criterion(predictions, labels)
# Backward
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
total_loss += loss.item()
return total_loss / len(train_loader)
def evaluate(self, val_loader):
"""Evaluate model"""
self.model.eval()
correct = 0
total = 0
with torch.no_grad():
for mel_specs, labels in val_loader:
mel_specs = mel_specs.to(self.device)
labels = labels.to(self.device)
predictions = self.model(mel_specs)
predicted = (predictions > 0.5).float()
correct += (predicted == labels).sum().item()
total += labels.numel()
accuracy = correct / total
return accuracy
Real-Time Streaming VAD
Process audio as it arrives (streaming).
Streaming Implementation
from collections import deque
import numpy as np
import struct
class StreamingVAD:
"""
Real-time VAD for streaming audio
Use case: Voice assistants, VoIP, live transcription
"""
def __init__(
self,
sr=16000,
frame_duration_ms=30,
aggressiveness=3,
speech_pad_ms=300
):
self.sr = sr
self.frame_duration_ms = frame_duration_ms
self.frame_length = int(sr * frame_duration_ms / 1000)
self.speech_pad_ms = speech_pad_ms
self.speech_pad_frames = int(speech_pad_ms / frame_duration_ms)
# WebRTC VAD
self.vad = webrtcvad.Vad(aggressiveness)
# State
self.buffer = deque(maxlen=10000) # Audio buffer
self.speech_frames = 0 # Consecutive speech frames
self.silence_frames = 0 # Consecutive silence frames
self.in_speech = False
# Store speech segments
self.current_speech = []
def add_audio(self, audio_chunk):
"""
Add audio chunk to buffer
Args:
audio_chunk: New audio samples (int16)
"""
self.buffer.extend(audio_chunk)
def process_frame(self):
"""
Process one frame from buffer
Returns:
(is_speech, speech_ended, speech_audio)
"""
if len(self.buffer) < self.frame_length:
return None, False, None
# Extract frame
frame = np.array([self.buffer.popleft() for _ in range(self.frame_length)])
# Convert to bytes
frame_bytes = struct.pack('%dh' % len(frame), *frame)
# Detect
is_speech = self.vad.is_speech(frame_bytes, self.sr)
# Update state
if is_speech:
self.speech_frames += 1
self.silence_frames = 0
if not self.in_speech:
# Speech just started
self.in_speech = True
self.current_speech = []
# Add to current speech
self.current_speech.extend(frame)
else:
self.silence_frames += 1
self.speech_frames = 0
if self.in_speech:
# Add padding
self.current_speech.extend(frame)
# Check if speech ended
if self.silence_frames >= self.speech_pad_frames:
# Speech ended
self.in_speech = False
speech_audio = np.array(self.current_speech)
self.current_speech = []
return False, True, speech_audio
return is_speech, False, None
def process_stream(self):
"""
Process all buffered audio
Yields speech segments as they complete
"""
while len(self.buffer) >= self.frame_length:
is_speech, speech_ended, speech_audio = self.process_frame()
if speech_ended:
yield speech_audio
# Usage
streaming_vad = StreamingVAD(sr=16000, frame_duration_ms=30)
# Simulate streaming (process chunks as they arrive)
chunk_size = 480 # 30ms at 16kHz
for chunk_start in range(0, len(audio), chunk_size):
chunk = audio[chunk_start:chunk_start + chunk_size]
# Add to buffer
streaming_vad.add_audio(chunk.astype(np.int16))
# Process
for speech_segment in streaming_vad.process_stream():
print(f"Speech segment detected: {len(speech_segment)} samples")
# Send to ASR, save, etc.
Production Considerations
Hangover and Padding
Add padding before/after speech to avoid cutting off words.
class VADWithPadding:
"""
VAD with pre/post padding
"""
def __init__(
self,
vad,
pre_pad_ms=200,
post_pad_ms=500,
sr=16000
):
self.vad = vad
self.pre_pad_frames = int(pre_pad_ms / 30) # Assuming 30ms frames
self.post_pad_frames = int(post_pad_ms / 30)
self.sr = sr
def detect_with_padding(self, audio):
"""
Detect speech with padding
"""
is_speech = self.vad.detect(audio)
# Add pre-padding
padded = np.copy(is_speech)
for i in range(len(is_speech)):
if is_speech[i]:
# Mark previous frames as speech
start = max(0, i - self.pre_pad_frames)
padded[start:i] = True
# Add post-padding
for i in range(len(is_speech)):
if is_speech[i]:
# Mark following frames as speech
end = min(len(is_speech), i + self.post_pad_frames)
padded[i:end] = True
return padded
Performance Optimization
import time
class OptimizedVAD:
"""
Optimized VAD for production
"""
def __init__(self, vad_impl):
self.vad = vad_impl
self.stats = {
'total_frames': 0,
'speech_frames': 0,
'processing_time': 0
}
def detect_with_stats(self, audio):
"""Detect with performance tracking"""
start = time.perf_counter()
is_speech = self.vad.detect(audio)
end = time.perf_counter()
# Update stats
self.stats['total_frames'] += len(is_speech)
self.stats['speech_frames'] += is_speech.sum()
self.stats['processing_time'] += (end - start)
return is_speech
def get_stats(self):
"""Get performance statistics"""
if self.stats['total_frames'] == 0:
return None
speech_ratio = self.stats['speech_frames'] / self.stats['total_frames']
avg_time_per_frame = self.stats['processing_time'] / self.stats['total_frames']
return {
'speech_ratio': speech_ratio,
'avg_latency_ms': avg_time_per_frame * 1000,
'total_frames': self.stats['total_frames'],
'speech_frames': self.stats['speech_frames']
}
Integration with ASR Pipeline
VAD as the first stage in speech recognition systems.
End-to-End Pipeline
class SpeechPipeline:
"""
Complete speech recognition pipeline with VAD
Pipeline: Audio → VAD → ASR → Text
"""
def __init__(self):
# VAD
self.vad = WebRTCVAD(sr=16000, frame_duration_ms=30, aggressiveness=3)
# Placeholder for ASR model
self.asr_model = None # Would be actual ASR model
# Buffering
self.min_speech_duration = 0.5 # seconds
self.max_speech_duration = 10.0 # seconds
def process_audio_file(self, audio_path):
"""
Process audio file end-to-end
Returns:
List of transcriptions
"""
# Load audio
import librosa
audio, sr = librosa.load(audio_path, sr=16000)
# Run VAD
speech_segments = self.vad.get_speech_timestamps(audio)
# Filter by duration
valid_segments = [
(start, end) for start, end in speech_segments
if (end - start) >= self.min_speech_duration and
(end - start) <= self.max_speech_duration
]
transcriptions = []
for start, end in valid_segments:
# Extract speech segment
start_sample = int(start * sr)
end_sample = int(end * sr)
speech_audio = audio[start_sample:end_sample]
# Run ASR (placeholder)
# transcript = self.asr_model.transcribe(speech_audio)
transcript = f"[Speech from {start:.2f}s to {end:.2f}s]"
transcriptions.append({
'start': start,
'end': end,
'duration': end - start,
'text': transcript
})
return transcriptions
def process_streaming(self, audio_stream):
"""
Process streaming audio
Yields transcriptions as speech segments complete
"""
streaming_vad = StreamingVAD(sr=16000, frame_duration_ms=30)
for chunk in audio_stream:
streaming_vad.add_audio(chunk)
for speech_segment in streaming_vad.process_stream():
# Run ASR on completed segment
# transcript = self.asr_model.transcribe(speech_segment)
transcript = "[Speech detected]"
yield {
'audio': speech_segment,
'text': transcript,
'timestamp': time.time()
}
# Usage
pipeline = SpeechPipeline()
# Process file
transcriptions = pipeline.process_audio_file('conversation.wav')
for t in transcriptions:
print(f"{t['start']:.2f}s - {t['end']:.2f}s: {t['text']}")
Double-Pass VAD for Higher Accuracy
Use aggressive VAD first, then refine with ML model.
class TwoPassVAD:
"""
Two-pass VAD for improved accuracy
Pass 1: Fast WebRTC VAD (aggressive) → candidate segments
Pass 2: ML VAD (accurate) → final segments
"""
def __init__(self):
# Fast pass: WebRTC VAD (aggressive)
self.fast_vad = WebRTCVAD(sr=16000, frame_duration_ms=30, aggressiveness=3)
# Accurate pass: ML VAD
self.ml_vad = CNNVAD(n_mels=40)
self.ml_vad.eval()
def detect(self, audio):
"""
Two-pass detection
Returns:
Refined speech segments
"""
# Pass 1: Fast VAD to get candidate regions
candidate_segments = self.fast_vad.get_speech_timestamps(audio)
# Pass 2: ML VAD to refine each candidate
refined_segments = []
for start, end in candidate_segments:
# Extract segment
start_sample = int(start * 16000)
end_sample = int(end * 16000)
segment_audio = audio[start_sample:end_sample]
# Run ML VAD on segment
# Convert to mel-spectrogram
import librosa
mel_spec = librosa.feature.melspectrogram(
y=segment_audio,
sr=16000,
n_mels=40
)
# ML model prediction
# mel_tensor = torch.from_numpy(mel_spec).unsqueeze(0).unsqueeze(0)
# with torch.no_grad():
# predictions = self.ml_vad(mel_tensor)
# is_speech_frames = predictions > 0.5
# For now, accept if fast VAD said speech
refined_segments.append((start, end))
return refined_segments
Comparison of VAD Methods
| Method | Pros | Cons | Use Case |
|---|---|---|---|
| Energy-based | Simple, fast, no training | Poor in noise | Quiet environments |
| ZCR + Energy | Better than energy alone | Still noise-sensitive | Moderate noise |
| WebRTC VAD | Fast, robust, production-tested | Fixed aggressiveness | Real-time apps, VoIP |
| ML-based (CNN) | Best accuracy, adaptable | Requires training, slower | High-noise, accuracy-critical |
| ML-based (RNN) | Temporal modeling | Higher latency | Offline processing |
| Hybrid (2-pass) | Balance speed/accuracy | More complex | Production ASR |
Production Deployment
Latency Budgets
For real-time applications:
Voice Assistant Latency Budget:
┌─────────────────────────────────────┐
│ VAD Detection: 5-10ms │
│ Endpoint Detection: 100-200ms │
│ ASR Processing: 500-1000ms │
│ NLU + Dialog: 100-200ms │
│ TTS Generation: 200-500ms │
├─────────────────────────────────────┤
│ Total: ~1-2 seconds│
└─────────────────────────────────────┘
VAD must be fast to keep overall latency low!
Resource Usage
import psutil
import time
class VADProfiler:
"""
Profile VAD performance
"""
def __init__(self, vad):
self.vad = vad
def profile(self, audio, num_runs=100):
"""
Benchmark VAD
Returns:
Performance metrics
"""
latencies = []
# Warm-up
for _ in range(10):
self.vad.detect(audio)
# Measure
process = psutil.Process()
cpu_percent_before = process.cpu_percent()
memory_before = process.memory_info().rss / 1024 / 1024 # MB
for _ in range(num_runs):
start = time.perf_counter()
result = self.vad.detect(audio)
end = time.perf_counter()
latencies.append((end - start) * 1000) # ms
cpu_percent_after = process.cpu_percent()
memory_after = process.memory_info().rss / 1024 / 1024 # MB
return {
'mean_latency_ms': np.mean(latencies),
'p50_latency_ms': np.percentile(latencies, 50),
'p95_latency_ms': np.percentile(latencies, 95),
'p99_latency_ms': np.percentile(latencies, 99),
'throughput_fps': 1000 / np.mean(latencies),
'cpu_usage_pct': cpu_percent_after - cpu_percent_before,
'memory_mb': memory_after - memory_before
}
# Usage
profiler = VADProfiler(WebRTCVAD())
audio, sr = librosa.load('test.wav', sr=16000, duration=10.0)
metrics = profiler.profile(audio)
print(f"Mean latency: {metrics['mean_latency_ms']:.2f}ms")
print(f"P95 latency: {metrics['p95_latency_ms']:.2f}ms")
print(f"Throughput: {metrics['throughput_fps']:.0f} frames/sec")
print(f"CPU usage: {metrics['cpu_usage_pct']:.1f}%")
print(f"Memory: {metrics['memory_mb']:.1f} MB")
Mobile/Edge Deployment
Optimize VAD for on-device deployment.
class MobileOptimizedVAD:
"""
VAD optimized for mobile devices
Quantized model, reduced precision, smaller memory footprint
"""
def __init__(self):
# Use int8 quantization for mobile
import torch
self.model = CNNVAD(n_mels=40)
# Quantize model
# Dynamic quantization applies to Linear/LSTM; Conv2d not supported
self.model = torch.quantization.quantize_dynamic(
self.model,
{torch.nn.Linear},
dtype=torch.qint8
)
self.model.eval()
def detect_efficient(self, audio):
"""
Efficient detection with reduced memory
Process in chunks to reduce peak memory
"""
chunk_size = 16000 # 1 second chunks
results = []
for i in range(0, len(audio), chunk_size):
chunk = audio[i:i+chunk_size]
# Process chunk
# result = self.process_chunk(chunk)
# results.extend(result)
pass
return results
Monitoring & Debugging
VAD Quality Metrics
class VADEvaluator:
"""
Evaluate VAD performance
Metrics:
- Precision: % of detected speech that is actual speech
- Recall: % of actual speech that was detected
- F1 score
- False alarm rate
- Miss rate
"""
def __init__(self):
pass
def evaluate(
self,
predictions: np.ndarray,
ground_truth: np.ndarray
) -> dict:
"""
Compute VAD metrics
Args:
predictions: Binary array (1=speech, 0=non-speech)
ground_truth: Ground truth labels
Returns:
Dictionary of metrics
"""
# True positives, false positives, etc.
tp = np.sum((predictions == 1) & (ground_truth == 1))
fp = np.sum((predictions == 1) & (ground_truth == 0))
tn = np.sum((predictions == 0) & (ground_truth == 0))
fn = np.sum((predictions == 0) & (ground_truth == 1))
# Metrics
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
accuracy = (tp + tn) / (tp + tn + fp + fn)
false_alarm_rate = fp / (fp + tn) if (fp + tn) > 0 else 0
miss_rate = fn / (fn + tp) if (fn + tp) > 0 else 0
return {
'precision': precision,
'recall': recall,
'f1_score': f1,
'accuracy': accuracy,
'false_alarm_rate': false_alarm_rate,
'miss_rate': miss_rate,
'tp': int(tp),
'fp': int(fp),
'tn': int(tn),
'fn': int(fn)
}
# Usage
evaluator = VADEvaluator()
# Load ground truth
# ground_truth = load_annotations('test_audio.txt')
# Run VAD
vad = WebRTCVAD()
# predictions = vad.detect(audio)
# Evaluate
# metrics = evaluator.evaluate(predictions, ground_truth)
# print(f"Precision: {metrics['precision']:.3f}")
# print(f"Recall: {metrics['recall']:.3f}")
# print(f"F1 Score: {metrics['f1_score']:.3f}")
# print(f"False Alarm Rate: {metrics['false_alarm_rate']:.3f}")
Debugging Common Issues
Issue 1: Clipping Speech Beginnings
# Solution: Increase pre-padding
vad_with_padding = VADWithPadding(
vad=WebRTCVAD(),
pre_pad_ms=300, # Increase from 200ms
post_pad_ms=500
)
Issue 2: False Positives from Music
# Solution: Use ML VAD or add music classifier
class MusicFilteredVAD:
"""
VAD with music filtering
"""
def __init__(self, vad, music_classifier):
self.vad = vad
self.music_classifier = music_classifier
def detect(self, audio):
"""Detect speech, filtering out music"""
# Run VAD
speech_frames = self.vad.detect(audio)
# Filter music
is_music = self.music_classifier.predict(audio)
# Combine
is_speech = speech_frames & (~is_music)
return is_speech
Issue 3: High CPU Usage
# Solution: Downsample audio or use simpler VAD
class DownsampledVAD:
"""
VAD with audio downsampling for efficiency
"""
def __init__(self, target_sr=8000):
self.target_sr = target_sr
self.vad = WebRTCVAD(sr=8000) # 8kHz instead of 16kHz
def detect(self, audio, original_sr=16000):
"""Detect with downsampling"""
# Downsample
import librosa
audio_downsampled = librosa.resample(
audio,
orig_sr=original_sr,
target_sr=self.target_sr
)
# Run VAD on downsampled audio
return self.vad.detect(audio_downsampled)
Advanced Techniques
Noise-Robust VAD
Use spectral subtraction for noise reduction before VAD.
class NoiseRobustVAD:
"""
VAD with noise reduction preprocessing
"""
def __init__(self, vad):
self.vad = vad
def spectral_subtraction(self, audio, noise_profile):
"""
Simple spectral subtraction
Args:
audio: Input audio
noise_profile: Estimated noise spectrum
Returns:
Denoised audio
"""
import librosa
# STFT
D = librosa.stft(audio)
magnitude = np.abs(D)
phase = np.angle(D)
# Subtract noise
magnitude_clean = np.maximum(magnitude - noise_profile, 0)
# Reconstruct
D_clean = magnitude_clean * np.exp(1j * phase)
audio_clean = librosa.istft(D_clean)
return audio_clean
def detect_with_denoising(self, audio):
"""Detect speech after denoising"""
# Estimate noise from first 0.5 seconds
noise_segment = audio[:8000] # 0.5s at 16kHz
import librosa
noise_spectrum = np.abs(librosa.stft(noise_segment))
noise_profile = np.median(noise_spectrum, axis=1, keepdims=True)
# Denoise
audio_clean = self.spectral_subtraction(audio, noise_profile)
# Run VAD on clean audio
return self.vad.detect(audio_clean)
Multi-Condition Training Data
For ML-based VAD, train on diverse conditions.
class DataAugmentationForVAD:
"""
Augment training data for robust VAD
"""
def augment(self, clean_speech):
"""
Create augmented samples
Augmentations:
- Add various noise types
- Vary SNR levels
- Apply room reverberation
- Change speaker characteristics
"""
augmented = []
# 1. Add white noise
noise = np.random.randn(len(clean_speech)) * 0.01
augmented.append(clean_speech + noise)
# 2. Add babble noise (simulated)
# babble = load_babble_noise()
# augmented.append(clean_speech + babble)
# 3. Apply reverberation
# reverb = apply_reverb(clean_speech)
# augmented.append(reverb)
return augmented
Real-World Deployment Examples
Zoom/Video Conferencing
Requirements:
- Ultra-low latency (< 10ms)
- Adaptive to varying network conditions
- Handle overlapping speech (multiple speakers)
Solution:
- WebRTC VAD for speed
- Adaptive aggressiveness based on network bandwidth
- Per-speaker VAD in multi-party calls
Smart Speakers (Alexa, Google Home)
Requirements:
- Always-on (low power)
- Far-field audio (echoes, reverberation)
- Wake word detection + VAD
Solution:
- Two-stage: Wake word detector → VAD → ASR
- On-device VAD (WebRTC or lightweight ML)
- Cloud-based refinement for difficult cases
Call Centers
Requirements:
- High accuracy (for analytics)
- Speaker diarization integration
- Post-processing acceptable
Solution:
- ML-based VAD with large models
- Two-pass processing
- Combined with speaker diarization
Key Takeaways
✅ Energy + ZCR provides simple baseline VAD ✅ WebRTC VAD is production-standard, fast, robust, widely deployed ✅ ML-based VAD achieves best accuracy in noisy conditions ✅ Two-pass VAD balances speed and accuracy for production ✅ Streaming processing enables real-time applications ✅ Padding is critical to avoid cutting off speech (200-500ms) ✅ Adaptive thresholds handle varying noise levels ✅ Frame size tradeoff: Smaller = lower latency, larger = better accuracy ✅ Quantization & optimization essential for mobile/edge deployment ✅ Monitor precision/recall in production to catch degradation ✅ Integration with ASR requires careful endpoint detection logic ✅ Noise robustness via preprocessing or multi-condition training
FAQ
Q: What is Voice Activity Detection and why does it matter? A: Voice Activity Detection (VAD) determines which parts of an audio stream contain speech versus silence or background noise. It matters because it saves 50-70% of compute by skipping non-speech frames, determines when utterances start and end for faster response times in streaming ASR, and reduces bandwidth by only transmitting speech in VoIP applications.
Q: Which VAD method should I use in production? A: WebRTC VAD is the recommended production choice for most real-time applications. It is battle-tested across billions of users in Chrome and Skype, runs in under 5ms per frame, and handles noise well. For high-noise environments where accuracy is critical, use a two-pass approach with WebRTC as the fast first pass and an ML model (CNN+LSTM on mel-spectrogram features) for refinement.
Q: How do I prevent VAD from cutting off the beginning or end of speech? A: Add pre-padding of 200-300ms before detected speech start and post-padding of 300-500ms after detected speech end. Use hysteresis by continuing to classify frames as speech for a buffer period after silence is detected, preventing premature utterance boundaries during natural pauses. This is especially important for downstream speaker recognition which needs complete utterances.
Originally published at: arunbaby.com/speech-tech/0004-voice-activity-detection
If you found this helpful, consider sharing it with others who might benefit.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch