Voice Activity Detection (VAD)
How voice assistants and video conferencing apps detect when you’re speaking vs silence, the critical first step in every speech pipeline.
Introduction
Voice Activity Detection (VAD) is the task of determining which parts of an audio stream contain speech vs non-speech (silence, background noise, music).
VAD is the gatekeeper of speech systems:
- Triggers when to start listening (wake word detection)
- Determines when utterance ends (endpoint detection)
- Saves compute by only processing speech frames
- Improves bandwidth by only transmitting speech
Why it matters:
- Power efficiency: Voice assistants sleep until speech detected
- Latency: Know when user finished speaking → respond faster
- Bandwidth: Transmit only speech frames in VoIP
- Accuracy: Reduce false alarms in ASR systems
What you’ll learn:
- Energy-based VAD (simple, fast)
- WebRTC VAD (production standard)
- ML-based VAD (state-of-the-art)
- Real-time streaming implementation
- Production deployment considerations
Problem Definition
Design a real-time voice activity detection system.
Functional Requirements
- Detection
- Classify each audio frame as speech or non-speech
- Handle noisy environments
- Detect speech from multiple speakers
- Endpoint Detection
- Determine start of speech
- Determine end of speech
- Handle pauses within utterances
- Real-time Processing
- Process audio frames as they arrive
- Minimal buffering
- Low latency
Non-Functional Requirements
- Latency
- Frame-level detection: < 5ms
- Endpoint detection: < 100ms after speech ends
- Accuracy
- True positive rate > 95% (detect speech)
- False positive rate < 5% (mistake noise for speech)
- Robustness
- Work in SNR (Signal-to-Noise Ratio) down to 0 dB
- Handle various noise types (music, traffic, crowds)
- Adapt to different speakers
Approach 1: Energy-Based VAD
Simplest approach: Speech has higher energy than silence.
Implementation
import numpy as np
import librosa
class EnergyVAD:
"""
Energy-based Voice Activity Detection
Pros: Simple, fast, no training required
Cons: Sensitive to noise, poor in low SNR
"""
def __init__(
self,
sr=16000,
frame_length_ms=20,
hop_length_ms=10,
energy_threshold=0.01
):
self.sr = sr
self.frame_length = int(sr * frame_length_ms / 1000)
self.hop_length = int(sr * hop_length_ms / 1000)
self.energy_threshold = energy_threshold
def compute_energy(self, frame):
"""
Compute frame energy (RMS)
Energy = sqrt(mean(x^2))
"""
return np.sqrt(np.mean(frame ** 2))
def detect(self, audio):
"""
Detect speech frames
Args:
audio: Audio signal
Returns:
List of booleans (True = speech, False = non-speech)
"""
# Frame the audio
frames = librosa.util.frame(
audio,
frame_length=self.frame_length,
hop_length=self.hop_length
)
# Compute energy per frame
energies = np.array([self.compute_energy(frame) for frame in frames.T])
# Threshold
is_speech = energies > self.energy_threshold
return is_speech
def get_speech_segments(self, audio):
"""
Get speech segments (start, end) in seconds
Returns:
List of (start_time, end_time) tuples
"""
is_speech = self.detect(audio)
segments = []
in_speech = False
start_frame = 0
for i, speech in enumerate(is_speech):
if speech and not in_speech:
# Speech started
start_frame = i
in_speech = True
elif not speech and in_speech:
# Speech ended
end_frame = i
in_speech = False
# Convert frames to time
start_time = start_frame * self.hop_length / self.sr
end_time = end_frame * self.hop_length / self.sr
segments.append((start_time, end_time))
# Handle case where audio ends during speech
if in_speech:
end_time = len(is_speech) * self.hop_length / self.sr
start_time = start_frame * self.hop_length / self.sr
segments.append((start_time, end_time))
return segments
# Usage
vad = EnergyVAD(energy_threshold=0.01)
# Load audio
audio, sr = librosa.load('speech_with_silence.wav', sr=16000)
# Detect speech
is_speech = vad.detect(audio)
print(f"Speech frames: {is_speech.sum()} / {len(is_speech)}")
# Get segments
segments = vad.get_speech_segments(audio)
for start, end in segments:
print(f"Speech from {start:.2f}s to {end:.2f}s")
Adaptive Thresholding
Fixed thresholds fail in varying noise conditions. Use adaptive thresholds.
class AdaptiveEnergyVAD(EnergyVAD):
"""
Energy VAD with adaptive threshold
Threshold adapts to background noise level
"""
def __init__(self, sr=16000, frame_length_ms=20, hop_length_ms=10):
super().__init__(sr, frame_length_ms, hop_length_ms)
self.noise_energy = 0.001 # Initial estimate
self.alpha = 0.95 # Smoothing factor
def detect(self, audio):
"""Detect with adaptive threshold"""
frames = librosa.util.frame(
audio,
frame_length=self.frame_length,
hop_length=self.hop_length
)
is_speech = []
for frame in frames.T:
energy = self.compute_energy(frame)
# Adaptive threshold: 3x noise energy
threshold = 3.0 * self.noise_energy
if energy > threshold:
# Likely speech
is_speech.append(True)
else:
# Likely noise/silence
is_speech.append(False)
# Update noise estimate (during silence only)
self.noise_energy = self.alpha * self.noise_energy + (1 - self.alpha) * energy
return np.array(is_speech)
Approach 2: Zero-Crossing Rate + Energy
Combine energy with zero-crossing rate for better accuracy.
Implementation
class ZCR_Energy_VAD:
"""
VAD using Energy + Zero-Crossing Rate
Intuition:
- Speech: Low ZCR (voiced sounds), moderate to high energy
- Noise: High ZCR (unvoiced), varying energy
- Silence: Low energy
"""
def __init__(
self,
sr=16000,
frame_length_ms=20,
hop_length_ms=10,
energy_threshold=0.01,
zcr_threshold=0.1
):
self.sr = sr
self.frame_length = int(sr * frame_length_ms / 1000)
self.hop_length = int(sr * hop_length_ms / 1000)
self.energy_threshold = energy_threshold
self.zcr_threshold = zcr_threshold
def compute_zcr(self, frame):
"""
Compute zero-crossing rate
ZCR = # of times signal crosses zero / # samples
"""
signs = np.sign(frame)
zcr = np.mean(np.abs(np.diff(signs))) / 2
return zcr
def detect(self, audio):
"""
Detect using both energy and ZCR
"""
frames = librosa.util.frame(
audio,
frame_length=self.frame_length,
hop_length=self.hop_length
)
is_speech = []
for frame in frames.T:
energy = np.sqrt(np.mean(frame ** 2))
zcr = self.compute_zcr(frame)
# Decision logic
if energy > self.energy_threshold:
# High energy: could be speech or noise
if zcr < self.zcr_threshold:
# Low ZCR → likely speech (voiced)
is_speech.append(True)
else:
# High ZCR → likely noise
is_speech.append(False)
else:
# Low energy → silence
is_speech.append(False)
return np.array(is_speech)
Approach 3: WebRTC VAD
Industry-standard VAD used in Chrome, Skype, etc.
Using WebRTC VAD
# WebRTC VAD requires: pip install webrtcvad
import webrtcvad
import struct
class WebRTCVAD:
"""
WebRTC Voice Activity Detector
Pros:
- Production-tested (billions of users)
- Fast, CPU-efficient
- Robust to noise
Cons:
- Only works with specific sample rates (8/16/32/48 kHz)
- Fixed frame sizes (10/20/30 ms)
"""
def __init__(self, sr=16000, frame_duration_ms=30, aggressiveness=3):
"""
Args:
sr: Sample rate (must be 8000, 16000, 32000, or 48000)
frame_duration_ms: Frame duration (10, 20, or 30 ms)
aggressiveness: 0-3 (0=least aggressive, 3=most aggressive)
- Higher = more likely to classify as non-speech
- Use 3 for noisy environments
"""
if sr not in [8000, 16000, 32000, 48000]:
raise ValueError("Sample rate must be 8000, 16000, 32000, or 48000")
if frame_duration_ms not in [10, 20, 30]:
raise ValueError("Frame duration must be 10, 20, or 30 ms")
self.sr = sr
self.frame_duration_ms = frame_duration_ms
self.frame_length = int(sr * frame_duration_ms / 1000)
# Create VAD instance
self.vad = webrtcvad.Vad(aggressiveness)
def detect(self, audio):
"""
Detect speech in audio
Args:
audio: numpy array of int16 samples
Returns:
List of booleans (True = speech)
"""
# Convert float to int16 if needed (clip to avoid overflow)
if audio.dtype == np.float32 or audio.dtype == np.float64:
audio = np.clip(audio, -1.0, 1.0)
audio = (audio * 32767).astype(np.int16)
# Frame audio
num_frames = len(audio) // self.frame_length
is_speech = []
for i in range(num_frames):
start = i * self.frame_length
end = start + self.frame_length
frame = audio[start:end]
# Convert to bytes
frame_bytes = struct.pack('%dh' % len(frame), *frame)
# Detect
speech = self.vad.is_speech(frame_bytes, self.sr)
is_speech.append(speech)
return np.array(is_speech)
def get_speech_timestamps(self, audio):
"""
Get speech timestamps
Returns:
List of (start_time, end_time) in seconds
"""
is_speech = self.detect(audio)
segments = []
in_speech = False
start_frame = 0
for i, speech in enumerate(is_speech):
if speech and not in_speech:
start_frame = i
in_speech = True
elif not speech and in_speech:
in_speech = False
start_time = start_frame * self.frame_length / self.sr
end_time = i * self.frame_length / self.sr
segments.append((start_time, end_time))
if in_speech:
start_time = start_frame * self.frame_length / self.sr
end_time = len(is_speech) * self.frame_length / self.sr
segments.append((start_time, end_time))
return segments
# Usage
vad = WebRTCVAD(sr=16000, frame_duration_ms=30, aggressiveness=3)
audio, sr = librosa.load('audio.wav', sr=16000)
segments = vad.get_speech_timestamps(audio)
print("Speech segments:")
for start, end in segments:
print(f" {start:.2f}s - {end:.2f}s")
Approach 4: ML-Based VAD
Use neural networks for state-of-the-art performance.
CNN-based VAD
import torch
import torch.nn as nn
class CNNVAD(nn.Module):
"""
CNN-based Voice Activity Detector
Input: Mel-spectrogram (time, freq)
Output: Speech probability per frame
"""
def __init__(self, n_mels=40):
super().__init__()
# CNN layers
self.conv1 = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2, 2)
)
self.conv2 = nn.Sequential(
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2, 2)
)
# LSTM for temporal modeling
self.lstm = nn.LSTM(
input_size=64 * (n_mels // 4),
hidden_size=128,
num_layers=2,
batch_first=True,
bidirectional=True
)
# Classification head
self.fc = nn.Linear(256, 1) # Binary classification
self.sigmoid = nn.Sigmoid()
def forward(self, x):
"""
Forward pass
Args:
x: (batch, 1, time, n_mels)
Returns:
Speech probabilities: (batch, time)
"""
# CNN
x = self.conv1(x) # (batch, 32, time/2, n_mels/2)
x = self.conv2(x) # (batch, 64, time/4, n_mels/4)
# Reshape for LSTM
batch, channels, time, freq = x.size()
x = x.permute(0, 2, 1, 3) # (batch, time, channels, freq)
x = x.reshape(batch, time, channels * freq)
# LSTM
x, _ = self.lstm(x) # (batch, time, 256)
# Classification
x = self.fc(x) # (batch, time, 1)
x = self.sigmoid(x) # (batch, time, 1)
return x.squeeze(-1) # (batch, time)
# Usage
model = CNNVAD(n_mels=40)
# Example input: mel-spectrogram
mel_spec = torch.randn(1, 1, 100, 40) # (batch=1, channels=1, time=100, mels=40)
# Predict
speech_prob = model(mel_spec) # (1, 100) - probability per frame
is_speech = speech_prob > 0.5 # Threshold at 0.5
print(f"Speech probability shape: {speech_prob.shape}")
print(f"Detected speech in {is_speech.sum().item()} / {is_speech.size(1)} frames")
Training ML VAD
class VADTrainer:
"""
Train VAD model
"""
def __init__(self, model, device='cuda'):
self.model = model.to(device)
self.device = device
self.criterion = nn.BCELoss()
self.optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
def train_epoch(self, train_loader):
"""Train for one epoch"""
self.model.train()
total_loss = 0
for mel_specs, labels in train_loader:
mel_specs = mel_specs.to(self.device)
labels = labels.to(self.device)
# Forward
predictions = self.model(mel_specs)
loss = self.criterion(predictions, labels)
# Backward
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
total_loss += loss.item()
return total_loss / len(train_loader)
def evaluate(self, val_loader):
"""Evaluate model"""
self.model.eval()
correct = 0
total = 0
with torch.no_grad():
for mel_specs, labels in val_loader:
mel_specs = mel_specs.to(self.device)
labels = labels.to(self.device)
predictions = self.model(mel_specs)
predicted = (predictions > 0.5).float()
correct += (predicted == labels).sum().item()
total += labels.numel()
accuracy = correct / total
return accuracy
Real-Time Streaming VAD
Process audio as it arrives (streaming).
Streaming Implementation
from collections import deque
import numpy as np
import struct
class StreamingVAD:
"""
Real-time VAD for streaming audio
Use case: Voice assistants, VoIP, live transcription
"""
def __init__(
self,
sr=16000,
frame_duration_ms=30,
aggressiveness=3,
speech_pad_ms=300
):
self.sr = sr
self.frame_duration_ms = frame_duration_ms
self.frame_length = int(sr * frame_duration_ms / 1000)
self.speech_pad_ms = speech_pad_ms
self.speech_pad_frames = int(speech_pad_ms / frame_duration_ms)
# WebRTC VAD
self.vad = webrtcvad.Vad(aggressiveness)
# State
self.buffer = deque(maxlen=10000) # Audio buffer
self.speech_frames = 0 # Consecutive speech frames
self.silence_frames = 0 # Consecutive silence frames
self.in_speech = False
# Store speech segments
self.current_speech = []
def add_audio(self, audio_chunk):
"""
Add audio chunk to buffer
Args:
audio_chunk: New audio samples (int16)
"""
self.buffer.extend(audio_chunk)
def process_frame(self):
"""
Process one frame from buffer
Returns:
(is_speech, speech_ended, speech_audio)
"""
if len(self.buffer) < self.frame_length:
return None, False, None
# Extract frame
frame = np.array([self.buffer.popleft() for _ in range(self.frame_length)])
# Convert to bytes
frame_bytes = struct.pack('%dh' % len(frame), *frame)
# Detect
is_speech = self.vad.is_speech(frame_bytes, self.sr)
# Update state
if is_speech:
self.speech_frames += 1
self.silence_frames = 0
if not self.in_speech:
# Speech just started
self.in_speech = True
self.current_speech = []
# Add to current speech
self.current_speech.extend(frame)
else:
self.silence_frames += 1
self.speech_frames = 0
if self.in_speech:
# Add padding
self.current_speech.extend(frame)
# Check if speech ended
if self.silence_frames >= self.speech_pad_frames:
# Speech ended
self.in_speech = False
speech_audio = np.array(self.current_speech)
self.current_speech = []
return False, True, speech_audio
return is_speech, False, None
def process_stream(self):
"""
Process all buffered audio
Yields speech segments as they complete
"""
while len(self.buffer) >= self.frame_length:
is_speech, speech_ended, speech_audio = self.process_frame()
if speech_ended:
yield speech_audio
# Usage
streaming_vad = StreamingVAD(sr=16000, frame_duration_ms=30)
# Simulate streaming (process chunks as they arrive)
chunk_size = 480 # 30ms at 16kHz
for chunk_start in range(0, len(audio), chunk_size):
chunk = audio[chunk_start:chunk_start + chunk_size]
# Add to buffer
streaming_vad.add_audio(chunk.astype(np.int16))
# Process
for speech_segment in streaming_vad.process_stream():
print(f"Speech segment detected: {len(speech_segment)} samples")
# Send to ASR, save, etc.
Production Considerations
Hangover and Padding
Add padding before/after speech to avoid cutting off words.
class VADWithPadding:
"""
VAD with pre/post padding
"""
def __init__(
self,
vad,
pre_pad_ms=200,
post_pad_ms=500,
sr=16000
):
self.vad = vad
self.pre_pad_frames = int(pre_pad_ms / 30) # Assuming 30ms frames
self.post_pad_frames = int(post_pad_ms / 30)
self.sr = sr
def detect_with_padding(self, audio):
"""
Detect speech with padding
"""
is_speech = self.vad.detect(audio)
# Add pre-padding
padded = np.copy(is_speech)
for i in range(len(is_speech)):
if is_speech[i]:
# Mark previous frames as speech
start = max(0, i - self.pre_pad_frames)
padded[start:i] = True
# Add post-padding
for i in range(len(is_speech)):
if is_speech[i]:
# Mark following frames as speech
end = min(len(is_speech), i + self.post_pad_frames)
padded[i:end] = True
return padded
Performance Optimization
import time
class OptimizedVAD:
"""
Optimized VAD for production
"""
def __init__(self, vad_impl):
self.vad = vad_impl
self.stats = {
'total_frames': 0,
'speech_frames': 0,
'processing_time': 0
}
def detect_with_stats(self, audio):
"""Detect with performance tracking"""
start = time.perf_counter()
is_speech = self.vad.detect(audio)
end = time.perf_counter()
# Update stats
self.stats['total_frames'] += len(is_speech)
self.stats['speech_frames'] += is_speech.sum()
self.stats['processing_time'] += (end - start)
return is_speech
def get_stats(self):
"""Get performance statistics"""
if self.stats['total_frames'] == 0:
return None
speech_ratio = self.stats['speech_frames'] / self.stats['total_frames']
avg_time_per_frame = self.stats['processing_time'] / self.stats['total_frames']
return {
'speech_ratio': speech_ratio,
'avg_latency_ms': avg_time_per_frame * 1000,
'total_frames': self.stats['total_frames'],
'speech_frames': self.stats['speech_frames']
}
Integration with ASR Pipeline
VAD as the first stage in speech recognition systems.
End-to-End Pipeline
class SpeechPipeline:
"""
Complete speech recognition pipeline with VAD
Pipeline: Audio → VAD → ASR → Text
"""
def __init__(self):
# VAD
self.vad = WebRTCVAD(sr=16000, frame_duration_ms=30, aggressiveness=3)
# Placeholder for ASR model
self.asr_model = None # Would be actual ASR model
# Buffering
self.min_speech_duration = 0.5 # seconds
self.max_speech_duration = 10.0 # seconds
def process_audio_file(self, audio_path):
"""
Process audio file end-to-end
Returns:
List of transcriptions
"""
# Load audio
import librosa
audio, sr = librosa.load(audio_path, sr=16000)
# Run VAD
speech_segments = self.vad.get_speech_timestamps(audio)
# Filter by duration
valid_segments = [
(start, end) for start, end in speech_segments
if (end - start) >= self.min_speech_duration and
(end - start) <= self.max_speech_duration
]
transcriptions = []
for start, end in valid_segments:
# Extract speech segment
start_sample = int(start * sr)
end_sample = int(end * sr)
speech_audio = audio[start_sample:end_sample]
# Run ASR (placeholder)
# transcript = self.asr_model.transcribe(speech_audio)
transcript = f"[Speech from {start:.2f}s to {end:.2f}s]"
transcriptions.append({
'start': start,
'end': end,
'duration': end - start,
'text': transcript
})
return transcriptions
def process_streaming(self, audio_stream):
"""
Process streaming audio
Yields transcriptions as speech segments complete
"""
streaming_vad = StreamingVAD(sr=16000, frame_duration_ms=30)
for chunk in audio_stream:
streaming_vad.add_audio(chunk)
for speech_segment in streaming_vad.process_stream():
# Run ASR on completed segment
# transcript = self.asr_model.transcribe(speech_segment)
transcript = "[Speech detected]"
yield {
'audio': speech_segment,
'text': transcript,
'timestamp': time.time()
}
# Usage
pipeline = SpeechPipeline()
# Process file
transcriptions = pipeline.process_audio_file('conversation.wav')
for t in transcriptions:
print(f"{t['start']:.2f}s - {t['end']:.2f}s: {t['text']}")
Double-Pass VAD for Higher Accuracy
Use aggressive VAD first, then refine with ML model.
class TwoPassVAD:
"""
Two-pass VAD for improved accuracy
Pass 1: Fast WebRTC VAD (aggressive) → candidate segments
Pass 2: ML VAD (accurate) → final segments
"""
def __init__(self):
# Fast pass: WebRTC VAD (aggressive)
self.fast_vad = WebRTCVAD(sr=16000, frame_duration_ms=30, aggressiveness=3)
# Accurate pass: ML VAD
self.ml_vad = CNNVAD(n_mels=40)
self.ml_vad.eval()
def detect(self, audio):
"""
Two-pass detection
Returns:
Refined speech segments
"""
# Pass 1: Fast VAD to get candidate regions
candidate_segments = self.fast_vad.get_speech_timestamps(audio)
# Pass 2: ML VAD to refine each candidate
refined_segments = []
for start, end in candidate_segments:
# Extract segment
start_sample = int(start * 16000)
end_sample = int(end * 16000)
segment_audio = audio[start_sample:end_sample]
# Run ML VAD on segment
# Convert to mel-spectrogram
import librosa
mel_spec = librosa.feature.melspectrogram(
y=segment_audio,
sr=16000,
n_mels=40
)
# ML model prediction
# mel_tensor = torch.from_numpy(mel_spec).unsqueeze(0).unsqueeze(0)
# with torch.no_grad():
# predictions = self.ml_vad(mel_tensor)
# is_speech_frames = predictions > 0.5
# For now, accept if fast VAD said speech
refined_segments.append((start, end))
return refined_segments
Comparison of VAD Methods
Method | Pros | Cons | Use Case |
---|---|---|---|
Energy-based | Simple, fast, no training | Poor in noise | Quiet environments |
ZCR + Energy | Better than energy alone | Still noise-sensitive | Moderate noise |
WebRTC VAD | Fast, robust, production-tested | Fixed aggressiveness | Real-time apps, VoIP |
ML-based (CNN) | Best accuracy, adaptable | Requires training, slower | High-noise, accuracy-critical |
ML-based (RNN) | Temporal modeling | Higher latency | Offline processing |
Hybrid (2-pass) | Balance speed/accuracy | More complex | Production ASR |
Production Deployment
Latency Budgets
For real-time applications:
Voice Assistant Latency Budget:
┌─────────────────────────────────────┐
│ VAD Detection: 5-10ms │
│ Endpoint Detection: 100-200ms │
│ ASR Processing: 500-1000ms │
│ NLU + Dialog: 100-200ms │
│ TTS Generation: 200-500ms │
├─────────────────────────────────────┤
│ Total: ~1-2 seconds│
└─────────────────────────────────────┘
VAD must be fast to keep overall latency low!
Resource Usage
import psutil
import time
class VADProfiler:
"""
Profile VAD performance
"""
def __init__(self, vad):
self.vad = vad
def profile(self, audio, num_runs=100):
"""
Benchmark VAD
Returns:
Performance metrics
"""
latencies = []
# Warm-up
for _ in range(10):
self.vad.detect(audio)
# Measure
process = psutil.Process()
cpu_percent_before = process.cpu_percent()
memory_before = process.memory_info().rss / 1024 / 1024 # MB
for _ in range(num_runs):
start = time.perf_counter()
result = self.vad.detect(audio)
end = time.perf_counter()
latencies.append((end - start) * 1000) # ms
cpu_percent_after = process.cpu_percent()
memory_after = process.memory_info().rss / 1024 / 1024 # MB
return {
'mean_latency_ms': np.mean(latencies),
'p50_latency_ms': np.percentile(latencies, 50),
'p95_latency_ms': np.percentile(latencies, 95),
'p99_latency_ms': np.percentile(latencies, 99),
'throughput_fps': 1000 / np.mean(latencies),
'cpu_usage_pct': cpu_percent_after - cpu_percent_before,
'memory_mb': memory_after - memory_before
}
# Usage
profiler = VADProfiler(WebRTCVAD())
audio, sr = librosa.load('test.wav', sr=16000, duration=10.0)
metrics = profiler.profile(audio)
print(f"Mean latency: {metrics['mean_latency_ms']:.2f}ms")
print(f"P95 latency: {metrics['p95_latency_ms']:.2f}ms")
print(f"Throughput: {metrics['throughput_fps']:.0f} frames/sec")
print(f"CPU usage: {metrics['cpu_usage_pct']:.1f}%")
print(f"Memory: {metrics['memory_mb']:.1f} MB")
Mobile/Edge Deployment
Optimize VAD for on-device deployment.
class MobileOptimizedVAD:
"""
VAD optimized for mobile devices
Quantized model, reduced precision, smaller memory footprint
"""
def __init__(self):
# Use int8 quantization for mobile
import torch
self.model = CNNVAD(n_mels=40)
# Quantize model
# Dynamic quantization applies to Linear/LSTM; Conv2d not supported
self.model = torch.quantization.quantize_dynamic(
self.model,
{torch.nn.Linear},
dtype=torch.qint8
)
self.model.eval()
def detect_efficient(self, audio):
"""
Efficient detection with reduced memory
Process in chunks to reduce peak memory
"""
chunk_size = 16000 # 1 second chunks
results = []
for i in range(0, len(audio), chunk_size):
chunk = audio[i:i+chunk_size]
# Process chunk
# result = self.process_chunk(chunk)
# results.extend(result)
pass
return results
Monitoring & Debugging
VAD Quality Metrics
class VADEvaluator:
"""
Evaluate VAD performance
Metrics:
- Precision: % of detected speech that is actual speech
- Recall: % of actual speech that was detected
- F1 score
- False alarm rate
- Miss rate
"""
def __init__(self):
pass
def evaluate(
self,
predictions: np.ndarray,
ground_truth: np.ndarray
) -> dict:
"""
Compute VAD metrics
Args:
predictions: Binary array (1=speech, 0=non-speech)
ground_truth: Ground truth labels
Returns:
Dictionary of metrics
"""
# True positives, false positives, etc.
tp = np.sum((predictions == 1) & (ground_truth == 1))
fp = np.sum((predictions == 1) & (ground_truth == 0))
tn = np.sum((predictions == 0) & (ground_truth == 0))
fn = np.sum((predictions == 0) & (ground_truth == 1))
# Metrics
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
accuracy = (tp + tn) / (tp + tn + fp + fn)
false_alarm_rate = fp / (fp + tn) if (fp + tn) > 0 else 0
miss_rate = fn / (fn + tp) if (fn + tp) > 0 else 0
return {
'precision': precision,
'recall': recall,
'f1_score': f1,
'accuracy': accuracy,
'false_alarm_rate': false_alarm_rate,
'miss_rate': miss_rate,
'tp': int(tp),
'fp': int(fp),
'tn': int(tn),
'fn': int(fn)
}
# Usage
evaluator = VADEvaluator()
# Load ground truth
# ground_truth = load_annotations('test_audio.txt')
# Run VAD
vad = WebRTCVAD()
# predictions = vad.detect(audio)
# Evaluate
# metrics = evaluator.evaluate(predictions, ground_truth)
# print(f"Precision: {metrics['precision']:.3f}")
# print(f"Recall: {metrics['recall']:.3f}")
# print(f"F1 Score: {metrics['f1_score']:.3f}")
# print(f"False Alarm Rate: {metrics['false_alarm_rate']:.3f}")
Debugging Common Issues
Issue 1: Clipping Speech Beginnings
# Solution: Increase pre-padding
vad_with_padding = VADWithPadding(
vad=WebRTCVAD(),
pre_pad_ms=300, # Increase from 200ms
post_pad_ms=500
)
Issue 2: False Positives from Music
# Solution: Use ML VAD or add music classifier
class MusicFilteredVAD:
"""
VAD with music filtering
"""
def __init__(self, vad, music_classifier):
self.vad = vad
self.music_classifier = music_classifier
def detect(self, audio):
"""Detect speech, filtering out music"""
# Run VAD
speech_frames = self.vad.detect(audio)
# Filter music
is_music = self.music_classifier.predict(audio)
# Combine
is_speech = speech_frames & (~is_music)
return is_speech
Issue 3: High CPU Usage
# Solution: Downsample audio or use simpler VAD
class DownsampledVAD:
"""
VAD with audio downsampling for efficiency
"""
def __init__(self, target_sr=8000):
self.target_sr = target_sr
self.vad = WebRTCVAD(sr=8000) # 8kHz instead of 16kHz
def detect(self, audio, original_sr=16000):
"""Detect with downsampling"""
# Downsample
import librosa
audio_downsampled = librosa.resample(
audio,
orig_sr=original_sr,
target_sr=self.target_sr
)
# Run VAD on downsampled audio
return self.vad.detect(audio_downsampled)
Advanced Techniques
Noise-Robust VAD
Use spectral subtraction for noise reduction before VAD.
class NoiseRobustVAD:
"""
VAD with noise reduction preprocessing
"""
def __init__(self, vad):
self.vad = vad
def spectral_subtraction(self, audio, noise_profile):
"""
Simple spectral subtraction
Args:
audio: Input audio
noise_profile: Estimated noise spectrum
Returns:
Denoised audio
"""
import librosa
# STFT
D = librosa.stft(audio)
magnitude = np.abs(D)
phase = np.angle(D)
# Subtract noise
magnitude_clean = np.maximum(magnitude - noise_profile, 0)
# Reconstruct
D_clean = magnitude_clean * np.exp(1j * phase)
audio_clean = librosa.istft(D_clean)
return audio_clean
def detect_with_denoising(self, audio):
"""Detect speech after denoising"""
# Estimate noise from first 0.5 seconds
noise_segment = audio[:8000] # 0.5s at 16kHz
import librosa
noise_spectrum = np.abs(librosa.stft(noise_segment))
noise_profile = np.median(noise_spectrum, axis=1, keepdims=True)
# Denoise
audio_clean = self.spectral_subtraction(audio, noise_profile)
# Run VAD on clean audio
return self.vad.detect(audio_clean)
Multi-Condition Training Data
For ML-based VAD, train on diverse conditions.
class DataAugmentationForVAD:
"""
Augment training data for robust VAD
"""
def augment(self, clean_speech):
"""
Create augmented samples
Augmentations:
- Add various noise types
- Vary SNR levels
- Apply room reverberation
- Change speaker characteristics
"""
augmented = []
# 1. Add white noise
noise = np.random.randn(len(clean_speech)) * 0.01
augmented.append(clean_speech + noise)
# 2. Add babble noise (simulated)
# babble = load_babble_noise()
# augmented.append(clean_speech + babble)
# 3. Apply reverberation
# reverb = apply_reverb(clean_speech)
# augmented.append(reverb)
return augmented
Real-World Deployment Examples
Zoom/Video Conferencing
Requirements:
- Ultra-low latency (< 10ms)
- Adaptive to varying network conditions
- Handle overlapping speech (multiple speakers)
Solution:
- WebRTC VAD for speed
- Adaptive aggressiveness based on network bandwidth
- Per-speaker VAD in multi-party calls
Smart Speakers (Alexa, Google Home)
Requirements:
- Always-on (low power)
- Far-field audio (echoes, reverberation)
- Wake word detection + VAD
Solution:
- Two-stage: Wake word detector → VAD → ASR
- On-device VAD (WebRTC or lightweight ML)
- Cloud-based refinement for difficult cases
Call Centers
Requirements:
- High accuracy (for analytics)
- Speaker diarization integration
- Post-processing acceptable
Solution:
- ML-based VAD with large models
- Two-pass processing
- Combined with speaker diarization
Key Takeaways
✅ Energy + ZCR provides simple baseline VAD
✅ WebRTC VAD is production-standard, fast, robust, widely deployed
✅ ML-based VAD achieves best accuracy in noisy conditions
✅ Two-pass VAD balances speed and accuracy for production
✅ Streaming processing enables real-time applications
✅ Padding is critical to avoid cutting off speech (200-500ms)
✅ Adaptive thresholds handle varying noise levels
✅ Frame size tradeoff: Smaller = lower latency, larger = better accuracy
✅ Quantization & optimization essential for mobile/edge deployment
✅ Monitor precision/recall in production to catch degradation
✅ Integration with ASR requires careful endpoint detection logic
✅ Noise robustness via preprocessing or multi-condition training
Originally published at: arunbaby.com/speech-tech/0004-voice-activity-detection
If you found this helpful, consider sharing it with others who might benefit.