Speech Command Classification
How voice assistants recognize “turn on the lights” from raw audio in under 100ms without full ASR transcription.
TL;DR
Direct audio-to-intent classification skips full ASR transcription entirely, classifying voice commands from mel-spectrogram features in 45ms (5x faster than ASR+NLU). A CNN model achieves 93% accuracy at 2MB/15ms on-device, while knowledge distillation from a Transformer teacher pushes this to 95% without increasing model size. Unknown command handling combines an explicit “unknown” training class, confidence thresholding, and entropy-based out-of-distribution detection. Production deployments use hybrid edge-cloud routing where high-confidence commands execute locally and uncertain queries fall back to cloud streaming ASR.

Introduction
When you say “Alexa, turn off the lights” or “Hey Google, set a timer,” your voice assistant doesn’t actually transcribe your speech to text first. Instead, it uses a direct audio-to-intent classification system that’s:
- Faster than ASR + NLU (50-100ms vs 200-500ms)
- Smaller models (< 10MB vs 100MB+)
- Works offline (on-device inference)
- More privacy-preserving (no text sent to cloud)
This approach is perfect for a limited vocabulary of commands (30-100 commands) where you care more about speed and privacy than open-ended understanding.
What you’ll learn:
- Why direct audio→intent beats ASR→NLU for commands
- Audio feature extraction (MFCCs, mel-spectrograms)
- Model architectures (CNN, RNN, Attention)
- Training strategies and data augmentation
- On-device deployment and optimization
- Unknown command handling (OOD detection)
- Real-world examples from Google, Amazon, Apple
Problem Definition
Design a speech command classification system for a voice assistant that:
Functional Requirements
- Multi-class Classification
- 30-50 predefined commands
- Examples: “lights on”, “volume up”, “play music”, “stop timer”
- Support synonyms and variations
- Unknown Detection
- Detect and reject out-of-vocabulary audio
- Handle background conversation
- Distinguish commands from non-commands
- Multi-language Support
- 5+ languages initially
- Shared model or separate models per language
- Context Awareness
- Optional: Use device state as context
- Example: “turn it off” depends on what’s currently on
Non-Functional Requirements
- Latency
- End-to-end < 100ms
- Includes audio buffering, processing, inference
- Model Constraints
- Model size < 10MB (on-device)
- RAM usage < 50MB during inference
- CPU-only (no GPU on most devices)
- Accuracy
-
95% on target commands (clean audio)
-
90% on noisy audio
- < 5% false positive rate
-
- Throughput
- 1000 QPS per server (cloud)
- Single inference on device
Why Not ASR + NLU?
Traditional Pipeline
Audio → ASR → Text → NLU → Intent
"lights on" → ASR (200ms) → "lights on" → NLU (50ms) → {action: "lights", state: "on"}
Total latency: 250ms
Direct Classification
Audio → Audio Features → CNN → Intent
"lights on" → Mel-spec (5ms) → CNN (40ms) → {action: "lights", state: "on"}
Total latency: 45ms
Advantages:
- ✅ 5x faster (45ms vs 250ms)
- ✅ 10x smaller model (5MB vs 50MB)
- ✅ Works offline
- ✅ More private (no text)
- ✅ Fewer points of failure
Disadvantages:
- ❌ Limited vocabulary (30-50 commands vs unlimited)
- ❌ Less flexible (new commands need retraining)
- ❌ Can’t handle complex queries (“turn on the lights in the living room at 8pm”)
When to use each:
- Direct classification: Simple commands, latency-critical, on-device
- ASR + NLU: Complex queries, unlimited vocabulary, cloud-based
Architecture
Audio Input (1-2 seconds @ 16kHz)
↓
Audio Preprocessing
├─ Resampling (if needed)
├─ Padding/Trimming to fixed length
└─ Normalization
↓
Feature Extraction
├─ MFCCs (40 coefficients)
or
├─ Mel-Spectrogram (40 bins)
↓
Neural Network
├─ CNN (fastest, on-device)
or
├─ RNN (better temporal modeling)
or
├─ Attention (best accuracy, slower)
↓
Softmax Layer (31 classes)
├─ 30 command classes
└─ 1 unknown class
↓
Post-processing
├─ Confidence thresholding
├─ Unknown detection
└─ Output filtering
↓
Prediction: {command: "lights_on", confidence: 0.94}
Component 1: Audio Preprocessing
Fixed-Length Input
Problem: Audio clips have variable duration (0.5s - 3s)
Solution: Standardize to fixed length (e.g., 1 second)
def preprocess_audio(audio: np.ndarray, sr=16000, target_duration=1.0):
"""
Ensure all audio clips are same length
Args:
audio: Audio waveform
sr: Sample rate
target_duration: Target duration in seconds
Returns:
Processed audio of length sr * target_duration
"""
target_length = int(sr * target_duration)
# Pad if too short
if len(audio) < target_length:
pad_length = target_length - len(audio)
audio = np.pad(audio, (0, pad_length), mode='constant')
# Trim if too long
elif len(audio) > target_length:
# Take central portion
start = (len(audio) - target_length) // 2
audio = audio[start:start + target_length]
return audio
Why fixed length?
- Neural networks expect fixed-size inputs
- Enables batching during training
- Simplifies model architecture
Alternative: Variable-length with padding
def pad_sequence(audios: list, sr=16000):
"""
Pad multiple audio clips to longest length
Used during batched inference
"""
max_length = max(len(a) for a in audios)
padded = []
masks = []
for audio in audios:
pad_length = max_length - len(audio)
padded_audio = np.pad(audio, (0, pad_length))
mask = np.ones(len(audio)).tolist() + [0] * pad_length
padded.append(padded_audio)
masks.append(mask)
return np.array(padded), np.array(masks)
Normalization
def normalize_audio(audio: np.ndarray) -> np.ndarray:
"""
Normalize audio to [-1, 1] range
Improves model convergence and generalization
"""
# Peak normalization
max_val = np.max(np.abs(audio))
if max_val > 0:
audio = audio / max_val
return audio
def normalize_rms(audio: np.ndarray, target_rms=0.1) -> np.ndarray:
"""
Normalize by RMS (root mean square) energy
Better for handling volume variations
"""
current_rms = np.sqrt(np.mean(audio ** 2))
if current_rms > 0:
audio = audio * (target_rms / current_rms)
return audio
Component 2: Feature Extraction
Option 1: MFCCs (Mel-Frequency Cepstral Coefficients)
MFCCs capture the spectral envelope of speech, which is important for phonetic content.
import librosa
def extract_mfcc(audio, sr=16000, n_mfcc=40, n_fft=512, hop_length=160):
"""
Extract MFCC features
Args:
audio: Waveform
sr: Sample rate (Hz)
n_mfcc: Number of MFCC coefficients
n_fft: FFT window size
hop_length: Hop length between frames (10ms at 16kHz)
Returns:
MFCCs: (n_mfcc, time_steps)
"""
# Compute MFCCs
mfccs = librosa.feature.mfcc(
y=audio,
sr=sr,
n_mfcc=n_mfcc,
n_fft=n_fft,
hop_length=hop_length,
n_mels=40, # Number of mel bands
fmin=20, # Minimum frequency
fmax=sr//2 # Maximum frequency (Nyquist)
)
# Add delta (velocity) and delta-delta (acceleration)
delta = librosa.feature.delta(mfccs)
delta2 = librosa.feature.delta(mfccs, order=2)
# Stack all features
features = np.vstack([mfccs, delta, delta2]) # (120, time)
return features.T # (time, 120)
Why delta features?
- MFCCs: Spectral shape (what phonemes)
- Delta: How spectral shape is changing (dynamics)
- Delta-delta: Rate of change (acceleration)
Together they capture both static and dynamic characteristics of speech.
Option 2: Mel-Spectrogram
Mel-spectrograms preserve more temporal resolution than MFCCs.
def extract_mel_spectrogram(audio, sr=16000, n_mels=40, n_fft=512, hop_length=160):
"""
Extract log mel-spectrogram
Returns:
Log mel-spectrogram: (time, n_mels)
"""
# Compute mel spectrogram
mel_spec = librosa.feature.melspectrogram(
y=audio,
sr=sr,
n_fft=n_fft,
hop_length=hop_length,
n_mels=n_mels,
fmin=20,
fmax=sr//2
)
# Convert to log scale (dB)
log_mel = librosa.power_to_db(mel_spec, ref=np.max)
return log_mel.T # (time, n_mels)
MFCCs vs Mel-Spectrogram:
| Feature | MFCCs | Mel-Spectrogram |
|---|---|---|
| Size | (time, 13-40) | (time, 40-80) |
| Information | Spectral envelope | Full spectrum |
| Works better with | Small models | CNNs (image-like) |
| Training time | Faster | Slower |
| Accuracy | Slightly lower | Slightly higher |
Recommendation: Use mel-spectrograms with CNNs for best accuracy.
Component 3: Model Architectures
Architecture 1: CNN (Fastest for On-Device)
import torch
import torch.nn as nn
class CommandCNN(nn.Module):
"""
CNN for audio command classification
Treats mel-spectrogram as 2D image
"""
def __init__(self, num_classes=31, input_channels=1):
super().__init__()
# Convolutional layers
self.conv1 = nn.Sequential(
nn.Conv2d(input_channels, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2, 2)
)
self.conv2 = nn.Sequential(
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2, 2)
)
self.conv3 = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.MaxPool2d(2, 2)
)
# Global average pooling (instead of fully-connected)
self.gap = nn.AdaptiveAvgPool2d((1, 1))
# Classification head
self.classifier = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(128, num_classes)
)
def forward(self, x):
# x: (batch, 1, time, freq)
x = self.conv1(x) # → (batch, 32, time/2, freq/2)
x = self.conv2(x) # → (batch, 64, time/4, freq/4)
x = self.conv3(x) # → (batch, 128, time/8, freq/8)
x = self.gap(x) # → (batch, 128, 1, 1)
x = x.view(x.size(0), -1) # → (batch, 128)
x = self.classifier(x) # → (batch, num_classes)
return x
# Model size: ~2MB
# Inference time (CPU): 15ms
# Accuracy: ~93%
Why CNNs work for audio:
- Local patterns: Phonemes have localized frequency patterns
- Translation invariance: Command can start at different times
- Parameter sharing: Same filters across time/frequency
- Efficient: Mostly matrix operations, highly optimized
Architecture 2: RNN (Better Temporal Modeling)
class CommandRNN(nn.Module):
"""
RNN for command classification
Better at capturing temporal dependencies
"""
def __init__(self, input_dim=40, hidden_dim=128, num_layers=2, num_classes=31):
super().__init__()
# LSTM layers
self.lstm = nn.LSTM(
input_size=input_dim,
hidden_size=hidden_dim,
num_layers=num_layers,
batch_first=True,
bidirectional=True,
dropout=0.2
)
# Attention mechanism (optional)
self.attention = nn.Linear(hidden_dim * 2, 1)
# Classification head
self.classifier = nn.Linear(hidden_dim * 2, num_classes)
def forward(self, x):
# x: (batch, time, features)
# LSTM
lstm_out, _ = self.lstm(x) # → (batch, time, hidden*2)
# Attention pooling (instead of taking last time step)
attention_weights = torch.softmax(
self.attention(lstm_out), # → (batch, time, 1)
dim=1
)
# Weighted sum
context = torch.sum(attention_weights * lstm_out, dim=1) # → (batch, hidden*2)
# Classify
logits = self.classifier(context) # → (batch, num_classes)
return logits
# Model size: ~5MB
# Inference time (CPU): 30ms
# Accuracy: ~95%
Architecture 3: Attention-Based (Best Accuracy)
class CommandTransformer(nn.Module):
"""
Transformer for command classification
Best accuracy but slower inference
"""
def __init__(self, input_dim=40, d_model=128, nhead=4, num_layers=2, num_classes=31):
super().__init__()
# Input projection
self.embedding = nn.Linear(input_dim, d_model)
# Positional encoding
self.pos_encoder = PositionalEncoding(d_model)
# Transformer encoder
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=nhead,
dim_feedforward=d_model * 4,
dropout=0.1
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
# Classification head
self.classifier = nn.Linear(d_model, num_classes)
def forward(self, x):
# x: (batch, time, features)
# Project to d_model
x = self.embedding(x) # → (batch, time, d_model)
# Add positional encoding
x = self.pos_encoder(x)
# Transformer expects (time, batch, d_model)
x = x.transpose(0, 1)
x = self.transformer(x)
x = x.transpose(0, 1)
# Average pool over time
x = x.mean(dim=1) # → (batch, d_model)
# Classify
logits = self.classifier(x) # → (batch, num_classes)
return logits
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
return x + self.pe[:, :x.size(1), :]
# Model size: ~8MB
# Inference time (CPU): 50ms
# Accuracy: ~97%
Model Comparison
| Model | Params | Size | CPU Latency | GPU Latency | Accuracy | Best For |
|---|---|---|---|---|---|---|
| CNN | 500K | 2MB | 15ms | 3ms | 93% | Mobile devices |
| RNN | 1.2M | 5MB | 30ms | 5ms | 95% | Balanced |
| Transformer | 2M | 8MB | 50ms | 8ms | 97% | Cloud/high-end |
Production choice: CNN for on-device, RNN for cloud
Training Strategy
Data Collection
Per command, need:
- 1000-5000 examples
- 100+ speakers (diversity)
- Both genders, various ages
- Different accents
- Background noise variations
- Different recording devices
Example dataset structure:
data/
├── lights_on/
│ ├── speaker001_01.wav
│ ├── speaker001_02.wav
│ ├── speaker002_01.wav
│ └── ...
├── lights_off/
│ └── ...
├── volume_up/
│ └── ...
└── unknown/
├── random_speech/
├── music/
├── noise/
└── silence/
Data Augmentation
Critical for robustness! Augment during training:
import random
def augment_audio(audio, sr=16000):
"""
Apply random augmentation
Each training example augmented differently
"""
augmentations = [
add_noise,
time_shift,
time_stretch,
pitch_shift,
add_reverb
]
# Apply 1-3 random augmentations
num_augs = random.randint(1, 3)
selected = random.sample(augmentations, num_augs)
for aug_fn in selected:
audio = aug_fn(audio, sr)
return audio
def add_noise(audio, sr, snr_db=random.uniform(5, 20)):
"""Add background noise at specific SNR"""
# Load random noise sample
noise = load_random_noise_sample(len(audio))
# Calculate noise power for target SNR
audio_power = np.mean(audio ** 2)
noise_power = audio_power / (10 ** (snr_db / 10))
noise_scaled = noise * np.sqrt(noise_power / np.mean(noise ** 2))
return audio + noise_scaled
def time_shift(audio, sr, shift_max=0.1):
"""Shift audio in time (simulates different reaction times)"""
shift = int(sr * shift_max * (random.random() - 0.5))
return np.roll(audio, shift)
def time_stretch(audio, sr, rate=random.uniform(0.9, 1.1)):
"""Change speed without changing pitch"""
return librosa.effects.time_stretch(audio, rate=rate)
def pitch_shift(audio, sr, n_steps=random.randint(-2, 2)):
"""Shift pitch (simulates different speakers)"""
return librosa.effects.pitch_shift(audio, sr=sr, n_steps=n_steps)
def add_reverb(audio, sr):
"""Add room reverb (simulates different environments)"""
# Simple reverb using convolution with impulse response
impulse_response = generate_simple_reverb(sr)
return np.convolve(audio, impulse_response, mode='same')
Impact: 2-3x effective dataset size, 10-20% accuracy improvement
Training Loop
def train_command_classifier(
model,
train_loader,
val_loader,
epochs=100,
lr=0.001
):
"""
Train speech command classifier
"""
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer,
mode='max',
factor=0.5,
patience=5,
verbose=True
)
best_val_acc = 0.0
for epoch in range(epochs):
# Training
model.train()
train_loss = 0
train_correct = 0
train_total = 0
for batch_idx, (audio, labels) in enumerate(train_loader):
# Extract features
features = extract_features_batch(audio, sr=16000)
features = torch.tensor(features, dtype=torch.float32)
# Add channel dimension for CNN
if len(features.shape) == 3:
features = features.unsqueeze(1) # (batch, 1, time, freq)
labels = torch.tensor(labels, dtype=torch.long)
# Forward
outputs = model(features)
loss = criterion(outputs, labels)
# Backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Track accuracy
_, predicted = torch.max(outputs, 1)
train_correct += (predicted == labels).sum().item()
train_total += labels.size(0)
train_loss += loss.item()
train_acc = train_correct / train_total
avg_loss = train_loss / len(train_loader)
# Validation
val_acc = validate(model, val_loader)
# Learning rate scheduling
scheduler.step(val_acc)
# Save best model
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_model.pth')
print(f"✓ New best model: {val_acc:.4f}")
print(f"Epoch {epoch+1}/{epochs}: "
f"Loss={avg_loss:.4f}, "
f"Train Acc={train_acc:.4f}, "
f"Val Acc={val_acc:.4f}")
return model
def validate(model, val_loader):
"""Evaluate on validation set"""
model.eval()
correct = 0
total = 0
with torch.no_grad():
for audio, labels in val_loader:
features = extract_features_batch(audio)
features = torch.tensor(features).unsqueeze(1)
labels = torch.tensor(labels)
outputs = model(features)
_, predicted = torch.max(outputs, 1)
correct += (predicted == labels).sum().item()
total += labels.size(0)
return correct / total
Component 4: Handling Unknown Commands
Strategy 1: Add “Unknown” Class
# Training data
command_classes = [
"lights_on", "lights_off", "volume_up", "volume_down",
"play_music", "stop", "pause", "next", "previous",
# ... 30 total commands
]
# Collect negative examples
unknown_class = [
"random_speech", # Conversations
"music", # Background music
"noise", # Environmental sounds
"silence" # No speech
]
# Labels: 0-29 for commands, 30 for unknown
all_classes = command_classes + ["unknown"]
Collecting unknown data:
# Record actual user interactions
# Label anything that's NOT a command as "unknown"
unknown_samples = []
for audio in production_audio_stream:
if not is_valid_command(audio):
unknown_samples.append(audio)
if len(unknown_samples) >= 10000:
# Add to training set
augment_and_save(unknown_samples, label="unknown")
Strategy 2: Confidence Thresholding
def predict_with_threshold(model, audio, threshold=0.7):
"""
Reject low-confidence predictions as unknown
"""
# Extract features
features = extract_mel_spectrogram(audio)
features = torch.tensor(features).unsqueeze(0).unsqueeze(0)
# Predict
with torch.no_grad():
logits = model(features)
probs = torch.softmax(logits, dim=1)[0]
# Get top prediction
max_prob, predicted_class = torch.max(probs, 0)
# Threshold check
if max_prob < threshold:
return "unknown", float(max_prob)
return command_classes[predicted_class], float(max_prob)
Strategy 3: Out-of-Distribution (OOD) Detection
def detect_ood_with_entropy(probs):
"""
High entropy = model is uncertain = likely OOD
"""
entropy = -torch.sum(probs * torch.log(probs + 1e-10))
# Calibrate threshold on validation set
# In-distribution: entropy ~0.5
# Out-of-distribution: entropy > 2.0
if entropy > 2.0:
return True # OOD
return False
def detect_ood_with_mahalanobis(features, class_means, class_covariances):
"""
Mahalanobis distance to class centroids
Far from all classes = likely OOD
"""
min_distance = float('inf')
for class_idx in range(len(class_means)):
mean = class_means[class_idx]
cov = class_covariances[class_idx]
# Mahalanobis distance
diff = features - mean
distance = np.sqrt(diff.T @ np.linalg.inv(cov) @ diff)
min_distance = min(min_distance, distance)
# Threshold: 3-sigma rule
if min_distance > 3.0:
return True # OOD
return False
Model Optimization for Edge Deployment
Quantization
# Post-training quantization (dynamic quantization targets Linear; Conv2d not supported)
model_fp32 = CommandCNN(num_classes=31)
model_fp32.load_state_dict(torch.load('model.pth'))
model_fp32.eval()
# Dynamic quantization (Linear layers)
model_int8 = torch.quantization.quantize_dynamic(
model_fp32,
{torch.nn.Linear},
dtype=torch.qint8
)
# Save
torch.save(model_int8.state_dict(), 'model_int8.pth')
# Results (typical on CPU with CNN head including Linear):
# - Model size: 2MB → ~1.2MB (1.6x smaller)
# - Inference: 15ms → ~10-12ms (1.3-1.5x faster)
# - Accuracy: ~93.2% → ~93.0% (≤0.2% drop)
Pruning
import torch.nn.utils.prune as prune
def prune_model(model, amount=0.3):
"""
Remove 30% of weights with lowest magnitude
"""
for name, module in model.named_modules():
if isinstance(module, (nn.Conv2d, nn.Linear)):
prune.l1_unstructured(module, name='weight', amount=amount)
return model
# Results with 30% pruning:
# - Model size: 2MB → 1.4MB
# - Inference: 15ms → 12ms
# - Accuracy: 93.2% → 92.7%
Knowledge Distillation
def distillation_loss(student_logits, teacher_logits, labels, temperature=3.0, alpha=0.7):
"""
Train small student to mimic large teacher
Args:
temperature: Soften probability distributions
alpha: Weight between soft and hard targets
"""
# Soft targets from teacher
soft_targets = torch.softmax(teacher_logits / temperature, dim=1)
soft_prob = torch.log_softmax(student_logits / temperature, dim=1)
soft_loss = -torch.sum(soft_targets * soft_prob) / soft_prob.size()[0]
soft_loss = soft_loss * (temperature ** 2)
# Hard targets (ground truth)
hard_loss = nn.CrossEntropyLoss()(student_logits, labels)
# Combine
return alpha * soft_loss + (1 - alpha) * hard_loss
# Train student
teacher = CommandTransformer(num_classes=31) # 8MB, 97% accuracy
student = CommandCNN(num_classes=31) # 2MB, 93% accuracy
for audio, labels in train_loader:
# Teacher predictions (frozen)
with torch.no_grad():
teacher_logits = teacher(audio)
# Student predictions
student_logits = student(audio)
# Distillation loss
loss = distillation_loss(student_logits, teacher_logits, labels)
# Optimize student
loss.backward()
optimizer.step()
# Result: Student achieves 95% (vs 93% without distillation)
On-Device Deployment
Export to Mobile Formats
TensorFlow Lite (Android):
import tensorflow as tf
# Convert PyTorch to TensorFlow (via ONNX)
# 1. Export PyTorch to ONNX
torch.onnx.export(
model,
dummy_input,
"model.onnx",
input_names=['input'],
output_names=['output']
)
# 2. Convert ONNX to TF
import onnx
from onnx_tf.backend import prepare
onnx_model = onnx.load("model.onnx")
tf_model = prepare(onnx_model)
tf_model.export_graph("model_tf")
# 3. Convert TF to TFLite
converter = tf.lite.TFLiteConverter.from_saved_model("model_tf")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
with open('command_classifier.tflite', 'wb') as f:
f.write(tflite_model)
Core ML (iOS):
import coremltools as ct
# Trace PyTorch model
example_input = torch.randn(1, 1, 100, 40)
traced_model = torch.jit.trace(model, example_input)
# Convert to Core ML
coreml_model = ct.convert(
traced_model,
inputs=[ct.TensorType(name="audio", shape=(1, 1, 100, 40))],
outputs=[ct.TensorType(name="logits")]
)
# Add metadata
coreml_model.author = "Arun Baby"
coreml_model.short_description = "Speech command classifier"
coreml_model.version = "1.0"
# Save
coreml_model.save("CommandClassifier.mlmodel")
Mobile Inference Code
Android (Kotlin):
import org.tensorflow.lite.Interpreter
import java.nio.ByteBuffer
class CommandClassifier(private val context: Context) {
private lateinit var interpreter: Interpreter
init {
// Load model
val model = loadModelFile("command_classifier.tflite")
interpreter = Interpreter(model)
}
fun classify(audio: FloatArray): Pair<String, Float> {
// Extract features
val features = extractMelSpectrogram(audio)
// Prepare input
val inputBuffer = ByteBuffer.allocateDirect(4 * features.size)
inputBuffer.order(ByteOrder.nativeOrder())
features.forEach { inputBuffer.putFloat(it) }
// Prepare output
val output = Array(1) { FloatArray(31) }
// Run inference
interpreter.run(inputBuffer, output)
// Get top prediction
val probabilities = output[0]
val maxIndex = probabilities.indices.maxByOrNull { probabilities[it] } ?: 0
val confidence = probabilities[maxIndex]
return Pair(commandNames[maxIndex], confidence)
}
}
iOS (Swift):
import CoreML
class CommandClassifier {
private var model: CommandClassifierModel!
init() {
model = try! CommandClassifierModel(configuration: MLModelConfiguration())
}
func classify(audio: [Float]) -> (command: String, confidence: Double) {
// Extract features
let features = extractMelSpectrogram(audio)
// Create MLMultiArray
let input = try! MLMultiArray(shape: [1, 1, 100, 40], dataType: .float32)
for i in 0..<features.count {
input[i] = NSNumber(value: features[i])
}
// Run inference
let output = try! model.prediction(audio: input)
// Get top prediction
let probabilities = output.logits
let maxIndex = probabilities.argmax()
let confidence = probabilities[maxIndex]
return (commandNames[maxIndex], Double(confidence))
}
}
Monitoring & Evaluation
Metrics Dashboard
from dataclasses import dataclass
from typing import List
@dataclass
class ClassificationMetrics:
"""Per-class metrics"""
precision: float
recall: float
f1_score: float
support: int # Number of samples
def compute_metrics(y_true: List[int], y_pred: List[int], num_classes: int):
"""
Compute detailed metrics per class
"""
from sklearn.metrics import classification_report, confusion_matrix
# Per-class metrics
report = classification_report(y_true, y_pred, output_dict=True)
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Identify problematic classes
for i in range(num_classes):
if report[str(i)]['f1-score'] < 0.85:
print(f"⚠️ Class {i} ({command_names[i]}) has low F1: {report[str(i)]['f1-score']:.3f}")
# Find most confused class
confused_with = cm[i].argmax()
if confused_with != i:
print(f" Most confused with class {confused_with} ({command_names[confused_with]})")
return report, cm
Online Monitoring
class OnlineMetricsTracker:
"""
Track metrics in production
"""
def __init__(self):
self.predictions = []
self.confidences = []
self.latencies = []
def record(self, prediction: int, confidence: float, latency_ms: float):
"""Record single prediction"""
self.predictions.append(prediction)
self.confidences.append(confidence)
self.latencies.append(latency_ms)
def get_stats(self, last_n=1000):
"""Get recent statistics"""
recent_preds = self.predictions[-last_n:]
recent_confs = self.confidences[-last_n:]
recent_lats = self.latencies[-last_n:]
# Class distribution
from collections import Counter
class_dist = Counter(recent_preds)
return {
'total_predictions': len(recent_preds),
'class_distribution': dict(class_dist),
'avg_confidence': np.mean(recent_confs),
'low_confidence_rate': sum(c < 0.7 for c in recent_confs) / len(recent_confs),
'p50_latency': np.percentile(recent_lats, 50),
'p95_latency': np.percentile(recent_lats, 95),
'p99_latency': np.percentile(recent_lats, 99)
}
Multi-Language Support
Approach 1: Separate Models per Language
Pros:
- Best accuracy per language
- Language-specific optimizations
- Easier to add new languages
Cons:
- Multiple models to maintain
- Higher storage footprint
- Language detection needed first
class MultilingualClassifier:
"""
Separate model per language
"""
def __init__(self):
self.models = {
'en': load_model('command_en.pth'),
'es': load_model('command_es.pth'),
'fr': load_model('command_fr.pth'),
'de': load_model('command_de.pth'),
'ja': load_model('command_ja.pth')
}
self.language_detector = load_model('lang_detect.pth')
def predict(self, audio):
# Detect language first
language = self.language_detector.predict(audio)
# Use language-specific model
model = self.models[language]
prediction = model.predict(audio)
return prediction, language
Storage requirement: 5 languages × 2MB = 10MB
Approach 2: Multilingual Shared Model
Training strategy:
def train_multilingual_model():
"""
Single model trained on all languages
Add language ID as auxiliary input
"""
model = MultilingualCommandCNN(
num_classes=30,
num_languages=5
)
# Training data from all languages
for audio, command_label, lang_id in train_loader:
features = extract_features(audio)
# Forward pass with language embedding
command_pred = model(features, lang_id)
# Loss
loss = criterion(command_pred, command_label)
loss.backward()
optimizer.step()
return model
Model architecture:
class MultilingualCommandCNN(nn.Module):
"""
Shared model with language embeddings
"""
def __init__(self, num_classes=30, num_languages=5, embedding_dim=16):
super().__init__()
# Language embedding
self.lang_embedding = nn.Embedding(num_languages, embedding_dim)
# Shared CNN backbone
self.cnn = CommandCNN(num_classes=128) # Feature extractor
# Language-conditioned classifier
self.classifier = nn.Linear(128 + embedding_dim, num_classes)
def forward(self, audio_features, language_id):
# CNN features
cnn_features = self.cnn(audio_features) # (batch, 128)
# Language embedding
lang_emb = self.lang_embedding(language_id) # (batch, 16)
# Concatenate
combined = torch.cat([cnn_features, lang_emb], dim=1) # (batch, 144)
# Classify
logits = self.classifier(combined) # (batch, num_classes)
return logits
Pros:
- Single model (2-3MB)
- Shared representations across languages
- Transfer learning for low-resource languages
Cons:
- Slightly lower accuracy per language
- All languages must use same command set
Failure Cases & Mitigation
Common Failure Modes
1. Background Speech/TV
Problem: Model activates on TV dialogue or background conversation
Mitigation:
def detect_background_speech(audio, sr=16000):
"""
Detect if audio is from TV/background vs direct user speech
Features:
- Energy envelope variation (TV more consistent)
- Reverb characteristics (TV more reverberant)
- Spectral rolloff (TV often compressed)
"""
# Energy variation
frame_energy = librosa.feature.rms(y=audio)[0]
energy_std = np.std(frame_energy)
# TV has lower energy variation
if energy_std < 0.01:
return True # Likely background
# Spectral centroid (TV often band-limited)
spectral_centroid = librosa.feature.spectral_centroid(y=audio, sr=sr)[0]
avg_centroid = np.mean(spectral_centroid)
if avg_centroid < 1000: # Hz
return True # Likely background
return False
Additional strategy: Use speaker verification to check if it’s the registered user
2. Accented Speech
Problem: Model trained on standard accent performs poorly on regional accents
Mitigation:
# Data collection strategy
accent_distribution = {
'general_american': 0.3,
'british': 0.15,
'australian': 0.1,
'indian': 0.15,
'southern_us': 0.1,
'canadian': 0.1,
'other': 0.1
}
# Ensure balanced training data
for accent, proportion in accent_distribution.items():
required_samples = total_samples * proportion
collect_samples(accent, required_samples)
# Use accent-aware data augmentation
def accent_aware_augmentation(audio, accent_type):
"""Apply accent-specific augmentations"""
if accent_type == 'indian':
# Indian English: Stronger pitch variation
audio = pitch_shift(audio, n_steps=random.randint(-3, 3))
elif accent_type == 'southern_us':
# Southern US: Slower speech
audio = time_stretch(audio, rate=random.uniform(0.85, 1.0))
return audio
3. Noisy Environments
Problem: Model degrades in cafes, cars, streets
Mitigation:
def enhance_audio_for_inference(audio, sr=16000):
"""
Lightweight denoising for inference
Must be < 5ms to maintain latency budget
"""
# Spectral gating (simple but effective)
stft = librosa.stft(audio)
magnitude = np.abs(stft)
# Estimate noise floor (first 100ms)
noise_frames = magnitude[:, :10]
noise_threshold = np.mean(noise_frames, axis=1, keepdims=True) * 1.5
# Gate
mask = magnitude > noise_threshold
stft_denoised = stft * mask
# Inverse STFT
audio_denoised = librosa.istft(stft_denoised)
return audio_denoised
Better approach: Train with noisy data
# Use diverse noise types during training
noise_types = [
'cafe_ambiance',
'car_interior',
'street_traffic',
'office_chatter',
'home_appliances',
'rain',
'wind'
]
for audio, label in train_loader:
# Add random noise
noise_type = random.choice(noise_types)
noisy_audio = add_noise(audio, noise_type, snr_db=random.uniform(5, 20))
4. Similar Sounding Commands
Problem: “lights on” vs “lights off”, “volume up” vs “volume down”
Mitigation:
# Use contrastive learning during training
def contrastive_loss(anchor, positive, negative, margin=1.0):
"""
Pull together similar commands, push apart confusable ones
"""
pos_distance = torch.norm(anchor - positive, dim=1)
neg_distance = torch.norm(anchor - negative, dim=1)
loss = torch.relu(pos_distance - neg_distance + margin)
return loss.mean()
# Identify confusable pairs
confusable_pairs = [
('lights_on', 'lights_off'),
('volume_up', 'volume_down'),
('next', 'previous'),
('play', 'pause')
]
# During training
for audio, label in train_loader:
features = model.extract_features(audio)
# For confusable commands, add contrastive loss
if label in confusable_commands:
opposite_label = get_opposite_command(label)
opposite_audio = sample_from_class(opposite_label)
opposite_features = model.extract_features(opposite_audio)
total_loss = classification_loss + 0.2 * contrastive_loss(
features,
features, # Anchor to itself
opposite_features
)
Production Deployment Architecture
Edge Deployment (Smart Speaker)
┌─────────────────────────────────────────┐
│ Smart Speaker Device │
├─────────────────────────────────────────┤
│ │
│ Microphone Array │
│ ↓ │
│ Beamforming (5ms) │
│ ↓ │
│ Wake Word Detection (10ms) │
│ ↓ │
│ [If wake word detected] │
│ ↓ │
│ Audio Buffer (1 second) │
│ ↓ │
│ Feature Extraction (5ms) │
│ ↓ │
│ Command CNN Inference (15ms) │
│ ↓ │
│ ┌──────────────┐ │
│ │ Confidence │ │
│ │ > 0.85? │ │
│ └──────┬───────┘ │
│ │ │
│ Yes │ No │
│ ↓ │
│ Execute Command Send to Cloud ASR │
│ │
└─────────────────────────────────────────┘
Total latency (on-device): < 40ms
Power consumption: < 100mW during inference
Hybrid Edge-Cloud Architecture
class HybridCommandClassifier:
"""
Intelligent routing between edge and cloud
"""
def __init__(self):
self.edge_model = load_edge_model() # Small CNN
self.cloud_client = CloudASRClient()
# Common commands handled on-device
self.edge_commands = {
'lights_on', 'lights_off',
'volume_up', 'volume_down',
'play', 'pause', 'stop',
'next', 'previous'
}
async def classify(self, audio):
# Try edge first
edge_pred, edge_conf = self.edge_model.predict(audio)
# High confidence + known command → use edge
if edge_conf > 0.85 and edge_pred in self.edge_commands:
return {
'command': edge_pred,
'confidence': edge_conf,
'source': 'edge',
'latency_ms': 35
}
# Otherwise → cloud ASR
cloud_result = await self.cloud_client.recognize(audio)
return {
'command': cloud_result['text'],
'confidence': cloud_result['confidence'],
'source': 'cloud',
'latency_ms': 250
}
Benefits:
- ✅ 90% of commands handled on-device (< 50ms)
- ✅ 10% fall back to cloud for complex queries
- ✅ Privacy for common commands
- ✅ Graceful degradation if network unavailable
A/B Testing & Gradual Rollout
Experiment Framework
class ModelExperiment:
"""
A/B test new model versions
"""
def __init__(self, control_model, treatment_model, treatment_percentage=10):
self.control = control_model
self.treatment = treatment_model
self.treatment_pct = treatment_percentage
def predict(self, audio, user_id):
# Deterministic assignment based on user_id
bucket = hash(user_id) % 100
if bucket < self.treatment_pct:
# Treatment group
pred, conf = self.treatment.predict(audio)
variant = 'treatment'
else:
# Control group
pred, conf = self.control.predict(audio)
variant = 'control'
# Log for analysis
self.log_prediction(user_id, variant, pred, conf)
return pred, conf
def log_prediction(self, user_id, variant, prediction, confidence):
"""Log to analytics system"""
event = {
'user_id': user_id,
'timestamp': time.time(),
'variant': variant,
'prediction': prediction,
'confidence': confidence
}
analytics_logger.log(event)
Metrics to Track
def compute_experiment_metrics(control_group, treatment_group):
"""
Compare model versions
"""
metrics = {}
# Accuracy (if ground truth available)
if has_ground_truth:
metrics['accuracy_control'] = compute_accuracy(control_group)
metrics['accuracy_treatment'] = compute_accuracy(treatment_group)
# Confidence distribution
metrics['avg_confidence_control'] = np.mean([x['confidence'] for x in control_group])
metrics['avg_confidence_treatment'] = np.mean([x['confidence'] for x in treatment_group])
# Latency
metrics['p95_latency_control'] = np.percentile([x['latency'] for x in control_group], 95)
metrics['p95_latency_treatment'] = np.percentile([x['latency'] for x in treatment_group], 95)
# User engagement (proxy for accuracy)
metrics['retry_rate_control'] = compute_retry_rate(control_group)
metrics['retry_rate_treatment'] = compute_retry_rate(treatment_group)
# Statistical significance
from scipy.stats import ttest_ind
control_success = [x['success'] for x in control_group]
treatment_success = [x['success'] for x in treatment_group]
t_stat, p_value = ttest_ind(control_success, treatment_success)
metrics['p_value'] = p_value
metrics['is_significant'] = p_value < 0.05
return metrics
Real-World Examples
Google Assistant
“Hey Google” Wake Word:
- Always-on detection using tiny model (< 1MB)
- Runs on low-power co-processor (DSP)
- < 10ms latency, ~0.5mW power
- ~ 99.5% accuracy on target phrase
- Personalized over time with on-device learning
Command Classification:
- Separate model for common commands (~30 commands)
- Fallback to full ASR for complex queries
- On-device for privacy (no audio sent to cloud)
- Multi-language support (40+ languages)
Architecture:
Microphone → Beamformer → Wake Word → Command CNN → Execute
↓
(if low conf)
↓
Cloud ASR
Amazon Alexa
“Alexa” Wake Word:
- Multi-stage cascade:
- Stage 1: Energy detector (< 1ms, filters silence)
- Stage 2: Keyword spotter (< 10ms, CNN)
- Stage 3: Full verification (< 50ms, larger model)
- Reduces false positives by 10x
- Power-efficient (only stage 3 uses main CPU)
Custom Skills:
- Slot-filling approach for structured commands
- Template: “play {song} by {artist}”
- Combined classification + entity extraction
- ~100K custom skills available
Deployment:
- Edge: Wake word + simple commands
- Cloud: Everything else (200ms latency acceptable)
Apple Siri
“Hey Siri” Detection:
- Neural network on Neural Engine (dedicated ML chip)
- Personalized to user’s voice during setup
- Continuously adapts to voice changes
- < 50ms latency
- Works offline (completely on-device)
- Power: < 1mW in always-listening mode
Privacy Design:
- Audio never sent to cloud without explicit activation
- Voice profile stored locally (encrypted)
- Random identifier (not tied to Apple ID)
Technical Details:
- Uses LSTM for temporal modeling
- Trained on millions of “Hey Siri” variations
- Negative examples: TV shows, movies, other voices
Key Takeaways
✅ Direct audio→intent faster than ASR→NLU for limited commands ✅ CNNs on mel-spectrograms work excellently for on-device ✅ Data augmentation critical for robustness (noise, time shift, pitch) ✅ Unknown class handling prevents false activations ✅ Quantization achieves 4x compression with < 1% accuracy loss ✅ Threshold tuning balances precision/recall for business needs
Further Reading
Papers:
Datasets:
Tools:
- TensorFlow Lite
- Core ML
- Librosa - Audio processing
FAQ
Q: Why use direct audio classification instead of ASR plus NLU for voice commands? A: Direct audio-to-intent classification is 5x faster (45ms vs 250ms), uses 10x smaller models (under 10MB vs over 100MB), works offline on-device, and is more privacy-preserving since no text is sent to the cloud. It is ideal for a limited vocabulary of 30-100 commands. For complex open-ended queries, fall back to full streaming ASR.
Q: Which model architecture is best for on-device speech command classification? A: CNNs on mel-spectrograms are best for on-device deployment, achieving 93% accuracy with only 2MB model size and 15ms CPU inference time. With knowledge distillation from a larger Transformer teacher, accuracy improves to 95% without increasing model size.
Q: How do you handle unknown or out-of-vocabulary commands? A: Three strategies work together: training an explicit “unknown” class with diverse negative examples (random speech, music, noise), applying confidence thresholding to reject predictions below 0.7 probability, and using entropy-based out-of-distribution detection where high entropy indicates the model is uncertain about the input. See also VAD for filtering non-speech audio before classification.
Originally published at: arunbaby.com/speech-tech/0002-speech-classification
If you found this helpful, consider sharing it with others who might benefit.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch