Speech Command Classification

Q: Why use direct audio classification instead of ASR plus NLU for voice commands?

Direct audio-to-intent classification is 5x faster (45ms vs 250ms), uses 10x smaller models (under 10MB vs over 100MB), works offline on-device, and is more privacy-preserving since no text is sent to the cloud. It is ideal for a limited vocabulary of 30-100 commands where speed and privacy matter more than open-ended understanding.

Q: Which model architecture is best for on-device speech command classification?

CNNs on mel-spectrograms are best for on-device deployment, achieving 93% accuracy with only 2MB model size and 15ms CPU inference time. With knowledge distillation from a larger Transformer teacher, accuracy improves to 95% without increasing model size.

Q: How do you handle unknown or out-of-vocabulary commands?

Three strategies work together: training an explicit unknown class with negative examples, applying confidence thresholding to reject predictions below 0.7 probability, and using entropy-based out-of-distribution detection where high entropy indicates the model is uncertain about the input.

30 minute read

How voice assistants recognize “turn on the lights” from raw audio in under 100ms without full ASR transcription.

TL;DR

Direct audio-to-intent classification skips full ASR transcription entirely, classifying voice commands from mel-spectrogram features in 45ms (5x faster than ASR+NLU). A CNN model achieves 93% accuracy at 2MB/15ms on-device, while knowledge distillation from a Transformer teacher pushes this to 95% without increasing model size. Unknown command handling combines an explicit “unknown” training class, confidence thresholding, and entropy-based out-of-distribution detection. Production deployments use hybrid edge-cloud routing where high-confidence commands execute locally and uncertain queries fall back to cloud streaming ASR.

A row of color-coded audio jack cables plugged into a patch panel with each cable routed to a different labeled outpu...

Introduction

When you say “Alexa, turn off the lights” or “Hey Google, set a timer,” your voice assistant doesn’t actually transcribe your speech to text first. Instead, it uses a direct audio-to-intent classification system that’s:

Faster than ASR + NLU (50-100ms vs 200-500ms)
Smaller models (< 10MB vs 100MB+)
Works offline (on-device inference)
More privacy-preserving (no text sent to cloud)

This approach is perfect for a limited vocabulary of commands (30-100 commands) where you care more about speed and privacy than open-ended understanding.

What you’ll learn:

Why direct audio→intent beats ASR→NLU for commands
Audio feature extraction (MFCCs, mel-spectrograms)
Model architectures (CNN, RNN, Attention)
Training strategies and data augmentation
On-device deployment and optimization
Unknown command handling (OOD detection)
Real-world examples from Google, Amazon, Apple

Problem Definition

Design a speech command classification system for a voice assistant that:

Functional Requirements

Multi-class Classification
- 30-50 predefined commands
- Examples: “lights on”, “volume up”, “play music”, “stop timer”
- Support synonyms and variations
Unknown Detection
- Detect and reject out-of-vocabulary audio
- Handle background conversation
- Distinguish commands from non-commands
Multi-language Support
- 5+ languages initially
- Shared model or separate models per language
Context Awareness
- Optional: Use device state as context
- Example: “turn it off” depends on what’s currently on

Non-Functional Requirements

Latency
- End-to-end < 100ms
- Includes audio buffering, processing, inference
Model Constraints
- Model size < 10MB (on-device)
- RAM usage < 50MB during inference
- CPU-only (no GPU on most devices)
Accuracy
- 95% on target commands (clean audio)
- 90% on noisy audio
- < 5% false positive rate
Throughput
- 1000 QPS per server (cloud)
- Single inference on device

Why Not ASR + NLU?

Traditional Pipeline

Audio → ASR → Text → NLU → Intent
"lights on" → ASR (200ms) → "lights on" → NLU (50ms) → {action: "lights", state: "on"}
Total latency: 250ms

Direct Classification

Audio → Audio Features → CNN → Intent
"lights on" → Mel-spec (5ms) → CNN (40ms) → {action: "lights", state: "on"}
Total latency: 45ms

Advantages:

✅ 5x faster (45ms vs 250ms)
✅ 10x smaller model (5MB vs 50MB)
✅ Works offline
✅ More private (no text)
✅ Fewer points of failure

Disadvantages:

❌ Limited vocabulary (30-50 commands vs unlimited)
❌ Less flexible (new commands need retraining)
❌ Can’t handle complex queries (“turn on the lights in the living room at 8pm”)

When to use each:

Direct classification: Simple commands, latency-critical, on-device
ASR + NLU: Complex queries, unlimited vocabulary, cloud-based

Architecture

Audio Input (1-2 seconds @ 16kHz)
 ↓
Audio Preprocessing
 ├─ Resampling (if needed)
 ├─ Padding/Trimming to fixed length
 └─ Normalization
 ↓
Feature Extraction
 ├─ MFCCs (40 coefficients)
 or
 ├─ Mel-Spectrogram (40 bins)
 ↓
Neural Network
 ├─ CNN (fastest, on-device)
 or
 ├─ RNN (better temporal modeling)
 or
 ├─ Attention (best accuracy, slower)
 ↓
Softmax Layer (31 classes)
 ├─ 30 command classes
 └─ 1 unknown class
 ↓
Post-processing
 ├─ Confidence thresholding
 ├─ Unknown detection
 └─ Output filtering
 ↓
Prediction: {command: "lights_on", confidence: 0.94}

Component 1: Audio Preprocessing

Fixed-Length Input

Problem: Audio clips have variable duration (0.5s - 3s)

Solution: Standardize to fixed length (e.g., 1 second)

def preprocess_audio(audio: np.ndarray, sr=16000, target_duration=1.0):
    """
    Ensure all audio clips are same length

    Args:
        audio: Audio waveform
        sr: Sample rate
        target_duration: Target duration in seconds

        Returns:
            Processed audio of length sr * target_duration
            """
            target_length = int(sr * target_duration)

            # Pad if too short
            if len(audio) < target_length:
                pad_length = target_length - len(audio)
                audio = np.pad(audio, (0, pad_length), mode='constant')

                # Trim if too long
            elif len(audio) > target_length:
                # Take central portion
                start = (len(audio) - target_length) // 2
                audio = audio[start:start + target_length]

                return audio

Why fixed length?

Neural networks expect fixed-size inputs
Enables batching during training
Simplifies model architecture

Alternative: Variable-length with padding

def pad_sequence(audios: list, sr=16000):
    """
    Pad multiple audio clips to longest length
    Used during batched inference
    """
    max_length = max(len(a) for a in audios)

    padded = []
    masks = []

    for audio in audios:
        pad_length = max_length - len(audio)
        padded_audio = np.pad(audio, (0, pad_length))
        mask = np.ones(len(audio)).tolist() + [0] * pad_length

        padded.append(padded_audio)
        masks.append(mask)

        return np.array(padded), np.array(masks)

Normalization

def normalize_audio(audio: np.ndarray) -> np.ndarray:
    """
    Normalize audio to [-1, 1] range

    Improves model convergence and generalization
    """
    # Peak normalization
    max_val = np.max(np.abs(audio))
    if max_val > 0:
        audio = audio / max_val

        return audio


    def normalize_rms(audio: np.ndarray, target_rms=0.1) -> np.ndarray:
        """
        Normalize by RMS (root mean square) energy

        Better for handling volume variations
        """
        current_rms = np.sqrt(np.mean(audio ** 2))
        if current_rms > 0:
            audio = audio * (target_rms / current_rms)

            return audio

Component 2: Feature Extraction

Option 1: MFCCs (Mel-Frequency Cepstral Coefficients)

MFCCs capture the spectral envelope of speech, which is important for phonetic content.

import librosa

def extract_mfcc(audio, sr=16000, n_mfcc=40, n_fft=512, hop_length=160):
    """
    Extract MFCC features

    Args:
        audio: Waveform
        sr: Sample rate (Hz)
        n_mfcc: Number of MFCC coefficients
        n_fft: FFT window size
        hop_length: Hop length between frames (10ms at 16kHz)

        Returns:
            MFCCs: (n_mfcc, time_steps)
            """
            # Compute MFCCs
            mfccs = librosa.feature.mfcc(
            y=audio,
            sr=sr,
            n_mfcc=n_mfcc,
            n_fft=n_fft,
            hop_length=hop_length,
            n_mels=40, # Number of mel bands
            fmin=20, # Minimum frequency
            fmax=sr//2 # Maximum frequency (Nyquist)
            )

            # Add delta (velocity) and delta-delta (acceleration)
            delta = librosa.feature.delta(mfccs)
            delta2 = librosa.feature.delta(mfccs, order=2)

            # Stack all features
            features = np.vstack([mfccs, delta, delta2]) # (120, time)

            return features.T # (time, 120)

Why delta features?

MFCCs: Spectral shape (what phonemes)
Delta: How spectral shape is changing (dynamics)
Delta-delta: Rate of change (acceleration)

Together they capture both static and dynamic characteristics of speech.

Option 2: Mel-Spectrogram

Mel-spectrograms preserve more temporal resolution than MFCCs.

def extract_mel_spectrogram(audio, sr=16000, n_mels=40, n_fft=512, hop_length=160):
    """
    Extract log mel-spectrogram

    Returns:
        Log mel-spectrogram: (time, n_mels)
        """
        # Compute mel spectrogram
        mel_spec = librosa.feature.melspectrogram(
        y=audio,
        sr=sr,
        n_fft=n_fft,
        hop_length=hop_length,
        n_mels=n_mels,
        fmin=20,
        fmax=sr//2
        )

        # Convert to log scale (dB)
        log_mel = librosa.power_to_db(mel_spec, ref=np.max)

        return log_mel.T # (time, n_mels)

MFCCs vs Mel-Spectrogram:

Feature	MFCCs	Mel-Spectrogram
Size	(time, 13-40)	(time, 40-80)
Information	Spectral envelope	Full spectrum
Works better with	Small models	CNNs (image-like)
Training time	Faster	Slower
Accuracy	Slightly lower	Slightly higher

Recommendation: Use mel-spectrograms with CNNs for best accuracy.

Component 3: Model Architectures

Architecture 1: CNN (Fastest for On-Device)

import torch
import torch.nn as nn

class CommandCNN(nn.Module):
    """
    CNN for audio command classification

    Treats mel-spectrogram as 2D image
    """
    def __init__(self, num_classes=31, input_channels=1):
        super().__init__()

        # Convolutional layers
        self.conv1 = nn.Sequential(
        nn.Conv2d(input_channels, 32, kernel_size=3, padding=1),
        nn.BatchNorm2d(32),
        nn.ReLU(),
        nn.MaxPool2d(2, 2)
        )

        self.conv2 = nn.Sequential(
        nn.Conv2d(32, 64, kernel_size=3, padding=1),
        nn.BatchNorm2d(64),
        nn.ReLU(),
        nn.MaxPool2d(2, 2)
        )

        self.conv3 = nn.Sequential(
        nn.Conv2d(64, 128, kernel_size=3, padding=1),
        nn.BatchNorm2d(128),
        nn.ReLU(),
        nn.MaxPool2d(2, 2)
        )

        # Global average pooling (instead of fully-connected)
        self.gap = nn.AdaptiveAvgPool2d((1, 1))

        # Classification head
        self.classifier = nn.Sequential(
        nn.Dropout(0.3),
        nn.Linear(128, num_classes)
        )

    def forward(self, x):
        # x: (batch, 1, time, freq)

        x = self.conv1(x) # → (batch, 32, time/2, freq/2)
        x = self.conv2(x) # → (batch, 64, time/4, freq/4)
        x = self.conv3(x) # → (batch, 128, time/8, freq/8)

        x = self.gap(x) # → (batch, 128, 1, 1)
        x = x.view(x.size(0), -1) # → (batch, 128)

        x = self.classifier(x) # → (batch, num_classes)

        return x

        # Model size: ~2MB
        # Inference time (CPU): 15ms
        # Accuracy: ~93%

Why CNNs work for audio:

Local patterns: Phonemes have localized frequency patterns
Translation invariance: Command can start at different times
Parameter sharing: Same filters across time/frequency
Efficient: Mostly matrix operations, highly optimized

Architecture 2: RNN (Better Temporal Modeling)

class CommandRNN(nn.Module):
    """
    RNN for command classification

    Better at capturing temporal dependencies
    """
    def __init__(self, input_dim=40, hidden_dim=128, num_layers=2, num_classes=31):
        super().__init__()

        # LSTM layers
        self.lstm = nn.LSTM(
        input_size=input_dim,
        hidden_size=hidden_dim,
        num_layers=num_layers,
        batch_first=True,
        bidirectional=True,
        dropout=0.2
        )

        # Attention mechanism (optional)
        self.attention = nn.Linear(hidden_dim * 2, 1)

        # Classification head
        self.classifier = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        # x: (batch, time, features)

        # LSTM
        lstm_out, _ = self.lstm(x) # → (batch, time, hidden*2)

        # Attention pooling (instead of taking last time step)
        attention_weights = torch.softmax(
        self.attention(lstm_out), # → (batch, time, 1)
        dim=1
        )

        # Weighted sum
        context = torch.sum(attention_weights * lstm_out, dim=1) # → (batch, hidden*2)

        # Classify
        logits = self.classifier(context) # → (batch, num_classes)

        return logits

        # Model size: ~5MB
        # Inference time (CPU): 30ms
        # Accuracy: ~95%

Architecture 3: Attention-Based (Best Accuracy)

class CommandTransformer(nn.Module):
    """
    Transformer for command classification

    Best accuracy but slower inference
    """
    def __init__(self, input_dim=40, d_model=128, nhead=4, num_layers=2, num_classes=31):
        super().__init__()

        # Input projection
        self.embedding = nn.Linear(input_dim, d_model)

        # Positional encoding
        self.pos_encoder = PositionalEncoding(d_model)

        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
        d_model=d_model,
        nhead=nhead,
        dim_feedforward=d_model * 4,
        dropout=0.1
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        # Classification head
        self.classifier = nn.Linear(d_model, num_classes)

    def forward(self, x):
        # x: (batch, time, features)

        # Project to d_model
        x = self.embedding(x) # → (batch, time, d_model)

        # Add positional encoding
        x = self.pos_encoder(x)

        # Transformer expects (time, batch, d_model)
        x = x.transpose(0, 1)
        x = self.transformer(x)
        x = x.transpose(0, 1)

        # Average pool over time
        x = x.mean(dim=1) # → (batch, d_model)

        # Classify
        logits = self.classifier(x) # → (batch, num_classes)

        return logits

    class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1), :]

        # Model size: ~8MB
        # Inference time (CPU): 50ms
        # Accuracy: ~97%

Model Comparison

Model	Params	Size	CPU Latency	GPU Latency	Accuracy	Best For
CNN	500K	2MB	15ms	3ms	93%	Mobile devices
RNN	1.2M	5MB	30ms	5ms	95%	Balanced
Transformer	2M	8MB	50ms	8ms	97%	Cloud/high-end

Production choice: CNN for on-device, RNN for cloud

Training Strategy

Data Collection

Per command, need:

1000-5000 examples
100+ speakers (diversity)
Both genders, various ages
Different accents
Background noise variations
Different recording devices

Example dataset structure:

data/
├── lights_on/
│ ├── speaker001_01.wav
│ ├── speaker001_02.wav
│ ├── speaker002_01.wav
│ └── ...
├── lights_off/
│ └── ...
├── volume_up/
│ └── ...
└── unknown/
 ├── random_speech/
 ├── music/
 ├── noise/
 └── silence/

Data Augmentation

Critical for robustness! Augment during training:

import random

def augment_audio(audio, sr=16000):
    """
    Apply random augmentation

    Each training example augmented differently
    """
    augmentations = [
    add_noise,
    time_shift,
    time_stretch,
    pitch_shift,
    add_reverb
    ]

    # Apply 1-3 random augmentations
    num_augs = random.randint(1, 3)
    selected = random.sample(augmentations, num_augs)

    for aug_fn in selected:
        audio = aug_fn(audio, sr)

        return audio


    def add_noise(audio, sr, snr_db=random.uniform(5, 20)):
        """Add background noise at specific SNR"""
        # Load random noise sample
        noise = load_random_noise_sample(len(audio))

        # Calculate noise power for target SNR
        audio_power = np.mean(audio ** 2)
        noise_power = audio_power / (10 ** (snr_db / 10))
        noise_scaled = noise * np.sqrt(noise_power / np.mean(noise ** 2))

        return audio + noise_scaled


    def time_shift(audio, sr, shift_max=0.1):
        """Shift audio in time (simulates different reaction times)"""
        shift = int(sr * shift_max * (random.random() - 0.5))
        return np.roll(audio, shift)


    def time_stretch(audio, sr, rate=random.uniform(0.9, 1.1)):
        """Change speed without changing pitch"""
        return librosa.effects.time_stretch(audio, rate=rate)


    def pitch_shift(audio, sr, n_steps=random.randint(-2, 2)):
        """Shift pitch (simulates different speakers)"""
        return librosa.effects.pitch_shift(audio, sr=sr, n_steps=n_steps)


    def add_reverb(audio, sr):
        """Add room reverb (simulates different environments)"""
        # Simple reverb using convolution with impulse response
        impulse_response = generate_simple_reverb(sr)
        return np.convolve(audio, impulse_response, mode='same')

Impact: 2-3x effective dataset size, 10-20% accuracy improvement

Training Loop

def train_command_classifier(
model,
train_loader,
val_loader,
epochs=100,
lr=0.001
):
    """
    Train speech command classifier
    """
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode='max',
    factor=0.5,
    patience=5,
    verbose=True
    )

    best_val_acc = 0.0

    for epoch in range(epochs):
        # Training
        model.train()
        train_loss = 0
        train_correct = 0
        train_total = 0

        for batch_idx, (audio, labels) in enumerate(train_loader):
            # Extract features
            features = extract_features_batch(audio, sr=16000)
            features = torch.tensor(features, dtype=torch.float32)

            # Add channel dimension for CNN
            if len(features.shape) == 3:
                features = features.unsqueeze(1) # (batch, 1, time, freq)

                labels = torch.tensor(labels, dtype=torch.long)

                # Forward
                outputs = model(features)
                loss = criterion(outputs, labels)

                # Backward
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                # Track accuracy
                _, predicted = torch.max(outputs, 1)
                train_correct += (predicted == labels).sum().item()
                train_total += labels.size(0)
                train_loss += loss.item()

                train_acc = train_correct / train_total
                avg_loss = train_loss / len(train_loader)

                # Validation
                val_acc = validate(model, val_loader)

                # Learning rate scheduling
                scheduler.step(val_acc)

                # Save best model
                if val_acc > best_val_acc:
                    best_val_acc = val_acc
                    torch.save(model.state_dict(), 'best_model.pth')
                    print(f"✓ New best model: {val_acc:.4f}")

                    print(f"Epoch {epoch+1}/{epochs}: "
                    f"Loss={avg_loss:.4f}, "
                    f"Train Acc={train_acc:.4f}, "
                    f"Val Acc={val_acc:.4f}")

                    return model


    def validate(model, val_loader):
        """Evaluate on validation set"""
        model.eval()
        correct = 0
        total = 0

        with torch.no_grad():
            for audio, labels in val_loader:
                features = extract_features_batch(audio)
                features = torch.tensor(features).unsqueeze(1)
                labels = torch.tensor(labels)

                outputs = model(features)
                _, predicted = torch.max(outputs, 1)

                correct += (predicted == labels).sum().item()
                total += labels.size(0)

                return correct / total

Component 4: Handling Unknown Commands

Strategy 1: Add “Unknown” Class

# Training data
command_classes = [
"lights_on", "lights_off", "volume_up", "volume_down",
"play_music", "stop", "pause", "next", "previous",
# ... 30 total commands
]

# Collect negative examples
unknown_class = [
"random_speech", # Conversations
"music", # Background music
"noise", # Environmental sounds
"silence" # No speech
]

# Labels: 0-29 for commands, 30 for unknown
all_classes = command_classes + ["unknown"]

Collecting unknown data:

# Record actual user interactions
# Label anything that's NOT a command as "unknown"

unknown_samples = []

for audio in production_audio_stream:
    if not is_valid_command(audio):
        unknown_samples.append(audio)

        if len(unknown_samples) >= 10000:
            # Add to training set
            augment_and_save(unknown_samples, label="unknown")

Strategy 2: Confidence Thresholding

def predict_with_threshold(model, audio, threshold=0.7):
    """
    Reject low-confidence predictions as unknown
    """
    # Extract features
    features = extract_mel_spectrogram(audio)
    features = torch.tensor(features).unsqueeze(0).unsqueeze(0)

    # Predict
    with torch.no_grad():
        logits = model(features)
        probs = torch.softmax(logits, dim=1)[0]

        # Get top prediction
        max_prob, predicted_class = torch.max(probs, 0)

        # Threshold check
        if max_prob < threshold:
            return "unknown", float(max_prob)

            return command_classes[predicted_class], float(max_prob)

Strategy 3: Out-of-Distribution (OOD) Detection

def detect_ood_with_entropy(probs):
    """
    High entropy = model is uncertain = likely OOD
    """
    entropy = -torch.sum(probs * torch.log(probs + 1e-10))

    # Calibrate threshold on validation set
    # In-distribution: entropy ~0.5
    # Out-of-distribution: entropy > 2.0

    if entropy > 2.0:
        return True # OOD
        return False


    def detect_ood_with_mahalanobis(features, class_means, class_covariances):
        """
        Mahalanobis distance to class centroids

        Far from all classes = likely OOD
        """
        min_distance = float('inf')

        for class_idx in range(len(class_means)):
            mean = class_means[class_idx]
            cov = class_covariances[class_idx]

            # Mahalanobis distance
            diff = features - mean
            distance = np.sqrt(diff.T @ np.linalg.inv(cov) @ diff)

            min_distance = min(min_distance, distance)

            # Threshold: 3-sigma rule
            if min_distance > 3.0:
                return True # OOD
                return False

Model Optimization for Edge Deployment

Quantization

# Post-training quantization (dynamic quantization targets Linear; Conv2d not supported)
model_fp32 = CommandCNN(num_classes=31)
model_fp32.load_state_dict(torch.load('model.pth'))
model_fp32.eval()

# Dynamic quantization (Linear layers)
model_int8 = torch.quantization.quantize_dynamic(
model_fp32,
{torch.nn.Linear},
dtype=torch.qint8
)

# Save
torch.save(model_int8.state_dict(), 'model_int8.pth')

# Results (typical on CPU with CNN head including Linear):
# - Model size: 2MB → ~1.2MB (1.6x smaller)
# - Inference: 15ms → ~10-12ms (1.3-1.5x faster)
# - Accuracy: ~93.2% → ~93.0% (≤0.2% drop)

Pruning

import torch.nn.utils.prune as prune

def prune_model(model, amount=0.3):
    """
    Remove 30% of weights with lowest magnitude
    """
    for name, module in model.named_modules():
        if isinstance(module, (nn.Conv2d, nn.Linear)):
            prune.l1_unstructured(module, name='weight', amount=amount)

            return model

            # Results with 30% pruning:
            # - Model size: 2MB → 1.4MB
            # - Inference: 15ms → 12ms
            # - Accuracy: 93.2% → 92.7%

Knowledge Distillation

def distillation_loss(student_logits, teacher_logits, labels, temperature=3.0, alpha=0.7):
    """
    Train small student to mimic large teacher

    Args:
        temperature: Soften probability distributions
        alpha: Weight between soft and hard targets
        """
        # Soft targets from teacher
        soft_targets = torch.softmax(teacher_logits / temperature, dim=1)
        soft_prob = torch.log_softmax(student_logits / temperature, dim=1)
        soft_loss = -torch.sum(soft_targets * soft_prob) / soft_prob.size()[0]
        soft_loss = soft_loss * (temperature ** 2)

        # Hard targets (ground truth)
        hard_loss = nn.CrossEntropyLoss()(student_logits, labels)

        # Combine
        return alpha * soft_loss + (1 - alpha) * hard_loss


        # Train student
        teacher = CommandTransformer(num_classes=31) # 8MB, 97% accuracy
        student = CommandCNN(num_classes=31) # 2MB, 93% accuracy

        for audio, labels in train_loader:
            # Teacher predictions (frozen)
            with torch.no_grad():
                teacher_logits = teacher(audio)

                # Student predictions
                student_logits = student(audio)

                # Distillation loss
                loss = distillation_loss(student_logits, teacher_logits, labels)

                # Optimize student
                loss.backward()
                optimizer.step()

                # Result: Student achieves 95% (vs 93% without distillation)

On-Device Deployment

Export to Mobile Formats

TensorFlow Lite (Android):

import tensorflow as tf

# Convert PyTorch to TensorFlow (via ONNX)
# 1. Export PyTorch to ONNX
torch.onnx.export(
model,
dummy_input,
"model.onnx",
input_names=['input'],
output_names=['output']
)

# 2. Convert ONNX to TF
import onnx
from onnx_tf.backend import prepare

onnx_model = onnx.load("model.onnx")
tf_model = prepare(onnx_model)
tf_model.export_graph("model_tf")

# 3. Convert TF to TFLite
converter = tf.lite.TFLiteConverter.from_saved_model("model_tf")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open('command_classifier.tflite', 'wb') as f:
    f.write(tflite_model)

Core ML (iOS):

import coremltools as ct

# Trace PyTorch model
example_input = torch.randn(1, 1, 100, 40)
traced_model = torch.jit.trace(model, example_input)

# Convert to Core ML
coreml_model = ct.convert(
traced_model,
inputs=[ct.TensorType(name="audio", shape=(1, 1, 100, 40))],
outputs=[ct.TensorType(name="logits")]
)

# Add metadata
coreml_model.author = "Arun Baby"
coreml_model.short_description = "Speech command classifier"
coreml_model.version = "1.0"

# Save
coreml_model.save("CommandClassifier.mlmodel")

Mobile Inference Code

Android (Kotlin):

import org.tensorflow.lite.Interpreter
import java.nio.ByteBuffer

class CommandClassifier(private val context: Context) {
 private lateinit var interpreter: Interpreter
 
 init {
 // Load model
 val model = loadModelFile("command_classifier.tflite")
 interpreter = Interpreter(model)
 }
 
 fun classify(audio: FloatArray): Pair<String, Float> {
 // Extract features
 val features = extractMelSpectrogram(audio)
 
 // Prepare input
 val inputBuffer = ByteBuffer.allocateDirect(4 * features.size)
 inputBuffer.order(ByteOrder.nativeOrder())
 features.forEach { inputBuffer.putFloat(it) }
 
 // Prepare output
 val output = Array(1) { FloatArray(31) }
 
 // Run inference
 interpreter.run(inputBuffer, output)
 
 // Get top prediction
 val probabilities = output[0]
 val maxIndex = probabilities.indices.maxByOrNull { probabilities[it] } ?: 0
 val confidence = probabilities[maxIndex]
 
 return Pair(commandNames[maxIndex], confidence)
 }
}

iOS (Swift):

import CoreML

class CommandClassifier {
 private var model: CommandClassifierModel!
 
 init() {
 model = try! CommandClassifierModel(configuration: MLModelConfiguration())
 }
 
 func classify(audio: [Float]) -> (command: String, confidence: Double) {
 // Extract features
 let features = extractMelSpectrogram(audio)
 
 // Create MLMultiArray
 let input = try! MLMultiArray(shape: [1, 1, 100, 40], dataType: .float32)
 for i in 0..<features.count {
 input[i] = NSNumber(value: features[i])
 }
 
 // Run inference
 let output = try! model.prediction(audio: input)
 
 // Get top prediction
 let probabilities = output.logits
 let maxIndex = probabilities.argmax()
 let confidence = probabilities[maxIndex]
 
 return (commandNames[maxIndex], Double(confidence))
 }
}

Monitoring & Evaluation

Metrics Dashboard

from dataclasses import dataclass
from typing import List

@dataclass
class ClassificationMetrics:
    """Per-class metrics"""
    precision: float
    recall: float
    f1_score: float
    support: int # Number of samples

    def compute_metrics(y_true: List[int], y_pred: List[int], num_classes: int):
        """
        Compute detailed metrics per class
        """
        from sklearn.metrics import classification_report, confusion_matrix

        # Per-class metrics
        report = classification_report(y_true, y_pred, output_dict=True)

        # Confusion matrix
        cm = confusion_matrix(y_true, y_pred)

        # Identify problematic classes
        for i in range(num_classes):
            if report[str(i)]['f1-score'] < 0.85:
                print(f"⚠️ Class {i} ({command_names[i]}) has low F1: {report[str(i)]['f1-score']:.3f}")

                # Find most confused class
                confused_with = cm[i].argmax()
                if confused_with != i:
                    print(f" Most confused with class {confused_with} ({command_names[confused_with]})")

                    return report, cm

Online Monitoring

class OnlineMetricsTracker:
    """
    Track metrics in production
    """
    def __init__(self):
        self.predictions = []
        self.confidences = []
        self.latencies = []

    def record(self, prediction: int, confidence: float, latency_ms: float):
        """Record single prediction"""
        self.predictions.append(prediction)
        self.confidences.append(confidence)
        self.latencies.append(latency_ms)

    def get_stats(self, last_n=1000):
        """Get recent statistics"""
        recent_preds = self.predictions[-last_n:]
        recent_confs = self.confidences[-last_n:]
        recent_lats = self.latencies[-last_n:]

        # Class distribution
        from collections import Counter
        class_dist = Counter(recent_preds)

        return {
        'total_predictions': len(recent_preds),
        'class_distribution': dict(class_dist),
        'avg_confidence': np.mean(recent_confs),
        'low_confidence_rate': sum(c < 0.7 for c in recent_confs) / len(recent_confs),
        'p50_latency': np.percentile(recent_lats, 50),
        'p95_latency': np.percentile(recent_lats, 95),
        'p99_latency': np.percentile(recent_lats, 99)
        }

Multi-Language Support

Approach 1: Separate Models per Language

Pros:

Best accuracy per language
Language-specific optimizations
Easier to add new languages

Cons:

Multiple models to maintain
Higher storage footprint
Language detection needed first

class MultilingualClassifier:
    """
    Separate model per language
    """
    def __init__(self):
        self.models = {
        'en': load_model('command_en.pth'),
        'es': load_model('command_es.pth'),
        'fr': load_model('command_fr.pth'),
        'de': load_model('command_de.pth'),
        'ja': load_model('command_ja.pth')
        }
        self.language_detector = load_model('lang_detect.pth')

    def predict(self, audio):
        # Detect language first
        language = self.language_detector.predict(audio)

        # Use language-specific model
        model = self.models[language]
        prediction = model.predict(audio)

        return prediction, language

Storage requirement: 5 languages × 2MB = 10MB

Approach 2: Multilingual Shared Model

Training strategy:

def train_multilingual_model():
    """
    Single model trained on all languages

    Add language ID as auxiliary input
    """
    model = MultilingualCommandCNN(
    num_classes=30,
    num_languages=5
    )

    # Training data from all languages
    for audio, command_label, lang_id in train_loader:
        features = extract_features(audio)

        # Forward pass with language embedding
        command_pred = model(features, lang_id)

        # Loss
        loss = criterion(command_pred, command_label)

        loss.backward()
        optimizer.step()

        return model

Model architecture:

class MultilingualCommandCNN(nn.Module):
    """
    Shared model with language embeddings
    """
    def __init__(self, num_classes=30, num_languages=5, embedding_dim=16):
        super().__init__()

        # Language embedding
        self.lang_embedding = nn.Embedding(num_languages, embedding_dim)

        # Shared CNN backbone
        self.cnn = CommandCNN(num_classes=128) # Feature extractor

        # Language-conditioned classifier
        self.classifier = nn.Linear(128 + embedding_dim, num_classes)

    def forward(self, audio_features, language_id):
        # CNN features
        cnn_features = self.cnn(audio_features) # (batch, 128)

        # Language embedding
        lang_emb = self.lang_embedding(language_id) # (batch, 16)

        # Concatenate
        combined = torch.cat([cnn_features, lang_emb], dim=1) # (batch, 144)

        # Classify
        logits = self.classifier(combined) # (batch, num_classes)

        return logits

Pros:

Single model (2-3MB)
Shared representations across languages
Transfer learning for low-resource languages

Cons:

Slightly lower accuracy per language
All languages must use same command set

Failure Cases & Mitigation

Common Failure Modes

1. Background Speech/TV

Problem: Model activates on TV dialogue or background conversation

Mitigation:

def detect_background_speech(audio, sr=16000):
    """
    Detect if audio is from TV/background vs direct user speech

    Features:
        - Energy envelope variation (TV more consistent)
        - Reverb characteristics (TV more reverberant)
        - Spectral rolloff (TV often compressed)
        """
        # Energy variation
        frame_energy = librosa.feature.rms(y=audio)[0]
        energy_std = np.std(frame_energy)

        # TV has lower energy variation
        if energy_std < 0.01:
            return True # Likely background

            # Spectral centroid (TV often band-limited)
            spectral_centroid = librosa.feature.spectral_centroid(y=audio, sr=sr)[0]
            avg_centroid = np.mean(spectral_centroid)

            if avg_centroid < 1000: # Hz
                return True # Likely background

                return False

Additional strategy: Use speaker verification to check if it’s the registered user

2. Accented Speech

Problem: Model trained on standard accent performs poorly on regional accents

Mitigation:

# Data collection strategy
accent_distribution = {
'general_american': 0.3,
'british': 0.15,
'australian': 0.1,
'indian': 0.15,
'southern_us': 0.1,
'canadian': 0.1,
'other': 0.1
}

# Ensure balanced training data
for accent, proportion in accent_distribution.items():
    required_samples = total_samples * proportion
    collect_samples(accent, required_samples)

    # Use accent-aware data augmentation
def accent_aware_augmentation(audio, accent_type):
    """Apply accent-specific augmentations"""
    if accent_type == 'indian':
        # Indian English: Stronger pitch variation
        audio = pitch_shift(audio, n_steps=random.randint(-3, 3))
    elif accent_type == 'southern_us':
        # Southern US: Slower speech
        audio = time_stretch(audio, rate=random.uniform(0.85, 1.0))

        return audio

3. Noisy Environments

Problem: Model degrades in cafes, cars, streets

Mitigation:

def enhance_audio_for_inference(audio, sr=16000):
    """
    Lightweight denoising for inference

    Must be < 5ms to maintain latency budget
    """
    # Spectral gating (simple but effective)
    stft = librosa.stft(audio)
    magnitude = np.abs(stft)

    # Estimate noise floor (first 100ms)
    noise_frames = magnitude[:, :10]
    noise_threshold = np.mean(noise_frames, axis=1, keepdims=True) * 1.5

    # Gate
    mask = magnitude > noise_threshold
    stft_denoised = stft * mask

    # Inverse STFT
    audio_denoised = librosa.istft(stft_denoised)

    return audio_denoised

Better approach: Train with noisy data

# Use diverse noise types during training
noise_types = [
'cafe_ambiance',
'car_interior',
'street_traffic',
'office_chatter',
'home_appliances',
'rain',
'wind'
]

for audio, label in train_loader:
    # Add random noise
    noise_type = random.choice(noise_types)
    noisy_audio = add_noise(audio, noise_type, snr_db=random.uniform(5, 20))

4. Similar Sounding Commands

Problem: “lights on” vs “lights off”, “volume up” vs “volume down”

Mitigation:

# Use contrastive learning during training
def contrastive_loss(anchor, positive, negative, margin=1.0):
    """
    Pull together similar commands, push apart confusable ones
    """
    pos_distance = torch.norm(anchor - positive, dim=1)
    neg_distance = torch.norm(anchor - negative, dim=1)

    loss = torch.relu(pos_distance - neg_distance + margin)

    return loss.mean()

    # Identify confusable pairs
    confusable_pairs = [
    ('lights_on', 'lights_off'),
    ('volume_up', 'volume_down'),
    ('next', 'previous'),
    ('play', 'pause')
    ]

    # During training
    for audio, label in train_loader:
        features = model.extract_features(audio)

        # For confusable commands, add contrastive loss
        if label in confusable_commands:
            opposite_label = get_opposite_command(label)
            opposite_audio = sample_from_class(opposite_label)
            opposite_features = model.extract_features(opposite_audio)

            total_loss = classification_loss + 0.2 * contrastive_loss(
            features,
            features, # Anchor to itself
            opposite_features
            )

Production Deployment Architecture

Edge Deployment (Smart Speaker)

┌─────────────────────────────────────────┐
│ Smart Speaker Device │
├─────────────────────────────────────────┤
│ │
│ Microphone Array │
│ ↓ │
│ Beamforming (5ms) │
│ ↓ │
│ Wake Word Detection (10ms) │
│ ↓ │
│ [If wake word detected] │
│ ↓ │
│ Audio Buffer (1 second) │
│ ↓ │
│ Feature Extraction (5ms) │
│ ↓ │
│ Command CNN Inference (15ms) │
│ ↓ │
│ ┌──────────────┐ │
│ │ Confidence │ │
│ │ > 0.85? │ │
│ └──────┬───────┘ │
│ │ │
│ Yes │ No │
│ ↓ │
│ Execute Command Send to Cloud ASR │
│ │
└─────────────────────────────────────────┘

Total latency (on-device): < 40ms
Power consumption: < 100mW during inference

Hybrid Edge-Cloud Architecture

class HybridCommandClassifier:
    """
    Intelligent routing between edge and cloud
    """
    def __init__(self):
        self.edge_model = load_edge_model() # Small CNN
        self.cloud_client = CloudASRClient()

        # Common commands handled on-device
        self.edge_commands = {
        'lights_on', 'lights_off',
        'volume_up', 'volume_down',
        'play', 'pause', 'stop',
        'next', 'previous'
        }

        async def classify(self, audio):
            # Try edge first
            edge_pred, edge_conf = self.edge_model.predict(audio)

            # High confidence + known command → use edge
            if edge_conf > 0.85 and edge_pred in self.edge_commands:
                return {
                'command': edge_pred,
                'confidence': edge_conf,
                'source': 'edge',
                'latency_ms': 35
                }

                # Otherwise → cloud ASR
                cloud_result = await self.cloud_client.recognize(audio)

                return {
                'command': cloud_result['text'],
                'confidence': cloud_result['confidence'],
                'source': 'cloud',
                'latency_ms': 250
                }

Benefits:

✅ 90% of commands handled on-device (< 50ms)
✅ 10% fall back to cloud for complex queries
✅ Privacy for common commands
✅ Graceful degradation if network unavailable

A/B Testing & Gradual Rollout

Experiment Framework

class ModelExperiment:
    """
    A/B test new model versions
    """
    def __init__(self, control_model, treatment_model, treatment_percentage=10):
        self.control = control_model
        self.treatment = treatment_model
        self.treatment_pct = treatment_percentage

    def predict(self, audio, user_id):
        # Deterministic assignment based on user_id
        bucket = hash(user_id) % 100

        if bucket < self.treatment_pct:
            # Treatment group
            pred, conf = self.treatment.predict(audio)
            variant = 'treatment'
        else:
            # Control group
            pred, conf = self.control.predict(audio)
            variant = 'control'

            # Log for analysis
            self.log_prediction(user_id, variant, pred, conf)

            return pred, conf

    def log_prediction(self, user_id, variant, prediction, confidence):
        """Log to analytics system"""
        event = {
        'user_id': user_id,
        'timestamp': time.time(),
        'variant': variant,
        'prediction': prediction,
        'confidence': confidence
        }

        analytics_logger.log(event)

Metrics to Track

def compute_experiment_metrics(control_group, treatment_group):
    """
    Compare model versions
    """
    metrics = {}

    # Accuracy (if ground truth available)
    if has_ground_truth:
        metrics['accuracy_control'] = compute_accuracy(control_group)
        metrics['accuracy_treatment'] = compute_accuracy(treatment_group)

        # Confidence distribution
        metrics['avg_confidence_control'] = np.mean([x['confidence'] for x in control_group])
        metrics['avg_confidence_treatment'] = np.mean([x['confidence'] for x in treatment_group])

        # Latency
        metrics['p95_latency_control'] = np.percentile([x['latency'] for x in control_group], 95)
        metrics['p95_latency_treatment'] = np.percentile([x['latency'] for x in treatment_group], 95)

        # User engagement (proxy for accuracy)
        metrics['retry_rate_control'] = compute_retry_rate(control_group)
        metrics['retry_rate_treatment'] = compute_retry_rate(treatment_group)

        # Statistical significance
        from scipy.stats import ttest_ind

        control_success = [x['success'] for x in control_group]
        treatment_success = [x['success'] for x in treatment_group]

        t_stat, p_value = ttest_ind(control_success, treatment_success)
        metrics['p_value'] = p_value
        metrics['is_significant'] = p_value < 0.05

        return metrics

Real-World Examples

Google Assistant

“Hey Google” Wake Word:

Always-on detection using tiny model (< 1MB)
Runs on low-power co-processor (DSP)
< 10ms latency, ~0.5mW power
~ 99.5% accuracy on target phrase
Personalized over time with on-device learning

Command Classification:

Separate model for common commands (~30 commands)
Fallback to full ASR for complex queries
On-device for privacy (no audio sent to cloud)
Multi-language support (40+ languages)

Architecture:

Microphone → Beamformer → Wake Word → Command CNN → Execute
 ↓
 (if low conf)
 ↓
 Cloud ASR

Amazon Alexa

“Alexa” Wake Word:

Multi-stage cascade:
Stage 1: Energy detector (< 1ms, filters silence)
Stage 2: Keyword spotter (< 10ms, CNN)
Stage 3: Full verification (< 50ms, larger model)
Reduces false positives by 10x
Power-efficient (only stage 3 uses main CPU)

Custom Skills:

Slot-filling approach for structured commands
Template: “play {song} by {artist}”
Combined classification + entity extraction
~100K custom skills available

Deployment:

Edge: Wake word + simple commands
Cloud: Everything else (200ms latency acceptable)

Apple Siri

“Hey Siri” Detection:

Neural network on Neural Engine (dedicated ML chip)
Personalized to user’s voice during setup
Continuously adapts to voice changes
< 50ms latency
Works offline (completely on-device)
Power: < 1mW in always-listening mode

Privacy Design:

Audio never sent to cloud without explicit activation
Voice profile stored locally (encrypted)
Random identifier (not tied to Apple ID)

Technical Details:

Uses LSTM for temporal modeling
Trained on millions of “Hey Siri” variations
Negative examples: TV shows, movies, other voices

Key Takeaways

✅ Direct audio→intent faster than ASR→NLU for limited commands ✅ CNNs on mel-spectrograms work excellently for on-device ✅ Data augmentation critical for robustness (noise, time shift, pitch) ✅ Unknown class handling prevents false activations ✅ Quantization achieves 4x compression with < 1% accuracy loss ✅ Threshold tuning balances precision/recall for business needs

FAQ

Q: Why use direct audio classification instead of ASR plus NLU for voice commands? A: Direct audio-to-intent classification is 5x faster (45ms vs 250ms), uses 10x smaller models (under 10MB vs over 100MB), works offline on-device, and is more privacy-preserving since no text is sent to the cloud. It is ideal for a limited vocabulary of 30-100 commands. For complex open-ended queries, fall back to full streaming ASR.

Q: Which model architecture is best for on-device speech command classification? A: CNNs on mel-spectrograms are best for on-device deployment, achieving 93% accuracy with only 2MB model size and 15ms CPU inference time. With knowledge distillation from a larger Transformer teacher, accuracy improves to 95% without increasing model size.

Q: How do you handle unknown or out-of-vocabulary commands? A: Three strategies work together: training an explicit “unknown” class with diverse negative examples (random speech, music, noise), applying confidence thresholding to reject predictions below 0.7 probability, and using entropy-based out-of-distribution detection where high entropy indicates the model is uncertain about the input. See also VAD for filtering non-speech audio before classification.

Originally published at: arunbaby.com/speech-tech/0002-speech-classification

If you found this helpful, consider sharing it with others who might benefit.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

TL;DR

Introduction

Problem Definition

Functional Requirements

Non-Functional Requirements

Why Not ASR + NLU?

Traditional Pipeline

Direct Classification

Architecture

Component 1: Audio Preprocessing

Fixed-Length Input

Normalization

Component 2: Feature Extraction

Option 1: MFCCs (Mel-Frequency Cepstral Coefficients)

Option 2: Mel-Spectrogram

Component 3: Model Architectures

Architecture 1: CNN (Fastest for On-Device)

Architecture 2: RNN (Better Temporal Modeling)

Architecture 3: Attention-Based (Best Accuracy)

Model Comparison

Training Strategy

Data Collection

Data Augmentation

Training Loop

Component 4: Handling Unknown Commands

Strategy 1: Add “Unknown” Class

Strategy 2: Confidence Thresholding

Strategy 3: Out-of-Distribution (OOD) Detection

Model Optimization for Edge Deployment

Quantization

Pruning

Knowledge Distillation

On-Device Deployment

Export to Mobile Formats

Mobile Inference Code

Monitoring & Evaluation

Metrics Dashboard

Online Monitoring

Multi-Language Support

Approach 1: Separate Models per Language

Approach 2: Multilingual Shared Model

Failure Cases & Mitigation

Common Failure Modes

1. Background Speech/TV

2. Accented Speech

3. Noisy Environments

4. Similar Sounding Commands

Production Deployment Architecture

Edge Deployment (Smart Speaker)

Hybrid Edge-Cloud Architecture

A/B Testing & Gradual Rollout

Experiment Framework

Metrics to Track

Real-World Examples

Google Assistant

Amazon Alexa

Apple Siri

Key Takeaways

Further Reading

FAQ

Related across topics

Valid Parentheses

Classification Pipeline Design

LLM Capabilities for Agents

Share on