Audio Feature Extraction for Speech ML
How to transform raw audio waveforms into ML-ready features that capture speech characteristics for robust model training.
TL;DR
Audio feature extraction transforms high-dimensional raw waveforms (16,000 samples/sec) into compact ML-ready representations. MFCCs (40 coefficients per frame) capture spectral envelopes for traditional ASR, mel-spectrograms (80 bins) preserve full spectral detail for CNNs, and wav2vec2 embeddings (768 dimensions) provide 15-25% accuracy gains via transfer learning. Delta features capture temporal dynamics, SpecAugment provides 10-20% WER reduction through feature-space augmentation, and streaming circular buffers enable real-time extraction under 10ms. These features feed directly into streaming ASR, speech classification, VAD, and speaker recognition systems.

Introduction
Raw audio waveforms are high-dimensional, noisy, and difficult for ML models to learn from directly. Feature extraction transforms audio into compact, informative representations that:
- Capture important speech characteristics
- Reduce dimensionality (16kHz audio = 16,000 samples/sec → ~40 features)
- Provide invariance to irrelevant variations (volume, recording device)
- Enable efficient model training
Why it matters:
- Improves accuracy: Good features → better models
- Reduces compute: Lower dimensionality = faster training/inference
- Enables transfer learning: Pre-extracted features work across tasks
- Production efficiency: Feature extraction can be cached
What you’ll learn:
- Core audio features (MFCCs, spectrograms, mel-scale)
- Time-domain vs frequency-domain features
- Production-grade extraction pipelines
- Optimization for real-time processing
- Feature engineering for speech tasks
Problem Definition
Design a feature extraction pipeline for speech ML systems.
Functional Requirements
- Feature Types
- Time-domain features (energy, zero-crossing rate)
- Frequency-domain features (spectrograms, MFCCs)
- Temporal features (deltas, delta-deltas)
- Learned features (embeddings)
- Input Handling
- Support multiple sample rates (8kHz, 16kHz, 48kHz)
- Handle variable-length audio
- Process both mono and stereo
- Support batch processing
- Output Format
- Fixed-size feature vectors
- Variable-length sequences
- 2D/3D tensors for neural networks
Non-Functional Requirements
- Performance
- Real-time: Extract features < 10ms for 1 sec audio
- Batch: Process 10K files/hour on single machine
- Memory: < 100MB RAM for streaming
- Quality
- Robust to noise
- Consistent across devices
- Reproducible (deterministic)
- Flexibility
- Configurable parameters
- Support multiple backends (librosa, torchaudio)
- Easy to extend with new features
Audio Basics
Waveform Representation
import numpy as np
import librosa
import matplotlib.pyplot as plt
# Load audio
audio, sr = librosa.load('speech.wav', sr=16000)
print(f"Sample rate: {sr} Hz")
print(f"Duration: {len(audio) / sr:.2f} seconds")
print(f"Shape: {audio.shape}")
print(f"Range: [{audio.min():.3f}, {audio.max():.3f}]")
# Visualize waveform
plt.figure(figsize=(12, 4))
time = np.arange(len(audio)) / sr
plt.plot(time, audio)
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.title('Audio Waveform')
plt.show()
Key properties:
- Sample rate (sr): Samples per second (e.g., 16000 Hz = 16000 samples/sec)
- Duration:
len(audio) / srseconds - Amplitude: Typically normalized to [-1, 1]
Feature 1: Mel-Frequency Cepstral Coefficients (MFCCs)
MFCCs are the most widely used features in speech recognition.
Why MFCCs?
- Mimic human hearing: Use mel scale (perceptual frequency scale)
- Compact: Represent spectral envelope with 13-40 coefficients
- Robust: Less sensitive to pitch variations
- Proven: Gold standard for ASR for decades
How MFCCs Work
Audio Waveform
↓
1. Pre-emphasis (boost high frequencies)
↓
2. Frame the signal (25ms windows, 10ms hop)
↓
3. Apply window function (Hamming)
↓
4. FFT (Fast Fourier Transform)
↓
5. Mel filterbank (map to mel scale)
↓
6. Log (compress dynamic range)
↓
7. DCT (Discrete Cosine Transform)
↓
MFCCs (13-40 coefficients per frame)
Implementation
import librosa
import numpy as np
class MFCCExtractor:
"""
Extract MFCC features from audio
Standard configuration for speech recognition
"""
def __init__(
self,
sr=16000,
n_mfcc=40,
n_fft=512,
hop_length=160, # 10ms at 16kHz
n_mels=40,
fmin=20,
fmax=8000
):
self.sr = sr
self.n_mfcc = n_mfcc
self.n_fft = n_fft
self.hop_length = hop_length
self.n_mels = n_mels
self.fmin = fmin
self.fmax = fmax
def extract(self, audio: np.ndarray) -> np.ndarray:
"""
Extract MFCCs
Args:
audio: Audio waveform (1D array)
Returns:
MFCCs: (n_mfcc, time_steps)
"""
# Extract MFCCs
mfccs = librosa.feature.mfcc(
y=audio,
sr=self.sr,
n_mfcc=self.n_mfcc,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
fmin=self.fmin,
fmax=self.fmax
)
return mfccs # Shape: (n_mfcc, time)
def extract_with_deltas(self, audio: np.ndarray) -> np.ndarray:
"""
Extract MFCCs + deltas + delta-deltas
Deltas capture temporal dynamics
Returns:
Features: (n_mfcc * 3, time_steps)
"""
# MFCCs
mfccs = self.extract(audio)
# Delta (first derivative)
delta = librosa.feature.delta(mfccs)
# Delta-delta (second derivative)
delta2 = librosa.feature.delta(mfccs, order=2)
# Stack
features = np.vstack([mfccs, delta, delta2]) # (120, time)
return features
# Usage
extractor = MFCCExtractor()
mfccs = extractor.extract(audio)
print(f"MFCCs shape: {mfccs.shape}") # (40, time_steps)
# With deltas
features = extractor.extract_with_deltas(audio)
print(f"MFCCs+deltas shape: {features.shape}") # (120, time_steps)
Visualizing MFCCs
import matplotlib.pyplot as plt
def plot_mfccs(mfccs, sr, hop_length):
"""Visualize MFCC features"""
plt.figure(figsize=(12, 6))
# Convert frame indices to time
times = librosa.frames_to_time(
np.arange(mfccs.shape[1]),
sr=sr,
hop_length=hop_length
)
plt.imshow(
mfccs,
aspect='auto',
origin='lower',
extent=[times[0], times[-1], 0, mfccs.shape[0]],
cmap='viridis'
)
plt.colorbar(format='%+2.0f dB')
plt.xlabel('Time (s)')
plt.ylabel('MFCC Coefficient')
plt.title('MFCC Features')
plt.tight_layout()
plt.show()
plot_mfccs(mfccs, sr=16000, hop_length=160)
Feature 2: Mel-Spectrograms
Mel-spectrograms preserve more temporal detail than MFCCs.
What is a Spectrogram?
A spectrogram shows how the frequency content of a signal changes over time.
- X-axis: Time
- Y-axis: Frequency
- Color: Magnitude (energy)
Mel-Spectrogram vs MFCC
| Aspect | Mel-Spectrogram | MFCC |
|---|---|---|
| Dimensions | (n_mels, time) | (n_mfcc, time) |
| Information | Full spectrum | Spectral envelope |
| Size | 40-128 bins | 13-40 coefficients |
| Use case | CNNs, deep learning | Traditional ASR |
| Temporal resolution | Higher | Lower (due to DCT) |
Implementation
class MelSpectrogramExtractor:
"""
Extract log mel-spectrogram features
Popular for deep learning models (CNNs, Transformers)
"""
def __init__(
self,
sr=16000,
n_fft=512,
hop_length=160,
n_mels=80,
fmin=0,
fmax=8000
):
self.sr = sr
self.n_fft = n_fft
self.hop_length = hop_length
self.n_mels = n_mels
self.fmin = fmin
self.fmax = fmax
def extract(self, audio: np.ndarray) -> np.ndarray:
"""
Extract log mel-spectrogram
Returns:
Log mel-spectrogram: (n_mels, time_steps)
"""
# Compute mel spectrogram
mel_spec = librosa.feature.melspectrogram(
y=audio,
sr=self.sr,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
fmin=self.fmin,
fmax=self.fmax
)
# Convert to log scale (dB)
log_mel = librosa.power_to_db(mel_spec, ref=np.max)
return log_mel # Shape: (n_mels, time)
def extract_normalized(self, audio: np.ndarray) -> np.ndarray:
"""
Extract and normalize to [0, 1]
Better for neural networks
"""
log_mel = self.extract(audio)
# Normalize to [0, 1]
log_mel_norm = (log_mel - log_mel.min()) / (log_mel.max() - log_mel.min() + 1e-8)
return log_mel_norm
# Usage
mel_extractor = MelSpectrogramExtractor(n_mels=80)
mel_spec = mel_extractor.extract(audio)
print(f"Mel-spectrogram shape: {mel_spec.shape}") # (80, time_steps)
Visualizing Mel-Spectrogram
def plot_mel_spectrogram(mel_spec, sr, hop_length):
"""Visualize mel-spectrogram"""
plt.figure(figsize=(12, 6))
librosa.display.specshow(
mel_spec,
sr=sr,
hop_length=hop_length,
x_axis='time',
y_axis='mel',
cmap='viridis'
)
plt.colorbar(format='%+2.0f dB')
plt.title('Mel-Spectrogram')
plt.tight_layout()
plt.show()
plot_mel_spectrogram(mel_spec, sr=16000, hop_length=160)
Feature 3: Raw Spectrograms (STFT)
Short-Time Fourier Transform (STFT) provides the highest frequency resolution.
Implementation
class STFTExtractor:
"""
Extract raw STFT features
Used when you need full frequency resolution
"""
def __init__(
self,
n_fft=512,
hop_length=160,
win_length=400
):
self.n_fft = n_fft
self.hop_length = hop_length
self.win_length = win_length
def extract(self, audio: np.ndarray) -> np.ndarray:
"""
Extract magnitude spectrogram
Returns:
Spectrogram: (n_fft//2 + 1, time_steps)
"""
# Compute STFT
stft = librosa.stft(
audio,
n_fft=self.n_fft,
hop_length=self.hop_length,
win_length=self.win_length
)
# Get magnitude
magnitude = np.abs(stft)
# Convert to dB
magnitude_db = librosa.amplitude_to_db(magnitude, ref=np.max)
return magnitude_db # Shape: (n_fft//2 + 1, time)
def extract_with_phase(self, audio: np.ndarray):
"""
Extract magnitude and phase
Phase information useful for reconstruction
"""
stft = librosa.stft(
audio,
n_fft=self.n_fft,
hop_length=self.hop_length,
win_length=self.win_length
)
magnitude = np.abs(stft)
phase = np.angle(stft)
return magnitude, phase
# Usage
stft_extractor = STFTExtractor()
spectrogram = stft_extractor.extract(audio)
print(f"Spectrogram shape: {spectrogram.shape}") # (257, time_steps)
Feature 4: Time-Domain Features
Simple but effective features computed directly from waveform.
Implementation
class TimeDomainExtractor:
"""
Extract time-domain features
Fast to compute, useful for simple tasks
"""
def extract_energy(self, audio: np.ndarray, frame_length=400, hop_length=160):
"""
Frame-wise energy (RMS)
Captures loudness/volume over time
"""
energy = librosa.feature.rms(
y=audio,
frame_length=frame_length,
hop_length=hop_length
)[0]
return energy
def extract_zero_crossing_rate(self, audio: np.ndarray, frame_length=400, hop_length=160):
"""
Zero-crossing rate
Measures how often signal crosses zero
High ZCR → noisy/unvoiced
Low ZCR → tonal/voiced
"""
zcr = librosa.feature.zero_crossing_rate(
audio,
frame_length=frame_length,
hop_length=hop_length
)[0]
return zcr
def extract_all(self, audio: np.ndarray):
"""Extract all time-domain features"""
energy = self.extract_energy(audio)
zcr = self.extract_zero_crossing_rate(audio)
# Stack features
features = np.vstack([energy, zcr]) # (2, time)
return features
# Usage
time_extractor = TimeDomainExtractor()
time_features = time_extractor.extract_all(audio)
print(f"Time-domain features shape: {time_features.shape}") # (2, time_steps)
Feature 5: Pitch & Formants
Pitch and formants are linguistic features important for speech.
Pitch Extraction
class PitchExtractor:
"""
Extract fundamental frequency (F0)
Important for:
- Speaker recognition
- Emotion detection
- Prosody modeling
"""
def __init__(self, sr=16000, fmin=80, fmax=400):
self.sr = sr
self.fmin = fmin # Typical male voice
self.fmax = fmax # Typical female voice
def extract_f0(self, audio: np.ndarray, hop_length=160):
"""
Extract pitch (fundamental frequency)
Returns:
f0: Pitch values (Hz) per frame
voiced_flag: Boolean array (voiced vs unvoiced)
"""
# Extract pitch using YIN algorithm
f0 = librosa.yin(
audio,
fmin=self.fmin,
fmax=self.fmax,
sr=self.sr,
hop_length=hop_length
)
# Detect voiced regions (f0 > 0)
voiced_flag = f0 > 0
return f0, voiced_flag
def extract_pitch_features(self, audio: np.ndarray):
"""
Extract pitch statistics
Useful for speaker/emotion recognition
"""
f0, voiced = self.extract_f0(audio)
# Statistics on voiced frames
voiced_f0 = f0[voiced]
if len(voiced_f0) > 0:
features = {
'mean_pitch': np.mean(voiced_f0),
'std_pitch': np.std(voiced_f0),
'min_pitch': np.min(voiced_f0),
'max_pitch': np.max(voiced_f0),
'pitch_range': np.max(voiced_f0) - np.min(voiced_f0),
'voiced_ratio': np.sum(voiced) / len(voiced)
}
else:
features = {k: 0.0 for k in ['mean_pitch', 'std_pitch', 'min_pitch', 'max_pitch', 'pitch_range', 'voiced_ratio']}
return features
# Usage
pitch_extractor = PitchExtractor()
f0, voiced = pitch_extractor.extract_f0(audio)
print(f"Pitch shape: {f0.shape}")
pitch_stats = pitch_extractor.extract_pitch_features(audio)
print(f"Pitch statistics: {pitch_stats}")
Production Feature Pipeline
Combine all features into a unified pipeline.
Unified Feature Extractor
from dataclasses import dataclass
from typing import Dict, List, Optional
import json
@dataclass
class FeatureConfig:
"""Configuration for feature extraction"""
sr: int = 16000
feature_types: List[str] = None # ['mfcc', 'mel', 'pitch']
# MFCC config
n_mfcc: int = 40
# Mel-spectrogram config
n_mels: int = 80
# Common config
n_fft: int = 512
hop_length: int = 160 # 10ms
# Normalization
normalize: bool = True
def __post_init__(self):
if self.feature_types is None:
self.feature_types = ['mfcc']
class AudioFeatureExtractor:
"""
Production-grade audio feature extractor
Supports multiple feature types, caching, and batch processing
"""
def __init__(self, config: FeatureConfig):
self.config = config
# Initialize extractors
self.mfcc_extractor = MFCCExtractor(
sr=config.sr,
n_mfcc=config.n_mfcc,
n_fft=config.n_fft,
hop_length=config.hop_length
)
self.mel_extractor = MelSpectrogramExtractor(
sr=config.sr,
n_mels=config.n_mels,
n_fft=config.n_fft,
hop_length=config.hop_length
)
self.pitch_extractor = PitchExtractor(sr=config.sr)
self.time_extractor = TimeDomainExtractor()
def extract(self, audio: np.ndarray) -> Dict[str, np.ndarray]:
"""
Extract features based on config
Args:
audio: Audio waveform
Returns:
Dictionary of features
"""
features = {}
if 'mfcc' in self.config.feature_types:
mfccs = self.mfcc_extractor.extract_with_deltas(audio)
if self.config.normalize:
mfccs = self._normalize(mfccs)
features['mfcc'] = mfccs
if 'mel' in self.config.feature_types:
mel = self.mel_extractor.extract(audio)
if self.config.normalize:
mel = self._normalize(mel)
features['mel'] = mel
if 'pitch' in self.config.feature_types:
f0, voiced = self.pitch_extractor.extract_f0(audio, hop_length=self.config.hop_length)
features['pitch'] = f0
features['voiced'] = voiced.astype(np.float32)
if 'time' in self.config.feature_types:
time_feats = self.time_extractor.extract_all(audio)
if self.config.normalize:
time_feats = self._normalize(time_feats)
features['time'] = time_feats
return features
def _normalize(self, features: np.ndarray) -> np.ndarray:
"""
Normalize features (mean=0, std=1) per coefficient
"""
mean = np.mean(features, axis=1, keepdims=True)
std = np.std(features, axis=1, keepdims=True) + 1e-8
normalized = (features - mean) / std
return normalized
def extract_from_file(self, audio_path: str) -> Dict[str, np.ndarray]:
"""
Extract features from audio file
"""
audio, sr = librosa.load(audio_path, sr=self.config.sr)
return self.extract(audio)
def extract_batch(self, audio_list: List[np.ndarray]) -> List[Dict[str, np.ndarray]]:
"""
Extract features from batch of audio
"""
return [self.extract(audio) for audio in audio_list]
def save_config(self, path: str):
"""Save feature extraction config"""
with open(path, 'w') as f:
json.dump(self.config.__dict__, f, indent=2)
@staticmethod
def load_config(path: str) -> FeatureConfig:
"""Load feature extraction config"""
with open(path, 'r') as f:
config_dict = json.load(f)
return FeatureConfig(**config_dict)
# Usage
config = FeatureConfig(
feature_types=['mfcc', 'mel', 'pitch'],
n_mfcc=40,
n_mels=80,
normalize=True
)
extractor = AudioFeatureExtractor(config)
# Extract features
features = extractor.extract(audio)
print("Extracted features:", features.keys())
for name, feat in features.items():
print(f" {name}: {feat.shape}")
# Save config for reproducibility
extractor.save_config('feature_config.json')
Handling Variable-Length Audio
Different audio clips have different durations. Need to handle this for ML.
Strategy 1: Padding/Truncation
class VariableLengthHandler:
"""
Handle variable-length audio
"""
def pad_or_truncate(self, features: np.ndarray, target_length: int) -> np.ndarray:
"""
Pad or truncate features to fixed length
Args:
features: (n_features, time)
target_length: Target time dimension
Returns:
Fixed-length features: (n_features, target_length)
"""
current_length = features.shape[1]
if current_length < target_length:
# Pad with zeros
pad_width = ((0, 0), (0, target_length - current_length))
features = np.pad(features, pad_width, mode='constant')
elif current_length > target_length:
# Truncate (take first target_length frames)
features = features[:, :target_length]
return features
def create_mask(self, features: np.ndarray, target_length: int) -> np.ndarray:
"""
Create attention mask for padded features
Returns:
Mask: (target_length,) - 1 for real frames, 0 for padding
"""
current_length = features.shape[1]
mask = np.zeros(target_length)
mask[:min(current_length, target_length)] = 1
return mask
Strategy 2: Temporal Pooling
class TemporalPooler:
"""
Pool variable-length features to fixed size
"""
def mean_pool(self, features: np.ndarray) -> np.ndarray:
"""
Average pool over time
Args:
features: (n_features, time)
Returns:
Pooled: (n_features,)
"""
return np.mean(features, axis=1)
def max_pool(self, features: np.ndarray) -> np.ndarray:
"""Max pool over time"""
return np.max(features, axis=1)
def stats_pool(self, features: np.ndarray) -> np.ndarray:
"""
Statistical pooling: mean + std
Returns:
Pooled: (n_features * 2,)
"""
mean = np.mean(features, axis=1)
std = np.std(features, axis=1)
return np.concatenate([mean, std])
Real-Time Feature Extraction
For streaming applications, need incremental feature extraction.
Streaming Feature Extractor
from collections import deque
class StreamingFeatureExtractor:
"""
Extract features from streaming audio
Use case: Real-time ASR, voice assistants
"""
def __init__(
self,
sr=16000,
frame_length_ms=25,
hop_length_ms=10,
buffer_duration_ms=500
):
self.sr = sr
self.frame_length = int(sr * frame_length_ms / 1000)
self.hop_length = int(sr * hop_length_ms / 1000)
self.buffer_length = int(sr * buffer_duration_ms / 1000)
# Circular buffer for audio
self.buffer = deque(maxlen=self.buffer_length)
# Feature extractor
self.extractor = MFCCExtractor(
sr=sr,
hop_length=self.hop_length
)
def add_audio_chunk(self, audio_chunk: np.ndarray):
"""
Add new audio chunk to buffer
Args:
audio_chunk: New audio samples
"""
self.buffer.extend(audio_chunk)
def extract_latest(self) -> Optional[np.ndarray]:
"""
Extract features from current buffer
Returns:
Features or None if buffer too small
"""
if len(self.buffer) < self.frame_length:
return None
# Convert buffer to array
audio = np.array(self.buffer)
# Extract features
features = self.extractor.extract(audio)
return features
def reset(self):
"""Clear buffer"""
self.buffer.clear()
# Usage
streaming_extractor = StreamingFeatureExtractor()
# Simulate streaming (100ms chunks)
chunk_size = 1600 # 100ms at 16kHz
for i in range(0, len(audio), chunk_size):
chunk = audio[i:i+chunk_size]
# Add to buffer
streaming_extractor.add_audio_chunk(chunk)
# Extract features
features = streaming_extractor.extract_latest()
if features is not None:
print(f"Chunk {i//chunk_size}: features shape = {features.shape}")
# Process features (send to model, etc.)
Performance Optimization
1. Caching Features
import os
import pickle
import hashlib
class CachedFeatureExtractor:
"""
Cache extracted features to disk
Avoid re-extracting for same audio
"""
def __init__(self, extractor: AudioFeatureExtractor, cache_dir='./feature_cache'):
self.extractor = extractor
self.cache_dir = cache_dir
os.makedirs(cache_dir, exist_ok=True)
def _get_cache_path(self, audio_path: str) -> str:
"""Generate cache file path based on audio path hash"""
path_hash = hashlib.md5(audio_path.encode()).hexdigest()
return os.path.join(self.cache_dir, f"{path_hash}.pkl")
def extract_from_file(self, audio_path: str, use_cache=True) -> Dict[str, np.ndarray]:
"""
Extract features with caching
"""
cache_path = self._get_cache_path(audio_path)
# Check cache
if use_cache and os.path.exists(cache_path):
with open(cache_path, 'rb') as f:
features = pickle.load(f)
return features
# Extract features
features = self.extractor.extract_from_file(audio_path)
# Save to cache
with open(cache_path, 'wb') as f:
pickle.dump(features, f)
return features
2. Parallel Processing
from multiprocessing import Pool
from functools import partial
class ParallelFeatureExtractor:
"""
Extract features from multiple files in parallel
"""
def __init__(self, extractor: AudioFeatureExtractor, n_workers=4):
self.extractor = extractor
self.n_workers = n_workers
def extract_from_files(self, audio_paths: List[str]) -> List[Dict[str, np.ndarray]]:
"""
Extract features from multiple files in parallel
"""
with Pool(self.n_workers) as pool:
features_list = pool.map(
self.extractor.extract_from_file,
audio_paths
)
return features_list
# Usage
parallel_extractor = ParallelFeatureExtractor(extractor, n_workers=8)
audio_files = ['file1.wav', 'file2.wav', ...] # 1000s of files
features = parallel_extractor.extract_from_files(audio_files)
Advanced Feature Types
1. Learned Features (Embeddings)
Instead of hand-crafted features, learn representations from data.
import torch
import torch.nn as nn
class AudioEmbeddingExtractor(nn.Module):
"""
Extract learned audio embeddings
Use pre-trained models (wav2vec, HuBERT) as feature extractors
"""
def __init__(self, model_name='facebook/wav2vec2-base'):
super().__init__()
from transformers import Wav2Vec2Model
# Load pre-trained model
self.model = Wav2Vec2Model.from_pretrained(model_name)
self.model.eval() # Freeze for feature extraction
def extract(self, audio: np.ndarray, sr=16000) -> np.ndarray:
"""
Extract contextualized embeddings
Returns:
Embeddings: (time_steps, hidden_dim)
typically (time, 768) for base model
"""
# Convert to tensor
audio_tensor = torch.tensor(audio, dtype=torch.float32).unsqueeze(0)
# Extract features
with torch.no_grad():
outputs = self.model(audio_tensor)
embeddings = outputs.last_hidden_state[0] # (time, 768)
return embeddings.numpy()
# Usage - MUCH more powerful than MFCCs for transfer learning
embedding_extractor = AudioEmbeddingExtractor()
embeddings = embedding_extractor.extract(audio)
print(f"Embeddings shape: {embeddings.shape}") # (time, 768)
Comparison:
| Feature Type | Dimension | Training Required | Transfer Learning | Accuracy |
|---|---|---|---|---|
| MFCCs | 40-120 | No | Poor | Baseline |
| Mel-spectrogram | 80-128 | No | Good | +5-10% |
| Wav2Vec embeddings | 768 | Yes (pre-trained) | Excellent | +15-25% |
2. Filter Bank Features (FBank)
Alternative to MFCCs - skip the DCT step.
class FilterbankExtractor:
"""
Extract log mel-filterbank features
Similar to mel-spectrograms, popular in modern ASR
"""
def __init__(self, sr=16000, n_mels=80, n_fft=512, hop_length=160):
self.sr = sr
self.n_mels = n_mels
self.n_fft = n_fft
self.hop_length = hop_length
def extract(self, audio: np.ndarray) -> np.ndarray:
"""
Extract log filter bank energies
Returns:
FBank: (n_mels, time_steps)
"""
# Mel spectrogram
mel_spec = librosa.feature.melspectrogram(
y=audio,
sr=self.sr,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels
)
# Log
log_mel = librosa.power_to_db(mel_spec, ref=np.max)
return log_mel
# FBank vs MFCC:
# - FBank: Keep all mel bins (80-128)
# - MFCC: Compress to 13-40 via DCT
#
# FBank often works better with neural networks
3. Prosodic Features
Capture rhythm, stress, and intonation.
class ProsodicFeatureExtractor:
"""
Extract prosodic features for emotion, speaker ID, etc.
"""
def extract_intensity_contour(self, audio, sr=16000, hop_length=160):
"""
Intensity (loudness) over time
"""
intensity = librosa.feature.rms(y=audio, hop_length=hop_length)[0]
# Convert to dB
intensity_db = librosa.amplitude_to_db(intensity, ref=np.max)
return intensity_db
def extract_speaking_rate(self, audio, sr=16000):
"""
Estimate speaking rate (syllables per second)
Approximation: count peaks in energy envelope
"""
# Energy envelope
energy = librosa.feature.rms(y=audio, hop_length=160)[0]
# Find peaks (local maxima)
from scipy.signal import find_peaks
peaks, _ = find_peaks(energy, distance=10, prominence=0.1)
# Speaking rate
duration = len(audio) / sr
syllables_per_sec = len(peaks) / duration
return syllables_per_sec
def extract_all_prosodic(self, audio, sr=16000):
"""Extract all prosodic features"""
# Pitch
pitch_extractor = PitchExtractor(sr=sr)
pitch_stats = pitch_extractor.extract_pitch_features(audio)
# Intensity
intensity = self.extract_intensity_contour(audio, sr)
# Speaking rate
speaking_rate = self.extract_speaking_rate(audio, sr)
return {
**pitch_stats,
'mean_intensity': np.mean(intensity),
'std_intensity': np.std(intensity),
'speaking_rate': speaking_rate
}
Feature Quality & Validation
Ensure extracted features are high quality.
Feature Quality Metrics
class FeatureQualityChecker:
"""
Validate quality of extracted features
"""
def check_for_nans(self, features: Dict[str, np.ndarray]) -> bool:
"""Check for NaN/Inf values"""
for name, feat in features.items():
if np.isnan(feat).any() or np.isinf(feat).any():
print(f"⚠️ {name} contains NaN/Inf")
return False
return True
def check_dynamic_range(self, features: Dict[str, np.ndarray]) -> Dict[str, float]:
"""
Check dynamic range of features
Low dynamic range → feature not informative
"""
ranges = {}
for name, feat in features.items():
feat_range = feat.max() - feat.min()
ranges[name] = feat_range
if feat_range < 1e-6:
print(f"⚠️ {name} has very low dynamic range: {feat_range}")
return ranges
def check_feature_statistics(self, features_batch: List[np.ndarray]):
"""
Check statistics across batch
Ensure features are properly normalized
"""
# Stack all features
all_features = np.concatenate(features_batch, axis=1) # (n_features, total_time)
# Per-feature statistics
mean_per_feature = np.mean(all_features, axis=1)
std_per_feature = np.std(all_features, axis=1)
print("Feature Statistics:")
print(f" Mean range: [{mean_per_feature.min():.3f}, {mean_per_feature.max():.3f}]")
print(f" Std range: [{std_per_feature.min():.3f}, {std_per_feature.max():.3f}]")
# Check if normalized
if np.abs(mean_per_feature).max() > 0.1:
print("⚠️ Features not centered (mean far from 0)")
if np.abs(std_per_feature - 1.0).max() > 0.2:
print("⚠️ Features not standardized (std far from 1)")
Connection to Data Preprocessing Pipeline
Feature extraction for speech is analogous to data preprocessing for ML systems.
Parallel Concepts
| Speech Feature Extraction | ML Data Preprocessing |
|---|---|
| Handle missing audio | Handle missing values |
| Normalize features (mean=0, std=1) | Normalize numerical features |
| Pad/truncate variable length | Handle variable-length sequences |
| Validate audio quality | Schema validation |
| Cache extracted features | Cache preprocessed data |
| Batch processing | Distributed data processing |
Unified Preprocessing Framework
class UnifiedPreprocessor:
"""
Combined preprocessing for multimodal ML
Example: Speech + text + metadata
"""
def __init__(self):
# Audio features
self.audio_extractor = AudioFeatureExtractor(
FeatureConfig(feature_types=['mfcc', 'mel'])
)
# Text features (from transcripts)
from sklearn.feature_extraction.text import TfidfVectorizer
self.text_vectorizer = TfidfVectorizer(max_features=1000)
# Numerical features
from sklearn.preprocessing import StandardScaler
self.numerical_scaler = StandardScaler()
def preprocess_sample(self, audio, text, metadata):
"""
Preprocess multimodal sample
Args:
audio: Audio waveform
text: Transcript or description
metadata: User/item metadata (dict)
Returns:
Combined feature vector
"""
# Extract audio features
audio_features = self.audio_extractor.extract(audio)
audio_pooled = np.mean(audio_features['mfcc'], axis=1) # (n_mfcc,)
# Extract text features
text_features = self.text_vectorizer.transform([text]).toarray()[0] # (1000,)
# Process metadata
metadata_array = np.array([
metadata['user_age'],
metadata['user_gender'],
metadata['device_type']
])
metadata_scaled = self.numerical_scaler.transform([metadata_array])[0]
# Concatenate all features
combined = np.concatenate([
audio_pooled, # (40,)
text_features, # (1000,)
metadata_scaled # (3,)
]) # Total: (1043,)
return combined
Production Best Practices
1. Feature Versioning
Track feature extraction versions for reproducibility.
class VersionedFeatureExtractor:
"""
Version feature extraction logic
Critical for:
- A/B testing different features
- Rollback if new features hurt performance
- Reproducibility
"""
VERSION = "1.2.0"
def __init__(self, config: FeatureConfig):
self.config = config
self.extractor = AudioFeatureExtractor(config)
def extract_with_metadata(self, audio_path: str):
"""
Extract features with version metadata
"""
features = self.extractor.extract_from_file(audio_path)
metadata = {
'version': self.VERSION,
'config': self.config.__dict__,
'timestamp': datetime.now().isoformat(),
'audio_path': audio_path
}
return {
'features': features,
'metadata': metadata
}
def save_features(self, features, output_path):
"""Save features with version info"""
np.savez_compressed(
output_path,
**features['features'],
metadata=json.dumps(features['metadata'])
)
2. Error Handling
Robust feature extraction handles failures gracefully.
class RobustFeatureExtractor:
"""
Feature extractor with error handling
"""
def __init__(self, extractor: AudioFeatureExtractor):
self.extractor = extractor
def extract_safe(self, audio_path: str) -> Optional[Dict]:
"""
Extract features with error handling
"""
try:
# Load audio
audio, sr = librosa.load(audio_path, sr=self.extractor.config.sr)
# Validate
if len(audio) == 0:
logger.warning(f"Empty audio: {audio_path}")
return None
if len(audio) < self.extractor.config.sr * 0.1: # < 100ms
logger.warning(f"Audio too short: {audio_path}")
return None
# Extract
features = self.extractor.extract(audio)
# Quality check
quality_checker = FeatureQualityChecker()
if not quality_checker.check_for_nans(features):
logger.error(f"Feature extraction failed (NaN): {audio_path}")
return None
return features
except Exception as e:
logger.error(f"Feature extraction error for {audio_path}: {e}")
return None
def extract_batch_robust(self, audio_paths: List[str]) -> List[Dict]:
"""
Extract from batch, skipping failures
"""
results = []
failures = []
for path in audio_paths:
features = self.extract_safe(path)
if features is not None:
results.append({'path': path, 'features': features})
else:
failures.append(path)
success_rate = len(results) / len(audio_paths)
logger.info(f"Feature extraction: {len(results)}/{len(audio_paths)} succeeded ({success_rate:.1%})")
if failures:
logger.warning(f"Failed files: {failures[:10]}") # Log first 10
return results
3. Monitoring Feature Quality
Track feature statistics over time to detect issues.
class FeatureMonitor:
"""
Monitor feature quality in production
"""
def __init__(self, expected_stats: Dict[str, Dict]):
"""
Args:
expected_stats: Expected statistics per feature type
{
'mfcc': {'mean_range': [-5, 5], 'std_range': [0.5, 2.0]},
'mel': {'mean_range': [-80, 0], 'std_range': [10, 30]}
}
"""
self.expected_stats = expected_stats
def validate_features(self, features: Dict[str, np.ndarray]) -> List[str]:
"""
Validate extracted features against expected statistics
Returns:
List of warnings
"""
warnings = []
for feat_name, feat_values in features.items():
if feat_name not in self.expected_stats:
continue
expected = self.expected_stats[feat_name]
# Check mean
actual_mean = np.mean(feat_values)
expected_mean_range = expected['mean_range']
if not (expected_mean_range[0] <= actual_mean <= expected_mean_range[1]):
warnings.append(
f"{feat_name}: mean {actual_mean:.2f} outside expected range {expected_mean_range}"
)
# Check std
actual_std = np.std(feat_values)
expected_std_range = expected['std_range']
if not (expected_std_range[0] <= actual_std <= expected_std_range[1]):
warnings.append(
f"{feat_name}: std {actual_std:.2f} outside expected range {expected_std_range}"
)
return warnings
def compute_statistics(self, features_batch: List[Dict[str, np.ndarray]]):
"""
Compute statistics across batch
Use to establish baseline expected_stats
"""
stats = {}
# Get feature names from first sample
feature_names = features_batch[0].keys()
for feat_name in feature_names:
# Collect all values
all_values = np.concatenate([
f[feat_name].flatten() for f in features_batch
])
stats[feat_name] = {
'mean': np.mean(all_values),
'std': np.std(all_values),
'min': np.min(all_values),
'max': np.max(all_values),
'percentiles': {
'25': np.percentile(all_values, 25),
'50': np.percentile(all_values, 50),
'75': np.percentile(all_values, 75),
'95': np.percentile(all_values, 95)
}
}
return stats
Data Augmentation in Feature Space
Augment features directly for training robustness.
SpecAugment
class SpecAugment:
"""
SpecAugment: Data augmentation on spectrograms
Proposed in "SpecAugment: A Simple Data Augmentation Method for ASR" (Google, 2019)
Improves ASR accuracy by 10-20% on many benchmarks
"""
def __init__(
self,
time_mask_param=70,
freq_mask_param=15,
num_time_masks=2,
num_freq_masks=2
):
self.time_mask_param = time_mask_param
self.freq_mask_param = freq_mask_param
self.num_time_masks = num_time_masks
self.num_freq_masks = num_freq_masks
def time_mask(self, spec: np.ndarray) -> np.ndarray:
"""
Mask random time region
Sets random time frames to zero
"""
spec = spec.copy()
time_length = spec.shape[1]
for _ in range(self.num_time_masks):
t = np.random.randint(0, min(self.time_mask_param, time_length))
t0 = np.random.randint(0, time_length - t)
spec[:, t0:t0+t] = 0
return spec
def freq_mask(self, spec: np.ndarray) -> np.ndarray:
"""
Mask random frequency region
Sets random frequency bins to zero
"""
spec = spec.copy()
freq_length = spec.shape[0]
for _ in range(self.num_freq_masks):
f = np.random.randint(0, min(self.freq_mask_param, freq_length))
f0 = np.random.randint(0, freq_length - f)
spec[f0:f0+f, :] = 0
return spec
def augment(self, spec: np.ndarray) -> np.ndarray:
"""Apply both time and freq masking"""
spec = self.time_mask(spec)
spec = self.freq_mask(spec)
return spec
# Usage during training
augmenter = SpecAugment()
for audio, label in train_loader:
# Extract features
mel_spec = mel_extractor.extract(audio)
# Augment
mel_spec_aug = augmenter.augment(mel_spec)
# Train model
train_model(mel_spec_aug, label)
Batch Feature Extraction for Training
Extract features for entire dataset efficiently.
Batch Extraction Pipeline
import os
from pathlib import Path
from tqdm import tqdm
import h5py
class BatchFeatureExtractor:
"""
Extract features for large audio datasets
Use case: Prepare training data
- Extract once, train many times
- Save features to disk (HDF5 format)
"""
def __init__(self, extractor: AudioFeatureExtractor, n_workers=8):
self.extractor = extractor
self.n_workers = n_workers
def extract_dataset(
self,
audio_dir: str,
output_path: str,
max_length_frames: int = 1000
):
"""
Extract features for all audio files in directory
Args:
audio_dir: Directory containing .wav files
output_path: HDF5 file to save features
max_length_frames: Pad/truncate to this length
"""
# Find all audio files
audio_files = list(Path(audio_dir).rglob('*.wav'))
print(f"Found {len(audio_files)} audio files")
# Create HDF5 file
with h5py.File(output_path, 'w') as hf:
# Pre-allocate datasets
# (We'll store features for each type)
feature_dim = self.extractor.config.n_mfcc * 3 # MFCCs + deltas
features_dataset = hf.create_dataset(
'features',
shape=(len(audio_files), feature_dim, max_length_frames),
dtype='float32'
)
lengths_dataset = hf.create_dataset(
'lengths',
shape=(len(audio_files),),
dtype='int32'
)
# Store file paths
paths_dataset = hf.create_dataset(
'paths',
shape=(len(audio_files),),
dtype=h5py.string_dtype()
)
# Extract features
for idx, audio_path in enumerate(tqdm(audio_files)):
try:
# Load audio
audio, sr = librosa.load(str(audio_path), sr=self.extractor.config.sr)
# Extract features
features = self.extractor.extract(audio)
# Get MFCCs with deltas
mfcc_deltas = features['mfcc'] # (120, time)
# Pad or truncate
handler = VariableLengthHandler()
mfcc_fixed = handler.pad_or_truncate(mfcc_deltas, max_length_frames)
# Store
features_dataset[idx] = mfcc_fixed
lengths_dataset[idx] = min(mfcc_deltas.shape[1], max_length_frames)
paths_dataset[idx] = str(audio_path)
except Exception as e:
logger.error(f"Failed to process {audio_path}: {e}")
# Store zeros for failed files
features_dataset[idx] = np.zeros((feature_dim, max_length_frames))
lengths_dataset[idx] = 0
paths_dataset[idx] = str(audio_path)
print(f"Features saved to {output_path}")
# Usage
batch_extractor = BatchFeatureExtractor(extractor, n_workers=8)
batch_extractor.extract_dataset(
audio_dir='./data/train/',
output_path='./features/train_features.h5',
max_length_frames=1000
)
# Load for training
with h5py.File('./features/train_features.h5', 'r') as hf:
features = hf['features'][:] # (N, feature_dim, max_length)
lengths = hf['lengths'][:] # (N,)
paths = hf['paths'][:] # (N,)
Real-World Systems
Kaldi: Traditional ASR Feature Pipeline
Kaldi is the industry standard for traditional ASR.
Feature extraction:
# Kaldi feature extraction (MFCC + pitch)
compute-mfcc-feats --config=conf/mfcc.conf scp:wav.scp ark:mfcc.ark
compute-and-process-kaldi-pitch-feats scp:wav.scp ark:pitch.ark
# Combine features
paste-feats ark:mfcc.ark ark:pitch.ark ark:features.ark
Configuration (mfcc.conf):
--use-energy=true
--num-mel-bins=40
--num-ceps=40
--low-freq=20
--high-freq=8000
--sample-frequency=16000
PyTorch: Modern Deep Learning Pipeline
import torchaudio
import torch
class TorchAudioExtractor:
"""
Feature extraction using torchaudio
Benefits:
- GPU acceleration
- Differentiable (can backprop through features)
- Integrated with PyTorch training
"""
def __init__(self, sr=16000, n_mfcc=40, n_mels=80):
self.sr = sr
self.n_mfcc = n_mfcc
self.n_mels = n_mels
# Create transforms (can move to GPU)
self.mfcc_transform = torchaudio.transforms.MFCC(
sample_rate=sr,
n_mfcc=n_mfcc,
melkwargs={'n_mels': 40, 'n_fft': 512, 'hop_length': 160}
)
self.mel_transform = torchaudio.transforms.MelSpectrogram(
sample_rate=sr,
n_fft=512,
hop_length=160,
n_mels=n_mels
)
# Amplitude → dB conversion
self.db_transform = torchaudio.transforms.AmplitudeToDB()
def to(self, device):
"""
Move transforms to a device (CPU/GPU) and return self.
"""
self.mfcc_transform = self.mfcc_transform.to(device)
self.mel_transform = self.mel_transform.to(device)
self.db_transform = self.db_transform.to(device)
return self
def extract(self, audio: torch.Tensor) -> Dict[str, torch.Tensor]:
"""
Extract features (GPU-accelerated if audio on GPU)
Args:
audio: (batch, time) or (time,)
Returns:
Dictionary of features
"""
if audio.ndim == 1:
audio = audio.unsqueeze(0) # Add batch dimension
# Extract
mfccs = self.mfcc_transform(audio) # (batch, n_mfcc, time)
mel = self.mel_transform(audio) # (batch, n_mels, time)
mel_db = self.db_transform(mel)
return {
'mfcc': mfccs,
'mel': mel_db
}
# Usage with GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
extractor = TorchAudioExtractor().to(device)
# Load audio
audio, sr = torchaudio.load('speech.wav')
audio = audio.to(device)
# Extract (on GPU)
features = extractor.extract(audio)
Google: Production ASR Feature Extraction
Stack:
- Input: 16kHz audio
- Features: 80-bin log mel-filterbank
- Augmentation: SpecAugment
- Normalization: Per-utterance mean/variance normalization
- Model: Transformer encoder-decoder
Key optimizations:
- Precompute features for training data
- On-the-fly extraction for inference
- GPU-accelerated extraction for real-time systems
Choosing the Right Features
Different tasks need different features.
Feature Selection Guide
| Task | Best Features | Why |
|---|---|---|
| ASR (traditional) | MFCCs + deltas | Captures phonetic content |
| ASR (deep learning) | Mel-spectrograms | Works well with CNNs |
| Speaker Recognition | MFCCs + pitch + prosody | Speaker identity in pitch/prosody |
| Emotion Recognition | Prosodic + spectral | Emotion in prosody + voice quality |
| Keyword Spotting | Mel-spectrograms | Simple, fast with CNNs |
| Speech Enhancement | STFT magnitude + phase | Need phase for reconstruction |
| Voice Activity Detection | Energy + ZCR | Simple features sufficient |
Combining Features
class MultiFeatureExtractor:
"""
Combine multiple feature types
Different features capture different aspects
"""
def __init__(self):
self.mfcc_ext = MFCCExtractor()
self.pitch_ext = PitchExtractor()
self.prosody_ext = ProsodicFeatureExtractor()
def extract_combined(self, audio):
"""
Extract and combine multiple feature types
"""
# MFCCs (40, time)
mfccs = self.mfcc_ext.extract(audio)
# Pitch (time,)
pitch, voiced = self.pitch_ext.extract_f0(audio)
pitch = pitch.reshape(1, -1) # (1, time)
# Energy (1, time)
energy = librosa.feature.rms(y=audio, hop_length=160)
# Align all features to same time dimension
min_time = min(mfccs.shape[1], pitch.shape[1], energy.shape[1])
mfccs = mfccs[:, :min_time]
pitch = pitch[:, :min_time]
energy = energy[:, :min_time]
# Stack
combined = np.vstack([mfccs, pitch, energy]) # (42, time)
return combined
Key Takeaways
✅ MFCCs are standard for speech recognition - compact and robust ✅ Mel-spectrograms work better with deep learning (CNNs, Transformers) ✅ Delta features capture temporal dynamics - critical for accuracy ✅ Normalize features for stable training (mean=0, std=1) ✅ Handle variable length with padding, pooling, or attention masks ✅ Cache features for repeated use - major speedup in training ✅ Streaming extraction possible with circular buffers ✅ Parallel processing speeds up batch feature extraction ✅ SpecAugment improves robustness through feature-space augmentation ✅ Monitor feature quality to detect pipeline issues early ✅ Version features for reproducibility and A/B testing ✅ Choose features based on task - no one-size-fits-all
FAQ
Q: What is the difference between MFCCs and mel-spectrograms? A: MFCCs apply a Discrete Cosine Transform on top of the mel-spectrogram, compressing it into 13-40 coefficients that capture the spectral envelope. Mel-spectrograms preserve full spectral detail with 40-128 bins per frame. MFCCs work better with small traditional models, while mel-spectrograms perform better with CNNs and Transformers used in speech classification and streaming ASR.
Q: Which audio features should I use for my speech ML task? A: Use MFCCs with deltas for traditional ASR and small models. Use mel-spectrograms for deep learning with CNNs or Transformers. Use pitch and prosodic features for speaker recognition and emotion detection. Use STFT magnitude with phase for speech enhancement and reconstruction tasks. Use energy and zero-crossing rate for VAD.
Q: How does SpecAugment improve speech model training? A: SpecAugment randomly masks time regions and frequency bands in spectrograms during training, forcing the model to be robust to missing information. This simple technique improves ASR accuracy by 10-20% relative WER reduction without requiring additional training data. It was proposed by Google in 2019 and is now standard practice in speech model training.
Originally published at: arunbaby.com/speech-tech/0003-audio-feature-extraction
If you found this helpful, consider sharing it with others who might benefit.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch