Audio Feature Extraction for Speech ML

Q: What is the difference between MFCCs and mel-spectrograms?

MFCCs apply a Discrete Cosine Transform on top of the mel-spectrogram, compressing it into 13-40 coefficients that capture the spectral envelope. Mel-spectrograms preserve full spectral detail with 40-128 bins per frame. MFCCs work better with small traditional models, while mel-spectrograms perform better with CNNs and Transformers.

Q: Which audio features should I use for my speech ML task?

Use MFCCs with deltas for traditional ASR and small models. Use mel-spectrograms for deep learning with CNNs or Transformers. Use pitch and prosodic features for speaker recognition and emotion detection. Use STFT magnitude with phase for speech enhancement and reconstruction tasks.

Q: How does SpecAugment improve speech model training?

SpecAugment randomly masks time regions and frequency bands in spectrograms during training, forcing the model to be robust to missing information. This simple technique improves ASR accuracy by 10-20% relative WER reduction without requiring additional training data.

30 minute read

How to transform raw audio waveforms into ML-ready features that capture speech characteristics for robust model training.

TL;DR

Audio feature extraction transforms high-dimensional raw waveforms (16,000 samples/sec) into compact ML-ready representations. MFCCs (40 coefficients per frame) capture spectral envelopes for traditional ASR, mel-spectrograms (80 bins) preserve full spectral detail for CNNs, and wav2vec2 embeddings (768 dimensions) provide 15-25% accuracy gains via transfer learning. Delta features capture temporal dynamics, SpecAugment provides 10-20% WER reduction through feature-space augmentation, and streaming circular buffers enable real-time extraction under 10ms. These features feed directly into streaming ASR, speech classification, VAD, and speaker recognition systems.

A spectrogram visualization glowing on an oscilloscope screen with bright orange and yellow frequency bands against a...

Introduction

Raw audio waveforms are high-dimensional, noisy, and difficult for ML models to learn from directly. Feature extraction transforms audio into compact, informative representations that:

Capture important speech characteristics
Reduce dimensionality (16kHz audio = 16,000 samples/sec → ~40 features)
Provide invariance to irrelevant variations (volume, recording device)
Enable efficient model training

Why it matters:

Improves accuracy: Good features → better models
Reduces compute: Lower dimensionality = faster training/inference
Enables transfer learning: Pre-extracted features work across tasks
Production efficiency: Feature extraction can be cached

What you’ll learn:

Core audio features (MFCCs, spectrograms, mel-scale)
Time-domain vs frequency-domain features
Production-grade extraction pipelines
Optimization for real-time processing
Feature engineering for speech tasks

Problem Definition

Design a feature extraction pipeline for speech ML systems.

Functional Requirements

Feature Types
- Time-domain features (energy, zero-crossing rate)
- Frequency-domain features (spectrograms, MFCCs)
- Temporal features (deltas, delta-deltas)
- Learned features (embeddings)
Input Handling
- Support multiple sample rates (8kHz, 16kHz, 48kHz)
- Handle variable-length audio
- Process both mono and stereo
- Support batch processing
Output Format
- Fixed-size feature vectors
- Variable-length sequences
- 2D/3D tensors for neural networks

Non-Functional Requirements

Performance
- Real-time: Extract features < 10ms for 1 sec audio
- Batch: Process 10K files/hour on single machine
- Memory: < 100MB RAM for streaming
Quality
- Robust to noise
- Consistent across devices
- Reproducible (deterministic)
Flexibility
- Configurable parameters
- Support multiple backends (librosa, torchaudio)
- Easy to extend with new features

Audio Basics

Waveform Representation

import numpy as np
import librosa
import matplotlib.pyplot as plt

# Load audio
audio, sr = librosa.load('speech.wav', sr=16000)

print(f"Sample rate: {sr} Hz")
print(f"Duration: {len(audio) / sr:.2f} seconds")
print(f"Shape: {audio.shape}")
print(f"Range: [{audio.min():.3f}, {audio.max():.3f}]")

# Visualize waveform
plt.figure(figsize=(12, 4))
time = np.arange(len(audio)) / sr
plt.plot(time, audio)
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.title('Audio Waveform')
plt.show()

Key properties:

Sample rate (sr): Samples per second (e.g., 16000 Hz = 16000 samples/sec)
Duration: len(audio) / sr seconds
Amplitude: Typically normalized to [-1, 1]

Feature 1: Mel-Frequency Cepstral Coefficients (MFCCs)

MFCCs are the most widely used features in speech recognition.

Why MFCCs?

Mimic human hearing: Use mel scale (perceptual frequency scale)
Compact: Represent spectral envelope with 13-40 coefficients
Robust: Less sensitive to pitch variations
Proven: Gold standard for ASR for decades

How MFCCs Work

Audio Waveform
 ↓
1. Pre-emphasis (boost high frequencies)
 ↓
2. Frame the signal (25ms windows, 10ms hop)
 ↓
3. Apply window function (Hamming)
 ↓
4. FFT (Fast Fourier Transform)
 ↓
5. Mel filterbank (map to mel scale)
 ↓
6. Log (compress dynamic range)
 ↓
7. DCT (Discrete Cosine Transform)
 ↓
MFCCs (13-40 coefficients per frame)

Implementation

import librosa
import numpy as np

class MFCCExtractor:
    """
    Extract MFCC features from audio

    Standard configuration for speech recognition
    """

    def __init__(
    self,
    sr=16000,
    n_mfcc=40,
    n_fft=512,
    hop_length=160, # 10ms at 16kHz
    n_mels=40,
    fmin=20,
    fmax=8000
    ):
        self.sr = sr
        self.n_mfcc = n_mfcc
        self.n_fft = n_fft
        self.hop_length = hop_length
        self.n_mels = n_mels
        self.fmin = fmin
        self.fmax = fmax

    def extract(self, audio: np.ndarray) -> np.ndarray:
        """
        Extract MFCCs

        Args:
            audio: Audio waveform (1D array)

            Returns:
                MFCCs: (n_mfcc, time_steps)
                """
                # Extract MFCCs
                mfccs = librosa.feature.mfcc(
                y=audio,
                sr=self.sr,
                n_mfcc=self.n_mfcc,
                n_fft=self.n_fft,
                hop_length=self.hop_length,
                n_mels=self.n_mels,
                fmin=self.fmin,
                fmax=self.fmax
                )

                return mfccs # Shape: (n_mfcc, time)

    def extract_with_deltas(self, audio: np.ndarray) -> np.ndarray:
        """
        Extract MFCCs + deltas + delta-deltas

        Deltas capture temporal dynamics

        Returns:
            Features: (n_mfcc * 3, time_steps)
            """
            # MFCCs
            mfccs = self.extract(audio)

            # Delta (first derivative)
            delta = librosa.feature.delta(mfccs)

            # Delta-delta (second derivative)
            delta2 = librosa.feature.delta(mfccs, order=2)

            # Stack
            features = np.vstack([mfccs, delta, delta2]) # (120, time)

            return features

            # Usage
            extractor = MFCCExtractor()
            mfccs = extractor.extract(audio)
            print(f"MFCCs shape: {mfccs.shape}") # (40, time_steps)

            # With deltas
            features = extractor.extract_with_deltas(audio)
            print(f"MFCCs+deltas shape: {features.shape}") # (120, time_steps)

Visualizing MFCCs

import matplotlib.pyplot as plt

def plot_mfccs(mfccs, sr, hop_length):
    """Visualize MFCC features"""
    plt.figure(figsize=(12, 6))

    # Convert frame indices to time
    times = librosa.frames_to_time(
    np.arange(mfccs.shape[1]),
    sr=sr,
    hop_length=hop_length
    )

    plt.imshow(
    mfccs,
    aspect='auto',
    origin='lower',
    extent=[times[0], times[-1], 0, mfccs.shape[0]],
    cmap='viridis'
    )

    plt.colorbar(format='%+2.0f dB')
    plt.xlabel('Time (s)')
    plt.ylabel('MFCC Coefficient')
    plt.title('MFCC Features')
    plt.tight_layout()
    plt.show()

    plot_mfccs(mfccs, sr=16000, hop_length=160)

Feature 2: Mel-Spectrograms

Mel-spectrograms preserve more temporal detail than MFCCs.

What is a Spectrogram?

A spectrogram shows how the frequency content of a signal changes over time.

X-axis: Time
Y-axis: Frequency
Color: Magnitude (energy)

Mel-Spectrogram vs MFCC

Aspect	Mel-Spectrogram	MFCC
Dimensions	(n_mels, time)	(n_mfcc, time)
Information	Full spectrum	Spectral envelope
Size	40-128 bins	13-40 coefficients
Use case	CNNs, deep learning	Traditional ASR
Temporal resolution	Higher	Lower (due to DCT)

Implementation

class MelSpectrogramExtractor:
    """
    Extract log mel-spectrogram features

    Popular for deep learning models (CNNs, Transformers)
    """

    def __init__(
    self,
    sr=16000,
    n_fft=512,
    hop_length=160,
    n_mels=80,
    fmin=0,
    fmax=8000
    ):
        self.sr = sr
        self.n_fft = n_fft
        self.hop_length = hop_length
        self.n_mels = n_mels
        self.fmin = fmin
        self.fmax = fmax

    def extract(self, audio: np.ndarray) -> np.ndarray:
        """
        Extract log mel-spectrogram

        Returns:
            Log mel-spectrogram: (n_mels, time_steps)
            """
            # Compute mel spectrogram
            mel_spec = librosa.feature.melspectrogram(
            y=audio,
            sr=self.sr,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            n_mels=self.n_mels,
            fmin=self.fmin,
            fmax=self.fmax
            )

            # Convert to log scale (dB)
            log_mel = librosa.power_to_db(mel_spec, ref=np.max)

            return log_mel # Shape: (n_mels, time)

    def extract_normalized(self, audio: np.ndarray) -> np.ndarray:
        """
        Extract and normalize to [0, 1]

        Better for neural networks
        """
        log_mel = self.extract(audio)

        # Normalize to [0, 1]
        log_mel_norm = (log_mel - log_mel.min()) / (log_mel.max() - log_mel.min() + 1e-8)

        return log_mel_norm

        # Usage
        mel_extractor = MelSpectrogramExtractor(n_mels=80)
        mel_spec = mel_extractor.extract(audio)
        print(f"Mel-spectrogram shape: {mel_spec.shape}") # (80, time_steps)

Visualizing Mel-Spectrogram

def plot_mel_spectrogram(mel_spec, sr, hop_length):
    """Visualize mel-spectrogram"""
    plt.figure(figsize=(12, 6))

    librosa.display.specshow(
    mel_spec,
    sr=sr,
    hop_length=hop_length,
    x_axis='time',
    y_axis='mel',
    cmap='viridis'
    )

    plt.colorbar(format='%+2.0f dB')
    plt.title('Mel-Spectrogram')
    plt.tight_layout()
    plt.show()

    plot_mel_spectrogram(mel_spec, sr=16000, hop_length=160)

Feature 3: Raw Spectrograms (STFT)

Short-Time Fourier Transform (STFT) provides the highest frequency resolution.

Implementation

class STFTExtractor:
    """
    Extract raw STFT features

    Used when you need full frequency resolution
    """

    def __init__(
    self,
    n_fft=512,
    hop_length=160,
    win_length=400
    ):
        self.n_fft = n_fft
        self.hop_length = hop_length
        self.win_length = win_length

    def extract(self, audio: np.ndarray) -> np.ndarray:
        """
        Extract magnitude spectrogram

        Returns:
            Spectrogram: (n_fft//2 + 1, time_steps)
            """
            # Compute STFT
            stft = librosa.stft(
            audio,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            win_length=self.win_length
            )

            # Get magnitude
            magnitude = np.abs(stft)

            # Convert to dB
            magnitude_db = librosa.amplitude_to_db(magnitude, ref=np.max)

            return magnitude_db # Shape: (n_fft//2 + 1, time)

    def extract_with_phase(self, audio: np.ndarray):
        """
        Extract magnitude and phase

        Phase information useful for reconstruction
        """
        stft = librosa.stft(
        audio,
        n_fft=self.n_fft,
        hop_length=self.hop_length,
        win_length=self.win_length
        )

        magnitude = np.abs(stft)
        phase = np.angle(stft)

        return magnitude, phase

        # Usage
        stft_extractor = STFTExtractor()
        spectrogram = stft_extractor.extract(audio)
        print(f"Spectrogram shape: {spectrogram.shape}") # (257, time_steps)

Feature 4: Time-Domain Features

Simple but effective features computed directly from waveform.

Implementation

class TimeDomainExtractor:
    """
    Extract time-domain features

    Fast to compute, useful for simple tasks
    """

    def extract_energy(self, audio: np.ndarray, frame_length=400, hop_length=160):
        """
        Frame-wise energy (RMS)

        Captures loudness/volume over time
        """
        energy = librosa.feature.rms(
        y=audio,
        frame_length=frame_length,
        hop_length=hop_length
        )[0]

        return energy

    def extract_zero_crossing_rate(self, audio: np.ndarray, frame_length=400, hop_length=160):
        """
        Zero-crossing rate

        Measures how often signal crosses zero
        High ZCR → noisy/unvoiced
        Low ZCR → tonal/voiced
        """
        zcr = librosa.feature.zero_crossing_rate(
        audio,
        frame_length=frame_length,
        hop_length=hop_length
        )[0]

        return zcr

    def extract_all(self, audio: np.ndarray):
        """Extract all time-domain features"""
        energy = self.extract_energy(audio)
        zcr = self.extract_zero_crossing_rate(audio)

        # Stack features
        features = np.vstack([energy, zcr]) # (2, time)

        return features

        # Usage
        time_extractor = TimeDomainExtractor()
        time_features = time_extractor.extract_all(audio)
        print(f"Time-domain features shape: {time_features.shape}") # (2, time_steps)

Feature 5: Pitch & Formants

Pitch and formants are linguistic features important for speech.

Pitch Extraction

class PitchExtractor:
    """
    Extract fundamental frequency (F0)

    Important for:
        - Speaker recognition
        - Emotion detection
        - Prosody modeling
        """

    def __init__(self, sr=16000, fmin=80, fmax=400):
        self.sr = sr
        self.fmin = fmin # Typical male voice
        self.fmax = fmax # Typical female voice

    def extract_f0(self, audio: np.ndarray, hop_length=160):
        """
        Extract pitch (fundamental frequency)

        Returns:
            f0: Pitch values (Hz) per frame
            voiced_flag: Boolean array (voiced vs unvoiced)
            """
            # Extract pitch using YIN algorithm
            f0 = librosa.yin(
            audio,
            fmin=self.fmin,
            fmax=self.fmax,
            sr=self.sr,
            hop_length=hop_length
            )

            # Detect voiced regions (f0 > 0)
            voiced_flag = f0 > 0

            return f0, voiced_flag

    def extract_pitch_features(self, audio: np.ndarray):
        """
        Extract pitch statistics

        Useful for speaker/emotion recognition
        """
        f0, voiced = self.extract_f0(audio)

        # Statistics on voiced frames
        voiced_f0 = f0[voiced]

        if len(voiced_f0) > 0:
            features = {
            'mean_pitch': np.mean(voiced_f0),
            'std_pitch': np.std(voiced_f0),
            'min_pitch': np.min(voiced_f0),
            'max_pitch': np.max(voiced_f0),
            'pitch_range': np.max(voiced_f0) - np.min(voiced_f0),
            'voiced_ratio': np.sum(voiced) / len(voiced)
            }
        else:
            features = {k: 0.0 for k in ['mean_pitch', 'std_pitch', 'min_pitch', 'max_pitch', 'pitch_range', 'voiced_ratio']}

            return features

            # Usage
            pitch_extractor = PitchExtractor()
            f0, voiced = pitch_extractor.extract_f0(audio)
            print(f"Pitch shape: {f0.shape}")

            pitch_stats = pitch_extractor.extract_pitch_features(audio)
            print(f"Pitch statistics: {pitch_stats}")

Production Feature Pipeline

Combine all features into a unified pipeline.

Unified Feature Extractor

from dataclasses import dataclass
from typing import Dict, List, Optional
import json

@dataclass
class FeatureConfig:
    """Configuration for feature extraction"""
    sr: int = 16000
    feature_types: List[str] = None # ['mfcc', 'mel', 'pitch']

    # MFCC config
    n_mfcc: int = 40

    # Mel-spectrogram config
    n_mels: int = 80

    # Common config
    n_fft: int = 512
    hop_length: int = 160 # 10ms

    # Normalization
    normalize: bool = True

    def __post_init__(self):
        if self.feature_types is None:
            self.feature_types = ['mfcc']

    class AudioFeatureExtractor:
        """
        Production-grade audio feature extractor

        Supports multiple feature types, caching, and batch processing
        """

    def __init__(self, config: FeatureConfig):
        self.config = config

        # Initialize extractors
        self.mfcc_extractor = MFCCExtractor(
        sr=config.sr,
        n_mfcc=config.n_mfcc,
        n_fft=config.n_fft,
        hop_length=config.hop_length
        )

        self.mel_extractor = MelSpectrogramExtractor(
        sr=config.sr,
        n_mels=config.n_mels,
        n_fft=config.n_fft,
        hop_length=config.hop_length
        )

        self.pitch_extractor = PitchExtractor(sr=config.sr)
        self.time_extractor = TimeDomainExtractor()

    def extract(self, audio: np.ndarray) -> Dict[str, np.ndarray]:
        """
        Extract features based on config

        Args:
            audio: Audio waveform

            Returns:
                Dictionary of features
                """
                features = {}

                if 'mfcc' in self.config.feature_types:
                    mfccs = self.mfcc_extractor.extract_with_deltas(audio)
                    if self.config.normalize:
                        mfccs = self._normalize(mfccs)
                        features['mfcc'] = mfccs

                        if 'mel' in self.config.feature_types:
                            mel = self.mel_extractor.extract(audio)
                            if self.config.normalize:
                                mel = self._normalize(mel)
                                features['mel'] = mel

                                if 'pitch' in self.config.feature_types:
                                    f0, voiced = self.pitch_extractor.extract_f0(audio, hop_length=self.config.hop_length)
                                    features['pitch'] = f0
                                    features['voiced'] = voiced.astype(np.float32)

                                    if 'time' in self.config.feature_types:
                                        time_feats = self.time_extractor.extract_all(audio)
                                        if self.config.normalize:
                                            time_feats = self._normalize(time_feats)
                                            features['time'] = time_feats

                                            return features

    def _normalize(self, features: np.ndarray) -> np.ndarray:
        """
        Normalize features (mean=0, std=1) per coefficient
        """
        mean = np.mean(features, axis=1, keepdims=True)
        std = np.std(features, axis=1, keepdims=True) + 1e-8

        normalized = (features - mean) / std

        return normalized

    def extract_from_file(self, audio_path: str) -> Dict[str, np.ndarray]:
        """
        Extract features from audio file
        """
        audio, sr = librosa.load(audio_path, sr=self.config.sr)
        return self.extract(audio)

    def extract_batch(self, audio_list: List[np.ndarray]) -> List[Dict[str, np.ndarray]]:
        """
        Extract features from batch of audio
        """
        return [self.extract(audio) for audio in audio_list]

    def save_config(self, path: str):
        """Save feature extraction config"""
        with open(path, 'w') as f:
            json.dump(self.config.__dict__, f, indent=2)

            @staticmethod
    def load_config(path: str) -> FeatureConfig:
        """Load feature extraction config"""
        with open(path, 'r') as f:
            config_dict = json.load(f)
            return FeatureConfig(**config_dict)

            # Usage
            config = FeatureConfig(
            feature_types=['mfcc', 'mel', 'pitch'],
            n_mfcc=40,
            n_mels=80,
            normalize=True
            )

            extractor = AudioFeatureExtractor(config)

            # Extract features
            features = extractor.extract(audio)
            print("Extracted features:", features.keys())
            for name, feat in features.items():
                print(f" {name}: {feat.shape}")

                # Save config for reproducibility
                extractor.save_config('feature_config.json')

Handling Variable-Length Audio

Different audio clips have different durations. Need to handle this for ML.

Strategy 1: Padding/Truncation

class VariableLengthHandler:
    """
    Handle variable-length audio
    """

    def pad_or_truncate(self, features: np.ndarray, target_length: int) -> np.ndarray:
        """
        Pad or truncate features to fixed length

        Args:
            features: (n_features, time)
            target_length: Target time dimension

            Returns:
                Fixed-length features: (n_features, target_length)
                """
                current_length = features.shape[1]

                if current_length < target_length:
                    # Pad with zeros
                    pad_width = ((0, 0), (0, target_length - current_length))
                    features = np.pad(features, pad_width, mode='constant')
                elif current_length > target_length:
                    # Truncate (take first target_length frames)
                    features = features[:, :target_length]

                    return features

    def create_mask(self, features: np.ndarray, target_length: int) -> np.ndarray:
        """
        Create attention mask for padded features

        Returns:
            Mask: (target_length,) - 1 for real frames, 0 for padding
            """
            current_length = features.shape[1]

            mask = np.zeros(target_length)
            mask[:min(current_length, target_length)] = 1

            return mask

Strategy 2: Temporal Pooling

class TemporalPooler:
    """
    Pool variable-length features to fixed size
    """

    def mean_pool(self, features: np.ndarray) -> np.ndarray:
        """
        Average pool over time

        Args:
            features: (n_features, time)

            Returns:
                Pooled: (n_features,)
                """
                return np.mean(features, axis=1)

    def max_pool(self, features: np.ndarray) -> np.ndarray:
        """Max pool over time"""
        return np.max(features, axis=1)

    def stats_pool(self, features: np.ndarray) -> np.ndarray:
        """
        Statistical pooling: mean + std

        Returns:
            Pooled: (n_features * 2,)
            """
            mean = np.mean(features, axis=1)
            std = np.std(features, axis=1)

            return np.concatenate([mean, std])

Real-Time Feature Extraction

For streaming applications, need incremental feature extraction.

Streaming Feature Extractor

from collections import deque

class StreamingFeatureExtractor:
    """
    Extract features from streaming audio

    Use case: Real-time ASR, voice assistants
    """

    def __init__(
    self,
    sr=16000,
    frame_length_ms=25,
    hop_length_ms=10,
    buffer_duration_ms=500
    ):
        self.sr = sr
        self.frame_length = int(sr * frame_length_ms / 1000)
        self.hop_length = int(sr * hop_length_ms / 1000)
        self.buffer_length = int(sr * buffer_duration_ms / 1000)

        # Circular buffer for audio
        self.buffer = deque(maxlen=self.buffer_length)

        # Feature extractor
        self.extractor = MFCCExtractor(
        sr=sr,
        hop_length=self.hop_length
        )

    def add_audio_chunk(self, audio_chunk: np.ndarray):
        """
        Add new audio chunk to buffer

        Args:
            audio_chunk: New audio samples
            """
            self.buffer.extend(audio_chunk)

    def extract_latest(self) -> Optional[np.ndarray]:
        """
        Extract features from current buffer

        Returns:
            Features or None if buffer too small
            """
            if len(self.buffer) < self.frame_length:
                return None

                # Convert buffer to array
                audio = np.array(self.buffer)

                # Extract features
                features = self.extractor.extract(audio)

                return features

    def reset(self):
        """Clear buffer"""
        self.buffer.clear()

        # Usage
        streaming_extractor = StreamingFeatureExtractor()

        # Simulate streaming (100ms chunks)
        chunk_size = 1600 # 100ms at 16kHz

        for i in range(0, len(audio), chunk_size):
            chunk = audio[i:i+chunk_size]

            # Add to buffer
            streaming_extractor.add_audio_chunk(chunk)

            # Extract features
            features = streaming_extractor.extract_latest()

            if features is not None:
                print(f"Chunk {i//chunk_size}: features shape = {features.shape}")
                # Process features (send to model, etc.)

Performance Optimization

1. Caching Features

import os
import pickle
import hashlib

class CachedFeatureExtractor:
    """
    Cache extracted features to disk

    Avoid re-extracting for same audio
    """

    def __init__(self, extractor: AudioFeatureExtractor, cache_dir='./feature_cache'):
        self.extractor = extractor
        self.cache_dir = cache_dir
        os.makedirs(cache_dir, exist_ok=True)

    def _get_cache_path(self, audio_path: str) -> str:
        """Generate cache file path based on audio path hash"""
        path_hash = hashlib.md5(audio_path.encode()).hexdigest()
        return os.path.join(self.cache_dir, f"{path_hash}.pkl")

    def extract_from_file(self, audio_path: str, use_cache=True) -> Dict[str, np.ndarray]:
        """
        Extract features with caching
        """
        cache_path = self._get_cache_path(audio_path)

        # Check cache
        if use_cache and os.path.exists(cache_path):
            with open(cache_path, 'rb') as f:
                features = pickle.load(f)
                return features

                # Extract features
                features = self.extractor.extract_from_file(audio_path)

                # Save to cache
                with open(cache_path, 'wb') as f:
                    pickle.dump(features, f)

                    return features

2. Parallel Processing

from multiprocessing import Pool
from functools import partial

class ParallelFeatureExtractor:
    """
    Extract features from multiple files in parallel
    """

    def __init__(self, extractor: AudioFeatureExtractor, n_workers=4):
        self.extractor = extractor
        self.n_workers = n_workers

    def extract_from_files(self, audio_paths: List[str]) -> List[Dict[str, np.ndarray]]:
        """
        Extract features from multiple files in parallel
        """
        with Pool(self.n_workers) as pool:
            features_list = pool.map(
            self.extractor.extract_from_file,
            audio_paths
            )

            return features_list

            # Usage
            parallel_extractor = ParallelFeatureExtractor(extractor, n_workers=8)
            audio_files = ['file1.wav', 'file2.wav', ...] # 1000s of files
            features = parallel_extractor.extract_from_files(audio_files)

Advanced Feature Types

1. Learned Features (Embeddings)

Instead of hand-crafted features, learn representations from data.

import torch
import torch.nn as nn

class AudioEmbeddingExtractor(nn.Module):
    """
    Extract learned audio embeddings

    Use pre-trained models (wav2vec, HuBERT) as feature extractors
    """

    def __init__(self, model_name='facebook/wav2vec2-base'):
        super().__init__()
        from transformers import Wav2Vec2Model

        # Load pre-trained model
        self.model = Wav2Vec2Model.from_pretrained(model_name)
        self.model.eval() # Freeze for feature extraction

    def extract(self, audio: np.ndarray, sr=16000) -> np.ndarray:
        """
        Extract contextualized embeddings

        Returns:
            Embeddings: (time_steps, hidden_dim)
            typically (time, 768) for base model
            """
            # Convert to tensor
            audio_tensor = torch.tensor(audio, dtype=torch.float32).unsqueeze(0)

            # Extract features
            with torch.no_grad():
                outputs = self.model(audio_tensor)
                embeddings = outputs.last_hidden_state[0] # (time, 768)

                return embeddings.numpy()

                # Usage - MUCH more powerful than MFCCs for transfer learning
                embedding_extractor = AudioEmbeddingExtractor()
                embeddings = embedding_extractor.extract(audio)
                print(f"Embeddings shape: {embeddings.shape}") # (time, 768)

Comparison:

Feature Type	Dimension	Training Required	Transfer Learning	Accuracy
MFCCs	40-120	No	Poor	Baseline
Mel-spectrogram	80-128	No	Good	+5-10%
Wav2Vec embeddings	768	Yes (pre-trained)	Excellent	+15-25%

2. Filter Bank Features (FBank)

Alternative to MFCCs - skip the DCT step.

class FilterbankExtractor:
    """
    Extract log mel-filterbank features

    Similar to mel-spectrograms, popular in modern ASR
    """

    def __init__(self, sr=16000, n_mels=80, n_fft=512, hop_length=160):
        self.sr = sr
        self.n_mels = n_mels
        self.n_fft = n_fft
        self.hop_length = hop_length

    def extract(self, audio: np.ndarray) -> np.ndarray:
        """
        Extract log filter bank energies

        Returns:
            FBank: (n_mels, time_steps)
            """
            # Mel spectrogram
            mel_spec = librosa.feature.melspectrogram(
            y=audio,
            sr=self.sr,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            n_mels=self.n_mels
            )

            # Log
            log_mel = librosa.power_to_db(mel_spec, ref=np.max)

            return log_mel

            # FBank vs MFCC:
            # - FBank: Keep all mel bins (80-128)
            # - MFCC: Compress to 13-40 via DCT
            #
            # FBank often works better with neural networks

3. Prosodic Features

Capture rhythm, stress, and intonation.

class ProsodicFeatureExtractor:
    """
    Extract prosodic features for emotion, speaker ID, etc.
    """

    def extract_intensity_contour(self, audio, sr=16000, hop_length=160):
        """
        Intensity (loudness) over time
        """
        intensity = librosa.feature.rms(y=audio, hop_length=hop_length)[0]

        # Convert to dB
        intensity_db = librosa.amplitude_to_db(intensity, ref=np.max)

        return intensity_db

    def extract_speaking_rate(self, audio, sr=16000):
        """
        Estimate speaking rate (syllables per second)

        Approximation: count peaks in energy envelope
        """
        # Energy envelope
        energy = librosa.feature.rms(y=audio, hop_length=160)[0]

        # Find peaks (local maxima)
        from scipy.signal import find_peaks

        peaks, _ = find_peaks(energy, distance=10, prominence=0.1)

        # Speaking rate
        duration = len(audio) / sr
        syllables_per_sec = len(peaks) / duration

        return syllables_per_sec

    def extract_all_prosodic(self, audio, sr=16000):
        """Extract all prosodic features"""

        # Pitch
        pitch_extractor = PitchExtractor(sr=sr)
        pitch_stats = pitch_extractor.extract_pitch_features(audio)

        # Intensity
        intensity = self.extract_intensity_contour(audio, sr)

        # Speaking rate
        speaking_rate = self.extract_speaking_rate(audio, sr)

        return {
        **pitch_stats,
        'mean_intensity': np.mean(intensity),
        'std_intensity': np.std(intensity),
        'speaking_rate': speaking_rate
        }

Feature Quality & Validation

Ensure extracted features are high quality.

Feature Quality Metrics

class FeatureQualityChecker:
    """
    Validate quality of extracted features
    """

    def check_for_nans(self, features: Dict[str, np.ndarray]) -> bool:
        """Check for NaN/Inf values"""
        for name, feat in features.items():
            if np.isnan(feat).any() or np.isinf(feat).any():
                print(f"⚠️ {name} contains NaN/Inf")
                return False
                return True

    def check_dynamic_range(self, features: Dict[str, np.ndarray]) -> Dict[str, float]:
        """
        Check dynamic range of features

        Low dynamic range → feature not informative
        """
        ranges = {}

        for name, feat in features.items():
            feat_range = feat.max() - feat.min()
            ranges[name] = feat_range

            if feat_range < 1e-6:
                print(f"⚠️ {name} has very low dynamic range: {feat_range}")

                return ranges

    def check_feature_statistics(self, features_batch: List[np.ndarray]):
        """
        Check statistics across batch

        Ensure features are properly normalized
        """
        # Stack all features
        all_features = np.concatenate(features_batch, axis=1) # (n_features, total_time)

        # Per-feature statistics
        mean_per_feature = np.mean(all_features, axis=1)
        std_per_feature = np.std(all_features, axis=1)

        print("Feature Statistics:")
        print(f" Mean range: [{mean_per_feature.min():.3f}, {mean_per_feature.max():.3f}]")
        print(f" Std range: [{std_per_feature.min():.3f}, {std_per_feature.max():.3f}]")

        # Check if normalized
        if np.abs(mean_per_feature).max() > 0.1:
            print("⚠️ Features not centered (mean far from 0)")

            if np.abs(std_per_feature - 1.0).max() > 0.2:
                print("⚠️ Features not standardized (std far from 1)")

Connection to Data Preprocessing Pipeline

Feature extraction for speech is analogous to data preprocessing for ML systems.

Parallel Concepts

Speech Feature Extraction	ML Data Preprocessing
Handle missing audio	Handle missing values
Normalize features (mean=0, std=1)	Normalize numerical features
Pad/truncate variable length	Handle variable-length sequences
Validate audio quality	Schema validation
Cache extracted features	Cache preprocessed data
Batch processing	Distributed data processing

Unified Preprocessing Framework

class UnifiedPreprocessor:
    """
    Combined preprocessing for multimodal ML

    Example: Speech + text + metadata
    """

    def __init__(self):
        # Audio features
        self.audio_extractor = AudioFeatureExtractor(
        FeatureConfig(feature_types=['mfcc', 'mel'])
        )

        # Text features (from transcripts)
        from sklearn.feature_extraction.text import TfidfVectorizer
        self.text_vectorizer = TfidfVectorizer(max_features=1000)

        # Numerical features
        from sklearn.preprocessing import StandardScaler
        self.numerical_scaler = StandardScaler()

    def preprocess_sample(self, audio, text, metadata):
        """
        Preprocess multimodal sample

        Args:
            audio: Audio waveform
            text: Transcript or description
            metadata: User/item metadata (dict)

            Returns:
                Combined feature vector
                """
                # Extract audio features
                audio_features = self.audio_extractor.extract(audio)
                audio_pooled = np.mean(audio_features['mfcc'], axis=1) # (n_mfcc,)

                # Extract text features
                text_features = self.text_vectorizer.transform([text]).toarray()[0] # (1000,)

                # Process metadata
                metadata_array = np.array([
                metadata['user_age'],
                metadata['user_gender'],
                metadata['device_type']
                ])
                metadata_scaled = self.numerical_scaler.transform([metadata_array])[0]

                # Concatenate all features
                combined = np.concatenate([
                audio_pooled, # (40,)
                text_features, # (1000,)
                metadata_scaled # (3,)
                ]) # Total: (1043,)

                return combined

Production Best Practices

1. Feature Versioning

Track feature extraction versions for reproducibility.

class VersionedFeatureExtractor:
    """
    Version feature extraction logic

    Critical for:
        - A/B testing different features
        - Rollback if new features hurt performance
        - Reproducibility
        """

        VERSION = "1.2.0"

    def __init__(self, config: FeatureConfig):
        self.config = config
        self.extractor = AudioFeatureExtractor(config)

    def extract_with_metadata(self, audio_path: str):
        """
        Extract features with version metadata
        """
        features = self.extractor.extract_from_file(audio_path)

        metadata = {
        'version': self.VERSION,
        'config': self.config.__dict__,
        'timestamp': datetime.now().isoformat(),
        'audio_path': audio_path
        }

        return {
        'features': features,
        'metadata': metadata
        }

    def save_features(self, features, output_path):
        """Save features with version info"""
        np.savez_compressed(
        output_path,
        **features['features'],
        metadata=json.dumps(features['metadata'])
        )

2. Error Handling

Robust feature extraction handles failures gracefully.

class RobustFeatureExtractor:
    """
    Feature extractor with error handling
    """

    def __init__(self, extractor: AudioFeatureExtractor):
        self.extractor = extractor

    def extract_safe(self, audio_path: str) -> Optional[Dict]:
        """
        Extract features with error handling
        """
        try:
            # Load audio
            audio, sr = librosa.load(audio_path, sr=self.extractor.config.sr)

            # Validate
            if len(audio) == 0:
                logger.warning(f"Empty audio: {audio_path}")
                return None

                if len(audio) < self.extractor.config.sr * 0.1: # < 100ms
                    logger.warning(f"Audio too short: {audio_path}")
                    return None

                    # Extract
                    features = self.extractor.extract(audio)

                    # Quality check
                    quality_checker = FeatureQualityChecker()
                    if not quality_checker.check_for_nans(features):
                        logger.error(f"Feature extraction failed (NaN): {audio_path}")
                        return None

                        return features

                    except Exception as e:
                        logger.error(f"Feature extraction error for {audio_path}: {e}")
                        return None

    def extract_batch_robust(self, audio_paths: List[str]) -> List[Dict]:
        """
        Extract from batch, skipping failures
        """
        results = []
        failures = []

        for path in audio_paths:
            features = self.extract_safe(path)
            if features is not None:
                results.append({'path': path, 'features': features})
            else:
                failures.append(path)

                success_rate = len(results) / len(audio_paths)
                logger.info(f"Feature extraction: {len(results)}/{len(audio_paths)} succeeded ({success_rate:.1%})")

                if failures:
                    logger.warning(f"Failed files: {failures[:10]}") # Log first 10

                    return results

3. Monitoring Feature Quality

Track feature statistics over time to detect issues.

class FeatureMonitor:
    """
    Monitor feature quality in production
    """

    def __init__(self, expected_stats: Dict[str, Dict]):
        """
        Args:
            expected_stats: Expected statistics per feature type
            {
            'mfcc': {'mean_range': [-5, 5], 'std_range': [0.5, 2.0]},
            'mel': {'mean_range': [-80, 0], 'std_range': [10, 30]}
            }
            """
            self.expected_stats = expected_stats

    def validate_features(self, features: Dict[str, np.ndarray]) -> List[str]:
        """
        Validate extracted features against expected statistics

        Returns:
            List of warnings
            """
            warnings = []

            for feat_name, feat_values in features.items():
                if feat_name not in self.expected_stats:
                    continue

                    expected = self.expected_stats[feat_name]

                    # Check mean
                    actual_mean = np.mean(feat_values)
                    expected_mean_range = expected['mean_range']

                    if not (expected_mean_range[0] <= actual_mean <= expected_mean_range[1]):
                        warnings.append(
                        f"{feat_name}: mean {actual_mean:.2f} outside expected range {expected_mean_range}"
                        )

                        # Check std
                        actual_std = np.std(feat_values)
                        expected_std_range = expected['std_range']

                        if not (expected_std_range[0] <= actual_std <= expected_std_range[1]):
                            warnings.append(
                            f"{feat_name}: std {actual_std:.2f} outside expected range {expected_std_range}"
                            )

                            return warnings

    def compute_statistics(self, features_batch: List[Dict[str, np.ndarray]]):
        """
        Compute statistics across batch

        Use to establish baseline expected_stats
        """
        stats = {}

        # Get feature names from first sample
        feature_names = features_batch[0].keys()

        for feat_name in feature_names:
            # Collect all values
            all_values = np.concatenate([
            f[feat_name].flatten() for f in features_batch
            ])

            stats[feat_name] = {
            'mean': np.mean(all_values),
            'std': np.std(all_values),
            'min': np.min(all_values),
            'max': np.max(all_values),
            'percentiles': {
            '25': np.percentile(all_values, 25),
            '50': np.percentile(all_values, 50),
            '75': np.percentile(all_values, 75),
            '95': np.percentile(all_values, 95)
            }
            }

            return stats

Data Augmentation in Feature Space

Augment features directly for training robustness.

SpecAugment

class SpecAugment:
    """
    SpecAugment: Data augmentation on spectrograms

    Proposed in "SpecAugment: A Simple Data Augmentation Method for ASR" (Google, 2019)

    Improves ASR accuracy by 10-20% on many benchmarks
    """

    def __init__(
    self,
    time_mask_param=70,
    freq_mask_param=15,
    num_time_masks=2,
    num_freq_masks=2
    ):
        self.time_mask_param = time_mask_param
        self.freq_mask_param = freq_mask_param
        self.num_time_masks = num_time_masks
        self.num_freq_masks = num_freq_masks

    def time_mask(self, spec: np.ndarray) -> np.ndarray:
        """
        Mask random time region

        Sets random time frames to zero
        """
        spec = spec.copy()
        time_length = spec.shape[1]

        for _ in range(self.num_time_masks):
            t = np.random.randint(0, min(self.time_mask_param, time_length))
            t0 = np.random.randint(0, time_length - t)
            spec[:, t0:t0+t] = 0

            return spec

    def freq_mask(self, spec: np.ndarray) -> np.ndarray:
        """
        Mask random frequency region

        Sets random frequency bins to zero
        """
        spec = spec.copy()
        freq_length = spec.shape[0]

        for _ in range(self.num_freq_masks):
            f = np.random.randint(0, min(self.freq_mask_param, freq_length))
            f0 = np.random.randint(0, freq_length - f)
            spec[f0:f0+f, :] = 0

            return spec

    def augment(self, spec: np.ndarray) -> np.ndarray:
        """Apply both time and freq masking"""
        spec = self.time_mask(spec)
        spec = self.freq_mask(spec)
        return spec

        # Usage during training
        augmenter = SpecAugment()

        for audio, label in train_loader:
            # Extract features
            mel_spec = mel_extractor.extract(audio)

            # Augment
            mel_spec_aug = augmenter.augment(mel_spec)

            # Train model
            train_model(mel_spec_aug, label)

Batch Feature Extraction for Training

Extract features for entire dataset efficiently.

Batch Extraction Pipeline

import os
from pathlib import Path
from tqdm import tqdm
import h5py

class BatchFeatureExtractor:
    """
    Extract features for large audio datasets

    Use case: Prepare training data
    - Extract once, train many times
    - Save features to disk (HDF5 format)
    """

    def __init__(self, extractor: AudioFeatureExtractor, n_workers=8):
        self.extractor = extractor
        self.n_workers = n_workers

    def extract_dataset(
    self,
    audio_dir: str,
    output_path: str,
    max_length_frames: int = 1000
    ):
        """
        Extract features for all audio files in directory

        Args:
            audio_dir: Directory containing .wav files
            output_path: HDF5 file to save features
            max_length_frames: Pad/truncate to this length
            """
            # Find all audio files
            audio_files = list(Path(audio_dir).rglob('*.wav'))
            print(f"Found {len(audio_files)} audio files")

            # Create HDF5 file
            with h5py.File(output_path, 'w') as hf:
                # Pre-allocate datasets
                # (We'll store features for each type)
                feature_dim = self.extractor.config.n_mfcc * 3 # MFCCs + deltas

                features_dataset = hf.create_dataset(
                'features',
                shape=(len(audio_files), feature_dim, max_length_frames),
                dtype='float32'
                )

                lengths_dataset = hf.create_dataset(
                'lengths',
                shape=(len(audio_files),),
                dtype='int32'
                )

                # Store file paths
                paths_dataset = hf.create_dataset(
                'paths',
                shape=(len(audio_files),),
                dtype=h5py.string_dtype()
                )

                # Extract features
                for idx, audio_path in enumerate(tqdm(audio_files)):
                    try:
                        # Load audio
                        audio, sr = librosa.load(str(audio_path), sr=self.extractor.config.sr)

                        # Extract features
                        features = self.extractor.extract(audio)

                        # Get MFCCs with deltas
                        mfcc_deltas = features['mfcc'] # (120, time)

                        # Pad or truncate
                        handler = VariableLengthHandler()
                        mfcc_fixed = handler.pad_or_truncate(mfcc_deltas, max_length_frames)

                        # Store
                        features_dataset[idx] = mfcc_fixed
                        lengths_dataset[idx] = min(mfcc_deltas.shape[1], max_length_frames)
                        paths_dataset[idx] = str(audio_path)

                    except Exception as e:
                        logger.error(f"Failed to process {audio_path}: {e}")
                        # Store zeros for failed files
                        features_dataset[idx] = np.zeros((feature_dim, max_length_frames))
                        lengths_dataset[idx] = 0
                        paths_dataset[idx] = str(audio_path)

                        print(f"Features saved to {output_path}")

                        # Usage
                        batch_extractor = BatchFeatureExtractor(extractor, n_workers=8)
                        batch_extractor.extract_dataset(
                        audio_dir='./data/train/',
                        output_path='./features/train_features.h5',
                        max_length_frames=1000
                        )

                        # Load for training
                        with h5py.File('./features/train_features.h5', 'r') as hf:
                            features = hf['features'][:] # (N, feature_dim, max_length)
                            lengths = hf['lengths'][:] # (N,)
                            paths = hf['paths'][:] # (N,)

Real-World Systems

Kaldi: Traditional ASR Feature Pipeline

Kaldi is the industry standard for traditional ASR.

Feature extraction:

# Kaldi feature extraction (MFCC + pitch)
compute-mfcc-feats --config=conf/mfcc.conf scp:wav.scp ark:mfcc.ark
compute-and-process-kaldi-pitch-feats scp:wav.scp ark:pitch.ark

# Combine features
paste-feats ark:mfcc.ark ark:pitch.ark ark:features.ark

Configuration (mfcc.conf):

--use-energy=true
--num-mel-bins=40
--num-ceps=40
--low-freq=20
--high-freq=8000
--sample-frequency=16000

PyTorch: Modern Deep Learning Pipeline

import torchaudio
import torch

class TorchAudioExtractor:
    """
    Feature extraction using torchaudio

    Benefits:
        - GPU acceleration
        - Differentiable (can backprop through features)
        - Integrated with PyTorch training
        """

    def __init__(self, sr=16000, n_mfcc=40, n_mels=80):
        self.sr = sr
        self.n_mfcc = n_mfcc
        self.n_mels = n_mels

        # Create transforms (can move to GPU)
        self.mfcc_transform = torchaudio.transforms.MFCC(
        sample_rate=sr,
        n_mfcc=n_mfcc,
        melkwargs={'n_mels': 40, 'n_fft': 512, 'hop_length': 160}
        )

        self.mel_transform = torchaudio.transforms.MelSpectrogram(
        sample_rate=sr,
        n_fft=512,
        hop_length=160,
        n_mels=n_mels
        )

        # Amplitude → dB conversion
        self.db_transform = torchaudio.transforms.AmplitudeToDB()

    def to(self, device):
        """
        Move transforms to a device (CPU/GPU) and return self.
        """
        self.mfcc_transform = self.mfcc_transform.to(device)
        self.mel_transform = self.mel_transform.to(device)
        self.db_transform = self.db_transform.to(device)
        return self

    def extract(self, audio: torch.Tensor) -> Dict[str, torch.Tensor]:
        """
        Extract features (GPU-accelerated if audio on GPU)

        Args:
            audio: (batch, time) or (time,)

            Returns:
                Dictionary of features
                """
                if audio.ndim == 1:
                    audio = audio.unsqueeze(0) # Add batch dimension

                    # Extract
                    mfccs = self.mfcc_transform(audio) # (batch, n_mfcc, time)
                    mel = self.mel_transform(audio) # (batch, n_mels, time)
                    mel_db = self.db_transform(mel)

                    return {
                    'mfcc': mfccs,
                    'mel': mel_db
                    }

                    # Usage with GPU
                    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

                    extractor = TorchAudioExtractor().to(device)

                    # Load audio
                    audio, sr = torchaudio.load('speech.wav')
                    audio = audio.to(device)

                    # Extract (on GPU)
                    features = extractor.extract(audio)

Google: Production ASR Feature Extraction

Stack:

Input: 16kHz audio
Features: 80-bin log mel-filterbank
Augmentation: SpecAugment
Normalization: Per-utterance mean/variance normalization
Model: Transformer encoder-decoder

Key optimizations:

Precompute features for training data
On-the-fly extraction for inference
GPU-accelerated extraction for real-time systems

Choosing the Right Features

Different tasks need different features.

Feature Selection Guide

Task	Best Features	Why
ASR (traditional)	MFCCs + deltas	Captures phonetic content
ASR (deep learning)	Mel-spectrograms	Works well with CNNs
Speaker Recognition	MFCCs + pitch + prosody	Speaker identity in pitch/prosody
Emotion Recognition	Prosodic + spectral	Emotion in prosody + voice quality
Keyword Spotting	Mel-spectrograms	Simple, fast with CNNs
Speech Enhancement	STFT magnitude + phase	Need phase for reconstruction
Voice Activity Detection	Energy + ZCR	Simple features sufficient

Combining Features

class MultiFeatureExtractor:
    """
    Combine multiple feature types

    Different features capture different aspects
    """

    def __init__(self):
        self.mfcc_ext = MFCCExtractor()
        self.pitch_ext = PitchExtractor()
        self.prosody_ext = ProsodicFeatureExtractor()

    def extract_combined(self, audio):
        """
        Extract and combine multiple feature types
        """
        # MFCCs (40, time)
        mfccs = self.mfcc_ext.extract(audio)

        # Pitch (time,)
        pitch, voiced = self.pitch_ext.extract_f0(audio)
        pitch = pitch.reshape(1, -1) # (1, time)

        # Energy (1, time)
        energy = librosa.feature.rms(y=audio, hop_length=160)

        # Align all features to same time dimension
        min_time = min(mfccs.shape[1], pitch.shape[1], energy.shape[1])

        mfccs = mfccs[:, :min_time]
        pitch = pitch[:, :min_time]
        energy = energy[:, :min_time]

        # Stack
        combined = np.vstack([mfccs, pitch, energy]) # (42, time)

        return combined

Key Takeaways

✅ MFCCs are standard for speech recognition - compact and robust ✅ Mel-spectrograms work better with deep learning (CNNs, Transformers) ✅ Delta features capture temporal dynamics - critical for accuracy ✅ Normalize features for stable training (mean=0, std=1) ✅ Handle variable length with padding, pooling, or attention masks ✅ Cache features for repeated use - major speedup in training ✅ Streaming extraction possible with circular buffers ✅ Parallel processing speeds up batch feature extraction ✅ SpecAugment improves robustness through feature-space augmentation ✅ Monitor feature quality to detect pipeline issues early ✅ Version features for reproducibility and A/B testing ✅ Choose features based on task - no one-size-fits-all

FAQ

Q: What is the difference between MFCCs and mel-spectrograms? A: MFCCs apply a Discrete Cosine Transform on top of the mel-spectrogram, compressing it into 13-40 coefficients that capture the spectral envelope. Mel-spectrograms preserve full spectral detail with 40-128 bins per frame. MFCCs work better with small traditional models, while mel-spectrograms perform better with CNNs and Transformers used in speech classification and streaming ASR.

Q: Which audio features should I use for my speech ML task? A: Use MFCCs with deltas for traditional ASR and small models. Use mel-spectrograms for deep learning with CNNs or Transformers. Use pitch and prosodic features for speaker recognition and emotion detection. Use STFT magnitude with phase for speech enhancement and reconstruction tasks. Use energy and zero-crossing rate for VAD.

Q: How does SpecAugment improve speech model training? A: SpecAugment randomly masks time regions and frequency bands in spectrograms during training, forcing the model to be robust to missing information. This simple technique improves ASR accuracy by 10-20% relative WER reduction without requiring additional training data. It was proposed by Google in 2019 and is now standard practice in speech model training.

Originally published at: arunbaby.com/speech-tech/0003-audio-feature-extraction

If you found this helpful, consider sharing it with others who might benefit.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

TL;DR

Introduction

Problem Definition

Functional Requirements

Non-Functional Requirements

Audio Basics

Waveform Representation

Feature 1: Mel-Frequency Cepstral Coefficients (MFCCs)

Why MFCCs?

How MFCCs Work

Implementation

Visualizing MFCCs

Feature 2: Mel-Spectrograms

What is a Spectrogram?

Mel-Spectrogram vs MFCC

Implementation

Visualizing Mel-Spectrogram

Feature 3: Raw Spectrograms (STFT)

Implementation

Feature 4: Time-Domain Features

Implementation

Feature 5: Pitch & Formants

Pitch Extraction

Production Feature Pipeline

Unified Feature Extractor

Handling Variable-Length Audio

Strategy 1: Padding/Truncation

Strategy 2: Temporal Pooling

Real-Time Feature Extraction

Streaming Feature Extractor

Performance Optimization

1. Caching Features

2. Parallel Processing

Advanced Feature Types

1. Learned Features (Embeddings)

2. Filter Bank Features (FBank)

3. Prosodic Features

Feature Quality & Validation

Feature Quality Metrics

Connection to Data Preprocessing Pipeline

Parallel Concepts

Unified Preprocessing Framework

Production Best Practices

1. Feature Versioning

2. Error Handling

3. Monitoring Feature Quality

Data Augmentation in Feature Space

SpecAugment

Batch Feature Extraction for Training

Batch Extraction Pipeline

Real-World Systems

Kaldi: Traditional ASR Feature Pipeline

PyTorch: Modern Deep Learning Pipeline

Google: Production ASR Feature Extraction

Choosing the Right Features

Feature Selection Guide

Combining Features

Key Takeaways

FAQ

Related across topics

Merge Two Sorted Lists

Data Preprocessing Pipeline Design

Prompt Engineering for Agents

Share on