Hierarchical Speech Classification
“From broad categories to fine-grained speech understanding.”
TL;DR
Hierarchical speech classification organizes voice commands into a taxonomy (domain to intent to slot) instead of flat classification over thousands of classes. Multi-task learning with shared Conformer encoders and level-specific classification heads provides the best accuracy, exploiting that early layers capture broad acoustic features (domain) while deeper layers capture semantics (slots). Google Assistant and Alexa use this approach to route billions of commands daily. For the underlying end-to-end model architectures and how these classifiers feed into voice search ranking, see the dedicated articles.

1. What is Hierarchical Speech Classification?
Hierarchical speech classification organizes audio into a taxonomy of categories, moving from coarse to fine-grained predictions.
Example: Voice Command Classification
Intent
├── Media Control
│ ├── Music
│ │ ├── Play Song
│ │ └── Pause Song
│ └── Video
│ ├── Play Video
│ └── Stop Video
└── Smart Home
├── Lights
│ ├── Turn On
│ └── Turn Off
└── Thermostat
Problem: Given the audio “Hey Google, turn on the bedroom lights”, classify into:
- Intent: Smart Home > Lights > Turn On
- Entity: “bedroom”
2. Why Hierarchical for Speech?
| Challenge | Flat Classification | Hierarchical Classification |
|---|---|---|
| Acoustic Similarity | Confuses “Play music” and “Pause music” | Groups under “Music Control” first |
| Scalability | 10,000 command A single model | Modular (one model per subtree) |
| Out-of-Domain | No fallback | Can classify to parent if uncertain about child |
| Interpretability | Black box | Clear decision path |
3. Speech Hierarchy Types
Type 1: Speaker Recognition
Speech
├── Speaker 1
├── Speaker 2
└── Unknown
├── Male
└── Female
├── Child
└── Adult
Type 2: Language Identification
Audio
├── English
│ ├── US
│ ├──UK
│ └── Australia
├── Spanish
│ ├── Spain
│ └── Mexico
└── Chinese
├── Mandarin
└── Cantonese
Type 3: Emotion Recognition
Emotion
├── Positive
│ ├── Happy
│ └── Excited
├── Negative
│ ├── Angry
│ └── Sad
└── Neutral
Type 4: Command Classification (Voice Assistants)
Domain
├── Music
│ ├── Play
│ ├── Pause
│ └── Skip
├── Navigation
│ ├── Directions
│ └── Traffic
└── Communication
├── Call
└── Message
4. Hierarchical Classification Approaches
Approach 1: Global Audio Classifier
Train a single end-to-end model predicting all leaf categories from raw audio.
Architecture:
class GlobalSpeechClassifier(nn.Module):
def __init__(self, num_classes=10000):
super().__init__()
self.wav2vec = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
self.classifier = nn.Linear(768, num_classes)
def forward(self, audio):
features = self.wav2vec(audio).last_hidden_state
pooled = features.mean(dim=1) # Mean pooling
logits = self.classifier(pooled)
return logits
Pros:
- Simple architecture.
- End-to-end optimization.
Cons:
- Class Imbalance: “Play music” has 1M examples, “Set thermostat to 68°F” has 100.
- No hierarchy exploitation.
Approach 2: Coarse-to-Fine Pipeline
Stage 1: Classify into broad categories (Domain). Stage 2: For each domain, classify into intents.
Example:
# Stage 1: Domain classification
domain = domain_classifier(audio) # Music, Navigation, Communication
# Stage 2: Intent classification
if domain == "Music":
intent = music_intent_classifier(audio) # Play, Pause, Skip
elif domain == "Navigation":
intent = nav_intent_classifier(audio) # Directions, Traffic
Pros:
- Modular: Can update one stage without touching the other.
- Balanced training (each stage sees balanced data).
Cons:
- Error Propagation: If Stage 1 is wrong, Stage 2 has no chance.
- Latency: Two forward passes.
Approach 3: Multi-Task Learning (MTL)
Train a shared encoder with multiple output heads (one per level).
Architecture:
class HierarchicalSpeechMTL(nn.Module):
def __init__(self):
super().__init__()
self.encoder = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
# Heads for each level
self.domain_head = nn.Linear(768, 10) # 10 domains
self.intent_head = nn.Linear(768, 100) # 100 intents
self.slot_head = nn.Linear(768, 1000) # 1000 slots
def forward(self, audio):
features = self.encoder(audio).last_hidden_state.mean(dim=1)
domain_logits = self.domain_head(features)
intent_logits = self.intent_head(features)
slot_logits = self.slot_head(features)
return {
'domain': domain_logits,
'intent': intent_logits,
'slot': slot_logits
}
Loss:
loss = α * domain_loss + β * intent_loss + γ * slot_loss
Pros:
- Shared representations learn general audio features.
- Joint optimization.
Cons:
- Balancing loss weights (α, β, γ) is tricky.
5. Handling Audio-Specific Challenges
Challenge 1: Acoustic Variability
Problem: “Play music” can be said in 1000 ways (different speakers, accents, noise).
Solution: Data Augmentation
import torchaudio
def augment_audio(waveform):
# Time stretching
waveform = torchaudio.functional.time_stretch(waveform, rate=random.uniform(0.9, 1.1))
# Pitch shift
waveform = torchaudio.functional.pitch_shift(waveform, sample_rate=16000, n_steps=random.randint(-2, 2))
# Add noise
noise = torch.randn_like(waveform) * 0.005
waveform = waveform + noise
return waveform
Challenge 2: Imbalanced Hierarchy
Problem: “Play music” appears 1M times, “Set thermostat to 72°F and humidity to 50%” appears 10 times.
Solution: Hierarchical Sampling
- Sample uniformly across domains first.
- Then sample uniformly within each domain.
- Ensures rare intents get seen during training.
Deep Dive: Conformer for Hierarchical Speech
Conformer (Convolution + Transformer) is the SOTA architecture for speech.
Why Conformer?
- Local Features: Convolution captures phonetic details.
- Global Context: Self-attention captures long-range dependencies (e.g., “turn on the bedroom lights” - “bedroom” modifies “lights”).
Hierarchical Conformer:
class HierarchicalConformer(nn.Module):
def __init__(self):
self.conformer_blocks = nn.ModuleList([
ConformerBlock() for _ in range(12)
])
# Insert classification heads at different depths
self.domain_head = nn.Linear(512, 10) # After block 4
self.intent_head = nn.Linear(512, 100) # After block 8
self.slot_head = nn.Linear(512, 1000) # After block 12
def forward(self, audio):
x = audio
outputs = {}
for i, block in enumerate(self.conformer_blocks):
x = block(x)
if i == 3: # After block 4
outputs['domain'] = self.domain_head(x.mean(dim=1))
if i == 7: # After block 8
outputs['intent'] = self.intent_head(x.mean(dim=1))
if i == 11: # After block 12
outputs['slot'] = self.slot_head(x.mean(dim=1))
return outputs
Intuition:
- Early layers: Broad acoustic features → Domain classification.
- Middle layers: Phonetic patterns → Intent classification.
- Deep layers: Semantic understanding → Slot filling.
Deep Dive: Hierarchical Attention
Use attention mechanisms to focus on different parts of the audio for different levels.
Example:
- Domain: Attend to the first word (“Play”, “Navigate”, “Call”).
- Intent: Attend to the verb + object (“Play music”, “Play video”).
- Slot: Attend to entities (“Play music by Taylor Swift”).
Implementation:
class HierarchicalAttention(nn.Module):
def __init__(self):
self.domain_attention = nn.MultiheadAttention(embed_dim=512, num_heads=8)
self.intent_attention = nn.MultiheadAttention(embed_dim=512, num_heads=8)
def forward(self, features):
# features: [seq_len, batch, 512]
# Domain: Attend to first 10% of audio
domain_context, _ = self.domain_attention(features[:10], features, features)
domain_logits = self.domain_head(domain_context.mean(dim=0))
# Intent: Attend to middle 50% of audio
intent_context, _ = self.intent_attention(features, features, features)
intent_logits = self.intent_head(intent_context.mean(dim=0))
return domain_logits, intent_logits
Deep Dive: Speaker-Aware Hierarchical Classification
Problem: Different users say the same command differently.
Solution: Speaker Embeddings
class SpeakerAwareClassifier(nn.Module):
def __init__(self):
self.audio_encoder = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
self.speaker_encoder = SpeakerNet() # x-vector or d-vector
self.fusion = nn.Linear(768 + 256, 512)
self.classifier = nn.Linear(512, num_classes)
def forward(self, audio):
audio_features = self.audio_encoder(audio).last_hidden_state.mean(dim=1)
speaker_embedding = self.speaker_encoder(audio)
combined = torch.cat([audio_features, speaker_embedding], dim=1)
fused = self.fusion(combined)
logits = self.classifier(fused)
return logits
Benefit: Model learns speaker-specific patterns (e.g., User A always says “Play tunes”, User B says “Play music”).
Deep Dive: Hierarchical Spoken Language Understanding (SLU)
SLU combines Intent Classification and Slot Filling.
Example:
- Input: “Set a timer for 5 minutes”
- Intent:
SetTimer - Slots:
duration=5 minutes
Hierarchy:
Root
├── Timer Intent
│ ├── Set Timer
│ ├── Cancel Timer
│ └── Check Timer
└── Alarm Intent
├── Set Alarm
└── Snooze Alarm
Joint Model:
class HierarchicalSLU(nn.Module):
def __init__(self):
self.encoder = BERTModel() # Or Conformer for end-to-end from audio
self.intent_classifier = nn.Linear(768, num_intents)
self.slot_tagger = nn.Linear(768, num_slot_tags) # BIO tagging
def forward(self, tokens):
embeddings = self.encoder(tokens)
# Intent: Use [CLS] token
intent_logits = self.intent_classifier(embeddings[:, 0, :])
# Slots: Use all tokens
slot_logits = self.slot_tagger(embeddings)
return intent_logits, slot_logits
Deep Dive: Hierarchical Metrics for Speech
Metric 1: Intent Accuracy at Each Level
def hierarchical_accuracy(pred_path, true_path):
correct_at_level = []
for i, (pred_node, true_node) in enumerate(zip(pred_path, true_path)):
correct_at_level.append(1 if pred_node == true_node else 0)
return correct_at_level
# Example:
# True: [Media, Music, Play]
# Pred: [Media, Video, Play]
# Accuracy: [1.0, 0.0, 0.0] # Got Media right, but wrong afterwards
Metric 2: Partial Match Score
Give credit for getting part of the hierarchy correct. \[ \text{Score} = \frac{\sum_{i} w_i \cdot \mathbb{1}[\text{pred}_i == \text{true}_i]}{\sum_i w_i} \] where \(w_i\) increases with depth (deeper levels weighted more).
Deep Dive: Google Assistant’s Hierarchical Command Classification
Google processes billions of voice commands daily.
Architecture:
- Hotword Detection: “Hey Google” (on-device, low power).
- Audio Streaming: Send audio to cloud. 3 ASR: Convert audio to text (Conformer-based RNN-T).
- Domain Classification: Is this Music, Navigation, SmartHome, etc.? (BERT classifier).
- Intent Classification: Within domain, what’s the intent? (Domain-specific BERT).
- Slot Filling: Extract entities (CRF on top of BERT).
- Execution: Call the appropriate API.
Hierarchical Optimization:
- Domain Model: Trained on all 1B+ queries.
- Intent Models: Separate model per domain, trained only on that domain’s data (more focused, higher accuracy).
Latency Budget:
- Hotword: < 100ms
- ASR: < 500ms
- NLU (Domain + Intent + Slot): < 200ms
- Total: < 800ms (target)
Deep Dive: Alexa’s Hierarchical Skill Routing
Amazon Alexa has 100,000+ skills (third-party voice apps).
Problem: Route the user’s command to the correct skill.
Hierarchy:
Utterance
├── Built-in Skills
│ ├── Music (Amazon Music)
│ ├── Shopping (Amazon Shopping)
│ └── SmartHome (Alexa Smart Home)
└── Third-Party Skills
├── Category: Games
├── Category: News
└── Category: Productivity
Routing Algorithm:
- Explicit Invocation: “Ask Spotify to play music” → Route to Spotify skill.
- Implicit Invocation: “Play music” → Disambiguate:
- Check user’s default music provider.
- If ambiguous, ask: “Would you like Amazon Music or Spotify?”
- Hierarchical Classification:
- Level 1: Built-in vs. Third-Party.
- Level 2: If Third-Party, which category?
- Level 3: Within category, which skill?
Deep Dive: Multilingual Hierarchical Speech
Challenge: Support 100+ languages.
Approach 1: Per-Language Models
- Train separate models for each language.
- Cons: 100 models to maintain.
Approach 2: Multilingual Shared Encoder
- Train a single wav2vec2 model on data from all languages.
- Add language-specific heads.
class MultilingualHierarchical(nn.Module): def __init__(self): self.shared_encoder = Wav2Vec2Model() # Trained on 100 languages self.language_heads = nn.ModuleDict({ 'en': nn.Linear(768, 1000), # English intents 'es': nn.Linear(768, 1000), # Spanish intents 'zh': nn.Linear(768, 1000), # Chinese intents }) def forward(self, audio, language): features = self.shared_encoder(audio).last_hidden_state.mean(dim=1) logits = self.language_heads[language](features) return logits
Benefit: Transfer learning. Low-resource languages benefit from high-resource languages.
Deep Dive: Confidence Calibration Across Levels
Problem: The model predicts:
- Domain: Music (confidence = 0.99)
- Intent: Play (confidence = 0.51)
Is the overall prediction reliable?
Solution: Hierarchical Confidence \[ C_{\text{overall}} = C_{\text{domain}} \times C_{\text{intent}} \times C_{\text{slot}} \]
If \(C_{\text{overall}} < 0.7\), ask for clarification: “Did you want to play music?”
Deep Dive: Active Learning for Rare Intents
Problem: “Set thermostat to 68°F and humidity level to 45%” appears only 5 times in training data.
Solution: Active Learning
- Deploy model.
- Log all predictions with \(C_{\text{overall}} < 0.5\) (uncertain).
- Human reviews and labels these uncertain examples.
- Retrain model with new labels.
Hierarchical Active Learning:
- Prioritize examples where the model is uncertain at multiple levels.
- Example: Uncertain about both Domain and Intent → High priority for labeling.
Deep Dive: Temporal Hierarchies (Sequential Commands)
Problem: “Play Taylor Swift, then set a timer for 5 minutes.”
Two Intents in One Utterance:
- Play Music (artist = Taylor Swift)
- Set Timer (duration = 5 minutes)
Approach: Segmentation + Per-Segment Classification
# Step 1: Segment audio
segments = segment_audio(audio) # ["Play Taylor Swift", "set a timer for 5 minutes"]
# Step 2: Classify each segment
for segment in segments:
intent, slots = hierarchical_classifier(segment)
execute(intent, slots)
Segmentation Techniques:
- Pause Detection: Split on silences > 500ms.
- Semantic Segmentation: Use a sequence tagging model to predict segment boundaries.
Deep Dive: Hierarchical Few-Shot Learning
Problem: A new intent “Book a table at a restaurant” is added. We have only 10 labeled examples.
Solution: Prototypical Networks
def prototypical_network(support_set, query):
# support_set: [(audio_1, label_1), (audio_2, label_2), ...]
# Compute prototype for each class
prototypes = {}
for audio, label in support_set:
features = encoder(audio)
if label not in prototypes:
prototypes[label] = []
prototypes[label].append(features)
for label in prototypes:
prototypes[label] = torch.stack(prototypes[label]).mean(dim=0)
# Classify query by nearest prototype
query_features = encoder(query)
distances = {label: cosine_distance(query_features, proto) for label, proto in prototypes.items()}
predicted_label = min(distances, key=distances.get)
return predicted_label
Benefit: Can add new intents with < 10 examples.
Deep Dive: Noise Robustness in Hierarchical Speech
Problem: Background noise (TV, traffic) degrades classification.
Solution: Multi-Condition Training (MCT)
def add_noise(clean_audio, noise_audio, snr_db):
# Signal-to-Noise Ratio
noise_power = clean_audio.norm() / ( 10 ** (snr_db / 20))
scaled_noise = noise_audio * noise_power / noise_audio.norm()
noisy_audio = clean_audio + scaled_noise
return noisy_audio
# Training
for audio, label in dataset:
noise = random.choice(noise_dataset) # TV, traffic, babble
snr = random.uniform(-5, 20) # dB
noisy_audio = add_noise(audio, noise, snr)
loss = criterion(model(noisy_audio), label)
Advanced: Denoising Front-End
- Use a speech enhancement model before the classifier.
- Example: Conv-TasNet, Sudormian.
Implementation: Full Hierarchical Speech Pipeline
import torch
import torch.nn as nn
from transformers import Wav2Vec2Model
class HierarchicalSpeechClassifier(nn.Module):
def __init__(self, domain_classes=10, intent_classes=100):
super().__init__()
self.wav2vec = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
# Domain classifier (coarse)
self.domain_head = nn.Sequential(
nn.Linear(768, 384),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(384, domain_classes)
)
# Intent classifier (fine)
self.intent_head = nn.Sequential(
nn.Linear(768, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, intent_classes)
)
def forward(self, audio, return_features=False):
# audio: [batch, waveform]
features = self.wav2vec(audio).last_hidden_state # [batch, time, 768]
pooled = features.mean(dim=1) # [batch, 768]
domain_logits = self.domain_head(pooled)
intent_logits = self.intent_head(pooled)
if return_features:
return domain_logits, intent_logits, pooled
return domain_logits, intent_logits
def hierarchical_loss(domain_logits, intent_logits, domain_target, intent_target, alpha=0.3):
domain_loss = nn.CrossEntropyLoss()(domain_logits, domain_target)
intent_loss = nn.CrossEntropyLoss()(intent_logits, intent_target)
return alpha * domain_loss + (1 - alpha) * intent_loss
# Training loop
model = HierarchicalSpeechClassifier()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for epoch in range(10):
for audio, domain_label, intent_label in dataloader:
optimizer.zero_grad()
domain_logits, intent_logits = model(audio)
loss = hierarchical_loss(domain_logits, intent_logits, domain_label, intent_label)
loss.backward()
optimizer.step()
Top Interview Questions
Q1: How do you handle code-switching (mixing languages) in hierarchical speech classification? Answer: Use a multilingual encoder (e.g., wav2vec2 fine-tuned on mixed-language data). Add a language identification head to detect which language(s) are spoken, then route to the appropriate intent classifier.
Q2: What if the hierarchy changes frequently (new intents added)? Answer: Use modular design: separate models for each level. When a new intent is added, only retrain the intent-level model. Alternatively, use label embeddings (encode intent names as text) so new intents can be added without retraining.
Q3: How do you ensure low latency for real-time voice assistants? Answer:
- Streaming Models: Use RNN-T or streaming Conformer that outputs predictions as audio arrives.
- Early Exit: If the domain classifier is very confident, skip deeper layers.
- Edge Deployment: Run lightweight models on-device (quantized, pruned).
Q4: How do you evaluate hierarchical speech models? Answer:
- Accuracy at Each Level: Report domain accuracy, intent accuracy separately.
- Partial Match Score: Give credit for getting higher levels correct even if lower levels are wrong.
- Confusion Matrices: Per-level confusion matrices to identify systematic errors.
Key Takeaways
- Hierarchy Reduces Confusion: Grouping similar commands improves accuracy.
- Multi-Task Learning: Shared encoder exploits commonalities across levels.
- Modular Design: Easier to update individual levels without retraining everything.
- Attention Mechanisms: Focus on different audio segments for different levels.
- Evaluation: Use hierarchical metrics (accuracy per level, partial match).
Summary
| Aspect | Insight |
|---|---|
| Approaches | Global, Coarse-to-Fine Pipeline, Multi-Task Learning |
| Architecture | Conformer (Convolution + Transformer) is SOTA |
| Challenges | Acoustic variability, imbalanced data, multilingual |
| Real-World | Google Assistant, Alexa use hierarchical routing |
FAQ
What is hierarchical speech classification and why use it over flat classification?
Hierarchical speech classification organizes voice commands into a taxonomy of nested categories (Domain to Intent to Action) instead of predicting from a flat list of thousands of classes. It reduces confusion between acoustically similar commands by grouping them first (“Play music” and “Pause music” are both under Media Control), scales to 10,000+ commands through modular per-subtree models, and provides graceful fallback to parent categories when the model is uncertain about leaf predictions.
How does Multi-Task Learning work for hierarchical speech?
A shared encoder (Conformer or wav2vec2) produces audio features, then separate classification heads predict at each hierarchy level. The combined loss is alpha * domain_loss + beta * intent_loss + gamma * slot_loss. Early Conformer layers learn broad acoustic features for domain classification, while deeper layers capture semantic patterns for intent and slot filling. Balancing the loss weights is the main tuning challenge.
How do you ensure low latency for real-time voice assistants?
Three techniques: streaming models (RNN-T or streaming Conformer) that output predictions as audio arrives, early exit where confident domain predictions skip deeper processing layers, and edge deployment with quantized/pruned models running on-device. Google Assistant targets an 800ms total budget across hotword, ASR, and NLU stages.
How do you handle code-switching in hierarchical speech classification?
Use a multilingual encoder (wav2vec2 fine-tuned on mixed-language data) with a language identification head. When code-switching is detected (e.g., “Chalo let’s go” mixing Hindi and English), the system routes to the appropriate language-specific intent classifier. Transfer learning from high-resource languages helps improve accuracy for low-resource languages.
Originally published at: arunbaby.com/speech-tech/0029-hierarchical-speech-classification
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch