Custom Language Modeling

Q: Why do generic ASR models fail on domain-specific vocabulary?

ASR uses the equation P(Text|Audio) proportional to P(Audio|Text) * P(Text). Even when the acoustic model correctly hears 'Xarelto', the generic language model assigns near-zero probability to that term, causing the decoder to prefer common phrases like 'the real toe' that have higher language model scores.

Q: How does Whisper's contextual prompting work for domain adaptation?

Whisper accepts an initial_prompt parameter that prepends keywords to the decoder's context window, priming it as if those words were recently spoken. For example, setting initial_prompt to 'Medical: Xarelto, Warfarin, Apixaban' significantly increases the probability of correctly transcribing medical terms.

Q: What is over-biasing in custom language modeling and how do you prevent it?

Over-biasing occurs when the boost weight is too high, causing the ASR to hallucinate boosted words. For example, boosting 'Call' might make the model transcribe 'Tall building' as 'Call building'. The fix is a carefully tunable biasing_weight parameter and measuring Entity-WER specifically on the boosted word list.

6 minute read

“The model knows ‘Apple’ the fruit. It needs to learn ‘Apple’ the stock ticker.”

TL;DR

Generic ASR models fail on domain jargon because their language models assign near-zero probability to specialized terms, causing the decoder to prefer common phrases. Custom language modeling fixes this without retraining the acoustic model: shallow fusion interpolates domain-specific n-gram scores during beam search, class-based LMs expand entity placeholders dynamically, and Whisper’s contextual prompting primes the decoder with expected vocabulary. Production systems compile per-user word lists into FSTs on-the-fly for real-time customization. Always measure Entity-WER on the boosted word list, not just global WER. This builds directly on ASR decoding beam search and connects to phonetic trie search for handling transcription variants.

A typesetter's case with letter blocks

1. Problem Statement

Generic ASR models (Whisper, Google Speech) are trained on general internet data. They perform poorly on Jargon.

Medical: “Administer 50mg of Xarelto.” -> “The real toe.”
Legal: “Habeas Corpus.” -> “Happy its corpse.”
Corporate: “Met me at the K8s sync.” -> “Kate’s sink.”

The Problem: How do we teach a pre-trained ASR model new vocabulary without retraining the massive acoustic model?

2. Fundamentals: The Noisy Channel Model

Recall the ASR equation: P(Text | Audio) \propto P(Audio | Text) \times P(Text)

Acoustic Model (P(Audio | Text)): “Does this sound like ‘Xarelto’?” (Maybe).
Language Model (P(Text)): “Is ‘Xarelto’ a word?” (Generic LM says No. Prob = 0.000001).

Since the LM probability is near zero, the total score is low. The decoder chooses “The real toe” because P("The real toe") is high. Solution: We hack P(Text).

3. Architecture: Shallow vs Deep Fusion

How do we inject the new words?

3.1 Shallow Fusion (The Standard)

We train a small, domain-specific LM (n-gram) on the client’s text documents. During decoding (Beam Search), we interpolate the scores: Score = \log P_{AM} + \alpha \log P_{GenericLM} + \beta \log P_{CustomLM}

If \beta is high, the custom model boosts “Xarelto”.

3.2 Deep Fusion

We inject a specialized neural network layer inside the ASR network that attends to a list of custom words. This is harder to implement but more robust.

4. Implementation Approaches

4.1 Class-Based LMs

Instead of hardcoding “John Smith”, we train the LM with a placeholder tag: @NAME.

Training sentence: “Call @NAME at 5pm.”
Runtime: We provide a map {@NAME: ["John", "Sarah", "Mike"]}. The FST (Finite State Transducer) dynamically expands the @NAME node into arcs for John, Sarah, Mike.

4.2 Contextual Biasing (Attention)

In Transformer ASR (Whisper), we can pass a list of “Hint Strings” in the prompt. prompt="Xarelto, Ibuprofen, Tylenol" The model’s cross-attention mechanism attends to these tokens, increasing their likelihood.

5. Implementation: Contextual Biasing with Whisper

import whisper

model = whisper.load_model("base")

# 1. Standard Inference
audio = "audio_xarelto.mp3"
result_bad = model.transcribe(audio)
print(result_bad["text"])
# Output: "Patient needs the real toe."

# 2. Contextual Prompting
# We prepend the keywords to the decoder's context window.
# It acts like the model "just said" these words, priming it to say them again.
initial_prompt = "Medical Logic: Xarelto, Warfarin, Apixaban."

result_good = model.transcribe(audio, initial_prompt=initial_prompt)
print(result_good["text"])
# Output: "Patient needs Xarelto."

6. Training Considerations

6.1 Text Data Augmentation

To train the Custom LM, you need text.

Source: Technical manuals, past transcripts, email logs.
Normalization: You must convert “50mg” to “fifty milligrams” to match ASR output space.

6.2 Pruning

A custom LM with 1 million words is slow. Prune the n-grams. Keep only unique jargon. Trust the Generic LM for “the”, “cat”, “is”.

7. Production Deployment: Dynamic Loading

In a SaaS ASR (like Otter.ai):

User enters a meeting (“Project Apollo Sync”).
System loads “Project Apollo” word list (Entities: “Apollo”, “Saturn”, “Launch”).
System compiles a tiny FST on-the-fly (ms).
Decoder graph = Generic_Graph composed with Dynamic_FST.

This allows Per-User customization.

8. Performance Metrics

Entity-WER.

Global WER might be 5% with or without customization.
But if the 5% error is the Patient’s Name, the transcript is useless.
Measure accuracy specifically on the Boosted List.

9. Failure Modes

Over-Biasing:
- Boost list: ["Call"].
- User says: “Tall building”.
- ASR hears: “Call building”.
- Fix: Tunable parameter biasing_weight.
Phonetic Confusion:
- Boost: ["Resume"] (Noun).
- User: “Resume” (Verb).
- ASR gets it right, but downstream NLP gets confused by the tag.

10. Real-World Case Study: Smart Speakers

Alexa Contact List. When you say “Call Mom”, Alexa biased the ASR towards your contacts. It didn’t boost “Mom” for everyone. It boosted “Mom”, “Dad”, “Arun” for you. This uses Personalized Language Models (PLM).

11. State-of-the-Art: Neural Biasing

Recent research (Google) uses GNNs (Graph Neural Networks) to encode the relationship between entities in the bias list, handling thousands of entities (e.g., a massive Song Library) without degrading latency.

12. Key Takeaways

Generic is not enough: Production ASR requires customization.
Shallow Fusion is cheap: No GPU retraining needed. Just text statistical counting.
Prompt Engineering works for ASR: Whisper’s prompt feature allows 0-shot adaptation.
Metric Validity: Optimize for Entity-WER, not just WER.

FAQ

Why do generic ASR models fail on domain-specific vocabulary?

ASR models use the equation P(Text

Audio) proportional to P(Audio

Text) * P(Text). Even when the acoustic model correctly identifies sounds matching “Xarelto”, the generic language model assigns near-zero probability to that term because it was rarely seen in general training data. The decoder then prefers common phrases like “the real toe” that have much higher language model scores, overriding the acoustic evidence.

What is shallow fusion and how does it improve domain-specific ASR?

Shallow fusion trains a small domain-specific n-gram language model on text from the client’s domain (medical manuals, legal documents, corporate communications). During beam search decoding, the system interpolates three scores: the acoustic model, the generic language model, and the custom language model. A tunable weight beta controls how strongly the custom model boosts domain terms, requiring no GPU retraining of the acoustic model.

How does Whisper’s contextual prompting work for domain adaptation?

Whisper accepts an initial_prompt parameter that prepends text to the decoder’s context window, effectively priming the model as if those words were recently spoken. By setting the prompt to domain keywords (e.g., “Medical: Xarelto, Warfarin, Apixaban”), the model’s cross-attention mechanism becomes biased toward those tokens during decoding, significantly increasing the probability of correctly transcribing domain-specific terms with zero additional training.

What is over-biasing in custom language modeling and how do you prevent it?

Over-biasing occurs when the boost weight for custom vocabulary is set too high, causing the ASR to hallucinate boosted words even when the speaker said something different. For example, boosting “Call” might cause “Tall building” to be transcribed as “Call building.” Prevention requires careful tuning of the biasing_weight parameter and measuring accuracy specifically on the boosted word list (Entity-WER) to detect both under-biasing and over-biasing.

Originally published at: arunbaby.com/speech-tech/0050-custom-language-modeling

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

Custom Language Modeling

TL;DR

1. Problem Statement

2. Fundamentals: The Noisy Channel Model

3. Architecture: Shallow vs Deep Fusion

3.1 Shallow Fusion (The Standard)

3.2 Deep Fusion

4. Implementation Approaches

4.1 Class-Based LMs

4.2 Contextual Biasing (Attention)

5. Implementation: Contextual Biasing with Whisper

6. Training Considerations

6.1 Text Data Augmentation

6.2 Pruning

7. Production Deployment: Dynamic Loading

8. Performance Metrics

9. Failure Modes

10. Real-World Case Study: Smart Speakers

11. State-of-the-Art: Neural Biasing

12. Key Takeaways

FAQ

Related across topics

Share on

TL;DR

1. Problem Statement

2. Fundamentals: The Noisy Channel Model

3. Architecture: Shallow vs Deep Fusion

3.1 Shallow Fusion (The Standard)

3.2 Deep Fusion

4. Implementation Approaches

4.1 Class-Based LMs

4.2 Contextual Biasing (Attention)

5. Implementation: Contextual Biasing with Whisper

6. Training Considerations

6.1 Text Data Augmentation

6.2 Pruning

7. Production Deployment: Dynamic Loading

8. Performance Metrics

9. Failure Modes

10. Real-World Case Study: Smart Speakers

11. State-of-the-Art: Neural Biasing

12. Key Takeaways

FAQ

Related across topics

Alien Dictionary

Character-Level Language Models

Building Domain-Specific Agents

Share on