Custom Language Modeling
“The model knows ‘Apple’ the fruit. It needs to learn ‘Apple’ the stock ticker.”
TL;DR
Generic ASR models fail on domain jargon because their language models assign near-zero probability to specialized terms, causing the decoder to prefer common phrases. Custom language modeling fixes this without retraining the acoustic model: shallow fusion interpolates domain-specific n-gram scores during beam search, class-based LMs expand entity placeholders dynamically, and Whisper’s contextual prompting primes the decoder with expected vocabulary. Production systems compile per-user word lists into FSTs on-the-fly for real-time customization. Always measure Entity-WER on the boosted word list, not just global WER. This builds directly on ASR decoding beam search and connects to phonetic trie search for handling transcription variants.

1. Problem Statement
Generic ASR models (Whisper, Google Speech) are trained on general internet data. They perform poorly on Jargon.
- Medical: “Administer 50mg of Xarelto.” -> “The real toe.”
- Legal: “Habeas Corpus.” -> “Happy its corpse.”
- Corporate: “Met me at the K8s sync.” -> “Kate’s sink.”
The Problem: How do we teach a pre-trained ASR model new vocabulary without retraining the massive acoustic model?
2. Fundamentals: The Noisy Channel Model
Recall the ASR equation:
P(Text | Audio) \propto P(Audio | Text) \times P(Text)
- Acoustic Model (
P(Audio | Text)): “Does this sound like ‘Xarelto’?” (Maybe). - Language Model (
P(Text)): “Is ‘Xarelto’ a word?” (Generic LM says No. Prob = 0.000001).
Since the LM probability is near zero, the total score is low. The decoder chooses “The real toe” because P("The real toe") is high.
Solution: We hack P(Text).
3. Architecture: Shallow vs Deep Fusion
How do we inject the new words?
3.1 Shallow Fusion (The Standard)
We train a small, domain-specific LM (n-gram) on the client’s text documents.
During decoding (Beam Search), we interpolate the scores:
Score = \log P_{AM} + \alpha \log P_{GenericLM} + \beta \log P_{CustomLM}
If \beta is high, the custom model boosts “Xarelto”.
3.2 Deep Fusion
We inject a specialized neural network layer inside the ASR network that attends to a list of custom words. This is harder to implement but more robust.
4. Implementation Approaches
4.1 Class-Based LMs
Instead of hardcoding “John Smith”, we train the LM with a placeholder tag: @NAME.
- Training sentence: “Call
@NAMEat 5pm.” - Runtime: We provide a map
{@NAME: ["John", "Sarah", "Mike"]}. The FST (Finite State Transducer) dynamically expands the@NAMEnode into arcs for John, Sarah, Mike.
4.2 Contextual Biasing (Attention)
In Transformer ASR (Whisper), we can pass a list of “Hint Strings” in the prompt.
prompt="Xarelto, Ibuprofen, Tylenol"
The model’s cross-attention mechanism attends to these tokens, increasing their likelihood.
5. Implementation: Contextual Biasing with Whisper
import whisper
model = whisper.load_model("base")
# 1. Standard Inference
audio = "audio_xarelto.mp3"
result_bad = model.transcribe(audio)
print(result_bad["text"])
# Output: "Patient needs the real toe."
# 2. Contextual Prompting
# We prepend the keywords to the decoder's context window.
# It acts like the model "just said" these words, priming it to say them again.
initial_prompt = "Medical Logic: Xarelto, Warfarin, Apixaban."
result_good = model.transcribe(audio, initial_prompt=initial_prompt)
print(result_good["text"])
# Output: "Patient needs Xarelto."
6. Training Considerations
6.1 Text Data Augmentation
To train the Custom LM, you need text.
- Source: Technical manuals, past transcripts, email logs.
- Normalization: You must convert “50mg” to “fifty milligrams” to match ASR output space.
6.2 Pruning
A custom LM with 1 million words is slow. Prune the n-grams. Keep only unique jargon. Trust the Generic LM for “the”, “cat”, “is”.
7. Production Deployment: Dynamic Loading
In a SaaS ASR (like Otter.ai):
- User enters a meeting (“Project Apollo Sync”).
- System loads “Project Apollo” word list (Entities: “Apollo”, “Saturn”, “Launch”).
- System compiles a tiny FST on-the-fly (ms).
- Decoder graph =
Generic_Graphcomposed withDynamic_FST.
This allows Per-User customization.
8. Performance Metrics
Entity-WER.
- Global WER might be 5% with or without customization.
- But if the 5% error is the Patient’s Name, the transcript is useless.
- Measure accuracy specifically on the Boosted List.
9. Failure Modes
- Over-Biasing:
- Boost list:
["Call"]. - User says: “Tall building”.
- ASR hears: “Call building”.
- Fix: Tunable parameter
biasing_weight.
- Boost list:
- Phonetic Confusion:
- Boost:
["Resume"](Noun). - User: “Resume” (Verb).
- ASR gets it right, but downstream NLP gets confused by the tag.
- Boost:
10. Real-World Case Study: Smart Speakers
Alexa Contact List. When you say “Call Mom”, Alexa biased the ASR towards your contacts. It didn’t boost “Mom” for everyone. It boosted “Mom”, “Dad”, “Arun” for you. This uses Personalized Language Models (PLM).
11. State-of-the-Art: Neural Biasing
Recent research (Google) uses GNNs (Graph Neural Networks) to encode the relationship between entities in the bias list, handling thousands of entities (e.g., a massive Song Library) without degrading latency.
12. Key Takeaways
- Generic is not enough: Production ASR requires customization.
- Shallow Fusion is cheap: No GPU retraining needed. Just text statistical counting.
- Prompt Engineering works for ASR: Whisper’s prompt feature allows 0-shot adaptation.
- Metric Validity: Optimize for Entity-WER, not just WER.
FAQ
Why do generic ASR models fail on domain-specific vocabulary?
| ASR models use the equation P(Text | Audio) proportional to P(Audio | Text) * P(Text). Even when the acoustic model correctly identifies sounds matching “Xarelto”, the generic language model assigns near-zero probability to that term because it was rarely seen in general training data. The decoder then prefers common phrases like “the real toe” that have much higher language model scores, overriding the acoustic evidence. |
What is shallow fusion and how does it improve domain-specific ASR?
Shallow fusion trains a small domain-specific n-gram language model on text from the client’s domain (medical manuals, legal documents, corporate communications). During beam search decoding, the system interpolates three scores: the acoustic model, the generic language model, and the custom language model. A tunable weight beta controls how strongly the custom model boosts domain terms, requiring no GPU retraining of the acoustic model.
How does Whisper’s contextual prompting work for domain adaptation?
Whisper accepts an initial_prompt parameter that prepends text to the decoder’s context window, effectively priming the model as if those words were recently spoken. By setting the prompt to domain keywords (e.g., “Medical: Xarelto, Warfarin, Apixaban”), the model’s cross-attention mechanism becomes biased toward those tokens during decoding, significantly increasing the probability of correctly transcribing domain-specific terms with zero additional training.
What is over-biasing in custom language modeling and how do you prevent it?
Over-biasing occurs when the boost weight for custom vocabulary is set too high, causing the ASR to hallucinate boosted words even when the speaker said something different. For example, boosting “Call” might cause “Tall building” to be transcribed as “Call building.” Prevention requires careful tuning of the biasing_weight parameter and measuring accuracy specifically on the boosted word list (Entity-WER) to detect both under-biasing and over-biasing.
Originally published at: arunbaby.com/speech-tech/0050-custom-language-modeling
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch