Character-Level Language Models
“Before machines could write essays, they had to learn to spell.”
TL;DR
Character-level language models predict text one letter at a time, offering infinite vocabulary and zero out-of-vocabulary issues at the cost of 5x longer sequences. RNNs with LSTM cells were the classic approach, using truncated backpropagation to handle long sequences. BPE subword tokenization ultimately won as the Goldilocks approach between character and word-level modeling, balancing manageable sequence lengths with robust vocabulary coverage. Character models remain relevant in speech recognition for spelling correction in noisy transcripts and in specialized domains like biological sequences. For how modern tokenization feeds into production systems, see LLM Serving.

1. Problem Statement
Modern LLMs (GPT-4) operate on tokens (sub-words). But to understand why, we must study the alternatives. Character-Level Modeling is the task of predicting the next character in a sequence.
- Input:
['h', 'e', 'l', 'l'] - Target:
'o'
Why build a Char-LM?
- Infinite Vocabulary: No “Unknown Token”
<UNK>issues. It can generate any word. - Robustness: Handles typos (
"helo") and biological sequences ("ACTG") natively. - Simplicity: Vocab size is 100 (ASCII), not 50,000 (GPT-2 BPE).
2. Understanding the Requirements
2.1 The Context Problem
Prediction depends on long-range dependencies.
- “The cat sat on the m…” -> a -> t. (Local context).
- “I grew up in France… I speak fluent F…” -> r -> e… (Global context).
A Char-LM must remember history that is 5x longer than a Word-LM (since avg word length is 5 chars). If a sentence is 20 words, the Char-LM sees 100 steps.
2.2 Sparsity vs Density
- One-Hot Encoding: Characters are dense.
ais always vector[1, 0, ...]. - Embedding: We still learn a dense vector for ‘a’, capturing nuances like “vowels cluster together”.
3. High-Level Architecture
We compare RNN (Recurrent) vs Transformer (Attention).
RNN Style (The Classic):
[h] -> [e] -> [l] -> [l]
| | | |
v v v v
(S0)-> (S1)-> (S2)-> (S3) -> Predict 'o'
State S3 must verify “We are in the word ‘hello’”.
Transformer Style (Modern):
Input: [h, e, l, l]
Attention: l attends to h, e, l.
Output: Prob(o)
4. Component Deep-Dives
4.1 Tokenization Trade-offs
| Strategy | Vocab Size | Sequence Length (for 1000 words) | OOV Risk |
|---|---|---|---|
| Character | ~100 | ~5000 chars | None |
| Word | ~1M | 1000 words | High (Rare names) |
| Subword (BPE) | ~50k | ~1300 tokens | Low |
Why BPE won: It balances the trade-off. It keeps sequence length manageable (for Transformer O(N^2) attention) while handling rare words via characters.
4.2 The Softmax Bottleneck
Predicting 1 out of 100 chars is cheap. Predicting 1 out of 50,000 tokens is expensive (large Matrix Mul at the end). Char-LMs are incredibly fast at the final layer, but incredibly slow at the layers/inference (requiring more steps).
5. Data Flow: Training Pipeline
- Raw Text: “Hello world”
- Vectorizer:
[H, e, l, l, o, _, w, o, r, l, d]->[8, 5, 12, 12, 15, 0, ...] - Windowing: Create pairs
(Input, Target).[8, 5, 12]->12(“Hel” -> “l”)[5, 12, 12]->15(“ell” -> “o”)
- Loss Calculation: Cross Entropy Loss on the prediction.
6. Implementation: RNN Char-LM
import torch
import torch.nn as nn
class CharRNN(nn.Module):
def __init__(self, vocab_size, hidden_size, n_layers=1):
super().__init__()
self.embedding = nn.Embedding(vocab_size, hidden_size)
self.rnn = nn.LSTM(hidden_size, hidden_size, n_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, x, hidden=None):
# x: [Batch, Seq_Len] (e.g., indices of chars)
embeds = self.embedding(x)
# rnn_out: [Batch, Seq_Len, Hidden]
rnn_out, hidden = self.rnn(embeds, hidden)
# Predict next char for EVERY step in sequence
logits = self.fc(rnn_out)
return logits, hidden
def generate(self, start_char, max_len=100):
# Inference Loop
curr_char = torch.tensor([[char_to_ix[start_char]]])
hidden = None
out = start_char
for _ in range(max_len):
logits, hidden = self.forward(curr_char, hidden)
probs = nn.functional.softmax(logits[0, -1], dim=0)
next_ix = torch.multinomial(probs, 1).item()
out += ix_to_char[next_ix]
curr_char = torch.tensor([[next_ix]])
return out
7. Scaling Strategies
7.1 Truncated Backpropagation through Time (TBPTT)
You cannot backpropagate through a book with 1 million characters. Gradients vanish or explode.
We process chunks of 100 characters.
Crucial: We pass the hidden state from Chunk 1 to Chunk 2, but we detach the gradient history. The model remembers the context, but doesn’t try to learn across the boundary.
8. Failure Modes
- Hallucinated Words: “The quxijumped over…”
- Since it spells letter-by-letter, it can invent non-existent words that “sound” pronounceable.
- Incoherent Grammar: It closes parentheses
)that were never opened(.- LSTMs struggled with this (counting). Transformers fixed it.
9. Real-World Case Study: Andrej Karpathy’s minGPT
The famous blog post “The Unreasonable Effectiveness of Recurrent Neural Networks” trained a Char-RNN on:
- Shakespeare: Resulted in fake plays.
- Linux Kernel Code: Resulted in C code that almost compiled (including comments and indentation). This proved that neural nets learn Syntactic Structure just from statistical co-occurrence.
10. Connections to ML Systems
This connects to Custom Language Modeling in Speech.
- ASR systems use Char-LMs to correct spelling in noisy transcripts.
- If ASR hears “Helo”, the Char-LM says “l followed by o is unlikely after He, it should be ‘ll’”.
11. Cost Analysis
Training: Cheap. A Char-RNN trains on a laptop CPU in minutes. Inference: Expensive.
- To generate a 1000-word essay (5000 chars), you run the model 5000 times (serial).
- A Token-LM runs 700 times.
- This 7x latency penalty is why Char-LMs are not used for Chatbots.
12. Key Takeaways
- Granularity Matters: Breaking text down to atoms (chars) simplifies vocabulary but complicates structure learning.
- Embeddings: Even characters need embeddings. ‘A’ and ‘a’ should be close vectors.
- Subword Dominance: BPE won because it is the “Goldilocks” zone, short sequences, manageable vocab.
FAQ
What are the advantages of character-level language models?
Character-level models have an infinite vocabulary with no unknown token issues, handle typos and novel words natively, and use a tiny vocabulary of roughly 100 characters. This makes them robust for specialized domains like biological sequences (ACTG) and useful for spell correction in ASR systems where noisy transcripts need character-level reasoning.
Why did subword tokenization like BPE replace character-level models?
BPE balances the trade-off between sequence length and vocabulary size. Character models require 5x longer sequences (5000 characters for a 1000-word essay), which is prohibitively expensive for Transformer attention at O(N-squared). BPE keeps sequences at around 1300 tokens per 1000 words with a manageable 50K vocabulary and low out-of-vocabulary risk.
How does truncated backpropagation through time work for character models?
Instead of backpropagating through millions of characters, the model processes chunks of around 100 characters. The hidden state is passed from chunk to chunk so the model remembers context, but gradient history is detached at chunk boundaries to prevent vanishing or exploding gradients. The model remembers but does not try to learn across boundaries.
Why are character-level models not used for chatbots?
Generating a 1000-word essay requires running the model 5000 times serially for character-level output versus roughly 700 times for token-level output. This 7x latency penalty makes character models impractical for interactive applications where users expect sub-second responses.
Originally published at: arunbaby.com/ml-system-design/0050-character-level-language-models
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch