Character-Level Language Models

7 minute read

“Before machines could write essays, they had to learn to spell.”

TL;DR

Character-level language models predict text one letter at a time, offering infinite vocabulary and zero out-of-vocabulary issues at the cost of 5x longer sequences. RNNs with LSTM cells were the classic approach, using truncated backpropagation to handle long sequences. BPE subword tokenization ultimately won as the Goldilocks approach between character and word-level modeling, balancing manageable sequence lengths with robust vocabulary coverage. Character models remain relevant in speech recognition for spelling correction in noisy transcripts and in specialized domains like biological sequences. For how modern tokenization feeds into production systems, see LLM Serving.

Individual Scrabble letter tiles scattered across a dark wooden surface beginning to arrange themselves into words

1. Problem Statement

Modern LLMs (GPT-4) operate on tokens (sub-words). But to understand why, we must study the alternatives. Character-Level Modeling is the task of predicting the next character in a sequence.

Input: ['h', 'e', 'l', 'l']
Target: 'o'

Why build a Char-LM?

Infinite Vocabulary: No “Unknown Token” <UNK> issues. It can generate any word.
Robustness: Handles typos ("helo") and biological sequences ("ACTG") natively.
Simplicity: Vocab size is 100 (ASCII), not 50,000 (GPT-2 BPE).

2. Understanding the Requirements

2.1 The Context Problem

Prediction depends on long-range dependencies.

“The cat sat on the m…” -> a -> t. (Local context).
“I grew up in France… I speak fluent F…” -> r -> e… (Global context).

A Char-LM must remember history that is 5x longer than a Word-LM (since avg word length is 5 chars). If a sentence is 20 words, the Char-LM sees 100 steps.

2.2 Sparsity vs Density

One-Hot Encoding: Characters are dense. a is always vector [1, 0, ...].
Embedding: We still learn a dense vector for ‘a’, capturing nuances like “vowels cluster together”.

3. High-Level Architecture

We compare RNN (Recurrent) vs Transformer (Attention).

RNN Style (The Classic):

[h] -> [e] -> [l] -> [l]
 | | | |
 v v v v
(S0)-> (S1)-> (S2)-> (S3) -> Predict 'o'

State S3 must verify “We are in the word ‘hello’”.

Transformer Style (Modern): Input: [h, e, l, l] Attention: l attends to h, e, l. Output: Prob(o)

4. Component Deep-Dives

4.1 Tokenization Trade-offs

Strategy	Vocab Size	Sequence Length (for 1000 words)	OOV Risk
Character	~100	~5000 chars	None
Word	~1M	1000 words	High (Rare names)
Subword (BPE)	~50k	~1300 tokens	Low

Why BPE won: It balances the trade-off. It keeps sequence length manageable (for Transformer O(N^2) attention) while handling rare words via characters.

4.2 The Softmax Bottleneck

Predicting 1 out of 100 chars is cheap. Predicting 1 out of 50,000 tokens is expensive (large Matrix Mul at the end). Char-LMs are incredibly fast at the final layer, but incredibly slow at the layers/inference (requiring more steps).

5. Data Flow: Training Pipeline

Raw Text: “Hello world”
Vectorizer: [H, e, l, l, o, _, w, o, r, l, d] -> [8, 5, 12, 12, 15, 0, ...]
Windowing: Create pairs (Input, Target).
- [8, 5, 12] -> 12 (“Hel” -> “l”)
- [5, 12, 12] -> 15 (“ell” -> “o”)
Loss Calculation: Cross Entropy Loss on the prediction.

6. Implementation: RNN Char-LM

import torch
import torch.nn as nn

class CharRNN(nn.Module):
    def __init__(self, vocab_size, hidden_size, n_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size, hidden_size, n_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, hidden=None):
        # x: [Batch, Seq_Len] (e.g., indices of chars)
        embeds = self.embedding(x)

        # rnn_out: [Batch, Seq_Len, Hidden]
        rnn_out, hidden = self.rnn(embeds, hidden)

        # Predict next char for EVERY step in sequence
        logits = self.fc(rnn_out)
        return logits, hidden

    def generate(self, start_char, max_len=100):
        # Inference Loop
        curr_char = torch.tensor([[char_to_ix[start_char]]])
        hidden = None
        out = start_char

        for _ in range(max_len):
            logits, hidden = self.forward(curr_char, hidden)
            probs = nn.functional.softmax(logits[0, -1], dim=0)
            next_ix = torch.multinomial(probs, 1).item()

            out += ix_to_char[next_ix]
            curr_char = torch.tensor([[next_ix]])

            return out

7. Scaling Strategies

7.1 Truncated Backpropagation through Time (TBPTT)

You cannot backpropagate through a book with 1 million characters. Gradients vanish or explode. We process chunks of 100 characters. Crucial: We pass the hidden state from Chunk 1 to Chunk 2, but we detach the gradient history. The model remembers the context, but doesn’t try to learn across the boundary.

8. Failure Modes

Hallucinated Words: “The quxijumped over…”
- Since it spells letter-by-letter, it can invent non-existent words that “sound” pronounceable.
Incoherent Grammar: It closes parentheses ) that were never opened (.
- LSTMs struggled with this (counting). Transformers fixed it.

9. Real-World Case Study: Andrej Karpathy’s minGPT

The famous blog post “The Unreasonable Effectiveness of Recurrent Neural Networks” trained a Char-RNN on:

Shakespeare: Resulted in fake plays.
Linux Kernel Code: Resulted in C code that almost compiled (including comments and indentation). This proved that neural nets learn Syntactic Structure just from statistical co-occurrence.

10. Connections to ML Systems

This connects to Custom Language Modeling in Speech.

ASR systems use Char-LMs to correct spelling in noisy transcripts.
If ASR hears “Helo”, the Char-LM says “l followed by o is unlikely after He, it should be ‘ll’”.

11. Cost Analysis

Training: Cheap. A Char-RNN trains on a laptop CPU in minutes. Inference: Expensive.

To generate a 1000-word essay (5000 chars), you run the model 5000 times (serial).
A Token-LM runs 700 times.
This 7x latency penalty is why Char-LMs are not used for Chatbots.

12. Key Takeaways

Granularity Matters: Breaking text down to atoms (chars) simplifies vocabulary but complicates structure learning.
Embeddings: Even characters need embeddings. ‘A’ and ‘a’ should be close vectors.
Subword Dominance: BPE won because it is the “Goldilocks” zone, short sequences, manageable vocab.

FAQ

What are the advantages of character-level language models?

Character-level models have an infinite vocabulary with no unknown token issues, handle typos and novel words natively, and use a tiny vocabulary of roughly 100 characters. This makes them robust for specialized domains like biological sequences (ACTG) and useful for spell correction in ASR systems where noisy transcripts need character-level reasoning.

Why did subword tokenization like BPE replace character-level models?

BPE balances the trade-off between sequence length and vocabulary size. Character models require 5x longer sequences (5000 characters for a 1000-word essay), which is prohibitively expensive for Transformer attention at O(N-squared). BPE keeps sequences at around 1300 tokens per 1000 words with a manageable 50K vocabulary and low out-of-vocabulary risk.

How does truncated backpropagation through time work for character models?

Instead of backpropagating through millions of characters, the model processes chunks of around 100 characters. The hidden state is passed from chunk to chunk so the model remembers context, but gradient history is detached at chunk boundaries to prevent vanishing or exploding gradients. The model remembers but does not try to learn across boundaries.

Why are character-level models not used for chatbots?

Generating a 1000-word essay requires running the model 5000 times serially for character-level output versus roughly 700 times for token-level output. This 7x latency penalty makes character models impractical for interactive applications where users expect sub-second responses.

Originally published at: arunbaby.com/ml-system-design/0050-character-level-language-models

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

Character-Level Language Models

TL;DR

1. Problem Statement

2. Understanding the Requirements

2.1 The Context Problem

2.2 Sparsity vs Density

3. High-Level Architecture

4. Component Deep-Dives

4.1 Tokenization Trade-offs

4.2 The Softmax Bottleneck

5. Data Flow: Training Pipeline

6. Implementation: RNN Char-LM

7. Scaling Strategies

7.1 Truncated Backpropagation through Time (TBPTT)

8. Failure Modes

9. Real-World Case Study: Andrej Karpathy’s minGPT

10. Connections to ML Systems

11. Cost Analysis

12. Key Takeaways

FAQ

Related across topics

Share on

TL;DR

1. Problem Statement

2. Understanding the Requirements

2.1 The Context Problem

2.2 Sparsity vs Density

3. High-Level Architecture

4. Component Deep-Dives

4.1 Tokenization Trade-offs

4.2 The Softmax Bottleneck

5. Data Flow: Training Pipeline

6. Implementation: RNN Char-LM

7. Scaling Strategies

7.1 Truncated Backpropagation through Time (TBPTT)

8. Failure Modes

9. Real-World Case Study: Andrej Karpathy’s minGPT

10. Connections to ML Systems

11. Cost Analysis

12. Key Takeaways

FAQ

Related across topics

Alien Dictionary

Custom Language Modeling

Building Domain-Specific Agents

Share on