Character-Level Language Models

5 minute read

“Before machines could write essays, they had to learn to spell.”

1. Problem Statement

Modern LLMs (GPT-4) operate on tokens (sub-words). But to understand why, we must study the alternatives. Character-Level Modeling is the task of predicting the next character in a sequence.

Input: ['h', 'e', 'l', 'l']
Target: 'o'

Why build a Char-LM?

Infinite Vocabulary: No “Unknown Token” <UNK> issues. It can generate any word.
Robustness: Handles typos ("helo") and biological sequences ("ACTG") natively.
Simplicity: Vocab size is 100 (ASCII), not 50,000 (GPT-2 BPE).

2. Understanding the Requirements

2.1 The Context Problem

Prediction depends on long-range dependencies.

“The cat sat on the m…” -> a -> t. (Local context).
“I grew up in France… I speak fluent F…” -> r -> e… (Global context).

A Char-LM must remember history that is 5x longer than a Word-LM (since avg word length is 5 chars). If a sentence is 20 words, the Char-LM sees 100 steps.

2.2 Sparsity vs Density

One-Hot Encoding: Characters are dense. a is always vector [1, 0, ...].
Embedding: We still learn a dense vector for ‘a’, capturing nuances like “vowels cluster together”.

3. High-Level Architecture

We compare RNN (Recurrent) vs Transformer (Attention).

RNN Style (The Classic):

[h] -> [e] -> [l] -> [l]
 |      |      |      |
 v      v      v      v
(S0)-> (S1)-> (S2)-> (S3) -> Predict 'o'

State S3 must verify “We are in the word ‘hello’”.

Transformer Style (Modern): Input: [h, e, l, l] Attention: l attends to h, e, l. Output: Prob(o)

4. Component Deep-Dives

4.1 Tokenization Trade-offs

Strategy	Vocab Size	Sequence Length (for 1000 words)	OOV Risk
Character	~100	~5000 chars	None
Word	~1M	1000 words	High (Rare names)
Subword (BPE)	~50k	~1300 tokens	Low

Why BPE won: It balances the trade-off. It keeps sequence length manageable (for Transformer $O(N^2)$ attention) while handling rare words via characters.

4.2 The Softmax Bottleneck

Predicting 1 out of 100 chars is cheap. Predicting 1 out of 50,000 tokens is expensive (large Matrix Mul at the end). Char-LMs are incredibly fast at the final layer, but incredibly slow at the layers/inference (requiring more steps).

5. Data Flow: Training Pipeline

Raw Text: “Hello world”
Vectorizer: [H, e, l, l, o, _, w, o, r, l, d] -> [8, 5, 12, 12, 15, 0, ...]
Windowing: Create pairs (Input, Target).
- [8, 5, 12] -> 12 (“Hel” -> “l”)
- [5, 12, 12] -> 15 (“ell” -> “o”)
Loss Calculation: Cross Entropy Loss on the prediction.

6. Implementation: RNN Char-LM

import torch
import torch.nn as nn

class CharRNN(nn.Module):
    def __init__(self, vocab_size, hidden_size, n_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size, hidden_size, n_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, x, hidden=None):
        # x: [Batch, Seq_Len] (e.g., indices of chars)
        embeds = self.embedding(x)
        
        # rnn_out: [Batch, Seq_Len, Hidden]
        rnn_out, hidden = self.rnn(embeds, hidden)
        
        # Predict next char for EVERY step in sequence
        logits = self.fc(rnn_out)
        return logits, hidden

    def generate(self, start_char, max_len=100):
        # Inference Loop
        curr_char = torch.tensor([[char_to_ix[start_char]]])
        hidden = None
        out = start_char
        
        for _ in range(max_len):
            logits, hidden = self.forward(curr_char, hidden)
            probs = nn.functional.softmax(logits[0, -1], dim=0)
            next_ix = torch.multinomial(probs, 1).item()
            
            out += ix_to_char[next_ix]
            curr_char = torch.tensor([[next_ix]])
            
        return out

7. Scaling Strategies

7.1 Truncated Backpropagation through Time (TBPTT)

You cannot backpropagate through a book with 1 million characters. Gradients vanish or explode. We process chunks of 100 characters. Crucial: We pass the hidden state from Chunk 1 to Chunk 2, but we detach the gradient history. The model remembers the context, but doesn’t try to learn across the boundary.

8. Failure Modes

Hallucinated Words: “The quxijumped over…”
- Since it spells letter-by-letter, it can invent non-existent words that “sound” pronounceable.
Incoherent Grammar: It closes parentheses ) that were never opened (.
- LSTMs struggled with this (counting). Transformers fixed it.

9. Real-World Case Study: Andrej Karpathy’s minGPT

The famous blog post “The Unreasonable Effectiveness of Recurrent Neural Networks” trained a Char-RNN on:

Shakespeare: Resulted in fake plays.
Linux Kernel Code: Resulted in C code that almost compiled (including comments and indentation). This proved that neural nets learn Syntactic Structure just from statistical co-occurrence.

10. Connections to ML Systems

This connects to Custom Language Modeling in Speech (Speech Day 50).

ASR systems use Char-LMs to correct spelling in noisy transcripts.
If ASR hears “Helo”, the Char-LM says “l followed by o is unlikely after He, it should be ‘ll’”.

11. Cost Analysis

Training: Cheap. A Char-RNN trains on a laptop CPU in minutes. Inference: Expensive.

To generate a 1000-word essay (5000 chars), you run the model 5000 times (serial).
A Token-LM runs 700 times.
This 7x latency penalty is why Char-LMs are not used for Chatbots.

12. Key Takeaways

Granularity Matters: Breaking text down to atoms (chars) simplifies vocabulary but complicates structure learning.
Embeddings: Even characters need embeddings. ‘A’ and ‘a’ should be close vectors.
Subword Dominance: BPE won because it is the “Goldilocks” zone—short sequences, manageable vocab.

Originally published at: arunbaby.com/ml-system-design/0050-character-level-language-models

If you found this helpful, consider sharing it with others who might benefit.

Character-Level Language Models

1. Problem Statement

2. Understanding the Requirements

2.1 The Context Problem

2.2 Sparsity vs Density

3. High-Level Architecture

4. Component Deep-Dives

4.1 Tokenization Trade-offs

4.2 The Softmax Bottleneck

5. Data Flow: Training Pipeline

6. Implementation: RNN Char-LM

7. Scaling Strategies

7.1 Truncated Backpropagation through Time (TBPTT)

8. Failure Modes

9. Real-World Case Study: Andrej Karpathy’s minGPT

10. Connections to ML Systems

11. Cost Analysis

12. Key Takeaways

Related across topics

Share on

1. Problem Statement

2. Understanding the Requirements

2.1 The Context Problem

2.2 Sparsity vs Density

3. High-Level Architecture

4. Component Deep-Dives

4.1 Tokenization Trade-offs

4.2 The Softmax Bottleneck

5. Data Flow: Training Pipeline

6. Implementation: RNN Char-LM

7. Scaling Strategies

7.1 Truncated Backpropagation through Time (TBPTT)

8. Failure Modes

9. Real-World Case Study: Andrej Karpathy’s minGPT

10. Connections to ML Systems

11. Cost Analysis

12. Key Takeaways

Related across topics

Alien Dictionary

Character-Level Language Models

Custom Language Modeling

Building Domain-Specific Agents

Share on