Alien Dictionary
“Language is just a graph of symbols. If you know the order, you know the language.”
TL;DR
Alien Dictionary reconstructs an unknown alphabet ordering from a sorted word list. Compare adjacent words to extract ordering rules from the first differing character, build a directed graph, and run Kahn’s topological sort. Watch for the prefix edge case where a longer word incorrectly precedes its own prefix. This builds directly on the Course Schedule topological sort pattern and relates to character-level language modeling in ML.
1. Problem Statement
This is a legendary Facebook interview question. It combines String processing with Graph Theory.
There is a new alien language that uses the English alphabet. However, the order among the letters is unknown to you.
You are given a list of strings words from the alien language’s dictionary, where the strings in words are sorted lexicographically by the rules of this new language.
Return a string of the unique letters in the new alien language sorted in lexicographically increasing order by the new language’s rules. If there is no solution, return "".
Example 1:
- Input:
words = ["wrt","wrf","er","ett","rftt"] - Output:
"wertf" - Explanation:
wrtvswrf:tcomes beforef. (t -> f)wrfvser:wcomes beforee. (w -> e)ervsett:rcomes beforet. (r -> t)ettvsrftt:ecomes beforer. (e -> r)- Combined:
w -> e -> r -> t -> f.
2. Understanding the Problem
2.1 The Lexicographical Comparison
In a sorted dictionary (English):
- “Apple” comes before “Approve”.
- Why? First mismatch.
lvsr.l<r. - The characters after the first mismatch don’t matter!
Algorithm: To find the rules, we only need to compare adjacent words in the list.
Comparing word[0] vs word[1] gives us one rule.
Comparing word[0] vs word[99] gives us redundant info (transitive property).
2.2 Graph Representation
- Nodes: Unique characters in the list.
- Edges:
u -> vmeansucomes beforevin the alphabet. - Task: Find a Topological Sort of this graph.
3. Approach 1: Graph Construction + Kahn’s Algo
We break this into three phases.
Phase 1: Initialize
Scan every word to find all unique characters. Initialize Indegree = {c: 0} and Adj = {c: []}.
Phase 2: Build Edges
Iterate words from 0 to N-1.
Compare w1 and w2.
- Find first index
jwherew1[j] != w2[j]. - Add edge
w1[j] -> w2[j]. - Increment Indegree of
w2[j]. - Break. (Don’t check further characters).
Edge Case: “Prefix Problem”.
If w1 = "abc" and w2 = "ab".
In a valid dictionary, prefix (“ab”) must come BEFORE “abc”.
If “abc” comes before “ab”, the input is invalid. Return "".
Phase 3: Topological Sort
Run Kahn’s Algorithm (BFS) . If result length < num unique chars, there is a cycle (Invalid).
4. Implementation
from collections import deque, defaultdict
class Solution:
def alienOrder(self, words: list[str]) -> str:
# 1. Initialize data structures
adj = defaultdict(set)
in_degree = {c: 0 for w in words for c in w}
# 2. Build Graph
for i in range(len(words) - 1):
w1, w2 = words[i], words[i+1]
min_len = min(len(w1), len(w2))
# Check for prefix error ("abc" before "ab")
if len(w1) > len(w2) and w1[:min_len] == w2[:min_len]:
return ""
for j in range(min_len):
if w1[j] != w2[j]:
if w2[j] not in adj[w1[j]]:
adj[w1[j]].add(w2[j])
in_degree[w2[j]] += 1
break # Only the first mismatch matters
# 3. Topological Sort (Kahn's)
queue = deque([c for c in in_degree if in_degree[c] == 0])
result = []
while queue:
char = queue.popleft()
result.append(char)
for neighbor in adj[char]:
in_degree[neighbor] -= 1
if in_degree[neighbor] == 0:
queue.append(neighbor)
# 4. Cycle Check
if len(result) < len(in_degree):
return "" # Cycle detected
return "".join(result)
5. Testing Strategy
Test Case 1: Simple Line
Input: ["z", "x"].
zvsx. Edgez -> x.- Queue:
[z]. Resultzx.
Test Case 2: Cycle
Input: ["z", "x", "z"].
z -> x.x -> z.- Cycle. Queue empties early. Result length < 3. Return
"".
Test Case 3: Multiple Components
Input: ["a", "b", "c", "d"] (where a->b and c->d, but no link between systems).
- Result order between disconnected components doesn’t matter (e.g.,
ab cdoracbd). - Any valid topo sort is accepted.
6. Complexity Analysis
Let C be total length of all words (total characters).
Let U be number of unique characters (at most 26).
- Time: O(C).
- We scan the words list. Comparing two words takes O(L). Total edge building is O(C).
- Topo sort takes O(U + E). Since
U \le 26, this is constant time relative to input size! - Dominant term is scanning the input.
- Space: O(1) or O(U + E). Since
Uis fixed at 26, the graph is tiny.
7. Production Considerations
This logic is used in Versioning Systems (SemVer). “Determine if v1.10.2 > v1.2.9”. It parses the string into segments and performs the exact same “First Mismatch” logic used here.
8. Connections to ML Systems
This relates to Character-Level LMs (ML ).
- In Alien Dictionary, we infer the rules of the language from raw data.
- In Char-RNN, the model infers the probability
P(char_t | char_{t-1}). - Both are “Language Modeling” problems at the atomic level.
9. Interview Strategy
- Don’t skip the Prefix Check: This is the most common reason to fail. Mention
abcvsabexplicitly. - Explain “Set” for Edges: Using
listfor adjacency can lead to duplicate edges if the same rule appears twice (wrt, wrfand laterart, arf). Use aSetor check before adding. - Variable Names: Use
adjandin_degree. Clear naming helps debugging.
10. Key Takeaways
- Information Theory: You only get information from the first difference. Everything after is noise.
- Graph Construction: Sometimes the graph isn’t given; you have to build it by observing constraints.
- Cycles: Order means “No Cycles”. Always cycle check.
FAQ
How do you determine character order from an alien dictionary?
Compare adjacent words in the sorted list and find the first position where characters differ. This difference gives one ordering rule as a directed edge. Build a graph from all such rules and run topological sort to produce the full alphabet ordering.
Why do you only compare adjacent words in the alien dictionary?
Adjacent word comparisons give direct ordering rules. Comparing non-adjacent words yields redundant information due to the transitive property. You only extract one edge per adjacent pair from the first mismatched character position.
What is the prefix edge case in Alien Dictionary?
If a longer word like “abc” appears before its prefix “ab” in the sorted list, the input is invalid because in a valid dictionary the prefix must always come first. Detect this by checking if the longer word starts with the shorter word and return an empty string.
What is the time complexity of the Alien Dictionary solution?
O(C) where C is the total number of characters across all words. The graph construction pass over all words dominates. The topological sort operates on at most 26 characters, making it effectively constant time relative to input size.
Originally published at: arunbaby.com/dsa/0050-alien-dictionary