7 minute read

“Language is just a graph of symbols. If you know the order, you know the language.”

TL;DR

Alien Dictionary reconstructs an unknown alphabet ordering from a sorted word list. Compare adjacent words to extract ordering rules from the first differing character, build a directed graph, and run Kahn’s topological sort. Watch for the prefix edge case where a longer word incorrectly precedes its own prefix. This builds directly on the Course Schedule topological sort pattern and relates to character-level language modeling in ML.

1. Problem Statement

This is a legendary Facebook interview question. It combines String processing with Graph Theory.

There is a new alien language that uses the English alphabet. However, the order among the letters is unknown to you. You are given a list of strings words from the alien language’s dictionary, where the strings in words are sorted lexicographically by the rules of this new language.

Return a string of the unique letters in the new alien language sorted in lexicographically increasing order by the new language’s rules. If there is no solution, return "".

Example 1:

  • Input: words = ["wrt","wrf","er","ett","rftt"]
  • Output: "wertf"
  • Explanation:
  • wrt vs wrf: t comes before f. (t -> f)
  • wrf vs er: w comes before e. (w -> e)
  • er vs ett: r comes before t. (r -> t)
  • ett vs rftt: e comes before r. (e -> r)
  • Combined: w -> e -> r -> t -> f.

2. Understanding the Problem

2.1 The Lexicographical Comparison

In a sorted dictionary (English):

  • “Apple” comes before “Approve”.
  • Why? First mismatch. l vs r. l < r.
  • The characters after the first mismatch don’t matter!

Algorithm: To find the rules, we only need to compare adjacent words in the list. Comparing word[0] vs word[1] gives us one rule. Comparing word[0] vs word[99] gives us redundant info (transitive property).

2.2 Graph Representation

  • Nodes: Unique characters in the list.
  • Edges: u -> v means u comes before v in the alphabet.
  • Task: Find a Topological Sort of this graph.

3. Approach 1: Graph Construction + Kahn’s Algo

We break this into three phases.

Phase 1: Initialize

Scan every word to find all unique characters. Initialize Indegree = {c: 0} and Adj = {c: []}.

Phase 2: Build Edges

Iterate words from 0 to N-1. Compare w1 and w2.

  • Find first index j where w1[j] != w2[j].
  • Add edge w1[j] -> w2[j].
  • Increment Indegree of w2[j].
  • Break. (Don’t check further characters).

Edge Case: “Prefix Problem”. If w1 = "abc" and w2 = "ab". In a valid dictionary, prefix (“ab”) must come BEFORE “abc”. If “abc” comes before “ab”, the input is invalid. Return "".

Phase 3: Topological Sort

Run Kahn’s Algorithm (BFS) . If result length < num unique chars, there is a cycle (Invalid).


4. Implementation

from collections import deque, defaultdict

class Solution:
    def alienOrder(self, words: list[str]) -> str:
        # 1. Initialize data structures
        adj = defaultdict(set)
        in_degree = {c: 0 for w in words for c in w}

        # 2. Build Graph
        for i in range(len(words) - 1):
            w1, w2 = words[i], words[i+1]
            min_len = min(len(w1), len(w2))

            # Check for prefix error ("abc" before "ab")
            if len(w1) > len(w2) and w1[:min_len] == w2[:min_len]:
                return ""

                for j in range(min_len):
                    if w1[j] != w2[j]:
                        if w2[j] not in adj[w1[j]]:
                            adj[w1[j]].add(w2[j])
                            in_degree[w2[j]] += 1
                            break # Only the first mismatch matters

                            # 3. Topological Sort (Kahn's)
                            queue = deque([c for c in in_degree if in_degree[c] == 0])
                            result = []

                            while queue:
                                char = queue.popleft()
                                result.append(char)

                                for neighbor in adj[char]:
                                    in_degree[neighbor] -= 1
                                    if in_degree[neighbor] == 0:
                                        queue.append(neighbor)

                                        # 4. Cycle Check
                                        if len(result) < len(in_degree):
                                            return "" # Cycle detected

                                            return "".join(result)

5. Testing Strategy

Test Case 1: Simple Line

Input: ["z", "x"].

  • z vs x. Edge z -> x.
  • Queue: [z]. Result zx.

Test Case 2: Cycle

Input: ["z", "x", "z"].

  • z -> x.
  • x -> z.
  • Cycle. Queue empties early. Result length < 3. Return "".

Test Case 3: Multiple Components

Input: ["a", "b", "c", "d"] (where a->b and c->d, but no link between systems).

  • Result order between disconnected components doesn’t matter (e.g., ab cd or acbd).
  • Any valid topo sort is accepted.

6. Complexity Analysis

Let C be total length of all words (total characters). Let U be number of unique characters (at most 26).

  • Time: O(C).
  • We scan the words list. Comparing two words takes O(L). Total edge building is O(C).
  • Topo sort takes O(U + E). Since U \le 26, this is constant time relative to input size!
  • Dominant term is scanning the input.
  • Space: O(1) or O(U + E). Since U is fixed at 26, the graph is tiny.

7. Production Considerations

This logic is used in Versioning Systems (SemVer). “Determine if v1.10.2 > v1.2.9”. It parses the string into segments and performs the exact same “First Mismatch” logic used here.


8. Connections to ML Systems

This relates to Character-Level LMs (ML ).

  • In Alien Dictionary, we infer the rules of the language from raw data.
  • In Char-RNN, the model infers the probability P(char_t | char_{t-1}).
  • Both are “Language Modeling” problems at the atomic level.

9. Interview Strategy

  1. Don’t skip the Prefix Check: This is the most common reason to fail. Mention abc vs ab explicitly.
  2. Explain “Set” for Edges: Using list for adjacency can lead to duplicate edges if the same rule appears twice (wrt, wrf and later art, arf). Use a Set or check before adding.
  3. Variable Names: Use adj and in_degree. Clear naming helps debugging.

10. Key Takeaways

  1. Information Theory: You only get information from the first difference. Everything after is noise.
  2. Graph Construction: Sometimes the graph isn’t given; you have to build it by observing constraints.
  3. Cycles: Order means “No Cycles”. Always cycle check.

FAQ

How do you determine character order from an alien dictionary?

Compare adjacent words in the sorted list and find the first position where characters differ. This difference gives one ordering rule as a directed edge. Build a graph from all such rules and run topological sort to produce the full alphabet ordering.

Why do you only compare adjacent words in the alien dictionary?

Adjacent word comparisons give direct ordering rules. Comparing non-adjacent words yields redundant information due to the transitive property. You only extract one edge per adjacent pair from the first mismatched character position.

What is the prefix edge case in Alien Dictionary?

If a longer word like “abc” appears before its prefix “ab” in the sorted list, the input is invalid because in a valid dictionary the prefix must always come first. Detect this by checking if the longer word starts with the shorter word and return an empty string.

What is the time complexity of the Alien Dictionary solution?

O(C) where C is the total number of characters across all words. The graph construction pass over all words dominates. The topological sort operates on at most 26 characters, making it effectively constant time relative to input size.


Originally published at: arunbaby.com/dsa/0050-alien-dictionary