Group Anagrams
Master hash-based grouping to solve anagrams—the foundation of clustering systems and speaker diarization in production ML.
Problem Statement
Given an array of strings strs, group the anagrams together. You can return the answer in any order.
An anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once.
Examples
Example 1:
Input: strs = ["eat","tea","tan","ate","nat","bat"]
Output: [["bat"],["nat","tan"],["ate","eat","tea"]]
Example 2:
Input: strs = [""]
Output: [[""]]
Example 3:
Input: strs = ["a"]
Output: [["a"]]
Constraints
1 <= strs.length <= 10^40 <= strs[i].length <= 100strs[i]consists of lowercase English letters
Understanding the Problem
This is a fundamental grouping problem that teaches us:
- How to identify similar items (anagrams share same characters)
- How to use hash tables for efficient grouping
- How to design good hash keys for complex objects
- Pattern recognition for clustering algorithms
What Are Anagrams?
Two strings are anagrams if they contain the same characters with the same frequencies, just in different order.
Examples:
- “listen” and “silent” → anagrams (same letters: e,i,l,n,s,t)
- “eat”, “tea”, “ate” → all anagrams
- “cat” and “rat” → NOT anagrams (different letters)
Key Insight
Anagrams share a unique signature:
- Sorted characters: “eat” → “aet”, “tea” → “aet”
- Character count: both have {a:1, e:1, t:1}
We can use this signature as a hash key to group anagrams together.
Why This Problem Matters
- Hash table mastery: Learn to design effective hash keys
- Grouping pattern: Fundamental for clustering algorithms
- String manipulation: Common in text processing
- Real-world applications:
- Text deduplication
- Spam detection (similar messages)
- DNA sequence analysis
- Document clustering
- Speaker identification (similar voice characteristics)
The Clustering Connection
The grouping pattern in this problem is identical to clustering in ML:
| Group Anagrams | Clustering Systems | Speaker Diarization |
|---|---|---|
| Group strings by characters | Group data points by features | Group speech by speaker |
| Hash key: sorted string | Cluster ID: centroid | Speaker ID: voice embedding |
| O(NK log K) grouping | O(N × K) clustering | O(N × M) diarization |
All three use hash-based or similarity-based grouping to organize items.
Approach 1: Brute Force - Compare All Pairs
Intuition
Compare every pair of strings to check if they’re anagrams, then group them.
Implementation
from typing import List
from collections import defaultdict
def groupAnagrams_bruteforce(strs: List[str]) -> List[List[str]]:
"""
Brute force: compare all pairs.
Time: O(N^2 × K) where N = number of strings, K = max string length
Space: O(NK)
Why this approach?
- Simple to understand
- Shows the naive solution
- Demonstrates need for optimization
Problem:
- Too slow for large inputs
- Redundant comparisons
"""
def are_anagrams(s1: str, s2: str) -> bool:
"""Check if two strings are anagrams."""
if len(s1) != len(s2):
return False
# Count characters in each string
from collections import Counter
return Counter(s1) == Counter(s2)
# Track which strings have been grouped
grouped = [False] * len(strs)
result = []
for i in range(len(strs)):
if grouped[i]:
continue
# Start new group with current string
group = [strs[i]]
grouped[i] = True
# Find all anagrams of current string
for j in range(i + 1, len(strs)):
if not grouped[j] and are_anagrams(strs[i], strs[j]):
group.append(strs[j])
grouped[j] = True
result.append(group)
return result
# Test
test_input = ["eat","tea","tan","ate","nat","bat"]
print(groupAnagrams_bruteforce(test_input))
# Output: [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]
Analysis
Time Complexity: O(N² × K)
- N² pairs to compare
- Each comparison: O(K) to count characters
Space Complexity: O(NK)
- Store all strings in result
For N=10,000, K=100:
- Operations: 10,000² × 100 = 10 billion
- Too slow!
Approach 2: Sorting as Hash Key (Standard Solution)
The Key Insight
Anagrams become identical when sorted!
- “eat” → sort → “aet”
- “tea” → sort → “aet”
- “ate” → sort → “aet”
We can use the sorted string as a hash key to group anagrams.
Implementation
from collections import defaultdict
from typing import List
def groupAnagrams(strs: List[str]) -> List[List[str]]:
"""
Optimal solution using sorted string as hash key.
Time: O(N × K log K) where N = number of strings, K = max string length
Space: O(NK)
Algorithm:
1. For each string, create hash key by sorting it
2. Use hash table to group strings with same key
3. Return groups as list
Why this works:
- Sorting is canonical representation of anagram
- Hash table provides O(1) lookup
- Single pass through all strings
"""
# Hash table: sorted_string -> list of original strings
anagram_map = defaultdict(list)
for s in strs:
# Sort the string to create hash key
# "eat" -> ['a', 'e', 't'] -> "aet"
sorted_str = ''.join(sorted(s))
# Group by sorted key
anagram_map[sorted_str].append(s)
# Return all groups
return list(anagram_map.values())
# Test cases
test_cases = [
["eat","tea","tan","ate","nat","bat"],
[""],
["a"],
["abc", "bca", "cab", "xyz", "zyx", "yxz"],
]
for test in test_cases:
result = groupAnagrams(test)
print(f"Input: {test}")
print(f"Output: {result}\n")
Step-by-Step Visualization
Input: ["eat","tea","tan","ate","nat","bat"]
Step 1: Process "eat"
sorted("eat") = "aet"
anagram_map = {"aet": ["eat"]}
Step 2: Process "tea"
sorted("tea") = "aet"
anagram_map = {"aet": ["eat", "tea"]}
Step 3: Process "tan"
sorted("tan") = "ant"
anagram_map = {"aet": ["eat", "tea"], "ant": ["tan"]}
Step 4: Process "ate"
sorted("ate") = "aet"
anagram_map = {"aet": ["eat", "tea", "ate"], "ant": ["tan"]}
Step 5: Process "nat"
sorted("nat") = "ant"
anagram_map = {"aet": ["eat", "tea", "ate"], "ant": ["tan", "nat"]}
Step 6: Process "bat"
sorted("bat") = "abt"
anagram_map = {
"aet": ["eat", "tea", "ate"],
"ant": ["tan", "nat"],
"abt": ["bat"]
}
Output: [["eat","tea","ate"], ["tan","nat"], ["bat"]]
Approach 3: Character Count as Hash Key (Optimal for Large K)
Alternative Hash Key
Instead of sorting, we can use character frequencies as the hash key.
Why? When strings are very long (K » 26), counting is faster than sorting.
Implementation
from collections import defaultdict
from typing import List
def groupAnagrams_count(strs: List[str]) -> List[List[str]]:
"""
Use character count as hash key.
Time: O(NK) where N = number of strings, K = max string length
Space: O(NK)
Advantage over sorting:
- O(K) instead of O(K log K) per string
- Better for very long strings
Hash key format:
- Tuple of 26 integers (a-z counts)
- e.g., "aab" -> (2, 1, 0, 0, ..., 0)
"""
anagram_map = defaultdict(list)
for s in strs:
# Count characters (a-z)
count = [0] * 26
for char in s:
count[ord(char) - ord('a')] += 1
# Use tuple as hash key (lists aren't hashable)
key = tuple(count)
anagram_map[key].append(s)
return list(anagram_map.values())
# Test
test = ["eat","tea","tan","ate","nat","bat"]
result = groupAnagrams_count(test)
print(f"Result: {result}")
Character Count Visualization
"eat" -> count array:
Index: 0 1 2 3 4 5 ... 19 20 21 ...
Char: a b c d e f ... t u v ...
Count: [1, 0, 0, 0, 1, 0, ... 1, 0, 0, ...]
(a=1, e=1, t=1)
Key = (1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0)
"tea" -> same count array -> same key!
"tan" -> different count array -> different key
Implementation: Production-Grade Solution
from collections import defaultdict
from typing import List, Dict, Optional
import logging
from enum import Enum
class GroupingStrategy(Enum):
"""Strategy for creating hash keys."""
SORTED = "sorted"
COUNT = "count"
PRIME = "prime" # Advanced: prime number hash
class AnagramGrouper:
"""
Production-ready anagram grouper with multiple strategies.
Features:
- Multiple grouping strategies
- Input validation
- Performance metrics
- Detailed logging
"""
def __init__(self, strategy: GroupingStrategy = GroupingStrategy.SORTED):
"""
Initialize grouper.
Args:
strategy: Grouping strategy to use
"""
self.strategy = strategy
self.logger = logging.getLogger(__name__)
# Metrics
self.comparisons = 0
self.groups_created = 0
def group_anagrams(self, strs: List[str]) -> List[List[str]]:
"""
Group anagrams using selected strategy.
Args:
strs: List of strings to group
Returns:
List of groups (each group is a list of anagrams)
Raises:
ValueError: If input is invalid
"""
# Validate input
if not isinstance(strs, list):
raise ValueError("Input must be a list of strings")
if not all(isinstance(s, str) for s in strs):
raise ValueError("All elements must be strings")
# Reset metrics
self.comparisons = 0
self.groups_created = 0
# Choose strategy
if self.strategy == GroupingStrategy.SORTED:
result = self._group_by_sorted(strs)
elif self.strategy == GroupingStrategy.COUNT:
result = self._group_by_count(strs)
elif self.strategy == GroupingStrategy.PRIME:
result = self._group_by_prime(strs)
else:
raise ValueError(f"Unknown strategy: {self.strategy}")
self.groups_created = len(result)
self.logger.info(
f"Grouped {len(strs)} strings into {self.groups_created} groups "
f"using {self.strategy.value} strategy"
)
return result
def _group_by_sorted(self, strs: List[str]) -> List[List[str]]:
"""Group using sorted string as key."""
anagram_map = defaultdict(list)
for s in strs:
key = ''.join(sorted(s))
anagram_map[key].append(s)
self.comparisons += 1
return list(anagram_map.values())
def _group_by_count(self, strs: List[str]) -> List[List[str]]:
"""Group using character count as key."""
anagram_map = defaultdict(list)
for s in strs:
# Count characters
count = [0] * 26
for char in s:
if 'a' <= char <= 'z':
count[ord(char) - ord('a')] += 1
else:
# Handle uppercase or non-alphabetic
char_lower = char.lower()
if 'a' <= char_lower <= 'z':
count[ord(char_lower) - ord('a')] += 1
key = tuple(count)
anagram_map[key].append(s)
self.comparisons += 1
return list(anagram_map.values())
def _group_by_prime(self, strs: List[str]) -> List[List[str]]:
"""
Group using prime number hash.
Assign each letter a prime number:
a=2, b=3, c=5, d=7, e=11, ...
Hash = product of primes for each character.
Advantage: Unique hash for each anagram group
Disadvantage: Can overflow for long strings
"""
# Prime numbers for a-z
primes = [
2, 3, 5, 7, 11, 13, 17, 19, 23, 29,
31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
73, 79, 83, 89, 97, 101
]
anagram_map = defaultdict(list)
for s in strs:
# Calculate prime product
hash_value = 1
for char in s:
if 'a' <= char <= 'z':
hash_value *= primes[ord(char) - ord('a')]
anagram_map[hash_value].append(s)
self.comparisons += 1
return list(anagram_map.values())
def find_anagrams_of(self, target: str, strs: List[str]) -> List[str]:
"""
Find all anagrams of a target string in a list.
Args:
target: Target string
strs: List of strings to search
Returns:
List of strings that are anagrams of target
"""
# Get hash key for target
if self.strategy == GroupingStrategy.SORTED:
target_key = ''.join(sorted(target))
elif self.strategy == GroupingStrategy.COUNT:
count = [0] * 26
for char in target:
if 'a' <= char <= 'z':
count[ord(char) - ord('a')] += 1
target_key = tuple(count)
else:
target_key = None
# Find matching strings
result = []
for s in strs:
if self.strategy == GroupingStrategy.SORTED:
key = ''.join(sorted(s))
elif self.strategy == GroupingStrategy.COUNT:
count = [0] * 26
for char in s:
if 'a' <= char <= 'z':
count[ord(char) - ord('a')] += 1
key = tuple(count)
if key == target_key:
result.append(s)
return result
def get_stats(self) -> Dict:
"""Get performance statistics."""
return {
"strategy": self.strategy.value,
"comparisons": self.comparisons,
"groups_created": self.groups_created,
}
# Example usage
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
# Test data
test_strs = ["eat","tea","tan","ate","nat","bat"]
# Test different strategies
for strategy in GroupingStrategy:
print(f"\n=== Testing {strategy.value} strategy ===")
grouper = AnagramGrouper(strategy=strategy)
result = grouper.group_anagrams(test_strs)
print(f"Result: {result}")
print(f"Stats: {grouper.get_stats()}")
# Test finding anagrams
anagrams_of_eat = grouper.find_anagrams_of("eat", test_strs)
print(f"Anagrams of 'eat': {anagrams_of_eat}")
Testing
Comprehensive Test Suite
import pytest
from typing import List
class TestAnagramGrouper:
"""Comprehensive test suite for anagram grouping."""
@pytest.fixture
def grouper(self):
return AnagramGrouper(strategy=GroupingStrategy.SORTED)
def test_basic_examples(self, grouper):
"""Test basic examples from problem."""
# Example 1
result = grouper.group_anagrams(["eat","tea","tan","ate","nat","bat"])
# Convert to sets for comparison (order doesn't matter)
result_sets = [set(group) for group in result]
expected_sets = [
{"eat", "tea", "ate"},
{"tan", "nat"},
{"bat"}
]
assert len(result_sets) == len(expected_sets)
for expected in expected_sets:
assert expected in result_sets
def test_empty_string(self, grouper):
"""Test with empty string."""
result = grouper.group_anagrams([""])
assert result == [[""]]
def test_single_string(self, grouper):
"""Test with single string."""
result = grouper.group_anagrams(["a"])
assert result == [["a"]]
def test_no_anagrams(self, grouper):
"""Test when no strings are anagrams."""
result = grouper.group_anagrams(["abc", "def", "ghi"])
assert len(result) == 3
assert all(len(group) == 1 for group in result)
def test_all_anagrams(self, grouper):
"""Test when all strings are anagrams."""
result = grouper.group_anagrams(["abc", "bca", "cab", "acb"])
assert len(result) == 1
assert len(result[0]) == 4
def test_long_strings(self, grouper):
"""Test with long strings."""
long1 = "a" * 100
long2 = "a" * 100
long3 = "b" * 100
result = grouper.group_anagrams([long1, long2, long3])
assert len(result) == 2
def test_strategy_equivalence(self):
"""Test that all strategies produce equivalent results."""
test_input = ["eat","tea","tan","ate","nat","bat"]
results = {}
for strategy in GroupingStrategy:
grouper = AnagramGrouper(strategy=strategy)
result = grouper.group_anagrams(test_input)
# Convert to frozensets for comparison
result_sets = frozenset(
frozenset(group) for group in result
)
results[strategy] = result_sets
# All strategies should produce same groupings
assert len(set(results.values())) == 1
def test_case_insensitive(self):
"""Test case handling."""
grouper = AnagramGrouper(strategy=GroupingStrategy.COUNT)
result = grouper.group_anagrams(["Eat", "Tea", "eat"])
# Should group regardless of case
assert len(result) <= 2 # Depends on implementation
def test_invalid_input(self, grouper):
"""Test input validation."""
with pytest.raises(ValueError):
grouper.group_anagrams("not a list")
with pytest.raises(ValueError):
grouper.group_anagrams([1, 2, 3])
def test_find_anagrams(self, grouper):
"""Test finding specific anagrams."""
strs = ["eat","tea","tan","ate","nat","bat"]
anagrams = grouper.find_anagrams_of("eat", strs)
assert set(anagrams) == {"eat", "tea", "ate"}
anagrams = grouper.find_anagrams_of("tan", strs)
assert set(anagrams) == {"tan", "nat"}
# Run tests
if __name__ == "__main__":
pytest.main([__file__, "-v"])
Complexity Analysis
Sorting Approach
Time Complexity: O(N × K log K)
- N strings to process
- Each string of length K needs sorting: O(K log K)
- Hash table operations: O(1) average
Space Complexity: O(NK)
- Store all strings in hash table: O(NK)
- Hash keys: O(NK)
Character Count Approach
Time Complexity: O(NK)
- N strings to process
- Each string of length K needs counting: O(K)
- Better than sorting for large K!
Space Complexity: O(NK)
- Store all strings: O(NK)
- Hash keys (26-element tuples): O(N)
Comparison
| Approach | Time | Space | Best For |
|---|---|---|---|
| Brute Force | O(N²K) | O(NK) | Never (too slow) |
| Sorting | O(NK log K) | O(NK) | Short to medium strings |
| Count | O(NK) | O(NK) | Long strings (K » 26) |
| Prime | O(NK) | O(NK) | Theoretical interest |
When K is small (≤100): Sorting is simpler and sufficient. When K is large (>1000): Character count is faster.
Production Considerations
1. Unicode Support
def group_anagrams_unicode(strs: List[str]) -> List[List[str]]:
"""
Group anagrams with full Unicode support.
Handles:
- Non-ASCII characters
- Emojis
- Accented characters
"""
from collections import Counter
anagram_map = defaultdict(list)
for s in strs:
# Use Counter for Unicode-safe counting
# Convert to frozenset of items for hashing
key = tuple(sorted(Counter(s).items()))
anagram_map[key].append(s)
return list(anagram_map.values())
# Test with Unicode
unicode_strs = ["café", "éfac", "hello", "हेलो", "😀😁", "😁😀"]
result = group_anagrams_unicode(unicode_strs)
print(result)
2. Case-Insensitive Grouping
def group_anagrams_case_insensitive(strs: List[str]) -> List[List[str]]:
"""Group anagrams ignoring case."""
anagram_map = defaultdict(list)
for s in strs:
# Normalize to lowercase before sorting
key = ''.join(sorted(s.lower()))
anagram_map[key].append(s)
return list(anagram_map.values())
# Test
result = group_anagrams_case_insensitive(["Eat", "Tea", "eat", "tea"])
print(result) # All in one group
3. Streaming / Online Grouping
class StreamingAnagramGrouper:
"""
Group anagrams in streaming fashion.
Useful when:
- Data doesn't fit in memory
- Processing real-time stream
- Need incremental results
"""
def __init__(self):
self.groups = defaultdict(list)
self.group_ids = {}
self.next_id = 0
def add_string(self, s: str) -> int:
"""
Add string to grouping.
Returns:
Group ID for this string
"""
key = ''.join(sorted(s))
if key not in self.group_ids:
self.group_ids[key] = self.next_id
self.next_id += 1
group_id = self.group_ids[key]
self.groups[group_id].append(s)
return group_id
def get_group(self, group_id: int) -> List[str]:
"""Get strings in a specific group."""
return self.groups.get(group_id, [])
def get_all_groups(self) -> List[List[str]]:
"""Get all groups."""
return list(self.groups.values())
# Usage
streamer = StreamingAnagramGrouper()
for word in ["eat", "tea", "tan", "ate"]:
group_id = streamer.add_string(word)
print(f"'{word}' -> group {group_id}")
print(f"Final groups: {streamer.get_all_groups()}")
4. Performance Monitoring
import time
from dataclasses import dataclass
@dataclass
class PerformanceMetrics:
"""Performance metrics for grouping operation."""
total_strings: int
total_groups: int
execution_time_ms: float
avg_group_size: float
largest_group_size: int
def __str__(self):
return (
f"Performance Metrics:\n"
f" Strings processed: {self.total_strings}\n"
f" Groups created: {self.total_groups}\n"
f" Execution time: {self.execution_time_ms:.2f}ms\n"
f" Avg group size: {self.avg_group_size:.2f}\n"
f" Largest group: {self.largest_group_size}"
)
def group_anagrams_with_metrics(strs: List[str]) -> tuple[List[List[str]], PerformanceMetrics]:
"""Group anagrams and return performance metrics."""
start_time = time.perf_counter()
# Group anagrams
anagram_map = defaultdict(list)
for s in strs:
key = ''.join(sorted(s))
anagram_map[key].append(s)
result = list(anagram_map.values())
# Calculate metrics
execution_time = (time.perf_counter() - start_time) * 1000
group_sizes = [len(group) for group in result]
metrics = PerformanceMetrics(
total_strings=len(strs),
total_groups=len(result),
execution_time_ms=execution_time,
avg_group_size=sum(group_sizes) / len(group_sizes) if group_sizes else 0,
largest_group_size=max(group_sizes) if group_sizes else 0
)
return result, metrics
Connections to ML Systems
The hash-based grouping pattern from this problem is fundamental to clustering systems:
1. Clustering Systems
Similarity to Group Anagrams:
- Anagrams: Group strings by character composition
- Clustering: Group data points by feature similarity
import numpy as np
from collections import defaultdict
class SimpleClusterer:
"""
Simple clustering using hash-based grouping.
Similar to anagram grouping:
- Hash key: quantized feature vector
- Grouping: points with same hash go to same cluster
"""
def __init__(self, num_bins: int = 10):
self.num_bins = num_bins
def cluster(self, points: np.ndarray) -> List[List[int]]:
"""
Cluster points using locality-sensitive hashing.
Args:
points: Array of shape (n_samples, n_features)
Returns:
List of clusters (each cluster is list of point indices)
"""
clusters = defaultdict(list)
for idx, point in enumerate(points):
# Create hash key by quantizing features
# (similar to sorting characters in anagram)
hash_key = tuple(
int(feature * self.num_bins) for feature in point
)
clusters[hash_key].append(idx)
return list(clusters.values())
# Example: Cluster 2D points
points = np.array([
[0.1, 0.1], # Cluster 1
[0.12, 0.11], # Cluster 1
[0.8, 0.9], # Cluster 2
[0.82, 0.88], # Cluster 2
])
clusterer = SimpleClusterer(num_bins=10)
clusters = clusterer.cluster(points)
print(f"Clusters: {clusters}")
2. Duplicate Detection
class DocumentDeduplicator:
"""
Detect duplicate or near-duplicate documents.
Uses same pattern as anagram grouping:
- Hash key: document signature
- Grouping: similar documents
"""
def __init__(self):
self.doc_groups = defaultdict(list)
def add_document(self, doc_id: str, text: str):
"""Add document to deduplication system."""
# Create signature (like anagram key)
signature = self._create_signature(text)
self.doc_groups[signature].append(doc_id)
def _create_signature(self, text: str) -> str:
"""
Create document signature.
Methods:
1. Word frequency (like character count)
2. N-gram hashing
3. MinHash
"""
# Simple: use sorted bag of words
words = text.lower().split()
return ' '.join(sorted(words))
def find_duplicates(self) -> List[List[str]]:
"""Find groups of duplicate documents."""
return [
group for group in self.doc_groups.values()
if len(group) > 1
]
# Usage
dedup = DocumentDeduplicator()
dedup.add_document("doc1", "the quick brown fox")
dedup.add_document("doc2", "quick brown fox the") # Duplicate!
dedup.add_document("doc3", "hello world")
duplicates = dedup.find_duplicates()
print(f"Duplicate groups: {duplicates}")
3. Feature Hashing
class FeatureHasher:
"""
Hash high-dimensional features to lower dimensions.
Similar to anagram grouping:
- Hash collisions group similar features
- Dimensionality reduction via hashing
"""
def __init__(self, n_features: int = 100):
self.n_features = n_features
def transform(self, feature_dict: Dict[str, float]) -> np.ndarray:
"""
Transform feature dictionary to fixed-size vector.
Args:
feature_dict: {feature_name: value}
Returns:
Dense feature vector
"""
# Initialize vector
vector = np.zeros(self.n_features)
# Hash each feature to a bin
for feature_name, value in feature_dict.items():
# Hash feature name to index
# (like sorting string to create key)
hash_idx = hash(feature_name) % self.n_features
vector[hash_idx] += value
return vector
# Example: Text features
hasher = FeatureHasher(n_features=10)
doc1_features = {"word_cat": 2, "word_dog": 1, "word_house": 1}
doc2_features = {"word_cat": 1, "word_dog": 2, "word_car": 1}
vec1 = hasher.transform(doc1_features)
vec2 = hasher.transform(doc2_features)
print(f"Vector 1: {vec1}")
print(f"Vector 2: {vec2}")
print(f"Similarity: {np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))}")
Key Parallels
| Group Anagrams | Clustering/ML |
|---|---|
| Sorted string as hash key | Feature signature as hash key |
| Group identical keys | Group similar signatures |
| O(1) hash lookup | O(1) hash lookup |
| Character frequency | Feature frequency |
| String similarity | Data point similarity |
Interview Strategy
How to Approach in an Interview
1. Clarify (1 min)
- Are all strings lowercase? (Yes, per constraints)
- Can strings be empty? (Yes)
- Does order of output matter? (No)
- Memory constraints? (Reasonable for N ≤ 10^4)
2. Explain Intuition (2 min)
"Anagrams have the same characters, just rearranged. If we sort
each string, anagrams become identical. We can use this sorted
string as a hash key to group them together in O(1) time per lookup."
3. Discuss Approaches (2 min)
1. Brute force: Compare all pairs - O(N²K)
2. Sorting as key: Sort each string - O(NK log K)
3. Count as key: Count characters - O(NK)
I'll implement approach 2 (sorting) as it's simpler and efficient
enough for the constraints.
4. Code (10 min)
- Clear variable names
- Add comments
- Handle edge cases
5. Test (3 min)
- Walk through example
- Edge cases: empty string, single string
6. Optimize (2 min)
- Discuss count approach for very long strings
Common Mistakes
- Forgetting to handle empty strings
# Wrong: crashes on empty string key = sorted(s)[0] # Correct: works for empty key = ''.join(sorted(s)) - Using list as hash key
# Wrong: lists aren't hashable key = sorted(s) # Returns list # Correct: convert to string or tuple key = ''.join(sorted(s)) key = tuple(sorted(s)) - Not considering case sensitivity
- Problem says lowercase only, but clarify in interview
- Inefficient anagram checking in brute force
# Inefficient def are_anagrams(s1, s2): return sorted(s1) == sorted(s2) # Better: use Counter from collections import Counter return Counter(s1) == Counter(s2)
Follow-up Questions
Q1: How would you find the largest group of anagrams?
def find_largest_anagram_group(strs: List[str]) -> List[str]:
"""Find the group with most anagrams."""
anagram_map = defaultdict(list)
for s in strs:
key = ''.join(sorted(s))
anagram_map[key].append(s)
# Return largest group
return max(anagram_map.values(), key=len)
Q2: Can you do this without sorting?
Yes! Use character count (shown in Approach 3).
Q3: How would you handle very large datasets?
"""
For datasets that don't fit in memory:
1. Use external sorting / MapReduce
2. Process in batches
3. Use database with hash index
4. Stream processing with approximate grouping
"""
def group_anagrams_distributed(strs_iterator):
"""
Process large dataset in streaming fashion.
MapReduce approach:
- Map: (string -> (sorted_key, string))
- Reduce: Group by sorted_key
"""
pass
Q4: What if we want fuzzy matching (allow 1-2 character differences)?
def group_similar_strings(strs: List[str], max_diff: int = 1) -> List[List[str]]:
"""
Group strings that are similar (not exact anagrams).
This requires different approach:
- Locality-sensitive hashing
- Edit distance clustering
- N-gram similarity
"""
# More complex - would need LSH or edit distance
pass
Key Takeaways
✅ Hash tables are perfect for grouping - O(1) lookup and insertion
✅ Good hash keys are canonical - sorted string represents all anagrams
✅ Two main approaches: Sorting O(NK log K) vs Counting O(NK)
✅ Character count is faster for very long strings (K » 26)
✅ Pattern applies broadly: Document clustering, duplicate detection, feature hashing
✅ Production considerations: Unicode support, case-insensitivity, streaming
✅ Same pattern in clustering: Hash key = feature signature, grouping = clustering
✅ Same pattern in speaker diarization: Hash key = voice embedding, grouping = speaker clusters
✅ Defaultdict is cleaner than manually checking key existence
✅ Testing is crucial: Edge cases (empty strings, single string, all anagrams)
Mental Model
Think of this problem as:
- Anagrams: Hash-based string grouping
- Clustering: Hash-based data point grouping
- Speaker Diarization: Hash-based audio segment grouping
All use the same pattern: create signature → hash → group by signature
Originally published at: arunbaby.com/dsa/0015-group-anagrams
If you found this helpful, consider sharing it with others who might benefit.