Median of Two Sorted Arrays
“Finding the middle ground between two ordered worlds.”
1. Problem Statement
Given two sorted arrays nums1 and nums2 of size m and n respectively, return the median of the two sorted arrays.
The overall run time complexity should be $O(\log(m+n))$.
Example 1:
Input: nums1 = [1,3], nums2 = [2]
Output: 2.00000
Explanation: merged array = [1,2,3] and median is 2.
Example 2:
Input: nums1 = [1,2], nums2 = [3,4]
Output: 2.50000
Explanation: merged array = [1,2,3,4] and median is (2 + 3) / 2 = 2.5.
Constraints:
nums1.length == mnums2.length == n0 <= m <= 10000 <= n <= 10001 <= m + n <= 2000-10^6 <= nums1[i], nums2[i] <= 10^6
2. Understanding the Median
Definition:
- For an odd-length array: the middle element.
- For an even-length array: the average of the two middle elements.
Key Insight: The median divides the combined array into two equal halves:
- Left half: All elements ≤ median.
- Right half: All elements ≥ median.
3. Approach 1: Merge and Find (Brute Force)
Algorithm:
- Merge both arrays into one sorted array.
- Find the median.
def findMedianSortedArrays(nums1, nums2):
merged = sorted(nums1 + nums2)
n = len(merged)
if n % 2 == 1:
return merged[n // 2]
else:
return (merged[n // 2 - 1] + merged[n // 2]) / 2
Complexity:
- Time: $O((m+n) \log(m+n))$ for sorting, or $O(m+n)$ if using merge sort’s merge step.
- Space: $O(m+n)$ for the merged array.
Issue: Doesn’t meet the $O(\log(m+n))$ requirement.
4. Approach 2: Two-Pointer Merge (No Full Merge)
Algorithm:
- Use two pointers to “virtually” merge.
- Stop when we reach the median position.
def findMedianSortedArrays(nums1, nums2):
m, n = len(nums1), len(nums2)
total = m + n
# We need element at (total-1)//2 and total//2 for median
target1 = (total - 1) // 2
target2 = total // 2
i = j = 0
val1 = val2 = 0
for k in range(target2 + 1):
if i < m and (j >= n or nums1[i] <= nums2[j]):
val = nums1[i]
i += 1
else:
val = nums2[j]
j += 1
if k == target1:
val1 = val
if k == target2:
val2 = val
return (val1 + val2) / 2
Complexity:
- Time: $O(m+n)$.
- Space: $O(1)$.
Still doesn’t meet the $O(\log(m+n))$ requirement.
5. Approach 3: Binary Search (Optimal)
Key Insight: We need to partition both arrays such that:
- Left partition has
(m + n + 1) // 2elements. - All elements in left partition ≤ all elements in right partition.
Algorithm:
- Binary search on the smaller array.
- For each partition of the smaller array, compute the corresponding partition of the larger array.
- Check if the partition is valid.
def findMedianSortedArrays(nums1, nums2):
# Ensure nums1 is the smaller array
if len(nums1) > len(nums2):
nums1, nums2 = nums2, nums1
m, n = len(nums1), len(nums2)
total = m + n
half = (total + 1) // 2
lo, hi = 0, m
while lo <= hi:
i = (lo + hi) // 2 # Partition index for nums1
j = half - i # Partition index for nums2
# Edge handling with -inf and +inf
nums1_left = nums1[i - 1] if i > 0 else float('-inf')
nums1_right = nums1[i] if i < m else float('inf')
nums2_left = nums2[j - 1] if j > 0 else float('-inf')
nums2_right = nums2[j] if j < n else float('inf')
# Check if partition is valid
if nums1_left <= nums2_right and nums2_left <= nums1_right:
# Valid partition found
if total % 2 == 1:
return max(nums1_left, nums2_left)
else:
return (max(nums1_left, nums2_left) + min(nums1_right, nums2_right)) / 2
elif nums1_left > nums2_right:
hi = i - 1 # Too many elements from nums1
else:
lo = i + 1 # Too few elements from nums1
return 0 # Should never reach here
Complexity:
- Time: $O(\log(\min(m, n)))$.
- Space: $O(1)$.
6. Detailed Walkthrough
Example: nums1 = [1, 3, 8, 9, 15], nums2 = [7, 11, 18, 19, 21, 25]
Setup:
- m = 5, n = 6, total = 11.
- half = (11 + 1) // 2 = 6.
- Need 6 elements in left partition.
Binary Search:
Iteration 1: lo=0, hi=5, i=2, j=6-2=4
- nums1_left = nums1[1] = 3
- nums1_right = nums1[2] = 8
- nums2_left = nums2[3] = 19
- nums2_right = nums2[4] = 21
- Check: 3 ≤ 21 ✓, 19 ≤ 8 ✗
- 19 > 8, so we need more from nums1. lo = 3.
Iteration 2: lo=3, hi=5, i=4, j=6-4=2
- nums1_left = nums1[3] = 9
- nums1_right = nums1[4] = 15
- nums2_left = nums2[1] = 11
- nums2_right = nums2[2] = 18
- Check: 9 ≤ 18 ✓, 11 ≤ 15 ✓
- Valid partition!
- Odd total: median = max(9, 11) = 11.
Merged Array (for verification): [1, 3, 7, 8, 9, 11, 15, 18, 19, 21, 25]. Median = 11. ✓
7. Why Binary Search Works
Invariant: If we take i elements from nums1, we must take half - i elements from nums2.
Validity Condition:
nums1[i-1] ≤ nums2[j]: Largest in nums1’s left ≤ smallest in nums2’s right.nums2[j-1] ≤ nums1[i]: Largest in nums2’s left ≤ smallest in nums1’s right.
Binary Search Logic:
- If
nums1[i-1] > nums2[j]: We took too many from nums1. Decrease i. - If
nums2[j-1] > nums1[i]: We took too few from nums1. Increase i.
8. Edge Cases
- One Array is Empty:
nums1 = [],nums2 = [2, 3]→ Median of nums2 = 2.5.- Handle with
-infand+infsentinels.
- Arrays of Different Sizes:
- Always binary search on the smaller array for efficiency.
- All Elements in One Array are Smaller:
nums1 = [1, 2],nums2 = [3, 4, 5, 6].- Partition: All of nums1 in left, some of nums2 in left.
- Single Element Arrays:
nums1 = [1],nums2 = [2]→ Median = 1.5.
9. System Design: Distributed Median Finding
Problem: Find the median of data distributed across k servers.
Naive Approach: Collect all data, sort, find median. $O(N \log N)$.
Efficient Approach (Sampling-Based):
- Sample: Each server sends a random sample.
- Estimate: Find approximate median from samples.
- Count: Each server counts elements < median, > median.
- Refine: Narrow the range containing the true median.
Algorithm (Binary Search on Value):
- Binary search on the median value (not indices).
- For each candidate value, count how many elements are ≤ it.
- If count = (total + 1) / 2, we found the median.
10. Deep Dive: Kth Element in Two Sorted Arrays
Generalization: Find the k-th smallest element in two sorted arrays.
Algorithm (Binary Search):
def findKthElement(nums1, nums2, k):
if len(nums1) > len(nums2):
nums1, nums2 = nums2, nums1
m, n = len(nums1), len(nums2)
lo, hi = max(0, k - n), min(k, m)
while lo <= hi:
i = (lo + hi) // 2
j = k - i
nums1_left = nums1[i - 1] if i > 0 else float('-inf')
nums1_right = nums1[i] if i < m else float('inf')
nums2_left = nums2[j - 1] if j > 0 else float('-inf')
nums2_right = nums2[j] if j < n else float('inf')
if nums1_left <= nums2_right and nums2_left <= nums1_right:
return max(nums1_left, nums2_left)
elif nums1_left > nums2_right:
hi = i - 1
else:
lo = i + 1
return -1
Use Case: Median is just k = (m + n + 1) // 2.
11. Production Application: Real-Time Percentile Computation
Scenario: Compute the 99th percentile latency from logs stored on multiple servers.
Algorithm:
- Histogram Binning: Each server maintains a histogram of latencies.
- Merge Histograms: Sum histograms across servers.
- Find Percentile: Walk through the merged histogram.
Optimization:
- Use exponential binning (log-scale) for accuracy across wide ranges.
- Stream histograms instead of raw data (reduces network traffic).
12. Interview Questions
- Median of Two Sorted Arrays (Classic): Solve with binary search.
- Kth Smallest Element: Generalize the median solution.
- Median of Data Stream: Design with two heaps.
- Distributed Median: Explain the sampling + counting approach.
- Running Median: Use balanced BST or two heaps.
13. Common Mistakes
- Not Handling Empty Arrays: Leads to index out of bounds.
- Off-by-One in Partition: Careful with i-1, j-1.
- Binary Searching the Wrong Array: Always search the smaller one.
- Integer Overflow: (a + b) / 2 can overflow. Use a + (b - a) / 2.
- Forgetting Even/Odd Cases: Median calculation differs.
14. Variant: Median of K Sorted Arrays
Problem: Given k sorted arrays, find the overall median.
Approach 1: Sequential Merge
- Merge pairs until one array remains.
- Find median.
- Time: $O(N \log k)$ where N is total elements.
Approach 2: Binary Search on Value
- Binary search on the median value.
- For each candidate, count elements ≤ it in all arrays using binary search.
- Time: $O(k \log V \log N)$ where V is the value range.
15. Mathematical Proof of Correctness
Theorem: The binary search algorithm correctly finds the median.
Proof:
- Partition Property: The left partition has exactly
(m + n + 1) // 2elements. - Validity: If
nums1[i-1] ≤ nums2[j]andnums2[j-1] ≤ nums1[i], then all left elements ≤ all right elements. - Binary Search Correctness:
- If
nums1[i-1] > nums2[j], we need fewer elements from nums1 (decrease i). - If
nums2[j-1] > nums1[i], we need more elements from nums1 (increase i).
- If
- Termination: Binary search terminates when a valid partition is found.
- Median: The median is max(left) or (max(left) + min(right)) / 2.
16. Deep Dive: Weighted Median
Problem: Given elements with weights, find the median where weight sums to 50% on each side.
Algorithm:
- Sort elements by value.
- Accumulate weights.
- Find where cumulative weight = total_weight / 2.
Use Case: Weighted voting, robust averaging.
17. Performance Benchmarking
import time
import random
def benchmark():
sizes = [(100, 100), (1000, 1000), (10000, 10000)]
for m, n in sizes:
nums1 = sorted(random.sample(range(1000000), m))
nums2 = sorted(random.sample(range(1000000), n))
# Binary Search
start = time.time()
for _ in range(1000):
findMedianSortedArrays(nums1, nums2)
bs_time = time.time() - start
# Merge
start = time.time()
for _ in range(1000):
findMedian_merge(nums1, nums2)
merge_time = time.time() - start
print(f"m={m}, n={n}: BS={bs_time:.4f}s, Merge={merge_time:.4f}s")
# Expected: Binary search is significantly faster for large arrays
18. Alternative: Divide and Conquer
Algorithm:
- Compare medians of both arrays.
- Discard the half that cannot contain the overall median.
- Recur on the remaining halves.
def findMedian_DC(nums1, nums2):
def kth(a, a_start, a_end, b, b_start, b_end, k):
if a_start > a_end:
return b[b_start + k - 1]
if b_start > b_end:
return a[a_start + k - 1]
if k == 1:
return min(a[a_start], b[b_start])
mid_a = float('inf') if a_start + k // 2 - 1 > a_end else a[a_start + k // 2 - 1]
mid_b = float('inf') if b_start + k // 2 - 1 > b_end else b[b_start + k // 2 - 1]
if mid_a < mid_b:
return kth(a, a_start + k // 2, a_end, b, b_start, b_end, k - k // 2)
else:
return kth(a, a_start, a_end, b, b_start + k // 2, b_end, k - k // 2)
m, n = len(nums1), len(nums2)
total = m + n
if total % 2 == 1:
return kth(nums1, 0, m - 1, nums2, 0, n - 1, total // 2 + 1)
else:
left = kth(nums1, 0, m - 1, nums2, 0, n - 1, total // 2)
right = kth(nums1, 0, m - 1, nums2, 0, n - 1, total // 2 + 1)
return (left + right) / 2
Complexity: $O(\log(m + n))$.
19. Conclusion
The Median of Two Sorted Arrays problem is a classic example of how binary search can achieve logarithmic time complexity in non-obvious ways.
Key Takeaways:
- Binary Search on Partition: Instead of searching for a value, search for a valid partition.
- Invariant Maintenance: Ensure left and right partitions satisfy the ordering property.
- Edge Handling: Use sentinels (-inf, +inf) for boundary cases.
- Generalization: The same technique works for finding the k-th element.
This problem demonstrates that mastering binary search goes beyond simple “find the target”—it’s about identifying the right search space.
20. Related Problems
- Median of Data Stream (LeetCode 295)
- Kth Smallest Element in a Sorted Matrix (LeetCode 378)
- Find K-th Smallest Pair Distance (LeetCode 719)
- Split Array Largest Sum (LeetCode 410)
Practice these to solidify your understanding of binary search on answer patterns!
21. Deep Dive: Median in a Stream (Two Heaps)
Problem: Find the median as elements are added one by one.
Algorithm:
- Maintain two heaps:
- Max-Heap (left): Smaller half.
- Min-Heap (right): Larger half.
-
Balance: left = right or left = right + 1. - Median = top(left) or (top(left) + top(right)) / 2.
import heapq
class MedianFinder:
def __init__(self):
self.left = [] # Max-heap (negate values)
self.right = [] # Min-heap
def addNum(self, num):
# Add to left (max-heap)
heapq.heappush(self.left, -num)
# Balance: Move largest from left to right
heapq.heappush(self.right, -heapq.heappop(self.left))
# Ensure left has at least as many elements
if len(self.left) < len(self.right):
heapq.heappush(self.left, -heapq.heappop(self.right))
def findMedian(self):
if len(self.left) > len(self.right):
return -self.left[0]
return (-self.left[0] + self.right[0]) / 2
Complexity:
- Add: $O(\log n)$.
- Find Median: $O(1)$.
22. Deep Dive: Median in a Matrix (Binary Search on Value)
Problem: Find the median of a row-wise sorted matrix.
Algorithm:
- Binary search on the value (not index).
- For each candidate value, count elements ≤ it in each row.
- If count = (rows × cols + 1) / 2, we found the median.
def matrixMedian(matrix):
rows, cols = len(matrix), len(matrix[0])
lo = min(row[0] for row in matrix)
hi = max(row[-1] for row in matrix)
target = (rows * cols + 1) // 2
while lo < hi:
mid = (lo + hi) // 2
# Count elements <= mid in all rows
count = sum(bisect.bisect_right(row, mid) for row in matrix)
if count < target:
lo = mid + 1
else:
hi = mid
return lo
Complexity: $O(R \cdot \log C \cdot \log(\max - \min))$.
23. Variant: Sliding Window Median
Problem: Find the median of each window of size k.
Approach:
- Use two heaps (like streaming median).
- Use lazy deletion to handle elements leaving the window.
Challenge: Efficiently removing elements from heaps.
from sortedcontainers import SortedList
def medianSlidingWindow(nums, k):
window = SortedList()
result = []
for i, num in enumerate(nums):
window.add(num)
if len(window) > k:
window.remove(nums[i - k])
if len(window) == k:
if k % 2 == 1:
result.append(window[k // 2])
else:
result.append((window[k // 2 - 1] + window[k // 2]) / 2)
return result
Complexity: $O(n \cdot \log k)$ with balanced BST.
24. Application: A/B Testing Statistics
Scenario: Compare median latency between control and treatment groups.
Challenge: Latencies are stored on multiple servers.
Approach:
- Each server computes a histogram of latencies.
- Merge histograms.
- Find median from merged histogram.
Benefit: Avoids transferring raw data.
25. Application: Database Query Optimization
Scenario: SQL query SELECT MEDIAN(column) FROM table.
Implementation:
- If table is sorted: Direct access to middle element.
- If unsorted: Use quickselect ($O(n)$ average) or sort ($O(n \log n)$).
- For approximate median: Use sampling or t-digest.
26. Deep Dive: Approximating Median with T-Digest
T-Digest is a data structure for approximate percentile computation.
Algorithm:
- Maintain a set of centroids (mean, count).
- Centroids near the median are small (high precision).
- Centroids at the extremes are large (low precision).
Operations:
- Add: Merge new element into nearest centroid.
- Query: Walk through centroids to find percentile.
Use Case: Real-time percentile computation at scale (Netflix, Elasticsearch).
27. Testing Strategy
Test Cases:
- Both Arrays Empty: Handle gracefully.
- One Array Empty: Return median of the other.
- Same Size Arrays:
[1,2],[3,4]→ 2.5. - Different Size Arrays:
[1,3],[2]→ 2. - All Elements in One Array Smaller:
[1,2],[3,4,5,6]. - Duplicates:
[1,1,1],[1,1,1]→ 1. - Large Arrays: Performance testing.
Edge Cases:
- Single element in each array.
- Negative numbers.
- Very large values.
28. Interview Tips
Step-by-Step Approach:
- Clarify: Sizes, sorted order, allowed time complexity.
- Brute Force: Merge and find median ($O(m+n)$).
- Optimize: Binary search on partition ($O(\log \min(m,n))$).
- Walk Through: Use a specific example.
- Edge Cases: Empty arrays, single elements.
- Code: Write clean, bug-free code.
- Complexity: Analyze time and space.
29. Common Follow-up Questions
Q: What if arrays are not sorted? A: Sort first ($O(n \log n)$), or use quickselect.
Q: What if we can’t afford $O(\log(m+n))$? A: Two-pointer merge is $O(m+n)$ and may be acceptable.
Q: What if arrays are on different machines? A: Use the sampling/counting approach for distributed median.
Q: What if we need the k-th element instead of median? A: Same algorithm, just change the target partition size.
30. Mastery Checklist
Mastery Checklist:
- Implement brute force (merge and find)
- Implement two-pointer (virtual merge)
- Implement binary search on partition
- Handle edge cases (empty arrays, single elements)
- Generalize to k-th element
- Understand the divide-and-conquer approach
- Explain why binary search works (partition invariant)
- Implement streaming median (two heaps)
- Implement sliding window median
- Understand distributed median algorithms
31. Conclusion
The Median of Two Sorted Arrays is one of the most elegant problems in computer science. It combines:
- Binary Search: A powerful algorithmic paradigm.
- Partition Logic: Dividing arrays into meaningful halves.
- Edge Case Handling: Using sentinels for boundaries.
Mastering this problem opens the door to many advanced topics: distributed computing, real-time statistics, and database optimization. It’s a testament to how mathematical insight can lead to efficient algorithms.
The journey from $O(m+n)$ to $O(\log \min(m,n))$ teaches us that the right perspective can transform a problem.