Median of Two Sorted Arrays
“Finding the middle ground between two ordered worlds.”
TL;DR
Binary search on the shorter array to find the correct partition where all left elements are smaller than all right elements. The partition splits both arrays so the left half contains exactly (m+n+1)/2 elements. Check if max_left1 <= min_right2 and max_left2 <= min_right1. This runs in O(log min(m,n)) time, a massive improvement over the O(m+n) merge approach. See also Merge K Sorted Lists for merge patterns and statistical analysis in ML evaluation.
1. Problem Statement
Given two sorted arrays nums1 and nums2 of size m and n respectively, return the median of the two sorted arrays.
The overall run time complexity should be O(\log(m+n)).
Example 1:
Input: nums1 = [1,3], nums2 = [2]
Output: 2.00000
Explanation: merged array = [1,2,3] and median is 2.
Example 2:
Input: nums1 = [1,2], nums2 = [3,4]
Output: 2.50000
Explanation: merged array = [1,2,3,4] and median is (2 + 3) / 2 = 2.5.
Constraints:
nums1.length == mnums2.length == n0 <= m <= 10000 <= n <= 10001 <= m + n <= 2000-10^6 <= nums1[i], nums2[i] <= 10^6
2. Understanding the Median
Definition:
- For an odd-length array: the middle element.
- For an even-length array: the average of the two middle elements.
Key Insight: The median divides the combined array into two equal halves:
- Left half: All elements ≤ median.
- Right half: All elements ≥ median.
3. Approach 1: Merge and Find (Brute Force)
Algorithm:
- Merge both arrays into one sorted array.
- Find the median.
def findMedianSortedArrays(nums1, nums2):
merged = sorted(nums1 + nums2)
n = len(merged)
if n % 2 == 1:
return merged[n // 2]
else:
return (merged[n // 2 - 1] + merged[n // 2]) / 2
Complexity:
- Time:
O((m+n) \log(m+n))for sorting, or O(m+n) if using merge sort’s merge step. - Space: O(m+n) for the merged array.
Issue: Doesn’t meet the O(\log(m+n)) requirement.
4. Approach 2: Two-Pointer Merge (No Full Merge)
Algorithm:
- Use two pointers to “virtually” merge.
- Stop when we reach the median position.
def findMedianSortedArrays(nums1, nums2):
m, n = len(nums1), len(nums2)
total = m + n
# We need element at (total-1)//2 and total//2 for median
target1 = (total - 1) // 2
target2 = total // 2
i = j = 0
val1 = val2 = 0
for k in range(target2 + 1):
if i < m and (j >= n or nums1[i] <= nums2[j]):
val = nums1[i]
i += 1
else:
val = nums2[j]
j += 1
if k == target1:
val1 = val
if k == target2:
val2 = val
return (val1 + val2) / 2
Complexity:
- Time: O(m+n).
- Space: O(1).
Still doesn’t meet the O(\log(m+n)) requirement.
5. Approach 3: Binary Search (Optimal)
Key Insight: We need to partition both arrays such that:
- Left partition has
(m + n + 1) // 2elements. - All elements in left partition ≤ all elements in right partition.
Algorithm:
- Binary search on the smaller array.
- For each partition of the smaller array, compute the corresponding partition of the larger array.
- Check if the partition is valid.
def findMedianSortedArrays(nums1, nums2):
# Ensure nums1 is the smaller array
if len(nums1) > len(nums2):
nums1, nums2 = nums2, nums1
m, n = len(nums1), len(nums2)
total = m + n
half = (total + 1) // 2
lo, hi = 0, m
while lo <= hi:
i = (lo + hi) // 2 # Partition index for nums1
j = half - i # Partition index for nums2
# Edge handling with -inf and +inf
nums1_left = nums1[i - 1] if i > 0 else float('-inf')
nums1_right = nums1[i] if i < m else float('inf')
nums2_left = nums2[j - 1] if j > 0 else float('-inf')
nums2_right = nums2[j] if j < n else float('inf')
# Check if partition is valid
if nums1_left <= nums2_right and nums2_left <= nums1_right:
# Valid partition found
if total % 2 == 1:
return max(nums1_left, nums2_left)
else:
return (max(nums1_left, nums2_left) + min(nums1_right, nums2_right)) / 2
elif nums1_left > nums2_right:
hi = i - 1 # Too many elements from nums1
else:
lo = i + 1 # Too few elements from nums1
return 0 # Should never reach here
Complexity:
- Time:
O(\log(\min(m, n))). - Space: O(1).
6. Detailed Walkthrough
Example: nums1 = [1, 3, 8, 9, 15], nums2 = [7, 11, 18, 19, 21, 25]
Setup:
- m = 5, n = 6, total = 11.
- half = (11 + 1) // 2 = 6.
- Need 6 elements in left partition.
Binary Search:
Iteration 1: lo=0, hi=5, i=2, j=6-2=4
- nums1_left = nums1[1] = 3
- nums1_right = nums1[2] = 8
- nums2_left = nums2[3] = 19
- nums2_right = nums2[4] = 21
- Check: 3 ≤ 21 ✓, 19 ≤ 8 ✗
- 19 > 8, so we need more from nums1. lo = 3.
Iteration 2: lo=3, hi=5, i=4, j=6-4=2
- nums1_left = nums1[3] = 9
- nums1_right = nums1[4] = 15
- nums2_left = nums2[1] = 11
- nums2_right = nums2[2] = 18
- Check: 9 ≤ 18 ✓, 11 ≤ 15 ✓
- Valid partition!
- Odd total: median = max(9, 11) = 11.
Merged Array (for verification): [1, 3, 7, 8, 9, 11, 15, 18, 19, 21, 25]. Median = 11. ✓
7. Why Binary Search Works
Invariant: If we take i elements from nums1, we must take half - i elements from nums2.
Validity Condition:
nums1[i-1] ≤ nums2[j]: Largest in nums1’s left ≤ smallest in nums2’s right.nums2[j-1] ≤ nums1[i]: Largest in nums2’s left ≤ smallest in nums1’s right.
Binary Search Logic:
- If
nums1[i-1] > nums2[j]: We took too many from nums1. Decrease i. - If
nums2[j-1] > nums1[i]: We took too few from nums1. Increase i.
8. Edge Cases
- One Array is Empty:
nums1 = [],nums2 = [2, 3]→ Median of nums2 = 2.5.- Handle with
-infand+infsentinels.
- Arrays of Different Sizes:
- Always binary search on the smaller array for efficiency.
- All Elements in One Array are Smaller:
nums1 = [1, 2],nums2 = [3, 4, 5, 6].- Partition: All of nums1 in left, some of nums2 in left.
- Single Element Arrays:
nums1 = [1],nums2 = [2]→ Median = 1.5.
9. System Design: Distributed Median Finding
Problem: Find the median of data distributed across k servers.
Naive Approach: Collect all data, sort, find median. O(N \log N).
Efficient Approach (Sampling-Based):
- Sample: Each server sends a random sample.
- Estimate: Find approximate median from samples.
- Count: Each server counts elements < median, > median.
- Refine: Narrow the range containing the true median.
Algorithm (Binary Search on Value):
- Binary search on the median value (not indices).
- For each candidate value, count how many elements are ≤ it.
- If count = (total + 1) / 2, we found the median.
10. Deep Dive: Kth Element in Two Sorted Arrays
Generalization: Find the k-th smallest element in two sorted arrays.
Algorithm (Binary Search):
def findKthElement(nums1, nums2, k):
if len(nums1) > len(nums2):
nums1, nums2 = nums2, nums1
m, n = len(nums1), len(nums2)
lo, hi = max(0, k - n), min(k, m)
while lo <= hi:
i = (lo + hi) // 2
j = k - i
nums1_left = nums1[i - 1] if i > 0 else float('-inf')
nums1_right = nums1[i] if i < m else float('inf')
nums2_left = nums2[j - 1] if j > 0 else float('-inf')
nums2_right = nums2[j] if j < n else float('inf')
if nums1_left <= nums2_right and nums2_left <= nums1_right:
return max(nums1_left, nums2_left)
elif nums1_left > nums2_right:
hi = i - 1
else:
lo = i + 1
return -1
Use Case: Median is just k = (m + n + 1) // 2.
11. Production Application: Real-Time Percentile Computation
Scenario: Compute the 99th percentile latency from logs stored on multiple servers.
Algorithm:
- Histogram Binning: Each server maintains a histogram of latencies.
- Merge Histograms: Sum histograms across servers.
- Find Percentile: Walk through the merged histogram.
Optimization:
- Use exponential binning (log-scale) for accuracy across wide ranges.
- Stream histograms instead of raw data (reduces network traffic).
12. Interview Questions
- Median of Two Sorted Arrays (Classic): Solve with binary search.
- Kth Smallest Element: Generalize the median solution.
- Median of Data Stream: Design with two heaps.
- Distributed Median: Explain the sampling + counting approach.
- Running Median: Use balanced BST or two heaps.
13. Common Mistakes
- Not Handling Empty Arrays: Leads to index out of bounds.
- Off-by-One in Partition: Careful with i-1, j-1.
- Binary Searching the Wrong Array: Always search the smaller one.
- Integer Overflow: (a + b) / 2 can overflow. Use a + (b - a) / 2.
- Forgetting Even/Odd Cases: Median calculation differs.
14. Variant: Median of K Sorted Arrays
Problem: Given k sorted arrays, find the overall median.
Approach 1: Sequential Merge
- Merge pairs until one array remains.
- Find median.
- Time: O(N \log k) where N is total elements.
Approach 2: Binary Search on Value
- Binary search on the median value.
- For each candidate, count elements ≤ it in all arrays using binary search.
- Time: O(k \log V \log N) where V is the value range.
15. Mathematical Proof of Correctness
Theorem: The binary search algorithm correctly finds the median.
Proof:
- Partition Property: The left partition has exactly
(m + n + 1) // 2elements. - Validity: If
nums1[i-1] ≤ nums2[j]andnums2[j-1] ≤ nums1[i], then all left elements ≤ all right elements. - Binary Search Correctness:
- If
nums1[i-1] > nums2[j], we need fewer elements from nums1 (decrease i). - If
nums2[j-1] > nums1[i], we need more elements from nums1 (increase i).
- If
- Termination: Binary search terminates when a valid partition is found.
- Median: The median is max(left) or (max(left) + min(right)) / 2.
16. Deep Dive: Weighted Median
Problem: Given elements with weights, find the median where weight sums to 50% on each side.
Algorithm:
- Sort elements by value.
- Accumulate weights.
- Find where cumulative weight = total_weight / 2.
Use Case: Weighted voting, robust averaging.
17. Performance Benchmarking
import time
import random
def benchmark():
sizes = [(100, 100), (1000, 1000), (10000, 10000)]
for m, n in sizes:
nums1 = sorted(random.sample(range(1000000), m))
nums2 = sorted(random.sample(range(1000000), n))
# Binary Search
start = time.time()
for _ in range(1000):
findMedianSortedArrays(nums1, nums2)
bs_time = time.time() - start
# Merge
start = time.time()
for _ in range(1000):
findMedian_merge(nums1, nums2)
merge_time = time.time() - start
print(f"m={m}, n={n}: BS={bs_time:.4f}s, Merge={merge_time:.4f}s")
# Expected: Binary search is significantly faster for large arrays
18. Alternative: Divide and Conquer
Algorithm:
- Compare medians of both arrays.
- Discard the half that cannot contain the overall median.
- Recur on the remaining halves.
def findMedian_DC(nums1, nums2):
def kth(a, a_start, a_end, b, b_start, b_end, k):
if a_start > a_end:
return b[b_start + k - 1]
if b_start > b_end:
return a[a_start + k - 1]
if k == 1:
return min(a[a_start], b[b_start])
mid_a = float('inf') if a_start + k // 2 - 1 > a_end else a[a_start + k // 2 - 1]
mid_b = float('inf') if b_start + k // 2 - 1 > b_end else b[b_start + k // 2 - 1]
if mid_a < mid_b:
return kth(a, a_start + k // 2, a_end, b, b_start, b_end, k - k // 2)
else:
return kth(a, a_start, a_end, b, b_start + k // 2, b_end, k - k // 2)
m, n = len(nums1), len(nums2)
total = m + n
if total % 2 == 1:
return kth(nums1, 0, m - 1, nums2, 0, n - 1, total // 2 + 1)
else:
left = kth(nums1, 0, m - 1, nums2, 0, n - 1, total // 2)
right = kth(nums1, 0, m - 1, nums2, 0, n - 1, total // 2 + 1)
return (left + right) / 2
Complexity: O(\log(m + n)).
19. Conclusion
The Median of Two Sorted Arrays problem is a classic example of how binary search can achieve logarithmic time complexity in non-obvious ways.
Key Takeaways:
- Binary Search on Partition: Instead of searching for a value, search for a valid partition.
- Invariant Maintenance: Ensure left and right partitions satisfy the ordering property.
- Edge Handling: Use sentinels (-inf, +inf) for boundary cases.
- Generalization: The same technique works for finding the k-th element.
This problem demonstrates that mastering binary search goes beyond simple “find the target”, it’s about identifying the right search space.
20. Related Problems
- Median of Data Stream (LeetCode 295)
- Kth Smallest Element in a Sorted Matrix (LeetCode 378)
- Find K-th Smallest Pair Distance (LeetCode 719)
- Split Array Largest Sum (LeetCode 410)
Practice these to solidify your understanding of binary search on answer patterns!
21. Deep Dive: Median in a Stream (Two Heaps)
Problem: Find the median as elements are added one by one.
Algorithm:
- Maintain two heaps:
- Max-Heap (left): Smaller half.
- Min-Heap (right): Larger half.
-
Balance: left = right or left = right + 1. - Median = top(left) or (top(left) + top(right)) / 2.
import heapq
class MedianFinder:
def __init__(self):
self.left = [] # Max-heap (negate values)
self.right = [] # Min-heap
def addNum(self, num):
# Add to left (max-heap)
heapq.heappush(self.left, -num)
# Balance: Move largest from left to right
heapq.heappush(self.right, -heapq.heappop(self.left))
# Ensure left has at least as many elements
if len(self.left) < len(self.right):
heapq.heappush(self.left, -heapq.heappop(self.right))
def findMedian(self):
if len(self.left) > len(self.right):
return -self.left[0]
return (-self.left[0] + self.right[0]) / 2
Complexity:
- Add: O(\log n).
- Find Median: O(1).
22. Deep Dive: Median in a Matrix (Binary Search on Value)
Problem: Find the median of a row-wise sorted matrix.
Algorithm:
- Binary search on the value (not index).
- For each candidate value, count elements ≤ it in each row.
- If count = (rows × cols + 1) / 2, we found the median.
def matrixMedian(matrix):
rows, cols = len(matrix), len(matrix[0])
lo = min(row[0] for row in matrix)
hi = max(row[-1] for row in matrix)
target = (rows * cols + 1) // 2
while lo < hi:
mid = (lo + hi) // 2
# Count elements <= mid in all rows
count = sum(bisect.bisect_right(row, mid) for row in matrix)
if count < target:
lo = mid + 1
else:
hi = mid
return lo
Complexity: O(R \cdot \log C \cdot \log(\max - \min)).
23. Variant: Sliding Window Median
Problem: Find the median of each window of size k.
Approach:
- Use two heaps (like streaming median).
- Use lazy deletion to handle elements leaving the window.
Challenge: Efficiently removing elements from heaps.
from sortedcontainers import SortedList
def medianSlidingWindow(nums, k):
window = SortedList()
result = []
for i, num in enumerate(nums):
window.add(num)
if len(window) > k:
window.remove(nums[i - k])
if len(window) == k:
if k % 2 == 1:
result.append(window[k // 2])
else:
result.append((window[k // 2 - 1] + window[k // 2]) / 2)
return result
Complexity: O(n \cdot \log k) with balanced BST.
24. Application: A/B Testing Statistics
Scenario: Compare median latency between control and treatment groups.
Challenge: Latencies are stored on multiple servers.
Approach:
- Each server computes a histogram of latencies.
- Merge histograms.
- Find median from merged histogram.
Benefit: Avoids transferring raw data.
25. Application: Database Query Optimization
Scenario: SQL query SELECT MEDIAN(column) FROM table.
Implementation:
- If table is sorted: Direct access to middle element.
- If unsorted: Use quickselect (O(n) average) or sort (O(n \log n)).
- For approximate median: Use sampling or t-digest.
26. Deep Dive: Approximating Median with T-Digest
T-Digest is a data structure for approximate percentile computation.
Algorithm:
- Maintain a set of centroids (mean, count).
- Centroids near the median are small (high precision).
- Centroids at the extremes are large (low precision).
Operations:
- Add: Merge new element into nearest centroid.
- Query: Walk through centroids to find percentile.
Use Case: Real-time percentile computation at scale (Netflix, Elasticsearch).
27. Testing Strategy
Test Cases:
- Both Arrays Empty: Handle gracefully.
- One Array Empty: Return median of the other.
- Same Size Arrays:
[1,2],[3,4]→ 2.5. - Different Size Arrays:
[1,3],[2]→ 2. - All Elements in One Array Smaller:
[1,2],[3,4,5,6]. - Duplicates:
[1,1,1],[1,1,1]→ 1. - Large Arrays: Performance testing.
Edge Cases:
- Single element in each array.
- Negative numbers.
- Very large values.
28. Interview Tips
Step-by-Step Approach:
- Clarify: Sizes, sorted order, allowed time complexity.
- Brute Force: Merge and find median (O(m+n)).
- Optimize: Binary search on partition (
O(\log \min(m,n))). - Walk Through: Use a specific example.
- Edge Cases: Empty arrays, single elements.
- Code: Write clean, bug-free code.
- Complexity: Analyze time and space.
29. Common Follow-up Questions
Q: What if arrays are not sorted? A: Sort first (O(n \log n)), or use quickselect.
Q: What if we can’t afford O(\log(m+n))?
A: Two-pointer merge is O(m+n) and may be acceptable.
Q: What if arrays are on different machines? A: Use the sampling/counting approach for distributed median.
Q: What if we need the k-th element instead of median? A: Same algorithm, just change the target partition size.
30. Mastery Checklist
Mastery Checklist:
- Implement brute force (merge and find)
- Implement two-pointer (virtual merge)
- Implement binary search on partition
- Handle edge cases (empty arrays, single elements)
- Generalize to k-th element
- Understand the divide-and-conquer approach
- Explain why binary search works (partition invariant)
- Implement streaming median (two heaps)
- Implement sliding window median
- Understand distributed median algorithms
31. Conclusion
The Median of Two Sorted Arrays is one of the most elegant problems in computer science. It combines:
- Binary Search: A powerful algorithmic paradigm.
- Partition Logic: Dividing arrays into meaningful halves.
- Edge Case Handling: Using sentinels for boundaries.
Mastering this problem opens the door to many advanced topics: distributed computing, real-time statistics, and database optimization. It’s a testament to how mathematical insight can lead to efficient algorithms.
The journey from O(m+n) to O(\log \min(m,n)) teaches us that the right perspective can transform a problem.
FAQ
Why does binary search on the partition work for finding the median?
The median splits the combined sorted array into two equal halves. By binary searching the partition position in the shorter array, we automatically determine the partition in the longer array (since both halves must contain (m+n+1)/2 elements). The correct partition satisfies max_left1 <= min_right2 and max_left2 <= min_right1, ensuring all left elements are smaller than all right elements.
Why do we binary search on the shorter array?
Searching the shorter array guarantees O(log min(m,n)) time complexity, which is better than O(log max(m,n)). It also prevents index out-of-bounds errors since the partition in the longer array is derived from the shorter array’s partition, and a valid partition range is always available in the longer array.
How do you handle edge cases with empty partitions?
When the partition places all elements of one array on one side, the other side’s boundary is empty. Use negative infinity for the left boundary and positive infinity for the right boundary of an empty partition. This ensures the comparison conditions max_left <= min_right are automatically satisfied, making the edge cases fall through naturally without special handling.
What is the difference between finding the median and finding the Kth element?
Finding the Kth smallest element generalizes the median problem. The binary search approach eliminates half the candidates at each step by comparing elements at positions k/2 in each array. The median is simply the (m+n+1)/2-th element for an odd total length, or the average of the middle two elements for an even total length.
Originally published at: arunbaby.com/dsa/0043-median-of-two-sorted-arrays