Median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value. So the median is the mean of the two middle value.
Examples:[2,3,4]
, the median is 3
[2,3]
, the median is (2 + 3) / 2 = 2.5
Design a data structure that supports the following two operations:
For example:
addNum(1) addNum(2) findMedian() -> 1.5 addNum(3) findMedian() -> 2
Credits:
Special thanks to @Louis1992 for adding this problem and creating all test cases.
Intuition
Do what the question says.
Algorithm
Store the numbers in a resize-able container. Every time you need to output the median, sort the container and output the median.
Complexity Analysis
Time complexity: .
Space complexity: linear space to hold input in a container. No extra space other than that needed (since sorting can usually be done in-place).
Intuition
Keeping our input container always sorted (i.e. maintaining the sorted nature of the container as an invariant).
Algorithm
Which algorithm allows a number to be added to a sorted list of numbers and yet keeps the entire list sorted? Well, for one, insertion sort!
We assume that the current list is already sorted. When a new number comes, we have to add it to the list while maintaining the sorted nature of the list. This is achieved easily by finding the correct place to insert the incoming number, using a binary search (remember, the list is always sorted). Once the position is found, we need to shift all higher elements by one space to make room for the incoming number.
This method would work well when the amount of insertion queries is lesser or about the same as the amount of median finding queries.
Complexity Analysis
Time complexity: .
Pop quiz: Can we use a linear search instead of a binary search to find insertion position, without incurring any significant runtime penalty?
Intuition
The above two approaches gave us some valuable insights on how to tackle this problem. Concretely, one can infer two things:
But perhaps the most important insight, which is not readily observable, is the fact that we only need a consistent way to access the median elements. Keeping the entire input sorted is not a requirement.
Well, if only there were a data structure which could handle our needs.
As it turns out there are two data structures for the job:
Heaps are a natural ingredient for this dish! Adding elements to them take logarithmic order of time. They also give direct access to the maximal/minimal elements in a group.
If we could maintain two heaps in the following way:
This gives access to median values in the input: they comprise the top of the heaps!
Wait, what? How?
If the following conditions are met:
then we can say that:
Then and/or are smaller than (or equal to) almost half of the elements and larger than (or equal to) the other half. That is the definition of median elements.
This leads us to a huge point of pain in this approach: balancing the two heaps!
Algorithm
Two priority queues:
lo
to store the smaller half of the numbershi
to store the larger half of the numbersThe max-heap lo
is allowed to store, at worst, one more element more than the min-heap hi
. Hence if we have processed elements:
lo
is allowed to hold elements, while hi
can hold elements.This gives us the nice property that when the heaps are perfectly balanced, the median can be derived from the tops of both heaps. Otherwise, the top of the max-heap lo
holds the legitimate median.
Adding a number num
:
num
to max-heap lo
. Since lo
received a new element, we must do a balancing step for hi
. So remove the largest element from lo
and offer it to hi
.hi
might end holding more elements than the max-heap lo
, after the previous operation. We fix that by removing the smallest element from hi
and offering it to lo
.The above step ensures that we do not disturb the nice little size property we just mentioned.
A little example will clear this up! Say we take input from the stream [41, 35, 62, 5, 97, 108]
. The run-though of the algorithm looks like this:
Adding number 41 MaxHeap lo: [41] // MaxHeap stores the largest value at the top (index 0) MinHeap hi: [] // MinHeap stores the smallest value at the top (index 0) Median is 41 ======================= Adding number 35 MaxHeap lo: [35] MinHeap hi: [41] Median is 38 ======================= Adding number 62 MaxHeap lo: [41, 35] MinHeap hi: [62] Median is 41 ======================= Adding number 4 MaxHeap lo: [35, 4] MinHeap hi: [41, 62] Median is 38 ======================= Adding number 97 MaxHeap lo: [41, 35, 4] MinHeap hi: [62, 97] Median is 41 ======================= Adding number 108 MaxHeap lo: [41, 35, 4] MinHeap hi: [62, 97, 108] Median is 51.5
Complexity Analysis
Time complexity: .
Space complexity: linear space to hold input in containers.
Intuition
Self-balancing Binary Search Trees (like an AVL Tree) have some very interesting properties. They maintain the tree's height to a logarithmic bound. Thus inserting a new element has reasonably good time performance. The median always winds up in the root of the tree and/or one of its children. Solving this problem using the same approach as Approach #3 but using a Self-balancing BST seems like a good choice. Except the fact that implementing such a tree is not trivial and prone to errors.
Why reinvent the wheel? Most languages implement a multiset
class which emulates such behavior. The only problem remains keeping track of the median elements. That is easily solved with pointers! 2
We maintain two pointers: one for the lower median element and the other for the higher median element. When the total number of elements is odd, both the pointers point to the same median element (since there is only one median in this case). When the number of elements is even, the pointers point to two consecutive elements, whose mean is the representative median of the input.
Algorithm
Two iterators/pointers lo_median
and hi_median
, which iterate over the data
multiset.
While adding a number num
, three cases arise:
num
and set both pointers to point to this element.The container currently holds an odd number of elements. This means that both the pointers currently point to the same element.
num
is not equal to the current median element, then num
goes on either side of it. Whichever side it goes, the size of that part increases and hence the corresponding pointer is updated. For example, if num
is less than the median element, the size of the lesser half of input increases by on inserting num
. Thus it makes sense to decrement lo_median
.num
is equal to the current median element, then the action taken is dependent on how num
is inserted into data
. NOTE: In our given C++ code example, std::multiset::insert
inserts an element after all elements of equal value. Hence we increment hi_median
.The container currently holds an even number of elements. This means that the pointers currently point to consecutive elements.
num
is a number between both median elements, then num
becomes the new median. Both pointers must point to it.num
increases the size of either the lesser or higher half of the input. We update the pointers accordingly. It is important to remember that both the pointers must point to the same element now.Finding the median is easy! It is simply the mean of the elements pointed to by the two pointers lo_median
and hi_median
.
A much shorter (but harder to understand), one pointer version 3 of this solution is given below:
Complexity Analysis
Time complexity: .
multiset
scheme. 4Space complexity: linear space to hold input in container.
There are so many ways around this problem, that frankly, it is scary. Here are a few more that I came across:
Buckets! If the numbers in the stream are statistically distributed, then it is easier to keep track of buckets where the median would land, than the entire array. Once you know the correct bucket, simply sort it find the median. If the bucket size is significantly smaller than the size of input processed, this results in huge time saving. @mitbbs8080 has an interesting implementation here.
Reservoir Sampling. Following along the lines of using buckets: if the stream is statistically distributed, you can rely on Reservoir Sampling. Basically, if you could maintain just one good bucket (or reservoir) which could hold a representative sample of the entire stream, you could estimate the median of the entire stream from just this one bucket. This means good time and memory performance. Reservoir Sampling lets you do just that. Determining a "good" size for your reservoir? Now, that's a whole other challenge. A good explanation for this can be found in this StackOverflow answer.
Segment Trees are a great data structure if you need to do a lot of insertions or a lot of read queries over a limited range of input values. They allow us to do all such operations fast and in roughly the same amount of time, always. The only problem is that they are far from trivial to implement. Take a look at my introductory article on Segment Trees if you are interested.
Order Statistic Trees are data structures which seem to be tailor-made for this problem. They have all the nice features of a BST, but also let you find the order element stored in the tree. They are a pain to implement and no standard interview would require you to code these up. But they are fun to use if they are already implemented in the language of your choice. 5
Analysis written by @babhishek21.
Priority Queues queue out elements based on a predefined priority. They are an abstract concept and can, as such, be implemented in many different ways. Heaps are an efficient way to implement Priority Queues. ↩
Inspired from this post by @StefanPochmann. ↩
GNU libstdc++
users are in luck! Take a look at this StackOverflow answer. ↩