Programming: Median of medians

Median of Medians Algorithm

Quickselect is linear-time on average, but it can require quadratic time with poor pivot choices. This is because quickselect is a decrease and conquer algorithm, with each step taking O(n) time in the size of the remaining search set. If the search set decreases exponentially quickly in size (by a fixed proportion), this yields a geometric series times the O(n) factor of a single step, and thus linear overall time. However, if the search set decreases slowly in size, such as linearly (by a fixed number of elements, in the worst case only reducing by one element each time), then a linear sum of linear steps yields quadratic overall time. For example, the worst case occurs when pivoting on the smallest element at each step, such as applying quickselect for the maximum element to already sorted data and taking the first element as pivot each time.

If one instead consistently chooses "good" pivots, this is avoided and one always gets linear performance even in the worst case. A "good" pivot is one for which we can establish that a constant proportion of elements fall both below and above it, as then the search set decreases at least by a constant proportion at each step, hence exponentially quickly, and the overall time remains linear. The median is a good pivot – the best for sorting, and the best overall choice for selection – decreasing the search set by half at each step. Thus if one can compute the median in linear time, this only adds linear time to each step, and thus the overall complexity of the algorithm remains linear.

The median-of-medians algorithm does not actually compute the exact median, but computes an approximate median, namely a point that is guaranteed to be between the 30th and 70th percentiles . Thus the search set decreases by a fixed proportion at each step, namely at least 30% (so at most 70% left). Lastly, the overhead of computing the pivot consists of recursing in a set of size 20% the size of the original search set, plus a linear factor, so at linear cost at each step, the problem is reduced to 90% (20% and 70%) the original size, which is a fixed proportion smaller, and thus by induction the overall complexity is linear in size.

Median-of-medians algorithm:

Line up elements in groups of five (this number 5 is not important, it could be e.g. 7 without changing the algorithm much). Call each group S[i], with i ranging from 1 to n/5.
Find the median of each group. (Call this x[i]). This takes 6 comparisons per group, so 6n/5 total (it is linear time because we are taking medians of very small subsets).
Find the median of the x[i], using a recursive call to the algorithm. If we write a recurrence in which T(n) is the time to run the algorithm on a list of n items, this step takes time T(n/5). Let M be this median of medians.
Use M to partition the input and call the algorithm recursively on one of the partitions, just like in quickselect.

The chosen pivot is both less than and greater than half of the elements in the list of medians, which is around n/10 elements (½×n/5) for each half. Each of these elements is a median of 5, making it less than 2 other elements and greater than 2 other elements outside the block. Hence, the pivot is less than 3(n/10) elements outside the block, and greater than another 3(n/10) elements inside the block. Thus the chosen median splits the elements somewhere between 30%/70% and 70%/30%, which assures worst-case linear behavior of the algorithm. To visualize:

One iteration on the list {0,1,2,3,...99}
	12	15	11	2	9	5	0	7	3	21	44	40	1	18	20	32	19	35	37	39
	13	16	14	8	10	26	6	33	4	27	49	46	52	25	51	34	43	56	72	79
Medians	17	23	24	28	29	30	31	36	42	47	50	55	58	60	63	65	66	67	81	83
	22	45	38	53	61	41	62	82	54	48	59	57	71	78	64	80	70	76	85	87
	96	95	94	86	89	69	68	97	73	92	74	88	99	84	75	90	77	93	98	91

(red = "(one of the two possible) median of medians", gray = "number < red", white = "number > red")

Proof of O(n) running time

The median-calculating recursive call does not exceed worst-case linear behavior because the list of medians is 20% of the size of the list, while the other recursive call recurses on at most 70% of the list, making the running time

$T(n) \leq T(n \cdot 2/10) + T(n \cdot 7/10) + c \cdot n.$

The O(n) term c n is for the partitioning work (we visited each element a constant number of times, in order to form them into n/5 groups and take each median in O(1) time).
From this, using induction – or summing the geometric series, obtaining $1/(1-9/10) = 1/(1/10) = 10$ for overall scaling factor – one can easily show that

$T(n) \leq 10 \cdot c \cdot n \in O(n).$

Code

// selects the median of medians in an array
int select(int *a, int s, int e, int k){
    // if the partition length is less than or equal to 5
    // we can sort and find the kth element of it
    // this way we can find the median of n/5 partitions
    if(e-s+1 <= 5){
        sort(a+s, a+e);
        return s+k-1;
    }
    
    // if array is bigger 
    // we partition the array in subarrays of size 5
    // no. of partitions = n/5 = (e+1)/5
    // iterate through each partition
    // and recursively calculate the median of all of them
    // and keep putting the medians in the starting of the array
    for(int i=0; i<(e+1)/5; i++){
        int left = 5*i;
        int right = left + 4;
        if(right > e) right = e;
        int median = select(a, 5*i, 5*i+4, 3);
        swap(a[median], a[i]);
    }
    
    // now we have array 
    // a[0] = median of 1st 5 sized partition
    // a[1] = median of 2nd 5 sized partition
    // and so on till n/5
    // to find out the median of these n/5 medians
    // we need to select the n/10th element of this set (i.e. middle of it)
    return select(a, 0, (e+1)/5, (e+1)/10);
}

int main(){
    int a[] = {6,7,8,1,2,3,4,5,9,10};
    int n = 10;
    
    int mom = select(a, 0, n-1, n/2);
    cout<<"Median of Medians: " << a[mom] << endl;
    return 0;
}

3 comments:

Victor12 October 2015 at 15:08
in line:
return select(a, 0, (e+1)/5, (e+1)/10);
If e = 8,
returns -1... bug
Unknown20 February 2018 at 05:42
citrus fruit,semi-finished bread,margarine,ready soups
tropical fruit,yogurt,coffee
whole milk
pip fruit,yogurt,cream cheese ,meat spreads
other vegetables,whole milk,condensed milk,long life bakery product
whole milk,butter,yogurt,rice,abrasive cleaner
rolls/buns
other vegetables,UHT-milk,rolls/buns,bottled beer,liquor (appetizer)
pot plants
whole milk,cereals
oak14 November 2020 at 11:19
The code was motivational to build a median of median algorithm. The median can now be used as a partition. However, the code has bugs for values less than 5. Here is the modified version:
int select(vector &arr, int s, int e, int k){

if(e-s+1 <= 5){

//If the difference was 1, return s only
int end;
end = e;
end = end + 1;
int diff = end - s;
if(diff == 1)
{
return s;
}
else{
insertionSort(arr, s, e+1);
return s + k;
}
}

// if array is bigger
// we partition the array in subarrays of size 5
// and total number of partitions is ceiling of division by 5

float maxIter = (float(e+1)/5);

maxIter = ceil(maxIter);

//Instead of finding the median from 3rd position, if the difference is even subtract 1
//If the difference is odd divide by 2
for(int i=0; i < maxIter; i++){
cout << endl;

int left = 5*i;

int right = left + 4;

if(right > e){

right = e;
}

int difference = right - left;
difference++;

if(difference%2==0)
{
difference = (difference/2) - 1;

}
else
{
difference = difference/2;

}

int median = select(arr, left, right, difference);

swap(arr[median], arr[i]);

}

selector = (e+1)/10;

return select(arr, 0, maxIter - 1, selector);

}

Sunday 1 September 2013

Median of medians

Median of Medians Algorithm

Proof of O(n) running time

3 comments: