Selection algorithm

In computer science, a selection algorithm is an algorithm for finding the $k$ th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the $k$ th order statistic. Different choices of $k$ produce the minimum, maximum, and median elements in the given collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of $n$ values, these algorithms take linear time, $O(n)$ as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time $O(1)$ .

Algorithms

Sorting and heapselect

As a baseline algorithm, selection of the $k$ th smallest value in a collection of values can be performed very simply by the following two steps:

Sort the collection
If the output of the sorting algorithm is an array, jump to its $k$ th element; otherwise, scan the sorted sequence to find the $k$ th element.

The time for this method is dominated by the sorting step, which requires $\Theta (n\log n)$ time using a comparison sort. Even when integer sorting algorithms may be used, these are generally slower than the linear time that may be achieved using specialized selection algorithms. Nevertheless, the simplicity of this approach makes it attractive, especially when a highly-optimized sorting routine is provided as part of a runtime library, but a selection algorithm is not.

For a sorting algorithm that generates one item at a time, such as selection sort, the scan can be done in tandem with the sort, and the sort can be terminated once the $k$ element has been found. Applying this optimization to heapsort produces the heapselect algorithm, which can select the $k$ th smallest value in time $O(n+k\log n)$ . This is fast when $k$ is small relative to $n$ , but degenerates to $O(n\log n)$ for larger values of $k$ , such as the choice $k=n/2$ used for median finding.

Decision trees

Pivoting

Many methods for selection are based on choosing a special "pivot" element from the input, and using comparisons with this element to divide the remaining $n-1$ input values into two subsets: the set $L$ of elements less than the pivot, and the set $R$ of elements greater than the pivot. The algorithm can then determine where the $k$ th smallest value is to be found, based on a comparison of $k$ with the sizes of these sets. In particular, if $k\leq |L|$ , the $k$ th smallest value is in $L$ , and can be found recursively by applying the same selection algorithm to $L$ . If $k=|L|+1$ , then the $k$ th smallest value is the pivot, and it can be returned immediately. In the remaining case, the $k$ th smallest value is in $R$ , and more specifically it is the element in position $k-|L|-1$ of $R$ . It can be found by applying a selection algorithm recursively, seeking the value in this position in $R$ .

As with the related pivoting-based quicksort algorithm, the partition of the input into $L$ and $R$ may be done by making new collections for these sets, or by a method that partitions a given list or array data type in-place. Details vary depending on how the input collection is represented.

The time to compare the pivot against all the other values is $O(n)$ . However, pivoting methods differ in how they choose the pivot, which affects how big the subproblems in each recursive call will be. The efficiency of these methods depends greatly on the choice of the pivot.

If the pivot were exactly at the median of the input, then each recursive call would have at most half as many values as the previous call, and the total times would add in a geometric series to $O(n)$ . However, finding the median is itself a selection problem, on the entire original input. Trying to find it by a recursive call to a selection algorithm would lead to an infinite recursion, because the problem size would not decrease in each call.
Quickselect chooses the pivot uniformly at random from the input values. It can be described as a variant of quicksort, with the same pivoting strategy, but where quicksort makes two recursive calls to sort the two subcollections $L$ and $R$ , quickselect only makes one of these two calls. Its expected time is $O(n)$ .
The median of medians method partitions the input into sets of five elements, and then uses a decision tree to find the median of each of these sets. It then recursively calls the same selection algorithm to find the median of these $n/5$ medians, using the result as its pivot. It can be shown that, for this choice of pivot, $\max(|L|,|R|)\leq 7n/10$ . Thus, a problem on $n$ elements is reduced to two recursive problems on $n/5$ and at most $7n/10$ elements. The total size of these two recursive subproblems is at most $9n/10$ , allowing the total time to be analyzed as a geometric series adding to $O(n)$ . Unlike quickselect, this algorithm is deterministic, not randomized. It is commonly taught in undergraduate algorithms classes as an example of a divide and conquer algorithm that does not divide into two equal subproblems. However, the high constant factors in its $O(n)$ time bound make it unsuitable for practical use.
Hybrid algorithms such as introselect can be used to achieve the practical performance of quickselect with a fallback to medians of medians guaranteeing worst-case $O(n)$ time.

Sublinear data structures

Given an unorganized list of data, linear time (Ω(n)) is required to find the minimum element, because we have to examine every element (otherwise, we might miss it). If we organize the list, for example by keeping it sorted at all times, then selecting the kth largest element is trivial, but then insertion requires linear time, as do other operations such as combining two lists.

The strategy to find an order statistic in sublinear time is to store the data in an organized fashion using suitable data structures that facilitate the selection. Two such data structures are tree-based structures and frequency tables.

When only the minimum (or maximum) is needed, a good approach is to use a heap, which is able to find the minimum (or maximum) element in constant time, while all other operations, including insertion, are O(log n) or better. More generally, a self-balancing binary search tree can easily be augmented to make it possible to both insert an element and find the kth largest element in O(log n) time; this is called an order statistic tree. We simply store in each node a count of how many descendants it has, and use this to determine which path to follow. The information can be updated efficiently since adding a node only affects the counts of its O(log n) ancestors, and tree rotations only affect the counts of the nodes involved in the rotation.

Another simple strategy is based on some of the same concepts as the hash table. When we know the range of values beforehand, we can divide that range into h subintervals and assign these to h buckets. When we insert an element, we add it to the bucket corresponding to the interval it falls in. To find the minimum or maximum element, we scan from the beginning or end for the first nonempty bucket and find the minimum or maximum element in that bucket. In general, to find the kth element, we maintain a count of the number of elements in each bucket, then scan the buckets from left to right adding up counts until we find the bucket containing the desired element, then use the expected linear-time algorithm to find the correct element in that bucket.

If we choose h of size roughly sqrt(n), and the input is close to uniformly distributed, this scheme can perform selections in expected O(sqrt(n)) time. Unfortunately, this strategy is also sensitive to clustering of elements in a narrow interval, which may result in buckets with large numbers of elements (clustering can be eliminated through a good hash function, but finding the element with the kth largest hash value isn't very useful). Additionally, like hash tables this structure requires table resizings to maintain efficiency as elements are added and n becomes much larger than h². A useful case of this is finding an order statistic or extremum in a finite range of data. Using above table with bucket interval 1 and maintaining counts in each bucket is much superior to other methods. Such hash tables are like frequency tables used to classify the data in descriptive statistics.

Lower bounds

In The Art of Computer Programming, Donald E. Knuth discussed a number of lower bounds for the number of comparisons required to locate the t smallest entries of an unorganized list of n items (using only comparisons). There is a trivial lower bound of n − 1 for the minimum or maximum entry. To see this, consider a tournament where each game represents one comparison. Since every player except the winner of the tournament must lose a game before we know the winner, we have a lower bound of n − 1 comparisons.

The story becomes more complex for other indexes. We define $W_{t}(n)$ as the minimum number of comparisons required to find the t smallest values. Knuth references a paper published by S. S. Kislitsyn, which shows an upper bound on this value:

W_{t}(n)\leq n-t+\sum _{n+1-t<j\leq n}\lceil {\log _{2}\,j}\rceil \quad {\text{for}}\,n\geq t

This bound is achievable for t=2 but better, more complex bounds are known for larger t.^{[citation needed]}

Language support

Very few languages have built-in support for general selection, although many provide facilities for finding the smallest or largest element of a list. A notable exception is C++, which provides a templated nth_element method with a guarantee of expected linear time, and also partitions the data, requiring that the nth element be sorted into its correct place, elements before the nth element are less than it, and elements after the nth element are greater than it. It is implied but not required that it is based on Hoare's algorithm (or some variant) by its requirement of expected linear time and partitioning of data.^[1]^[2]

For Perl, the module Sort::Key::Top, available from CPAN, provides a set of functions to select the top n elements from a list using several orderings and custom key extraction procedures. Furthermore, the Statistics::CaseResampling module provides a function to calculate quantiles using Quickselect.

Python's standard library (since 2.4) includes heapq.nsmallest() and nlargest(), returning sorted lists, in O(n log k) time.^[3] Numpy has the partition() function.

Matlab includes maxk() and mink() functions, which return the maximal (minimal) k values in a vector as well as their indices.

Because language support for sorting is more ubiquitous, the simplistic approach of sorting followed by indexing is preferred in many environments despite its disadvantage in speed. Indeed, for lazy languages, this simplistic approach can even achieve the best complexity possible for the k smallest/greatest sorted (with maximum/minimum as a special case) if the sort is lazy enough^{[citation needed]}.

References

^ Section 25.3.2 of ISO/IEC 14882:2003(E) and 14882:1998(E)
^ nth_element, SGI STL
^ "Python - What is the time complexity of heapq.nlargest?".

Bibliography

Blum, M.; Floyd, R. W.; Pratt, V. R.; Rivest, R. L.; Tarjan, R. E. (August 1973). "Time bounds for selection" (PDF). Journal of Computer and System Sciences. 7 (4): 448–461. doi:10.1016/S0022-0000(73)80033-9.
Floyd, R. W.; Rivest, R. L. (March 1975). "Expected time bounds for selection". Communications of the ACM. 18 (3): 165–172. doi:10.1145/360680.360691. S2CID 3064709.
Kiwiel, K. C. (2005). "On Floyd and Rivest's SELECT algorithm". Theoretical Computer Science. 347 (1–2): 214–238. doi:10.1016/j.tcs.2005.06.032.
Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0. Section 5.3.3: Minimum-Comparison Selection, pp. 207–219.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 9: Medians and Order Statistics, pp. 183–196. Section 14.1: Dynamic order statistics, pp. 302–308.

External links

"Lecture notes for January 25, 1996: Selection and order statistics", ICS 161: Design and Analysis of Algorithms, David Eppstein

[1] Section 25.3.2 of ISO/IEC 14882:2003(E) and 14882:1998(E)

[2] th_element, SGI STL

[3] "Python - What is the time complexity of heapq.nlargest?".

[1]

[2]

[3]