Selection algorithm

In computer science, a selection algorithm is an algorithm for finding the $k$ th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the $k$ th order statistic. Selection includes as special cases the problems of finding the minimum, median, and maximum element in the collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of $n$ values, these algorithms take linear time, $O(n)$ as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time $O(1)$ .

Problem statement

An algorithm for the selection problem takes as input a collection of values, and a number $k$ . It outputs the $k$ th smallest of these values. For this should be well-defined, it should be possible to sort the values into an order from smallest to largest; for instance, they may be numbers, or some other kind of object with a numeric key. However, they are not assumed to have been already sorted. Often, selection algorithms are restricted to a comparison-based model of computation, as in comparison sort algorithms, where the algorithm has access to a comparison operation that can determine the relative ordering of any two values, but may not perform any other kind of arithmetic operations on these values.

To simplify the problem, some sources may assume that the values are all distinct from each other,^[1] or that some consistent tie-breaking method has been used to assign an ordering to pairs of items with the same value as each other. Another variation in the problem definition concerns the numbering of the ordered values: is the smallest value obtained by setting $k=0$ , as in zero-based numbering of arrays, or is it obtained by setting $k=1$ , following the usual English-language conventions for the smallest, second-smallest, etc.? This article follows the conventions used by Cormen et al., according to which all values are distinct and the minimum value is obtained from $k=1$ .^[1]

With these conventions, the maximum value, among a collection of $n$ values, is obtained by setting $k=n$ . When $n$ is an odd number, the median of the collection is obtained by setting $k=(n+1)/2$ . When $n$ is even, there are two choices for the median, obtained by rounding this choice of $k$ down or up, respectively: the lower median with $k=n/2$ and the upper median with $k=n/2+1$ .^[1]

Algorithms

Sorting and heapselect

As a baseline algorithm, selection of the $k$ th smallest value in a collection of values can be performed very simply by the following two steps:

Sort the collection
If the output of the sorting algorithm is an array, jump to its $k$ th element; otherwise, scan the sorted sequence to find the $k$ th element.

The time for this method is dominated by the sorting step, which requires $\Theta (n\log n)$ time using a comparison sort.^[1] Even when integer sorting algorithms may be used, these are generally slower than the linear time that may be achieved using specialized selection algorithms. Nevertheless, the simplicity of this approach makes it attractive, especially when a highly-optimized sorting routine is provided as part of a runtime library, but a selection algorithm is not.

For a sorting algorithm that generates one item at a time, such as selection sort, the scan can be done in tandem with the sort, and the sort can be terminated once the $k$ element has been found. One possible design of a consolation bracket in a single-elimination tournament, in which the teams who lost to the eventual winner play another mini-tournament to determine second place, can be seen as an instance of this method.^[2] Applying this optimization to heapsort produces the heapselect algorithm, which can select the $k$ th smallest value in time $O(n+k\log n)$ . This is fast when $k$ is small relative to $n$ , but degenerates to $O(n\log n)$ for larger values of $k$ , such as the choice $k=n/2$ used for median finding.

Decision trees

Pivoting

Many methods for selection are based on choosing a special "pivot" element from the input, and using comparisons with this element to divide the remaining $n-1$ input values into two subsets: the set $L$ of elements less than the pivot, and the set $R$ of elements greater than the pivot. The algorithm can then determine where the $k$ th smallest value is to be found, based on a comparison of $k$ with the sizes of these sets. In particular, if $k\leq |L|$ , the $k$ th smallest value is in $L$ , and can be found recursively by applying the same selection algorithm to $L$ . If $k=|L|+1$ , then the $k$ th smallest value is the pivot, and it can be returned immediately. In the remaining case, the $k$ th smallest value is in $R$ , and more specifically it is the element in position $k-|L|-1$ of $R$ . It can be found by applying a selection algorithm recursively, seeking the value in this position in $R$ .^[3]

As with the related pivoting-based quicksort algorithm, the partition of the input into $L$ and $R$ may be done by making new collections for these sets, or by a method that partitions a given list or array data type in-place. Details vary depending on how the input collection is represented.^[4] The time to compare the pivot against all the other values is $O(n)$ .^[3] However, pivoting methods differ in how they choose the pivot, which affects how big the subproblems in each recursive call will be. The efficiency of these methods depends greatly on the choice of the pivot.

If the pivot were exactly at the median of the input, then each recursive call would have at most half as many values as the previous call, and the total times would add in a geometric series to $O(n)$ . However, finding the median is itself a selection problem, on the entire original input. Trying to find it by a recursive call to a selection algorithm would lead to an infinite recursion, because the problem size would not decrease in each call.^[3]
Quickselect chooses the pivot uniformly at random from the input values. It can be described as a variant of quicksort, with the same pivoting strategy, but where quicksort makes two recursive calls to sort the two subcollections $L$ and $R$ , quickselect only makes one of these two calls. Its expected time is $O(n)$ .^[1]^[3]
The Floyd–Rivest algorithm, a variation of quickselect, chooses a pivot by randomly sampling a subset of $r$ data values, for some sample size $r$ , and then recursively selecting an element near position $rk/n$ of the sample to use as the pivot. This choice causes the pivot to be likely to be close in the sorted sequence to the eventual result of the overall selection algorithm, so that each pivoting step eliminates as many values as possible. With a careful choice of sample size, and with the index of the recursive selection call chosen either somewhat above or somewhat below $rk/n$ (to control which side of $k$ the pivot is likely to land on), this method can achieve an expected number of comparisons that is $n+\min(k,n-k)+O({\sqrt {n}})$ .^[5]
The median of medians method partitions the input into sets of five elements, and then uses some other method (rather than a recursive call) to find the median of each of these sets in constant time per set. It then recursively calls the same selection algorithm to find the median of these $n/5$ medians, using the result as its pivot. It can be shown that, for this choice of pivot, $\max(|L|,|R|)\leq 7n/10$ . Thus, a problem on $n$ elements is reduced to two recursive problems on $n/5$ and at most $7n/10$ elements. The total size of these two recursive subproblems is at most $9n/10$ , allowing the total time to be analyzed as a geometric series adding to $O(n)$ . Unlike quickselect, this algorithm is deterministic, not randomized.^[1]^[2] It was the first linear-time deterministic selection algorithm known,^[2] and is commonly taught in undergraduate algorithms classes as an example of a divide and conquer algorithm that does not divide into two equal subproblems. However, the high constant factors in its $O(n)$ time bound make it unsuitable for practical use.
Hybrid algorithms such as introselect can be used to achieve the practical performance of quickselect with a fallback to medians of medians guaranteeing worst-case $O(n)$ time.^[6]

Sublinear data structures

When data is already organized into a data structure, it may be possible to perform selection in an amount of time that is sublinear in the number of values. As a simple case of this, for data already sorted into an array, selecting the $k$ element may be performed by a single array lookup, in constant time.

For data organized as a binary heap it is possible to perform selection in time $O(k)$ , independent of the size $n$ of the whole tree, and faster than the $O(k\log n)$ time bound that would be obtained from best-first search.^[7] This same method can be applied more generally to data organized as any kind of heap-ordered tree (a tree in which each node stores one value in which the parent of each non-root node has a smaller value than its child). This method of performing selection in a heap has been applied to problems of listing multiple solutions to combinatorial optimization problems, such as finding the $k$ shortest paths in a weighted graph, by defining a state space of solutions in the form of an implicitly defined heap-ordered tree, and then applying this selection algorithm to this tree.^[8]

For a collection of data values undergoing dynamic insertions and deletions, it is possible to augment a self-balancing binary search tree structure with a constant amount of additional information per tree node, allowing insertions, deletions, and selection queries that ask for the $k$ th element in the current set to all be performed in $O(\log n)$ time per operation.^[1]

Lower bounds

In The Art of Computer Programming, Donald E. Knuth discussed a number of lower bounds for the number of comparisons required to locate the t smallest entries of an unorganized list of n items (using only comparisons). There is a trivial lower bound of n − 1 for the minimum or maximum entry. To see this, consider a tournament where each game represents one comparison. Since every player except the winner of the tournament must lose a game before we know the winner, we have a lower bound of n − 1 comparisons.

The story becomes more complex for other indexes. We define $W_{t}(n)$ as the minimum number of comparisons required to find the t smallest values. Knuth references a paper published by S. S. Kislitsyn, which shows an upper bound on this value:

W_{t}(n)\leq n-t+\sum _{n+1-t<j\leq n}\lceil {\log _{2}\,j}\rceil \quad {\text{for}}\,n\geq t

This bound is achievable for t=2 but better, more complex bounds are known for larger t.^{[citation needed]}

Language support

Very few languages have built-in support for general selection, although many provide facilities for finding the smallest or largest element of a list. A notable exception is C++, which provides a templated nth_element method with a guarantee of expected linear time, and also partitions the data, requiring that the nth element be sorted into its correct place, elements before the nth element are less than it, and elements after the nth element are greater than it. It is implied but not required that it is based on Hoare's algorithm (or some variant) by its requirement of expected linear time and partitioning of data.^[9]^[10]

For Perl, the module Sort::Key::Top, available from CPAN, provides a set of functions to select the top n elements from a list using several orderings and custom key extraction procedures. Furthermore, the Statistics::CaseResampling module provides a function to calculate quantiles using Quickselect.

Python's standard library (since 2.4) includes heapq.nsmallest() and nlargest(), returning sorted lists.

Matlab includes maxk() and mink() functions, which return the maximal (minimal) k values in a vector as well as their indices.

History

Quickselect was presented without analysis by Tony Hoare in 1965,^[11] and first analyzed in a 1971 technical report by Donald Knuth.^[5] The first known linear time deterministic selection algorithm is the median of medians method, published in 1973 by Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ron Rivest, and Robert Tarjan. They trace the formulation of the selection problem to work of Lewis Carroll in 1883, who pointed out that the usual design of single-elimination sports tournaments does not guarantee that the second-best player wins second place, and to work of Hugo Steinhaus circa 1930, who followed up this same line of thought by asking for a tournament design that can make this guarantee, with a minimum number of games played (that is, comparisons).^[2]

References

^ ^a ^b ^c ^d ^e ^f ^g Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. "Chapter 9: Medians and order statistics". Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. pp. 213–227. ISBN 0-262-03384-4.; "Section 14.1: Dynamic order statistics", pp. 339–345
^ ^a ^b ^c ^d Blum, Manuel; Floyd, Robert W.; Pratt, Vaughan; Rivest, Ronald L.; Tarjan, Robert E. (1973). "Time bounds for selection" (PDF). Journal of Computer and System Sciences. 7: 448–461. doi:10.1016/S0022-0000(73)80033-9. MR 0329916.
^ ^a ^b ^c ^d Kleinberg, Jon; Tardos, Éva (2006). "13.5 Randomized divide and conquer: median-finding and quicksort". Algorithm Design. Addison-Wesley. pp. 727–734. ISBN 9780321295354.
^ For instance, Cormen et al. use an in-place array partition, while Kleinberg and Tardos describe the input as a set and use a method that partitions it into two new sets.
^ ^a ^b Floyd, Robert W.; Rivest, Ronald L. (March 1975). "Expected time bounds for selection". Communications of the ACM. 18 (3): 165–172. doi:10.1145/360680.360691. S2CID 3064709.
^ Musser, David R. (August 1997). "Introspective sorting and selection algorithms". Software: Practice and Experience. 27 (8). Wiley: 983–993. doi:10.1002/(sici)1097-024x(199708)27:8<983::aid-spe117>3.0.co;2-#.
^ Frederickson, Greg N. (1993). "An optimal algorithm for selection in a min-heap". Information and Computation. 104 (2): 197–214. doi:10.1006/inco.1993.1030. MR 1221889.
^ Eppstein, David (1999). "Finding the $k$ shortest paths". SIAM Journal on Computing. 28 (2): 652–673. doi:10.1137/S0097539795290477. MR 1634364.
^ Section 25.3.2 of ISO/IEC 14882:2003(E) and 14882:1998(E)
^ nth_element, SGI STL
^ Hoare, C. A. R. (July 1961). "Algorithm 65: Find". Communications of the ACM. 4 (7): 321–322. doi:10.1145/366622.366647.