Selection algorithm: Difference between revisions

Content deleted Content added
remove <math> in intro without deleting random strings
Undid revision 1272462687 by 49.207.232.175 (talk) the values must have an ordering. They may not be presented in that ordering.
 
(224 intermediate revisions by 72 users not shown)
Line 1:
{{Short description|Method for finding kth smallest value}}
{{for|simulated natural selection in genetic algorithms|Selection (genetic algorithm)}}
{{For|simulated natural selection in genetic algorithms|Selection (genetic algorithm)}}
In [[computer science]], a '''selection algorithm''' is an [[algorithm]] for finding the ''k''th smallest number in a list (such a number is called the ''k''th ''[[order statistic]]''). This includes the cases of finding the [[minimum]], [[maximum]], and [[median]] elements. There are O(''n''), worst-case linear time, selection algorithms. Selection is a subproblem of more complex problems like the [[nearest neighbor problem]] and [[shortest path]] problems.
{{Good article}}
{{Use mdy dates|cs1-dates=ly|date=April 2023}}
{{Use list-defined references|date=April 2023}}
In [[computer science]], a '''selection algorithm''' is an [[algorithm]] for finding the <math>k</math>th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the {{nowrap|<math>k</math>th}} [[order statistic]]. Selection includes as special cases the problems of finding the [[minimum]], [[median]], and [[maximum]] element in the collection. Selection algorithms include [[quickselect]], and the [[median of medians]] algorithm. When applied to a collection of <math>n</math> values, these algorithms take [[linear time]], <math>O(n)</math> as expressed using [[big O notation]]. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted [[Array (data structure)|array]] takes {{nowrap|time <math>O(1)</math>.}}
 
==Problem statement==
== Selection by sorting ==
An algorithm for the selection problem takes as input a collection of values, and a {{nowrap|number <math>k</math>.}} It outputs the {{nowrap|<math>k</math>th}} smallest of these values, or, in some versions of the problem, a collection of the <math>k</math> smallest values. For this to be well-defined, it should be possible to [[Sorting|sort]] the values into an order from smallest to largest; for instance, they may be [[Integer (computer science)|integers]], [[floating-point number]]s, or some other kind of [[Object (computer science)|object]] with a numeric key. However, they are not assumed to have been already sorted. Often, selection algorithms are restricted to a comparison-based [[model of computation]], as in [[comparison sort]] algorithms, where the algorithm has access to a [[Relational operator|comparison operation]] that can determine the relative ordering of any two values, but may not perform any other kind of [[Arithmetic|arithmetic operations]] on these values.{{r|cunmun}}
Selection can be [[Reduction (complexity)|reduced]] to [[sorting algorithm|sorting]] by sorting the list and then extracting the desired element. This method is efficient when many selections need to be made from a list, in which case only one initial, expensive sort is needed, followed by many cheap extraction operations. In general, this method requires O(''n'' log ''n'') time, where ''n'' is the length of the list (although a lower bound is possible with [[radix sort]]).
 
To simplify the problem, some works on this problem assume that the values are all distinct from each {{nowrap|other,{{r|clrs}}}} or that some consistent tie-breaking method has been used to assign an ordering to pairs of items with the same value as each other. Another variation in the problem definition concerns the numbering of the ordered values: is the smallest value obtained by {{nowrap|setting <math>k=0</math>,}} as in [[zero-based numbering]] of arrays, or is it obtained by {{nowrap|setting <math>k=1</math>,}} following the usual English-language conventions for the smallest, second-smallest, etc.? This article follows the conventions used by Cormen et al., according to which all values are distinct and the minimum value is obtained from {{nowrap|<math>k=1</math>.{{r|clrs}}}}
==Linear minimum/maximum algorithms==
Linear time algorithms to find minima or maxima work by iterating over the list and keeping track of the minimum or maximum element so far.
 
With these conventions, the maximum value, among a collection of <math>n</math> values, is obtained by {{nowrap|setting <math>k=n</math>.}} When <math>n</math> is an [[odd number]], the [[median]] of the collection is obtained by {{nowrap|setting <math>k=(n+1)/2</math>.}} When <math>n</math> is even, there are two choices for the median, obtained by rounding this choice of <math>k</math> down or up, respectively: the ''lower median'' with <math>k=n/2</math> and the ''upper median'' with {{nowrap|<math>k=n/2+1</math>.{{r|clrs}}}}
== Nonlinear general selection algorithm ==
Using the same ideas used in minimum/maximum algorithms, we can construct a simple, but inefficient general algorithm for finding the ''k''th smallest or ''k''th largest item in a list, requiring O(''kn'') time, which is effective when ''k'' is small. To accomplish this, we simply find the most extreme value and move it to the beginning until we reach our desired index. This can be seen as an incomplete [[selection sort]]. Here is the minimum-based algorithm:
 
==Algorithms==
'''function''' select(list[1..n], k)
===Sorting and heapselect===
'''for''' i '''from''' 1 '''to''' k
As a baseline algorithm, selection of the {{nowrap|<math>k</math>th}} smallest value in a collection of values can be performed by the following two steps:
minIndex = i
* [[Sorting|Sort]] the collection
minValue = list[i]
* If the output of the sorting algorithm is an [[Array (data type)|array]], retrieve its {{nowrap|<math>k</math>th}} element; otherwise, scan the sorted sequence to find the {{nowrap|<math>k</math>th}} element.
'''for''' j '''from''' i+1 '''to''' n
The time for this method is dominated by the sorting step, which requires <math>\Theta(n\log n)</math> time using a {{nowrap|[[comparison sort]].{{r|clrs|skiena}}}} Even when [[integer sorting]] algorithms may be used, these are generally slower than the linear time that may be achieved using specialized selection algorithms. Nevertheless, the simplicity of this approach makes it attractive, especially when a highly-optimized sorting routine is provided as part of a runtime library, but a selection algorithm is not. For inputs of moderate size, sorting can be faster than non-random selection algorithms, because of the smaller constant factors in its running {{nowrap|time.{{r|erickson}}}} This method also produces a sorted version of the collection, which may be useful for other later computations, and in particular for selection with other choices {{nowrap|of <math>k</math>.{{r|skiena}}}}
'''if''' list[j] < minValue
minIndex = j
minValue = list[j]
swap list[i] and list[minIndex]
'''return''' list[k]
 
For a sorting algorithm that generates one item at a time, such as [[selection sort]], the scan can be done in tandem with the sort, and the sort can be terminated once the {{nowrap|<math>k</math>th}} element has been found. One possible design of a consolation bracket in a [[single-elimination tournament]], in which the teams who lost to the eventual winner play another mini-tournament to determine second place, can be seen as an instance of this {{nowrap|method.{{r|bfprt}}}} Applying this optimization to [[heapsort]] produces the [[heapselect]] algorithm, which can select the {{nowrap|<math>k</math>th}} smallest value in {{nowrap|time <math>O(n+k\log n)</math>.{{r|brodal}}}} This is fast when <math>k</math> is small relative {{nowrap|to <math>n</math>,}} but degenerates to <math>O(n\log n)</math> for larger values {{nowrap|of <math>k</math>,}} such as the choice <math>k=n/2</math> used for median finding.
Other advantages of this method are:
* After locating the ''j''th smallest element, it requires only O(''j'' + (''k''-''j'')<sup>2</sup>) time to find the ''k''th smallest element, or only O(1) for ''k'' ≤ ''j''.
* It can be done with [[linked list]] data structures, whereas the one based on partition requires [[random access]].
 
===Pivoting===
== Partition-based general selection algorithm ==
Many methods for selection are based on choosing a special "pivot" element from the input, and using comparisons with this element to divide the remaining <math>n-1</math> input values into two subsets: the set <math>L</math> of elements less than the pivot, and the set <math>R</math> of elements greater than the pivot. The algorithm can then determine where the {{nowrap|<math>k</math>th}} smallest value is to be found, based on a comparison of <math>k</math> with the sizes of these sets. In particular, {{nowrap|if <math>k\le|L|</math>,}} the {{nowrap|<math>k</math>th}} smallest value is {{nowrap|in <math>L</math>,}} and can be found recursively by applying the same selection algorithm {{nowrap|to <math>L</math>.}} {{nowrap|If <math>k=|L|+1</math>,}} then the {{nowrap|<math>k</math>th}} smallest value is the pivot, and it can be returned immediately. In the remaining case, the {{nowrap|<math>k</math>th}} smallest value is {{nowrap|in <math>R</math>,}} and more specifically it is the element in position <math>k-|L|-1</math> {{nowrap|of <math>R</math>.}} It can be found by applying a selection algorithm recursively, seeking the value in this position {{nowrap|in <math>R</math>.{{r|kletar}}}}
 
As with the related pivoting-based [[quicksort]] algorithm, the partition of the input into <math>L</math> and <math>R</math> may be done by making new collections for these sets, or by a method that partitions a given list or array data type in-place. Details vary depending on how the input collection is {{nowrap|represented.<ref>For instance, Cormen et al. use an in-place array partition, while Kleinberg and Tardos describe the input as a set and use a method that partitions it into two new sets.</ref>}} The time to compare the pivot against all the other values {{nowrap|is <math>O(n)</math>.{{r|kletar}}}} However, pivoting methods differ in how they choose the pivot, which affects how big the subproblems in each recursive call will be. The efficiency of these methods depends greatly on the choice of the pivot. If the pivot is chosen badly, the running time of this method can be as slow {{nowrap|as <math>O(n^2)</math>.{{r|erickson}}}}
A general selection algorithm that is efficient in practice, but has poor worst-case performance, was conceived by the inventor of [[quicksort]], [[C.A.R. Hoare]], and is known as '''Hoare's selection algorithm''' or '''quickselect'''.
*If the pivot were exactly at the median of the input, then each recursive call would have at most half as many values as the previous call, and the total times would add in a [[geometric series]] {{nowrap|to <math>O(n)</math>.}} However, finding the median is itself a selection problem, on the entire original input. Trying to find it by a recursive call to a selection algorithm would lead to an infinite recursion, because the problem size would not decrease in each {{nowrap|call.{{r|kletar}}}}
*[[Quickselect]] chooses the pivot uniformly at random from the input values. It can be described as a [[prune and search]] algorithm,{{r|gootam}} a variant of [[quicksort]], with the same pivoting strategy, but where quicksort makes two recursive calls to sort the two subcollections <math>L</math> {{nowrap|and <math>R</math>,}} quickselect only makes one of these two calls. Its [[expected time]] {{nowrap|is <math>O(n)</math>.{{r|clrs|kletar|gootam}}}} For any constant <math>C</math>, the probability that its number of comparisons exceeds <math>Cn</math> is superexponentially small {{nowrap|in <math>C</math>.{{r|devroye}}}}
*The [[Floyd–Rivest algorithm]], a variation of quickselect, chooses a pivot by randomly sampling a subset of <math>r</math> data values, for some sample {{nowrap|size <math>r</math>,}} and then recursively selecting two elements somewhat above and below position <math>rk/n</math> of the sample to use as pivots. With this choice, it is likely that <math>k</math> is sandwiched between the two pivots, so that after pivoting only a small number of data values between the pivots are left for a recursive call. This method can achieve an expected number of comparisons that is {{nowrap|<math>n+\min(k,n-k)+o(n)</math>.{{r|floriv}}}} In their original work, Floyd and Rivest claimed that the <math>o(n)</math> term could be made as small as <math>O(\sqrt n)</math> by a recursive sampling scheme, but the correctness of their analysis has been {{nowrap|questioned.{{r|brown|prt}}}} Instead, more rigorous analysis has shown that a version of their algorithm achieves <math>O(\sqrt{n\log n})</math> for this {{nowrap|term.{{r|knuth}}}} Although the usual analysis of both quickselect and the Floyd–Rivest algorithm assumes the use of a [[true random number generator]], a version of the Floyd–Rivest algorithm using a [[pseudorandom number generator]] seeded with only logarithmically many true random bits has been proven to run in linear time with high probability.{{r|karrag}}
[[File:Mid-of-mid.png|thumb|upright=1.35|Visualization of pivot selection for the [[median of medians]] method. Each set of five elements is shown as a column of dots in the figure, sorted in increasing order from top to bottom. If their medians (the green and purple dots in the middle row) are sorted in increasing order from left to right, and the median of medians is chosen as the pivot, then the <math>3n/10</math> elements in the upper left quadrant will be less than the pivot, and the <math>3n/10</math> elements in the lower right quadrant will be greater than the pivot, showing that many elements will be eliminated by pivoting.]]
*The [[median of medians]] method partitions the input into sets of five elements, and uses some other non-recursive method to find the median of each of these sets in constant time per set. It then recursively calls itself to find the median of these <math>n/5</math> medians. Using the resulting median of medians as the pivot produces a partition with {{nowrap|<math>\max(|L|,|R|)\le 7n/10</math>.}} Thus, a problem on <math>n</math> elements is reduced to two recursive problems on <math>n/5</math> elements (to find the pivot) and at most <math>7n/10</math> elements (after the pivot is used). The total size of these two recursive subproblems is at {{nowrap|most <math>9n/10</math>,}} allowing the total time to be analyzed as a geometric series adding {{nowrap|to <math>O(n)</math>.}} Unlike quickselect, this algorithm is deterministic, not {{nowrap|randomized.{{r|clrs|erickson|bfprt}}}} It was the first linear-time deterministic selection algorithm {{nowrap|known,{{r|bfprt}}}} and is commonly taught in undergraduate algorithms classes as an example of a [[Divide-and-conquer algorithm|divide and conquer]] that does not divide into two equal {{nowrap|subproblems.{{r|clrs|erickson|gootam|gurwitz}}}} However, the high constant factors in its <math>O(n)</math> time bound make it slower than quickselect in {{nowrap|practice,{{r|skiena|gootam}}}} and slower even than sorting for inputs of moderate {{nowrap|size.{{r|erickson}}}}
*Hybrid algorithms such as [[introselect]] can be used to achieve the practical performance of quickselect with a fallback to medians of medians guaranteeing worst-case <math>O(n)</math> {{nowrap|time.{{r|musser}}}}
 
===Factories===
In quicksort, there is a subprocedure called partition that can, in linear time, group a list (ranging from indices <code>left</code> to <code>right</code>) into two parts, those less than a certain element, and those greater than or equal to the element. Here is pseudocode that performs a partition about the element <code>list[pivotIndex]</code>:
The deterministic selection algorithms with the smallest known numbers of comparisons, for values of <math>k</math> that are far from <math>1</math> {{nowrap|or <math>n</math>,}} are based on the concept of ''factories'', introduced in 1976 by [[Arnold Schönhage]], [[Mike Paterson]], and {{nowrap|[[Nick Pippenger]].{{r|spp}}}} These are methods that build [[partial order]]s of certain specified types, on small subsets of input values, by using comparisons to combine smaller partial orders. As a very simple example, one type of factory can take as input a sequence of single-element partial orders, compare pairs of elements from these orders, and produce as output a sequence of two-element totally ordered sets. The elements used as the inputs to this factory could either be input values that have not been compared with anything yet, or "waste" values produced by other factories. The goal of a factory-based algorithm is to combine together different factories, with the outputs of some factories going to the inputs of others, in order to eventually obtain a partial order in which one element (the {{nowrap|<math>k</math>th}} smallest) is larger than some <math>k-1</math> other elements and smaller than another <math>n-k</math> others. A careful design of these factories leads to an algorithm that, when applied to median-finding, uses at most <math>2.942n</math> comparisons. For other values {{nowrap|of <math>k</math>,}} the number of comparisons is {{nowrap|smaller.{{r|dz99}}}}
 
===Parallel algorithms===
'''function''' partition(list, left, right, pivotIndex)
[[Parallel algorithm]]s for selection have been studied since 1975, when [[Leslie Valiant]] introduced the parallel comparison tree model for analyzing these algorithms, and proved that in this model selection using a linear number of comparisons requires <math>\Omega(\log\log n)</math> parallel steps, even for selecting the minimum or {{nowrap|maximum.{{r|valiant}}}} Researchers later found parallel algorithms for selection in <math>O(\log\log n)</math> steps, matching this {{nowrap|bound.{{r|akss|azapip}}}} In a randomized parallel comparison tree model it is possible to perform selection in a bounded number of steps and a linear number of {{nowrap|comparisons.{{r|reischuk}}}} On the more realistic [[parallel RAM]] model of computing, with exclusive read exclusive write memory access, selection can be performed in time <math>O(\log n)</math> with <math>O(n/\log n)</math> processors, which is optimal both in time and in the number of {{nowrap|processors.{{r|han}}}} With concurrent memory access, slightly faster parallel time is possible in {{nowrap|general,{{r|chr}}}} and the <math>\log n</math> term in the time bound can be replaced {{nowrap|by <math>\log k</math>.{{r|dieram}}}}
pivotValue := list[pivotIndex]
swap list[pivotIndex] and list[right] ''// Move pivot to end''
storeIndex := left
'''for''' i '''from''' left '''to''' right-1
'''if''' list[i] < pivotValue
swap list[storeIndex] and list[i]
increment storeIndex
swap list[right] and list[storeIndex] ''// Move pivot to its final place''
'''return''' storeIndex
 
===Sublinear data structures===
In quicksort, we recursively sort both branches, leading to best-case [[Big-O notation|Ω]](''n'' log ''n'') time. However, when doing selection, we already know which partition our desired element lies in, since the pivot is in its final sorted position, with all those preceding it in sorted order and all those following it in sorted order. Thus a single recursive call locates the desired element in the correct partition:
When data is already organized into a [[data structure]], it may be possible to perform selection in an amount of time that is sublinear in the number of values. As a simple case of this, for data already sorted into an array, selecting the {{nowrap|<math>k</math>th}} element may be performed by a single array lookup, in constant {{nowrap|time.{{r|frejoh}}}} For values organized into a two-dimensional array of {{nowrap|size <math>m\times n</math>,}} with sorted rows and columns, selection may be performed in time {{nowrap|<math>O\bigl(m\log(2n/m)\bigr)</math>,}} or faster when <math>k</math> is small relative to the array {{nowrap|dimensions.{{r|frejoh|kkzz}}}} For a collection of <math>m</math> one-dimensional sorted arrays, with <math>k_i</math> items less than the selected item in the {{nowrap|<math>i</math>th}} array, the time is {{nowrap|<math display=inline>O\bigl(m+\sum_{i=1}^m\log(k_i+1)\bigr)</math>.{{r|kkzz}}}}
 
Selection from data in a [[binary heap]] takes {{nowrap|time <math>O(k)</math>.}} This is independent of the size <math>n</math> of the heap, and faster than the <math>O(k\log n)</math> time bound that would be obtained from {{nowrap|[[best-first search]].{{r|kkzz|frederickson}}}} This same method can be applied more generally to data organized as any kind of heap-ordered tree (a tree in which each node stores one value in which the parent of each non-root node has a smaller value than its child). This method of performing selection in a heap has been applied to problems of listing multiple solutions to combinatorial optimization problems, such as finding the [[k shortest path routing|{{mvar|k}} shortest paths]] in a weighted graph, by defining a [[State space (computer science)|state space]] of solutions in the form of an [[implicit graph|implicitly defined]] heap-ordered tree, and then applying this selection algorithm to this {{nowrap|tree.{{r|kpaths}}}} In the other direction, linear time selection algorithms have been used as a subroutine in a [[priority queue]] data structure related to the heap, improving the time for extracting its {{nowrap|<math>k</math>th}} item from <math>O(\log n)</math> to {{nowrap|<math>O(\log^* n+\log k)</math>;}} here <math>\log^* n</math> is the {{nowrap|[[iterated logarithm]].{{r|bks}}}}
''// Returns the k-th smallest element of list within left..right inclusive.''
'''function''' select(list, left, right, k)
'''if''' left = right ''// If the list contains only one element''
'''return''' list[left] ''// Return that element''
pivotIndex := ... ''// select a pivotIndex between left and right''
pivotNewIndex := partition(list, left, right, pivotIndex)
pivotDist := pivotNewIndex - left + 1
''// The pivot is in its final sorted position,''
''// so pivotDist reflects its 1-based position if list were sorted''
'''if''' pivotDist = k
'''return''' list[pivotNewIndex]
'''else if''' k < pivotDist
'''return''' select(list, left, pivotNewIndex - 1, k)
'''else'''
'''return''' select(list, pivotNewIndex + 1, right, k - pivotDist)
 
For a collection of data values undergoing dynamic insertions and deletions, the [[order statistic tree]] augments a [[self-balancing binary search tree]] structure with a constant amount of additional information per tree node, allowing insertions, deletions, and selection queries that ask for the {{nowrap|<math>k</math>th}} element in the current set to all be performed in <math>O(\log n)</math> time per {{nowrap|operation.{{r|clrs}}}} Going beyond the comparison model of computation, faster times per operation are possible for values that are small integers, on which binary arithmetic operations are {{nowrap|allowed.{{r|pattho}}}} It is not possible for a [[streaming algorithms|streaming algorithm]] with memory sublinear in both <math>n</math> and <math>k</math> to solve selection queries exactly for dynamic data, but the [[count–min sketch]] can be used to solve selection queries approximately, by finding a value whose position in the ordering of the elements (if it were added to them) would be within <math>\varepsilon n</math> steps of <math>k</math>, for a sketch whose size is within logarithmic factors of <math>1/\varepsilon</math>.{{r|cormut}}
Note the resemblance to quicksort: just as the minimum-based selection algorithm is a partial selection sort, this is a partial quicksort, generating and partitioning only O(log ''n'') of its O(''n'') partitions. This simple procedure has expected linear performance, and, like quicksort, has quite good performance in practice. It is also an [[in-place algorithm]], requiring only constant memory overhead, since the [[tail recursion]] can be eliminated with a loop like this:
 
== Lower bounds ==
'''function''' select(list, left, right, k)
The <math>O(n)</math> running time of the selection algorithms described above is necessary, because a selection algorithm that can handle inputs in an arbitrary order must take that much time to look at all of its inputs. If any one of its input values is not compared, that one value could be the one that should have been selected, and the algorithm can be made to produce an incorrect answer.{{r|kkzz}} Beyond this simple argument, there has been a significant amount of research on the exact number of comparisons needed for selection, both in the randomized and deterministic cases.
'''loop'''
pivotIndex := ... ''// select pivotIndex between left and right''
pivotNewIndex := partition(list, left, right, pivotIndex)
pivotDist := pivotNewIndex - left + 1
'''if''' pivotDist = k
'''return''' list[pivotNewIndex]
'''else if''' k < pivotDist
right := pivotNewIndex - 1
'''else'''
k := k - pivotDist
left := pivotNewIndex + 1
 
Selecting the minimum of <math>n</math> values requires <math>n-1</math> comparisons, because the <math>n-1</math> values that are not selected must each have been determined to be non-minimal, by being the largest in some comparison, and no two of these values can be largest in the same comparison. The same argument applies symmetrically to selecting the {{nowrap|maximum.{{r|knuth}}}}
Like quicksort, the performance of the algorithm is sensitive to the pivot that is chosen. If bad pivots are consistently chosen, this degrades to the minimum-based selection described previously, and so can require as much as O(''n''<sup>2</sup>) time. David Musser describes a "median-of-3 killer" sequence that can force the well-known median-of-three pivot selection algorithm to fail with worst-case behavior (see ''[[#Introselect|Introselect]]'' section below).
 
The next simplest case is selecting the second-smallest. After several incorrect attempts, the first tight lower bound on this case was published in 1964 by Soviet mathematician [[Sergey Kislitsyn]]. It can be shown by observing that selecting the second-smallest also requires distinguishing the smallest value from the rest, and by considering the number <math>p</math> of comparisons involving the smallest value that an algorithm for this problem makes. Each of the <math>p</math> items that were compared to the smallest value is a candidate for second-smallest, and <math>p-1</math> of these values must be found larger than another value in a second comparison in order to rule them out as second-smallest.
== Linear general selection algorithm - Median of Medians algorithm ==
With <math>n-1</math> values being the larger in at least one comparison, and <math>p-1</math> values being the larger in at least two comparisons, there are a total of at least <math>n+p-2</math> comparisons. An [[Adversary model|adversary argument]], in which the outcome of each comparison is chosen in order to maximize <math>p</math> (subject to consistency with at least one possible ordering) rather than by the numerical values of the given items, shows that it is possible to force <math>p</math> to be {{nowrap|at least <math>\log_2 n</math>.}} Therefore, the worst-case number of comparisons needed to select the second smallest {{nowrap|is <math>n+\lceil\log_2 n\rceil-2</math>,}} the same number that would be obtained by holding a single-elimination tournament with a run-off tournament among the values that lost to the smallest value. However, the expected number of comparisons of a randomized selection algorithm can be better than this bound; for instance, selecting the second-smallest of six elements requires seven comparisons in the worst case, but may be done by a randomized algorithm with an expected number of {{nowrap|6.5 comparisons.{{r|knuth}}}}
{{Infobox Algorithm
|name=Median of Medians
|class='''Selection algorithm'''
|image=
|data=[[Array data structure|Array]]
|best-time= <math>O(n)</math>
|time=<math>O(n)</math>
|space=<math>O(1)</math> auxiliary
|optimal=Yes
}}
 
More generally, selecting the {{nowrap|<math>k</math>th}} element out of <math>n</math> requires at least <math>n+\min(k,n-k)-O(1)</math> comparisons, in the average case, matching the number of comparisons of the Floyd–Rivest algorithm up to its <math>o(n)</math> term. The argument is made directly for deterministic algorithms, with a number of comparisons that is averaged over all possible [[permutation]]s of the input {{nowrap|values.{{r|cunmun}}}} By [[Yao's principle]], it also applies to the expected number of comparisons for a randomized algorithm on its worst-case {{nowrap|input.{{r|chan}}}}
A worst-case linear algorithm for the general case of selecting the ''k''th largest element was published by [[Manuel Blum|Blum]], [[Robert Floyd|Floyd]], [[Vaughan Ronald Pratt|Pratt]], [[Ron Rivest|Rivest]] and [[Robert Tarjan|Tarjan]] in their 1973 paper "Time bounds for selection", sometimes called '''BFPRT''' after the last names of the authors. It is based on the quickselect algorithm and is also known as the '''median-of-medians algorithm'''.
 
For deterministic algorithms, it has been shown that selecting the {{nowrap|<math>k</math>th}} element requires <math>\bigl(1+H(k/n)\bigr)n+\Omega(\sqrt n)</math> comparisons, where <math display=block>H(x)=x\log_2\frac1x + (1-x)\log_2\frac1{1-x}</math> is the {{nowrap|[[binary entropy function]].{{r|benjoh}}}} The special case of median-finding has a slightly larger lower bound on the number of comparisons, {{nowrap|at least <math>(2+\varepsilon)n</math>,}} for {{nowrap|<math>\varepsilon\approx 2^{-80}</math>.{{r|dz01}}}}
Although quickselect is linear-time on average, it can require quadratic time with poor pivot choices (consider the case of pivoting around the largest element at each step). The solution to make it O(n) in the ''worst'' case is to consistently find "good" pivots. A good pivot is one for which we can establish that a constant proportion of elements fall both below and above it.
 
== Exact numbers of comparisons ==
The ''Select'' algorithm divides the list into groups of five elements. (Left over elements are ignored for now.) Then, for each group of five, the median is calculated (an operation that can potentially be made very fast if the five values can be loaded into [[Processor register|registers]] and compared). (If sorting in-place, then these medians are moved into one contiguous block in the list.) ''Select'' is then called recursively on this sublist of ''n''/5 elements to find their true median. Finally, the "median of medians" is chosen to be the pivot.
[[File:Median of 5.svg|thumb|upright=0.9|Finding the median of five values using six comparisons. Each step shows the comparisons to be performed next as yellow line segments, and a [[Hasse diagram]] of the order relations found so far (with smaller=lower and larger=higher) as blue line segments. The red elements have already been found to be greater than three others and so cannot be the median. The larger of the two elements in the final comparison is the median.]]
[[Donald Knuth|Knuth]] supplies the following triangle of numbers summarizing pairs of <math>n</math> and <math>k</math> for which the exact number of comparisons needed by an optimal selection algorithm is known. The {{nowrap|<math>n</math>th}} row of the triangle (starting with <math>n=1</math> in the top row) gives the numbers of comparisons for inputs of <math>n</math> values, and the {{nowrap|<math>k</math>th}} number within each row gives the number of comparisons needed to select the {{nowrap|<math>k</math>th}} smallest value from an input of that size. The rows are symmetric because selecting the {{nowrap|<math>k</math>th}} smallest requires exactly the same number of comparisons, in the worst case, as selecting the {{nowrap|<math>k</math>th}} {{nowrap|largest.{{r|knuth}}}}
 
{{center|0}}
''//returns the index of the median of medians.''
{{center|1 &nbsp;&nbsp; 1}}
''//requires a variant of select, "selectIdx"''
{{center|2 &nbsp;&nbsp; 3 &nbsp;&nbsp; 2}}
''//which returns the index of the selected item rather than the value''
{{center|3 &nbsp;&nbsp; 4 &nbsp;&nbsp; 4 &nbsp;&nbsp; 3}}
'''function''' medianOfMedians(list, left, right)
{{center|4 &nbsp;&nbsp; 6 &nbsp;&nbsp; 6 &nbsp;&nbsp; 6 &nbsp;&nbsp; 4}}
numMedians = (right - left) / 5
{{center|5 &nbsp;&nbsp; 7 &nbsp;&nbsp; 8 &nbsp;&nbsp; 8 &nbsp;&nbsp; 7 &nbsp;&nbsp; 5}}
'''for''' i '''from''' 0 '''to''' numMedians
{{center|6 &nbsp;&nbsp; 8 &nbsp; 10 &nbsp; 10 &nbsp; 10 &nbsp; 8 &nbsp;&nbsp; 6}}
''//get the median of the five-element subgroup''
{{center|7 &nbsp;&nbsp; 9 &nbsp; 11 &nbsp; 12 &nbsp; 12 &nbsp; 11 &nbsp; 9 &nbsp;&nbsp; 7}}
subLeft := left + i*5
{{center|8 &nbsp; 11 &nbsp; 12 &nbsp; 14 &nbsp; 14 &nbsp; 14 &nbsp; 12 &nbsp; 11 &nbsp; 8}}
subRight := subLeft + 4
{{center|9 &nbsp; 12 &nbsp; 14 &nbsp; 15 &nbsp; 16 &nbsp; 16 &nbsp; 15 &nbsp; 14 &nbsp; 12 &nbsp; 9}}
if (subRight > right) subRight := right
''//alternatively, use a faster method that works on lists of size 5''
medianIdx := selectIdx(list, subLeft, subRight, (subRight - subLeft) / 2)
''//move the median to a contiguous block at the beginning of the list''
swap list[left+i] and list[medianIdx]
''//select the median from the contiguous block''
'''return''' selectIdx(list, left, left + numMedians, numMedians / 2)
 
Most, but not all, of the entries on the left half of each row can be found using the formula <math display=block>n-k+(k-1)\bigl\lceil\log_2(n+2-k)\bigr\rceil.</math> This describes the number of comparisons made by a method of Abdollah Hadian and [[Milton Sobel]], related to heapselect, that finds the smallest value using a single-elimination tournament and then repeatedly uses a smaller tournament among the values eliminated by the eventual tournament winners to find the next successive values until reaching the {{nowrap|<math>k</math>th}} smallest.{{r|knuth|hadsob}} Some of the larger entries were proven to be optimal using a computer search.{{r|knuth|gkp}}
=== Properties of pivot ===
The chosen pivot is both less than and greater than half of the elements in the list of medians, which is around <math>n/10</math> elements <math>(1/2 * (n/5))</math> for each half. Each of these elements is a median of 5, making it less than 2 other elements and greater than 2 other elements outside the block. Hence, the pivot is less than <math>3(n/10)</math> elements outside the block, and greater than another <math>3(n/10)</math> elements inside the block. Thus the chosen median splits the elements somewhere between 30%/70% and 70%/30%, which assures worst-case linear behavior of the algorithm. To visualize:
 
== Language support ==
{|class="wikitable" border=1
Very few languages have built-in support for general selection, although many provide facilities for finding the smallest or largest element of a list. A notable exception is the [[Standard Template Library]] for [[C++]], which provides a templated <code>nth_element</code> method with a guarantee of expected linear time.{{r|skiena}}
|+ One iteration on the list {0,1,2,3,...99}
!
| bgcolor=gray|12||||bgcolor=gray|15||||bgcolor=gray|11||||bgcolor=gray|2||||bgcolor=gray|9||||bgcolor=gray|5||||bgcolor=gray|0||||bgcolor=gray|7||||bgcolor=gray|3||||bgcolor=gray|21||||bgcolor=gray|44||||bgcolor=gray|40||||bgcolor=gray|1||||bgcolor=gray|18||||bgcolor=gray|20||||bgcolor=gray|32||||bgcolor=gray|19||||bgcolor=gray|35||||bgcolor=gray|37||||bgcolor=gray|39
|-
!
| bgcolor=gray|13||||bgcolor=gray|16||||bgcolor=gray|14||||bgcolor=gray|8||||bgcolor=gray|10||||bgcolor=gray|26||||bgcolor=gray|6||||bgcolor=gray|33||||bgcolor=gray|4||||bgcolor=gray|27||||bgcolor=white|49||||bgcolor=gray|46||||bgcolor=white|52||||bgcolor=gray|25||||bgcolor=white|51||||bgcolor=gray|34||||bgcolor=gray|43||||bgcolor=white|56||||bgcolor=white|72||||bgcolor=white|79
|-
! Medians
| bgcolor=gray|17||||bgcolor=gray|23||||bgcolor=gray|24||||bgcolor=gray|28||||bgcolor=gray|29||||bgcolor=gray|30||||bgcolor=gray|31||||bgcolor=gray|36||||bgcolor=gray|42||||bgcolor=red|47||||bgcolor=white|50||||bgcolor=white|55||||bgcolor=white|58||||bgcolor=white|60||||bgcolor=white|63||||bgcolor=white|65||||bgcolor=white|66||||bgcolor=white|67||||bgcolor=white|81||||bgcolor=white|83
|-
!
| bgcolor=gray|22||||bgcolor=gray|45||||bgcolor=gray|38||||bgcolor=white|53||||bgcolor=white|61||||bgcolor=gray|41||||bgcolor=white|62||||bgcolor=white|82||||bgcolor=white|54||||bgcolor=white|48||||bgcolor=white|59||||bgcolor=white|57||||bgcolor=white|71||||bgcolor=white|78||||bgcolor=white|64||||bgcolor=white|80||||bgcolor=white|70||||bgcolor=white|76||||bgcolor=white|85||||bgcolor=white|87
|-
!
| bgcolor=white|96||||bgcolor=white|95||||bgcolor=white|94||||bgcolor=white|86||||bgcolor=white|89||||bgcolor=white|69||||bgcolor=white|68||||bgcolor=white|97||||bgcolor=white|73||||bgcolor=white|92||||bgcolor=white|74||||bgcolor=white|88||||bgcolor=white|99||||bgcolor=white|84||||bgcolor=white|75||||bgcolor=white|90||||bgcolor=white|77||||bgcolor=white|93||||bgcolor=white|98||||bgcolor=white|91
|-
|}
 
[[Python (programming language)|Python]]'s standard library includes <code>heapq.nsmallest</code> and <code>heapq.nlargest</code> functions for returning the smallest or largest elements from a collection, in sorted order. The implementation maintains a [[binary heap]], limited to holding <math>k</math> elements, and initialized to the first <math>k</math> elements in the collection. Then, each subsequent item of the collection may replace the largest or smallest element in the heap if it is smaller or larger than this element. The algorithm's memory usage is superior to heapselect (the former only holds <math>k</math> elements in memory at a time while the latter requires manipulating the entire dataset into memory). Running time depends on data ordering. The best case is <math>O((n - k) + k\log k)</math> for already sorted data. The worst-case is <math>O(n\log k)</math> for reverse sorted data. In average cases, there are likely to be few heap updates and most input elements are processed with only a single comparison. For example, extracting the 100 largest or smallest values out of 10,000,000 random inputs makes 10,009,401 comparisons on average.{{r|python}}
(red = "(one of the two possible) median of medians", gray = "number < red", white = "number > red")
 
Since 2017, [[Matlab]] has included <code>maxk()</code> and <code>mink()</code> functions, which return the maximal (minimal) <math>k</math> values in a vector as well as their indices. The Matlab documentation does not specify which algorithm these functions use or what their running {{nowrap|time is.{{r|matlab}}}}
5-tuples are shown here sorted by median, for clarity. Sorting the tuples is not necessary because we only need the median for use as pivot element.
 
== History==
Note that all elements above/left of the red (30% of the 100 elements) are less, and all elements below/right of the red (another 30% of the 100 elements) are greater.
[[Quickselect]] was presented without analysis by [[Tony Hoare]] {{nowrap|in 1965,{{r|hoare}}}} and first analyzed in a 1971 technical report by {{nowrap|[[Donald Knuth]].{{r|floriv}}}} The first known linear time deterministic selection algorithm is the [[median of medians]] method, published in 1973 by [[Manuel Blum]], [[Robert W. Floyd]], [[Vaughan Pratt]], [[Ron Rivest]], and [[Robert Tarjan]].{{r|bfprt}} They trace the formulation of the selection problem to work of Charles L. Dodgson (better known as [[Lewis Carroll]]) who in 1883 pointed out that the usual design of single-elimination sports tournaments does not guarantee that the second-best player wins second place,{{r|bfprt|carroll}} and to work of [[Hugo Steinhaus]] circa 1930, who followed up this same line of thought by asking for a tournament design that can make this guarantee, with a minimum number of games played (that is, {{nowrap|comparisons).{{r|bfprt}}}}
 
== See also ==
=== Proof of O(n) running time ===
* {{slink|Geometric median|Computation}}, algorithms for higher-dimensional generalizations of medians
* [[Median filter]], application of median-finding algorithms in image processing
 
== References ==
The median-calculating recursive call does not exceed worst-case linear behavior because the list of medians is 20% of the size of the list, while the other recursive call recurses on at most 70% of the list, making the running time
{{reflist|refs=
:<math>T(n) \leq T(n \cdot 2/10) + T(n \cdot 7/10) + O(n).</math>
 
<ref name=akss>{{cite journal
The O(''n'') is for the partitioning work (we visited each element a constant number of times, in order to form them into n/5 groups and take each median in O(1) time).
| last1 = Ajtai | first1 = Miklós | author1-link = Miklós Ajtai
From this, one can then show that
| last2 = Komlós | first2 = János | author2-link = János Komlós (mathematician)
:<math>T(n) \leq c \cdot n \cdot (1 + (9/10) + (9/10)^2 + \cdots) \in O(n).</math>
| last3 = Steiger | first3 = W. L.
| last4 = Szemerédi | first4 = Endre | author4-link = Endre Szemerédi
| doi = 10.1016/0022-0000(89)90035-4
| issue = 1
| journal = [[Journal of Computer and System Sciences]]
| mr = 990052
| pages = 125–133
| title = Optimal parallel selection has complexity <math>O(\log\log n)</math>
| volume = 38
| year = 1989}}</ref>
 
<ref name=azapip>{{cite journal
=== Important notes ===
| last1 = Azar | first1 = Yossi
| last2 = Pippenger | first2 = Nicholas | author2-link = Nick Pippenger
| doi = 10.1016/0166-218X(90)90128-Y
| issue = 1–2
| journal = [[Discrete Applied Mathematics]]
| mr = 1055590
| pages = 49–58
| title = Parallel selection
| volume = 27
| year = 1990}}</ref>
 
<ref name=benjoh>{{cite conference
Although this approach optimizes quite well, it is typically outperformed in practice by the expected linear algorithm with random pivot choices{{Citation needed|date=December 2009}}.
| last1 = Bent | first1 = Samuel W.
| last2 = John | first2 = John W.
| editor-last = Sedgewick | editor-first = Robert
| contribution = Finding the median requires <math>2n</math> comparisons
| doi = 10.1145/22145.22169
| pages = 213–216
| publisher = Association for Computing Machinery
| title = Proceedings of the 17th Annual ACM Symposium on Theory of Computing, May 6–8, 1985, Providence, Rhode Island, USA
| year = 1985| doi-access = free
| isbn = 0-89791-151-2
}}</ref>
 
<ref name=bfprt>{{cite journal
The median-of-medians algorithm can be used to construct a worst-case O(''n''&nbsp;log&nbsp;''n'') [[quicksort]] algorithm, by using it to find the median at every step.
| last1 = Blum | first1 = Manuel | author1-link = Manuel Blum
| last2 = Floyd | first2 = Robert W. | author2-link = Robert W. Floyd
| last3 = Pratt | first3 = Vaughan | author3-link = Vaughan Pratt
| last4 = Rivest | first4 = Ronald L. | author4-link = Ron Rivest
| last5 = Tarjan | first5 = Robert E. | author5-link = Robert Tarjan
| doi = 10.1016/S0022-0000(73)80033-9 | doi-access = free
| journal = [[Journal of Computer and System Sciences]]
| mr = 329916
| pages = 448–461
| title = Time bounds for selection
| url = http://people.csail.mit.edu/rivest/pubs/BFPRT73.pdf
| volume = 7
| year = 1973| issue = 4 }}</ref>
 
<ref name=bks>{{cite journal
== Introselect ==
| last1 = Babenko | first1 = Maxim
| last2 = Kolesnichenko | first2 = Ignat
| last3 = Smirnov | first3 = Ivan
| doi = 10.1007/s00224-018-9866-1
| issue = 4
| journal = Theory of Computing Systems
| mr = 3942251
| pages = 637–646
| title = Cascade heap: towards time-optimal extractions
| volume = 63
| year = 2019| s2cid = 253740380
}}</ref>
 
<ref name=brodal>{{cite conference
[[David Musser]]'s well-known [[introsort]] achieves practical performance comparable to quicksort while preserving ''O''(''n'' log ''n'') worst-case behavior by creating a hybrid of quicksort and [[heapsort]]. In the same paper, Musser introduced an "introspective selection" algorithm, popularly called '''introselect''', which combines Hoare's algorithm with the worst-case linear algorithm described above to achieve worst-case linear selection with performance similar to Hoare's algorithm.<ref>David R. Musser. Introspective Sorting and Selection Algorithms. ''Software: Practice and Experience'', vol. 27, no. 8, pp.983&ndash;993. 1997. Section: Introspective Selection Algorithms.</ref> It works by optimistically starting out with Hoare's algorithm and only switching to the worst-time linear algorithm if it recurses too many times without making sufficient progress. Simply limiting the recursion to constant depth is not good enough, since this would make the algorithm switch on all sufficiently large lists. Musser discusses a couple of simple approaches:
| last = Brodal | first = Gerth Stølting | author-link = Gerth Stølting Brodal
* Keep track of the list of sizes of the subpartitions processed so far. If at any point ''k'' recursive calls have been made without halving the list size, for some small positive ''k'', switch to the worst-case linear algorithm.
| editor1-last = Brodnik | editor1-first = Andrej
* Sum the size of all partitions generated so far. If this exceeds the list size times some small positive constant ''k'', switch to the worst-case linear algorithm. This sum is easy to track in a single scalar variable.
| editor2-last = López-Ortiz | editor2-first = Alejandro
| editor3-last = Raman | editor3-first = Venkatesh
| editor4-last = Viola | editor4-first = Alfredo
| contribution = A survey on priority queues
| doi = 10.1007/978-3-642-40273-9_11
| pages = 150–163
| publisher = Springer
| series = Lecture Notes in Computer Science
| title = Space-Efficient Data Structures, Streams, and Algorithms – Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday
| volume = 8066
| year = 2013| isbn = 978-3-642-40272-2 }}</ref>
 
<ref name=brown>{{cite journal
Both approaches limit the recursion depth to ''k'' ⌈log ''n''⌉ = ''O''(log ''n'') and the total running time to ''O''(''n)''. The paper suggested that more research on introselect was forthcoming, but the author retired in 2007 without having published any such further research.
| last = Brown | first = Theodore
| date = September 1976
| doi = 10.1145/355694.355704
| issue = 3
| journal = ACM Transactions on Mathematical Software
| pages = 301–304
| title = Remark on Algorithm 489
| volume = 2| s2cid = 13985011
}}</ref>
 
<ref name=carroll>{{cite book|title=Lawn Tennis Tournaments: The True Method of Assigning Prizes with a Proof of the Fallacy of the Present Method|first=Charles L.|last=Dodgson|author-link=Lewis Carroll|year=1883|___location=London|publisher=Macmillan and Co.}} See also {{cite book|title=The Mathematical World of Charles L. Dodgson (Lewis Carroll)|editor1-first=Robin|editor1-last=Wilson|editor2-first=Amirouche|editor2-last=Moktefi|publisher=Oxford University Press|year=2019|isbn=9780192549013|contribution=Lawn tennis tournaments|page=129|contribution-url=https://books.google.com/books?id=OBGIDwAAQBAJ&pg=PA129}}</ref>
== Selection as incremental sorting ==
One of the advantages of the sort-and-index approach, as mentioned, is its ability to [[amortized analysis|amortize]] the sorting cost over many subsequent selections. However, sometimes the number of selections that will be done is not known in advance, and may be either small or large. In these cases, we can adapt the algorithms given above to simultaneously select an element while [[Partial sorting|partially sorting]] the list, thus accelerating future selections.
 
<ref name=chan>{{cite journal
Both the selection procedure based on minimum-finding and the one based on partitioning can be seen as a form of partial sort. The minimum-based algorithm sorts the list up to the given index, and so clearly speeds up future selections, especially of smaller indexes. The partition-based algorithm does not achieve the same behaviour automatically, but can be adapted to remember its previous pivot choices and reuse them wherever possible, avoiding costly partition operations, particularly the top-level one. The list becomes gradually more sorted as more partition operations are done incrementally; no pivots are ever "lost". If desired, this same pivot list could be passed on to quicksort to reuse, again avoiding many costly partition operations.
| last = Chan | first = Timothy M. | author-link = Timothy M. Chan
| doi = 10.1145/1721837.1721842
| issue = 2
| journal = [[ACM Transactions on Algorithms]]
| mr = 2675693
| page = A26:1–A26:16
| title = Comparison-based time-space lower bounds for selection
| volume = 6
| year = 2010| s2cid = 11742607 }}</ref>
 
<ref name=chr>{{cite conference
== Using data structures to select in sublinear time ==
| last1 = Chaudhuri | first1 = Shiva
Given an unorganized list of data, linear time (Ω(''n'')) is required to find the minimum element, because we have to examine every element (otherwise, we might miss it). If we organize the list, for example by keeping it sorted at all times, then selecting the ''k''th largest element is trivial, but then insertion requires linear time, as do other operations such as combining two lists.
| last2 = Hagerup | first2 = Torben
| last3 = Raman | first3 = Rajeev
| editor1-last = Borzyszkowski | editor1-first = Andrzej M.
| editor2-last = Sokolowski | editor2-first = Stefan
| contribution = Approximate and exact deterministic parallel selection
| doi = 10.1007/3-540-57182-5_27
| pages = 352–361
| publisher = Springer
| series = Lecture Notes in Computer Science
| title = Mathematical Foundations of Computer Science 1993, 18th International Symposium, MFCS'93, Gdansk, Poland, August 30 – September 3, 1993, Proceedings
| volume = 711
| year = 1993| hdl = 11858/00-001M-0000-0014-B748-C
| isbn = 978-3-540-57182-7
| hdl-access = free
}}</ref>
 
<ref name=clrs>{{Introduction to Algorithms|edition=3|chapter=Chapter 9: Medians and order statistics|pages=213–227}}; "Section 14.1: Dynamic order statistics", pp. 339–345</ref>
The strategy to find an order statistic in [[sublinear time]] is to store the data in an organized fashion using suitable data structures that facilitate the selection. Two such data structures are tree-based structures and frequency tables.
 
<ref name=cormut>{{cite journal
When only the minimum (or maximum) is needed, a good approach is to use a [[Heap_(data_structure)|heap]], which is able to find the minimum (or maximum) element in constant time, while all other operations, including insertion, are O(log ''n'') or better. More generally, a [[self-balancing binary search tree]] can easily be [[Order statistic tree|augmented]] to make it possible to both insert an element and find the ''k''th largest element in O(log ''n'') time. We simply store in each node a count of how many descendants it has, and use this to determine which path to follow. The information can be updated efficiently since adding a node only affects the counts of its O(log ''n'') ancestors, and tree rotations only affect the counts of the nodes involved in the rotation.
| last1 = Cormode | first1 = Graham
| last2 = Muthukrishnan | first2 = S. | author2-link = S. Muthukrishnan (computer scientist)
| doi = 10.1016/j.jalgor.2003.12.001
| issue = 1
| journal = Journal of Algorithms
| mr = 2132028
| pages = 58–75
| title = An improved data stream summary: the count-min sketch and its applications
| volume = 55
| year = 2005}}</ref>
 
<ref name=cunmun>{{cite journal
Another simple strategy is based on some of the same concepts as the [[hash table]]. When we know the range of values beforehand, we can divide that range into ''h'' subintervals and assign these to ''h'' buckets. When we insert an element, we add it to the bucket corresponding to the interval it falls in. To find the minimum or maximum element, we scan from the beginning or end for the first nonempty bucket and find the minimum or maximum element in that bucket. In general, to find the ''k''th element, we maintain a count of the number of elements in each bucket, then scan the buckets from left to right adding up counts until we find the bucket containing the desired element, then use the expected linear-time algorithm to find the correct element in that bucket.
| last1 = Cunto | first1 = Walter
| last2 = Munro | first2 = J. Ian | author2-link = Ian Munro (computer scientist)
| doi = 10.1145/62044.62047
| issue = 2
| journal = [[Journal of the ACM]]
| mr = 1072421
| pages = 270–279
| title = Average case selection
| volume = 36
| year = 1989| s2cid = 10947879
| doi-access = free
}}</ref>
 
<ref name=devroye>{{cite journal
If we choose ''h'' of size roughly sqrt(''n''), and the input is close to uniformly distributed, this scheme can perform selections in expected O(sqrt(''n'')) time. Unfortunately, this strategy is also sensitive to clustering of elements in a narrow interval, which may result in buckets with large numbers of elements (clustering can be eliminated through a good hash function, but finding the element with the ''k''th largest hash value isn't very useful). Additionally, like hash tables this structure requires table resizings to maintain efficiency as elements are added and ''n'' becomes much larger than ''h''<sup>2</sup>. A useful case of this is finding an order statistic or extremum in a finite range of data. Using above table with bucket interval 1 and maintaining counts in each bucket is much superior to other methods. Such hash tables are like [[frequency tables]] used to classify the data in [[descriptive statistics]].
| last = Devroye | first = Luc
| doi = 10.1016/0022-0000(84)90009-6
| issue = 1
| journal = Journal of Computer and System Sciences
| mr = 761047
| pages = 1–7
| title = Exponential bounds for the running time of a selection algorithm
| url = http://luc.devroye.org/devroye-selection1984.pdf
| volume = 29
| year = 1984}} {{cite journal
| last = Devroye | first = Luc
| doi = 10.1007/s00453-001-0046-2
| issue = 3
| journal = Algorithmica
| mr = 1855252
| pages = 291–303
| title = On the probabilistic worst-case time of 'find'
| url = https://luc.devroye.org/wcfind.pdf
| volume = 31
| year = 2001| s2cid = 674040
}}</ref>
 
<ref name=dieram>{{cite journal
== Selecting k smallest or largest elements ==
| last1 = Dietz | first1 = Paul F.
{{main|Partial sorting}}
| last2 = Raman | first2 = Rajeev
Another fundamental selection problem is that of selecting the ''k'' smallest or ''k'' largest elements, which is particularly useful where we want to present just the "top ''k''" of an unsorted list, such as the top 100 corporations by gross sales. This is also commonly called [[partial sorting]].
| doi = 10.1006/jagm.1998.0971
| issue = 1
| journal = [[Journal of Algorithms]]
| mr = 1661179
| pages = 33–51
| title = Small-rank selection in parallel, with applications to heap construction
| volume = 30
| year = 1999}}</ref>
 
<ref name=dz99>{{cite journal
== Lower bounds ==
| last1 = Dor | first1 = Dorit | author1-link = Dorit Dor
In ''[[The Art of Computer Programming]]'', Donald E. Knuth discussed a number of lower bounds for the number of comparisons required to locate the ''t'' smallest entries of an unorganized list of ''n'' items (using only comparisons). There is a trivial lower bound of ''n'' &minus; 1 for the minimum or maximum entry. To see this, consider a tournament where each game represents one comparison. Since every player except the winner of the tournament must lose a game before we know the winner, we have a lower bound of ''n'' &minus; 1 comparisons.
| last2 = Zwick | first2 = Uri | author2-link = Uri Zwick
| doi = 10.1137/S0097539795288611
| issue = 5
| journal = [[SIAM Journal on Computing]]
| mr = 1694164
| pages = 1722–1758
| title = Selecting the median
| volume = 28
| year = 1999| s2cid = 2633282
}}</ref>
 
<ref name=dz01>{{cite journal
The story becomes more complex for other indexes. We define <math>W_{t}(n)</math> as the minimum number of comparisons required to find the ''t'' smallest values. Knuth references a paper published by S. S. Kislitsyn, which shows an upper bound on this value:
| last1 = Dor | first1 = Dorit | author1-link = Dorit Dor
| last2 = Zwick | first2 = Uri | author2-link = Uri Zwick
| doi = 10.1137/S0895480199353895
| issue = 3
| journal = [[SIAM Journal on Discrete Mathematics]]
| mr = 1857348
| pages = 312–325
| title = Median selection requires <math>(2+\varepsilon)N</math> comparisons
| volume = 14
| year = 2001}}</ref>
 
<ref name=erickson>{{cite book|title=Algorithms|url=https://jeffe.cs.illinois.edu/teaching/algorithms/|first=Jeff|last=Erickson|date=June 2019|chapter=1.8: Linear-time selection|pages=35–39}}</ref>
:<math>W_{t}(n) \leq n - t + \sum_{n+1-t < j \leq n} \lceil{\log_2\, j}\rceil \quad \text{for}\, n \geq t</math>
 
<ref name=floriv>{{cite journal
This bound is achievable for ''t''=2 but better, more complex bounds are known for larger ''t''.
| last1 = Floyd | first1 = Robert W. | author1-link = Robert W. Floyd
| last2 = Rivest | first2 = Ronald L. | author2-link = Ron Rivest
| date = March 1975
| doi = 10.1145/360680.360691
| issue = 3
| journal = [[Communications of the ACM]]
| pages = 165–172
| s2cid = 3064709
| title = Expected time bounds for selection
| volume = 18| doi-access = free
}} See also "Algorithm 489: the algorithm SELECT—for finding the {{nowrap|<math>i</math>th}} smallest of <math>n</math> elements", p. 173, {{doi|10.1145/360680.360694}}.</ref>
 
<ref name=frederickson>{{cite journal
== Language support ==
| last = Frederickson | first = Greg N.
Very few languages have built-in support for general selection, although many provide facilities for finding the smallest or largest element of a list. A notable exception is [[C++]], which provides a templated <code>nth_element</code> method with a guarantee of expected linear time. It is implied but not required that it is based on Hoare's algorithm by its requirement of expected linear time. (Ref section 25.3.2 of ISO/IEC 14882:2003(E) and 14882:1998(E), see also [http://www.sgi.com/tech/stl/nth_element.html SGI STL description of nth_element])
| doi = 10.1006/inco.1993.1030
| issue = 2
| journal = [[Information and Computation]]
| mr = 1221889
| pages = 197–214
| title = An optimal algorithm for selection in a min-heap
| volume = 104
| year = 1993| doi-access = free
}}</ref>
 
<ref name=frejoh>{{cite journal
C++ also provides the [http://www.sgi.com/tech/stl/partial_sort.html partial_sort] algorithm, which solves the problem of selecting the smallest ''k'' elements (sorted), with a time complexity of O(''n'' log ''k''). No algorithm is provided for selecting the greatest ''k'' elements since this should be done by inverting the ordering [[Predicate (computer programming)|predicate]].
| last1 = Frederickson | first1 = Greg N.
| last2 = Johnson | first2 = Donald B. | author2-link = Donald B. Johnson
| doi = 10.1137/0213002
| issue = 1
| journal = [[SIAM Journal on Computing]]
| mr = 731024
| pages = 14–30
| title = Generalized selection and ranking: sorted matrices
| volume = 13
| year = 1984}}</ref>
 
<ref name=gkp>{{cite journal
For [[Perl]], the module [http://search.cpan.org/dist/Sort-Key-Top Sort::Key::Top], available from [[CPAN]], provides a set of functions to select the top n elements from a list using several orderings and custom key extraction procedures. Furthermore, the [http://search.cpan.org/dist/Statistics-CaseResampling Statistics::CaseResampling] module provides a function to calculate quantiles using quickselect.
| last1 = Gasarch | first1 = William | author1-link = William Gasarch
| last2 = Kelly | first2 = Wayne
| last3 = Pugh | first3 = William
| date = July 1996
| doi = 10.1145/235767.235772
| issue = 2
| journal = [[ACM SIGACT News]]
| pages = 88–96
| title = Finding the {{nowrap|<math>i</math>th}} largest of <math>n</math> for small <math>i,n</math>
| volume = 27| s2cid = 3133332 }}</ref>
 
<ref name=gootam>{{cite book|title=Algorithm Design and Applications|first1=Michael T.|last1=Goodrich|author1-link= Michael T. Goodrich|first2=Roberto|last2=Tamassia|author2-link=Roberto Tamassia|publisher=Wiley|year=2015|chapter=9.2: Selection|pages=270–275|isbn=978-1-118-33591-8}}</ref>
[[Python (programming language)|Python]]'s standard library (since 2.4) includes <code>[http://docs.python.org/library/heapq.html heapq].nsmallest()</code> and <code>nlargest()</code>, returning sorted lists, the former in O(''n'' + ''k'' log ''n'') time, the latter in O(''n'' log ''k'') time.
 
<ref name=gurwitz>{{cite journal
Because [[sorting algorithm#Language support|language support for sorting]] is more ubiquitous, the simplistic approach of sorting followed by indexing is preferred in many environments despite its disadvantage in speed. Indeed for [[Lazy evaluation|lazy languages]], this simplistic approach can even achieve the best complexity possible for the ''k'' smallest/greatest sorted (with maximum/minimum as a special case) if the sort is lazy enough.
| last = Gurwitz | first = Chaya
| doi = 10.1109/13.144650
| issue = 3
| journal = IEEE Transactions on Education
| pages = 230–232
| title = On teaching median-finding algorithms
| volume = 35
| year = 1992| bibcode = 1992ITEdu..35..230G
}}</ref>
 
<ref name=hadsob>{{cite report|first1=Abdollah|last1=Hadian|first2=Milton|last2=Sobel|author2-link=Milton Sobel|hdl=11299/199105|series=School of Statistics Technical Reports|volume=121|publisher=University of Minnesota|title=Selecting the {{nowrap|<math>t</math>-th}} largest using binary errorless comparisons|date=May 1969}}</ref>
==Online selection algorithm==
In certain selection problems, selection must be online, that is, an element can only be selected from a sequential input at the instance of observation and each selection, respectively refusal, is irrevocable. The problem is to select, under these constraints, a
specific element of the input sequence (as for example the largest or the smallest value)
with largest probability. This problem can be tackled by the [[Odds algorithm]] designed by [[F. Thomas Bruss]] who coined the name Odds algorithm. It is also known as Bruss-algorithm or Bruss-strategy. This algorithm yields the optimal under an independence condition; it is also optimal itself as an algorithm with the number of computations being linear in the length of input.
 
<ref name=han>{{cite journal
== Notes ==
| last = Han | first = Yijie
<references />
| doi = 10.1145/1290672.1290675
| issue = 4
| journal = [[ACM Transactions on Algorithms]]
| mr = 2364962
| page = A38:1–A38:11
| title = Optimal parallel selection
| volume = 3
| year = 2007| s2cid = 9645870
}}</ref>
 
<ref name=hoare>{{cite journal
== References ==
| last = Hoare | first = C. A. R. | author-link = Tony Hoare
* [[Manuel Blum|M. Blum]], [[Robert Floyd|R.W. Floyd]], [[Vaughan Ronald Pratt|V. Pratt]], [[Ron Rivest|R. Rivest]] and [[Robert Tarjan|R. Tarjan]], "Time bounds for selection," ''J. Comput. System Sci''. 7 (1973) 448-461.
| date = July 1961
* K. C. Kiwiel. On Floyd and Rivest’s SELECT Algorithm, ''Theoretical Computer Sci.'' 347 (2005) 214-238.
| doi = 10.1145/366622.366647
* [[Donald Knuth]]. ''[[The Art of Computer Programming]]'', Volume 3: ''Sorting and Searching'', Third Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0. Section 5.3.3: Minimum-Comparison Selection, pp.207&ndash;219.
| issue = 7
* [[Thomas H. Cormen]], [[Charles E. Leiserson]], [[Ronald L. Rivest]], and [[Clifford Stein]]. ''[[Introduction to Algorithms]]'', Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 9: Medians and Order Statistics, pp.183&ndash;196. Section 14.1: Dynamic order statistics, pp.302&ndash;308.
| journal = [[Communications of the ACM]]
* {{DADS|Select|select}}
| pages = 321–322
| title = Algorithm 65: Find
| volume = 4}}</ref>
 
<ref name=karrag>{{cite journal
==External links==
| last1 = Karloff | first1 = Howard J.
* [http://www.ics.uci.edu/~eppstein/161/960130.html Design and Analysis of Algorithms], for a detailed explanation of the recurrence relation for the median-of-medians
| last2 = Raghavan | first2 = Prabhakar | author2-link = Prabhakar Raghavan
| doi = 10.1145/174130.174132
| issue = 3
| journal = [[Journal of the ACM]]
| mr = 1370358
| pages = 454–476
| title = Randomized algorithms and pseudorandom numbers
| volume = 40
| year = 1993| s2cid = 17956460
| doi-access = free
}}</ref>
 
<ref name=kkzz>{{cite conference
| last1 = Kaplan | first1 = Haim
| last2 = Kozma | first2 = László
| last3 = Zamir | first3 = Or
| last4 = Zwick | first4 = Uri | author4-link = Uri Zwick
| editor1-last = Fineman | editor1-first = Jeremy T.
| editor2-last = Mitzenmacher | editor2-first = Michael | editor2-link = Michael Mitzenmacher
| arxiv = 1802.07041
| contribution = Selection from heaps, row-sorted matrices, and <math>X+Y</math> using soft heaps
| doi = 10.4230/OASIcs.SOSA.2019.5
| pages = 5:1–5:21
| publisher = Schloss Dagstuhl – Leibniz-Zentrum für Informatik
| series = OASIcs
| title = 2nd Symposium on Simplicity in Algorithms, SOSA 2019, January 8–9, 2019, San Diego, CA, USA
| volume = 69
| year = 2019| doi-access = free
}}</ref>
 
<ref name=kletar>{{cite book
| last1 = Kleinberg | first1 = Jon | author1-link = Jon Kleinberg
| last2 = Tardos | first2 = Éva | author2-link = Éva Tardos
| contribution = 13.5 Randomized divide and conquer: median-finding and quicksort
| isbn = 9780321295354
| pages = 727–734
| publisher = Addison-Wesley
| title = Algorithm Design
| year = 2006}}</ref>
 
<ref name=knuth>{{cite book
| last = Knuth | first = Donald E. | author-link = Donald Knuth
| contribution = Section 5.3.3: Minimum-comparison selection
| edition = 2nd
| isbn = 0-201-89685-0
| pages = 207–219
| publisher = Addison-Wesley
| title = The Art of Computer Programming, Volume 3: Sorting and Searching
| title-link = The Art of Computer Programming
| year = 1998}}</ref>
 
<ref name=kpaths>{{cite journal
| last = Eppstein | first = David | author-link = David Eppstein
| doi = 10.1137/S0097539795290477
| issue = 2
| journal = [[SIAM Journal on Computing]]
| mr = 1634364
| pages = 652–673
| title = Finding the <math>k</math> shortest paths
| volume = 28
| year = 1999}}</ref>
 
<ref name=musser>{{cite journal
| last = Musser | first = David R.
| date = August 1997
| doi = 10.1002/(sici)1097-024x(199708)27:8<983::aid-spe117>3.0.co;2-#
| issue = 8
| journal = Software: Practice and Experience
| pages = 983–993
| publisher = Wiley
| title = Introspective sorting and selection algorithms
| volume = 27}}</ref>
 
<ref name=pattho>{{cite conference
| last1 = Pătraşcu | first1 = Mihai | author1-link = Mihai Pătrașcu (computer scientist)
| last2 = Thorup | first2 = Mikkel
| arxiv = 1408.3045
| contribution = Dynamic integer sets with optimal rank, select, and predecessor search
| doi = 10.1109/FOCS.2014.26
| pages = 166–175
| publisher = IEEE Computer Society
| title = 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18–21, 2014
| year = 2014| isbn = 978-1-4799-6517-5 }}</ref>
 
<ref name=prt>{{cite journal
| last1 = Postmus | first1 = J. T.
| last2 = Rinnooy Kan | first2 = A. H. G. | author2-link = Alexander Rinnooy Kan
| last3 = Timmer | first3 = G. T.
| doi = 10.1145/182.358440
| issue = 11
| journal = [[Communications of the ACM]]
| mr = 784120
| pages = 878–881
| title = An efficient dynamic selection method
| volume = 26
| year = 1983| s2cid = 3211474
| doi-access = free
}}</ref>
 
<ref name=reischuk>{{cite journal
| last = Reischuk | first = Rüdiger
| doi = 10.1137/0214030
| issue = 2
| journal = [[SIAM Journal on Computing]]
| mr = 784745
| pages = 396–409
| title = Probabilistic parallel algorithms for sorting and selection
| volume = 14
| year = 1985}}</ref>
 
<ref name=skiena>{{cite book
| last = Skiena | first = Steven S. | author-link = Steven Skiena
| chapter = 17.3: Median and selection
| doi = 10.1007/978-3-030-54256-6
| edition = Third
| isbn = 978-3-030-54255-9
| mr = 4241430
| pages = 514–516
| publisher = Springer
| series = Texts in Computer Science
| title = The Algorithm Design Manual
| year = 2020| s2cid = 22382667 }}</ref>
 
<ref name=spp>{{cite journal
| last1 = Schönhage | first1 = A. | author1-link = Arnold Schönhage
| last2 = Paterson | first2 = M. | author2-link = Mike Paterson
| last3 = Pippenger | first3 = N. | author3-link = Nick Pippenger
| doi = 10.1016/S0022-0000(76)80029-3
| issue = 2
| journal = [[Journal of Computer and System Sciences]]
| mr = 428794
| pages = 184–199
| title = Finding the median
| volume = 13
| year = 1976| s2cid = 29867292 }}</ref>
 
<ref name=valiant>{{cite journal
| last = Valiant | first = Leslie G. | author-link = Leslie Valiant
| doi = 10.1137/0204030
| issue = 3
| journal = [[SIAM Journal on Computing]]
| mr = 378467
| pages = 348–355
| title = Parallelism in comparison problems
| volume = 4
| year = 1975}}</ref>
 
<ref name=matlab>{{cite web|url=https://www.mathworks.com/help/matlab/ref/mink.html|title=mink: Find k smallest elements of array|work=Matlab R2023a documentation|publisher=Mathworks|access-date=2023-03-30}}</ref>
 
<ref name=python>{{cite web|url=https://github.com/python/cpython/blob/main/Lib/heapq.py|title=heapq package source code|work=Python library|access-date=2023-08-06}}; see also the linked [https://code.activestate.com/recipes/577573-compare-algorithms-for-heapqsmallest/ comparison of algorithm performance on best-case data].</ref>
 
}}
 
{{DEFAULTSORT:Selection Algorithm}}
[[Category:Selection algorithms| ]]
 
[[ru:BFPRT-Алгоритм]]