Selection algorithm: Difference between revisions

Content deleted Content added
Further reading: not sure what value Kiwiel adds; the O((n log n)^{2/3}) version of Rivest & Floyd is unproblematic, and equal values are not an interesting case (break the ties). Knuth needs to stay here until he can be promoted to an actual reference, though.
Promote knuth and fix edition
Line 28:
*If the pivot were exactly at the median of the input, then each recursive call would have at most half as many values as the previous call, and the total times would add in a [[geometric series]] to <math>O(n)</math>. However, finding the median is itself a selection problem, on the entire original input. Trying to find it by a recursive call to a selection algorithm would lead to an infinite recursion, because the problem size would not decrease in each call.{{r|kletar}}
*[[Quickselect]] chooses the pivot uniformly at random from the input values. It can be described as a variant of [[quicksort]], with the same pivoting strategy, but where quicksort makes two recursive calls to sort the two subcollections <math>L</math> and <math>R</math>, quickselect only makes one of these two calls. Its [[expected time]] is <math>O(n)</math>.{{r|clrs|kletar}}
*The [[Floyd–Rivest algorithm]], a variation of quickselect, chooses a pivot by randomly sampling a subset of <math>r</math> data values, for some sample size <math>r</math>, and then recursively selecting two elements somewhat above and below position <math>rk/n</math> of the sample to use as pivots. With this choice, it is likely that <math>k</math> is sandwiched between the two pivots, so that after pivoting only a small number of data values between the pivots are left for a recursive call. This method can achieve an expected number of comparisons that is <math>n+\min(k,n-k)+o(n)</math>.{{r|floriv}} In their original work, Floyd and Rivest claimed that the <math>o(n)</math> term could be made as small as <math>O(\sqrt n)</math> by a recursive sampling scheme, but the correctness of their analysis has been questioned.{{r|brown|prt}} Instead, more rigorous analysis has shown that a version of their algorithm achieves <math>O(\sqrt{n\log n})</math> for this term.{{r|knuth}}
*The [[median of medians]] method partitions the input into sets of five elements, and then uses some other method (rather than a recursive call) to find the median of each of these sets in constant time per set. It then recursively calls the same selection algorithm to find the median of these <math>n/5</math> medians, using the result as its pivot. It can be shown that, for this choice of pivot, <math>\max(|L|,|R|)\le 7n/10</math>. Thus, a problem on <math>n</math> elements is reduced to two recursive problems on <math>n/5</math> and at most <math>7n/10</math> elements. The total size of these two recursive subproblems is at most <math>9n/10</math>, allowing the total time to be analyzed as a geometric series adding to <math>O(n)</math>. Unlike quickselect, this algorithm is deterministic, not randomized.{{r|clrs|bfprt}} It was the first linear-time deterministic selection algorithm known,{{r|bfprt}} and is commonly taught in undergraduate algorithms classes as an example of a [[divide and conquer]] algorithm that does not divide into two equal subproblems. However, the high constant factors in its <math>O(n)</math> time bound make it unsuitable for practical use.
*Hybrid algorithms such as [[introselect]] can be used to achieve the practical performance of quickselect with a fallback to medians of medians guaranteeing worst-case <math>O(n)</math> time.{{r|musser}}
Line 40:
 
== Lower bounds ==
The <math>O(n)</math> running time of the selection algorithms described above is necessary, because a selection algorithm that can handle inputs in an arbitrary order must take that much time to look at all of its inputs; if any one of its input values is not compared, that one value could be the one that should have been selected, and the algorithm can be made to produce an incorrect answer. However, beyond this simple argument, there has been a significant amount of research on the exact number of comparisons needed for selection, both in the randomized and deterministic cases.
{{anchor|Tournament Algorithm}}
In ''[[The Art of Computer Programming]]'', [[Donald E. Knuth]] discussed a number of lower bounds for the number of comparisons required to locate the ''t'' smallest entries of an unorganized list of ''n'' items (using only comparisons). There is a trivial lower bound of ''n'' &minus; 1 for the minimum or maximum entry. To see this, consider a tournament where each game represents one comparison. Since every player except the winner of the tournament must lose a game before we know the winner, we have a lower bound of ''n'' &minus; 1 comparisons.
 
Selecting the minimum of <math>n</math> values requires <math>n-1</math> comparisons, because the <math>n-1</math> values that are not selected must each have been determined to be non-minimal, by being the largest in some comparison, and no two of these values can be largest in the same comparison. The same argument applies symmetrically to selecting the maximum.{{r|knuth}}
The story becomes more complex for other indexes. We define <math>W_{t}(n)</math> as the minimum number of comparisons required to find the ''t'' smallest values. Knuth references a paper published by S. S. Kislitsyn, which shows an upper bound on this value:
 
The next simplest case is selecting the second-smallest. After several incorrect attempts, the first tight lower bound on this case was published in 1964 by Soviet mathematician [[Sergey Kislitsyn]]. It can be shown by observing that selecting the second-smallest also requires distinguishing the smallest value from the rest, and by considering the number <math>p</math> of comparisons involving the smallest value that an algorithm for this problem makes. Each of the <math>p</math> items that were compared to the smallest value is a candidate for second-smallest, and <math>p-1</math> of these values must be found larger than another value in a second comparison in order to rule them out as second-smallest.
:<math>W_{t}(n) \leq n - t + \sum_{n+1-t < j \leq n} \lceil{\log_2\, j}\rceil \quad \text{for}\, n \geq t</math>
With <math>n-1</math> values being the larger in at least one comparison, and <math>p-1</math> values being the larger in at least two comparisons, there are a total of at least <math>n+p-2</math> comparisons. An [[Adversary model|adversary argument]], in which the outcome of each comparison is chosen in order to maximize <math>p</math> (subject to consistency with at least one possible ordering) rather than by the numerical values of the given items, shows that it is possible to force <math>p</math> to be at least <math>\log_2 n</math>. Therefore, the worst-case number of comparisons needed to select the second smallest is <math>n+\lceil\log_2 n\rceil-2</math>, the same number that would be obtained by holding a single-elimination tournament with a run-off tournament among the values that lost to the smallest value. However, the expected number of comparisons of a randomized selection algorithm can be better than this bound; for instance, selecting the second-smallest of six elements requires seven comparisons in the worst case, but may be done by a randomized algorithm with an expected number of 6.5 comparisons.{{r|knuth}}
 
This bound is achievable for ''t''=2 but better, more complex bounds are known for larger ''t''.{{citation needed|date=April 2018}}
 
== Language support ==
Line 137 ⟶ 135:
| title = Algorithm Design
| year = 2006}}</ref>
 
<ref name=knuth>{{cite book
| last = Knuth | first = Donald E. | author-link = Donald Knuth
| contribution = Section 5.3.3: Minimum-comparison selection
| edition = 2nd
| isbn = 0-201-89685-0
| pages = 207–219
| publisher = Addison-Wesley
| title = The Art of Computer Programming, Volume 3: Sorting and Searching
| title-link = The Art of Computer Programming
| year = 1998}}</ref>
 
<ref name=kpaths>{{cite journal
Line 178 ⟶ 187:
 
}}
 
==Further reading==
{{refbegin}}
* [[Donald Knuth]]. ''[[The Art of Computer Programming]]'', Volume 3: ''Sorting and Searching'', Third Edition. Addison-Wesley, 1997. {{ISBN|0-201-89685-0}}. Section 5.3.3: Minimum-Comparison Selection, pp.&nbsp;207&ndash;219.
{{refend}}
 
{{DEFAULTSORT:Selection Algorithm}}