Selection algorithm: Difference between revisions

Content deleted Content added
Lower bounds: too many howevers
Skiena
Line 16:
* [[Sorting|Sort]] the collection
* If the output of the sorting algorithm is an array, jump to its <math>k</math>th element; otherwise, scan the sorted sequence to find the <math>k</math>th element.
The time for this method is dominated by the sorting step, which requires <math>\Theta(n\log n)</math> time using a [[comparison sort]].{{r|clrs|skiena}} Even when [[integer sorting]] algorithms may be used, these are generally slower than the linear time that may be achieved using specialized selection algorithms. Nevertheless, the simplicity of this approach makes it attractive, especially when a highly-optimized sorting routine is provided as part of a runtime library, but a selection algorithm is not.
 
For a sorting algorithm that generates one item at a time, such as [[selection sort]], the scan can be done in tandem with the sort, and the sort can be terminated once the <math>k</math> element has been found. One possible design of a consolation bracket in a [[single-elimination tournament]], in which the teams who lost to the eventual winner play another mini-tournament to determine second place, can be seen as an instance of this method.{{r|bfprt}} Applying this optimization to [[heapsort]] produces the [[heapselect]] algorithm, which can select the <math>k</math>th smallest value in time <math>O(n+k\log n)</math>. This is fast when <math>k</math> is small relative to <math>n</math>, but degenerates to <math>O(n\log n)</math> for larger values of <math>k</math>, such as the choice <math>k=n/2</math> used for median finding.
Line 29:
*[[Quickselect]] chooses the pivot uniformly at random from the input values. It can be described as a variant of [[quicksort]], with the same pivoting strategy, but where quicksort makes two recursive calls to sort the two subcollections <math>L</math> and <math>R</math>, quickselect only makes one of these two calls. Its [[expected time]] is <math>O(n)</math>.{{r|clrs|kletar}}
*The [[Floyd–Rivest algorithm]], a variation of quickselect, chooses a pivot by randomly sampling a subset of <math>r</math> data values, for some sample size <math>r</math>, and then recursively selecting two elements somewhat above and below position <math>rk/n</math> of the sample to use as pivots. With this choice, it is likely that <math>k</math> is sandwiched between the two pivots, so that after pivoting only a small number of data values between the pivots are left for a recursive call. This method can achieve an expected number of comparisons that is <math>n+\min(k,n-k)+o(n)</math>.{{r|floriv}} In their original work, Floyd and Rivest claimed that the <math>o(n)</math> term could be made as small as <math>O(\sqrt n)</math> by a recursive sampling scheme, but the correctness of their analysis has been questioned.{{r|brown|prt}} Instead, more rigorous analysis has shown that a version of their algorithm achieves <math>O(\sqrt{n\log n})</math> for this term.{{r|knuth}}
*The [[median of medians]] method partitions the input into sets of five elements, and then uses some other method (rather than a recursive call) to find the median of each of these sets in constant time per set. It then recursively calls the same selection algorithm to find the median of these <math>n/5</math> medians, using the result as its pivot. It can be shown that, for this choice of pivot, <math>\max(|L|,|R|)\le 7n/10</math>. Thus, a problem on <math>n</math> elements is reduced to two recursive problems on <math>n/5</math> and at most <math>7n/10</math> elements. The total size of these two recursive subproblems is at most <math>9n/10</math>, allowing the total time to be analyzed as a geometric series adding to <math>O(n)</math>. Unlike quickselect, this algorithm is deterministic, not randomized.{{r|clrs|bfprt}} It was the first linear-time deterministic selection algorithm known,{{r|bfprt}} and is commonly taught in undergraduate algorithms classes as an example of a [[divide and conquer]] algorithm that does not divide into two equal subproblems. However, the high constant factors in its <math>O(n)</math> time bound make it unsuitableslower forthan practicalquickselect usein practice.{{r|skiena}}
*Hybrid algorithms such as [[introselect]] can be used to achieve the practical performance of quickselect with a fallback to medians of medians guaranteeing worst-case <math>O(n)</math> time.{{r|musser}}
 
Line 55:
 
== Language support ==
Very few languages have built-in support for general selection, although many provide facilities for finding the smallest or largest element of a list. A notable exception is the [[Standard Template Library]] for [[C++]], which provides a templated <code>nth_element</code> method with a guarantee of expected linear time, and also partitions the data, requiring that the ''n''th element be sorted into its correct place, elements before the ''n''th element are less than it, and elements after the ''n''th element are greater than it. It is implied but not required that it is based on Hoare's algorithm (or some variant) by its requirement of expected linear time and partitioning of data.{{r|iso|sgiskiena}}
 
For [[Perl]], the module [https://metacpan.org/module/Sort::Key::Top Sort::Key::Top], available from [[CPAN]], provides a set of functions to select the top n elements from a list using several orderings and custom key extraction procedures. Furthermore, the [https://metacpan.org/module/Statistics::CaseResampling Statistics::CaseResampling] module provides a function to calculate quantiles using Quickselect.
Line 246:
| volume = 26
| year = 1983}}</ref>
 
<ref name=skiena>{{cite book
| last = Skiena | first = Steven S. | author-link = Steven Skiena
| chapter = 17.3: Median and selection
| doi = 10.1007/978-3-030-54256-6
| edition = Third
| isbn = 978-3-030-54255-9
| mr = 4241430
| pages = 514–516
| publisher = Springer
| series = Texts in Computer Science
| title = The Algorithm Design Manual
| year = 2020}}</ref>
 
<ref name=spp>{{cite journal
Line 259 ⟶ 272:
| volume = 13
| year = 1976}}</ref>
 
<ref name=iso>Section 25.3.2 of ISO/IEC 14882:2003(E) and 14882:1998(E)</ref>
 
<ref name=sgi>[http://www.sgi.com/tech/stl/nth_element.html nth_element], SGI STL</ref>
 
}}