Selection algorithm: Difference between revisions

Content deleted Content added
Start rewrite, cutting a lot of unsourced cruft. New material is also unsourced for now but should be sourceable.
pull refs to end for greater source readability and remove some bad ones; promote CLRS to ref
Line 2:
{{for|simulated natural selection in genetic algorithms|Selection (genetic algorithm)}}
{{more footnotes|date=July 2017}}
In [[computer science]], a '''selection algorithm''' is an [[algorithm]] for finding the <math>k</math>th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the <math>k</math>th [[order statistic]]. DifferentFor choicesa collection of <math>n</math> values, setting <math>k</math> produceto <math>1</math>, <math>(n+1)/2</math> (with some choice of rounding), or <math>n</math> produces the [[minimum]], [[maximummedian]], and [[medianmaximum]] elements in the given collection, respectively.{{r|clrs}} Selection algorithms include [[quickselect]], and the [[median of medians]] algorithm. When applied to a collection of <math>n</math> values, these algorithms take [[linear time]], <math>O(n)</math> as expressed using [[big O notation]]. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted [[Array (data structure)|array]] takes time <math>O(1)</math>.
 
==Algorithms==
Line 9:
* [[Sorting|Sort]] the collection
* If the output of the sorting algorithm is an array, jump to its <math>k</math>th element; otherwise, scan the sorted sequence to find the <math>k</math>th element.
The time for this method is dominated by the sorting step, which requires <math>\Theta(n\log n)</math> time using a [[comparison sort]].{{r|clrs}} Even when [[integer sorting]] algorithms may be used, these are generally slower than the linear time that may be achieved using specialized selection algorithms. Nevertheless, the simplicity of this approach makes it attractive, especially when a highly-optimized sorting routine is provided as part of a runtime library, but a selection algorithm is not.
 
For a sorting algorithm that generates one item at a time, such as [[selection sort]], the scan can be done in tandem with the sort, and the sort can be terminated once the <math>k</math> element has been found. Applying this optimization to [[heapsort]] produces the [[heapselect]] algorithm, which can select the <math>k</math>th smallest value in time <math>O(n+k\log n)</math>. This is fast when <math>k</math> is small relative to <math>n</math>, but degenerates to <math>O(n\log n)</math> for larger values of <math>k</math>, such as the choice <math>k=n/2</math> used for median finding.
Line 22:
The time to compare the pivot against all the other values is <math>O(n)</math>. However, pivoting methods differ in how they choose the pivot, which affects how big the subproblems in each recursive call will be. The efficiency of these methods depends greatly on the choice of the pivot.
*If the pivot were exactly at the median of the input, then each recursive call would have at most half as many values as the previous call, and the total times would add in a [[geometric series]] to <math>O(n)</math>. However, finding the median is itself a selection problem, on the entire original input. Trying to find it by a recursive call to a selection algorithm would lead to an infinite recursion, because the problem size would not decrease in each call.
*[[Quickselect]] chooses the pivot uniformly at random from the input values. It can be described as a variant of [[quicksort]], with the same pivoting strategy, but where quicksort makes two recursive calls to sort the two subcollections <math>L</math> and <math>R</math>, quickselect only makes one of these two calls. Its [[expected time]] is <math>O(n)</math>.{{r|clrs}}
*The [[median of medians]] method partitions the input into sets of five elements, and then uses some other method (rather than a decisionrecursive treecall) to find the median of each of these sets in constant time per set. It then recursively calls the same selection algorithm to find the median of these <math>n/5</math> medians, using the result as its pivot. It can be shown that, for this choice of pivot, <math>\max(|L|,|R|)\le 7n/10</math>. Thus, a problem on <math>n</math> elements is reduced to two recursive problems on <math>n/5</math> and at most <math>7n/10</math> elements. The total size of these two recursive subproblems is at most <math>9n/10</math>, allowing the total time to be analyzed as a geometric series adding to <math>O(n)</math>. Unlike quickselect, this algorithm is deterministic, not randomized.{{r|clrs}} It is commonly taught in undergraduate algorithms classes as an example of a [[divide and conquer]] algorithm that does not divide into two equal subproblems. However, the high constant factors in its <math>O(n)</math> time bound make it unsuitable for practical use.
*Hybrid algorithms such as [[introselect]] can be used to achieve the practical performance of quickselect with a fallback to medians of medians guaranteeing worst-case <math>O(n)</math> time.
 
Line 48:
 
== Language support ==
Very few languages have built-in support for general selection, although many provide facilities for finding the smallest or largest element of a list. A notable exception is [[C++]], which provides a templated <code>nth_element</code> method with a guarantee of expected linear time, and also partitions the data, requiring that the ''n''th element be sorted into its correct place, elements before the ''n''th element are less than it, and elements after the ''n''th element are greater than it. It is implied but not required that it is based on Hoare's algorithm (or some variant) by its requirement of expected linear time and partitioning of data.<ref>Section 25.3.2 of ISO/IEC 14882:2003(E) and 14882:1998(E)</ref><ref>[http://www.{{r|iso|sgi.com/tech/stl/nth_element.html nth_element], SGI STL</ref>}}
 
For [[Perl]], the module [https://metacpan.org/module/Sort::Key::Top Sort::Key::Top], available from [[CPAN]], provides a set of functions to select the top n elements from a list using several orderings and custom key extraction procedures. Furthermore, the [https://metacpan.org/module/Statistics::CaseResampling Statistics::CaseResampling] module provides a function to calculate quantiles using Quickselect.
 
[[Python (programming language)|Python]]'s standard library (since 2.4) includes <code>[https://docs.python.org/library/heapq.html heapq].nsmallest()</code> and <code>nlargest()</code>, returning sorted lists, in O(''n'' log ''k'') time.<ref>{{Cite web|url=https://stackoverflow.com/a/23038826|title=Python - What is the time complexity of heapq.nlargest?}}</ref> Numpy has the <code>partition()</code> function.
 
[[Matlab]] includes <code>maxk()</code> and <code>mink()</code> functions, which return the maximal (minimal) k values in a vector as well as their indices.
 
Because [[sorting algorithm#Language support|language support for sorting]] is more ubiquitous, the simplistic approach of sorting followed by indexing is preferred in many environments despite its disadvantage in speed. Indeed, for [[Lazy evaluation|lazy languages]], this simplistic approach can even achieve the best complexity possible for the ''k'' smallest/greatest sorted (with maximum/minimum as a special case) if the sort is lazy enough{{Citation needed|date=April 2014}}.
 
== See also ==
Line 63 ⟶ 61:
 
== References ==
{{reflist}}|refs=
{{refbegin}}
 
<ref name=clrs>{{Introduction to Algorithms|edition=3|chapter=Chapter 9: Medians and order statistics|pages=213–227}}; "Section 14.1: Dynamic order statistics", pp. 339–345</ref>
==Bibliography==
 
<ref name=iso>Section 25.3.2 of ISO/IEC 14882:2003(E) and 14882:1998(E)</ref>
 
<ref name=sgi>[http://www.sgi.com/tech/stl/nth_element.html nth_element], SGI STL</ref>
 
}}
 
==Further reading==
{{refbegin}}
* {{Cite journal | last1 = Blum | first1 = M. | author-link1 = Manuel Blum| last2 = Floyd | first2 = R. W. | author-link2 = Robert W. Floyd| last3 = Pratt | first3 = V. R. | author-link3 = Vaughan Pratt| last4 = Rivest | first4 = R. L. | author-link4 = Ron Rivest| last5 = Tarjan | first5 = R. E. | author-link5 = Robert Tarjan | title = Time bounds for selection | doi = 10.1016/S0022-0000(73)80033-9 | journal = Journal of Computer and System Sciences | volume = 7 | issue = 4 | pages = 448–461 | date =August 1973 | url = http://people.csail.mit.edu/rivest/pubs/BFPRT73.pdf| doi-access = free }}
* {{Cite journal | last1 = Floyd | first1 = R. W. | author-link1 = Robert W. Floyd | last2 = Rivest | first2 = R. L. | author-link2 = Ron Rivest | doi = 10.1145/360680.360691 | title = Expected time bounds for selection | journal = Communications of the ACM | volume = 18 | issue = 3 | pages = 165–172 | date=March 1975 | s2cid = 3064709 }}