Selection algorithm: Difference between revisions

Content deleted Content added
Partial selection sort: The swap shall be made at the end and NOT inside the inner "for" loop.
m Capitalize lower-cased proper nouns (Quickselect and Quicksort) to be consistent with existing capitalized ones
Line 3:
In [[computer science]], a '''selection algorithm''' is an [[algorithm]] for finding the ''k''th smallest number in a [[List (abstract data type)|list]] or [[Array data structure|array]]; such a number is called the ''k''th ''[[order statistic]]''. This includes the cases of finding the [[minimum]], [[maximum]], and [[median]] elements. There are O(''n'')-time (worst-case linear time) selection algorithms, and sublinear performance is possible for structured data; in the extreme, O(1) for an array of sorted data. Selection is a subproblem of more complex problems like the [[nearest neighbor problem|nearest neighbor]] and [[shortest path]] problems. Many selection algorithms are derived by generalizing a [[sorting algorithm]], and conversely some sorting algorithms can be derived as repeated application of selection.
 
The simplest case of a selection algorithm is finding the minimum (or maximum) element by iterating through the list, keeping track of the running minimum – the minimum so far – (or maximum) and can be seen as related to the [[selection sort]]. Conversely, the hardest case of a selection algorithm is finding the median. In fact, a specialized median-selection algorithm can be used to build a general selection algorithm, as in [[median of medians]]. The best-known selection algorithm is [[quickselectQuickselect]], which is related to [[quicksortQuicksort]]; like quicksortQuicksort, it has (asymptotically) optimal average performance, but poor worst-case performance, though it can be modified to give optimal worst-case performance as well.
 
== Selection by sorting ==
Line 11:
 
=== Unordered partial sorting ===
If partial sorting is relaxed so that the ''k'' smallest elements are returned, but not in order, the factor of O(''k'' log ''k'') can be eliminated. An additional maximum selection (taking O(''k'') time) is required, but since <math>k \leq n</math>, this still yields asymptotic complexity of O(''n''). In fact, partition-based selection algorithms yield both the ''k''th smallest element itself and the ''k'' smallest elements (with other elements not in order). This can be done in O(''n'') time – average complexity of [[quickselectQuickselect]], and worst-case complexity of refined partition-based selection algorithms.
 
Conversely, given a selection algorithm, one can easily get an unordered partial sort (''k'' smallest elements, not in order) in O(''n'') time by iterating through the list and recording all elements less than the ''k''th element. If this results in fewer than ''k''&nbsp;&minus;&nbsp;1 elements, any remaining elements equal the ''k''th element. Care must be taken, due to the possibility of equality of elements: one must not include all elements less than ''or equal to'' the ''k''th element, as elements greater than the ''k''th element may also be equal to it.
Line 20:
A simple example of selection by partial sorting is to use the partial [[selection sort]].
 
The obvious linear time algorithm to find the minimum (resp. maximum) – iterating over the list and keeping track of the minimum (resp. maximum) element so far – can be seen as a partial selection sort that selects the 1 smallest element. However, many other partial sorts also reduce to this algorithm for the case ''k''&nbsp; =&nbsp;1, such as a partial heap sort.
 
More generally, a partial selection sort yields a simple selection algorithm which takes O(''kn'') time. This is asymptotically inefficient, but can be sufficiently efficient if ''k'' is small, and is easy to implement. Concretely, we simply find the minimum value and move it to the beginning, repeating on the remaining list until we have accumulated ''k'' elements, and then return the ''k''th element. Here is partial selection sort-based algorithm:
Line 38:
{{further|Quickselect}}
 
Linear performance can be achieved by a partition-based selection algorithm, most basically [[quickselectQuickselect]]. Quickselect is a variant of [[quicksortQuicksort]] – in both one chooses a pivot and then partitions the data by it, but while Quicksort recurses on both sides of the partition, Quickselect only recurses on one side, namely the side on which the desired ''k''th element is. As with Quicksort, this has optimal average performance, in this case linear, but poor worst-case performance, in this case quadratic. This occurs for instance by taking the first element as the pivot and searching for the maximum element, if the data is already sorted. In practice this can be avoided by choosing a random element as pivot, which yields [[almost certain]] linear performance. Alternatively, a more careful deterministic pivot strategy can be used, such as [[median of medians]]. These are combined in the hybrid [[introselect]] algorithm (analogous to [[introsort]]), which starts with Quickselect but falls back to median of medians if progress is slow, resulting in both fast average performance and optimal worst-case performance of O(''n'').
 
The partition-based algorithms are generally done in place, which thus results in partially sorting the data. They can be done out of place, not changing the original data, at the cost of O(''n'') additional space.
Line 51:
 
== Incremental sorting by selection ==
Converse to selection by sorting, one can incrementally sort by repeated selection. Abstractly, selection only yields a single element, the ''k''th element. However, practical selection algorithms frequently involve partial sorting, or can be modified to do so. Selecting by partial sorting naturally does so, sorting the elements up to ''k'', and selecting by partitioning also sorts some elements: the pivots are sorted to the correct positions, with the ''k''th element being the final pivot, and the elements between the pivots have values between the pivot values. The difference between partition-based selection and partition-based sorting, as in quickselectQuickselect versus quicksortQuicksort, is that in selection one recurses on only one side of each pivot, sorting only the pivots (an average of log(''n'') pivots are used), rather than recursing on both sides of the pivot.
 
This can be used to speed up subsequent selections on the same data; in the extreme, a fully sorted array allows O(1) selection. Further, compared with first doing a full sort, incrementally sorting by repeated selection [[amortized analysis|amortizes]] the sorting cost over multiple selections.
Line 97:
Very few languages have built-in support for general selection, although many provide facilities for finding the smallest or largest element of a list. A notable exception is [[C++]], which provides a templated <code>nth_element</code> method with a guarantee of expected linear time, and also partitions the data, requiring that the ''n''th element be sorted into its correct place, elements before the ''n''th element are less than it, and elements after the ''n''th element are greater than it. It is implied but not required that it is based on Hoare's algorithm (or some variant) by its requirement of expected linear time and partitioning of data.<ref>Section 25.3.2 of ISO/IEC 14882:2003(E) and 14882:1998(E)</ref><ref>[http://www.sgi.com/tech/stl/nth_element.html nth_element], SGI STL</ref>
 
For [[Perl]], the module [https://metacpan.org/module/Sort::Key::Top Sort::Key::Top], available from [[CPAN]], provides a set of functions to select the top n elements from a list using several orderings and custom key extraction procedures. Furthermore, the [https://metacpan.org/module/Statistics::CaseResampling Statistics::CaseResampling] module provides a function to calculate quantiles using quickselectQuickselect.
 
[[Python (programming language)|Python]]'s standard library (since 2.4) includes <code>[https://docs.python.org/library/heapq.html heapq].nsmallest()</code> and <code>nlargest()</code>, returning sorted lists, in O(''n'' log ''k'') time.<ref>https://stackoverflow.com/a/23038826</ref>