Selection algorithm: Difference between revisions

Content deleted Content added
Bibliography: I don't think any material here still resembles the DADS link
Start rewrite, cutting a lot of unsourced cruft. New material is also unsourced for now but should be sourceable.
Line 2:
{{for|simulated natural selection in genetic algorithms|Selection (genetic algorithm)}}
{{more footnotes|date=July 2017}}
In [[computer science]], a '''selection algorithm''' is an [[algorithm]] for finding the ''<math>k''</math>th smallest numbervalue in a [[Listcollection (abstractof dataordered type)|list]]values, orsuch [[Arrayas datanumbers. structure|array]];The suchvalue athat it numberfinds is called the ''<math>k''</math>th ''[[order statistic]]''. ThisDifferent includeschoices theof cases of<math>k</math> findingproduce the [[minimum]], [[maximum]], and [[median]] elements. Therein arethe O(''n'')-timegiven (worst-casecollection. linearSelection time)algorithms selectioninclude algorithms[[quickselect]], and sublinear performance is possible for structured data; in the extreme, O(1) for an array[[median of sortedmedians]] dataalgorithm. SelectionWhen isapplied to a subproblemcollection of more<math>n</math> complexvalues, problemsthese likealgorithms thetake [[nearestlinear neighbor problem|nearest neighbortime]], and<math>O(n)</math> as expressed using [[shortestbig pathO notation]] problems. ManyFor selectiondata algorithmsthat areis derivedalready bystructured, generalizingfaster aalgorithms [[sortingmay algorithm]],be andpossible; converselyas somean sortingextreme algorithmscase, canselection bein derivedan asalready-sorted repeated[[Array application(data ofstructure)|array]] selectiontakes time <math>O(1)</math>.
 
==Algorithms==
The simplest case of a selection algorithm is finding the minimum (or maximum) element by iterating through the list, keeping track of the running minimum – the minimum so far – (or maximum) and can be seen as related to the [[selection sort]]. Conversely, the hardest case of a selection algorithm is finding the median. In fact, a specialized median-selection algorithm can be used to build a general selection algorithm, as in [[median of medians]]. The best-known selection algorithm is [[Quickselect]], which is related to [[Quicksort]]; like Quicksort, it has (asymptotically) optimal average performance, but poor worst-case performance, though it can be modified to give optimal worst-case performance as well.
===Sorting and heapselect===
As a baseline algorithm, selection of the <math>k</math>th smallest value in a collection of values can be performed very simply by the following two steps:
* [[Sorting|Sort]] the collection
* If the output of the sorting algorithm is an array, jump to its <math>k</math>th element; otherwise, scan the sorted sequence to find the <math>k</math>th element.
The time for this method is dominated by the sorting step, which requires <math>\Theta(n\log n)</math> time using a [[comparison sort]]. Even when [[integer sorting]] algorithms may be used, these are generally slower than the linear time that may be achieved using specialized selection algorithms. Nevertheless, the simplicity of this approach makes it attractive, especially when a highly-optimized sorting routine is provided as part of a runtime library, but a selection algorithm is not.
 
For a sorting algorithm that generates one item at a time, such as [[selection sort]], the scan can be done in tandem with the sort, and the sort can be terminated once the <math>k</math> element has been found. Applying this optimization to [[heapsort]] produces the [[heapselect]] algorithm, which can select the <math>k</math>th smallest value in time <math>O(n+k\log n)</math>. This is fast when <math>k</math> is small relative to <math>n</math>, but degenerates to <math>O(n\log n)</math> for larger values of <math>k</math>, such as the choice <math>k=n/2</math> used for median finding.
== Selection by sorting ==
By sorting the list or array then selecting the desired element, selection can be [[Reduction (complexity)|reduced]] to [[sorting algorithm|sorting]]. This method is inefficient for selecting a single element, but is efficient when many selections need to be made from an array, in which case only one initial, expensive sort is needed, followed by many cheap selection operations – O(1) for an array, though selection is O(''n'') in a linked list, even if sorted, due to lack of [[random access]]. In general, sorting requires O(''n'' log ''n'') time, where ''n'' is the length of the list, although a lower bound is possible with non-comparative sorting algorithms like [[radix sort]] and [[counting sort]].
 
===Decision trees===
Rather than sorting the whole list or array, one can instead use [[partial sorting]] to select the ''k'' smallest or ''k'' largest elements. The ''k''th smallest (resp., ''k''th largest element) is then the largest (resp., smallest element) of the partially sorted list – this then takes O(1) to access in an array and O(''k'') to access in a list.
 
===Pivoting===
=== Unordered partial sorting ===
Many methods for selection are based on choosing a special "pivot" element from the input, and using comparisons with this element to divide the remaining <math>n-1</math> input values into two subsets: the set <math>L</math> of elements less than the pivot, and the set <math>R</math> of elements greater than the pivot. The algorithm can then determine where the <math>k</math>th smallest value is to be found, based on a comparison of <math>k</math> with the sizes of these sets. In particular, if <math>k\le|L|</math>, the <math>k</math>th smallest value is in <math>L</math>, and can be found recursively by applying the same selection algorithm to <math>L</math>. If <math>k=|L|+1</math>, then the <math>k</math>th smallest value is the pivot, and it can be returned immediately. In the remaining case, the <math>k</math>th smallest value is in <math>R</math>, and more specifically it is the element in position <math>k-|L|-1</math> of <math>R</math>. It can be found by applying a selection algorithm recursively, seeking the value in this position in <math>R</math>.
If partial sorting is relaxed so that the ''k'' smallest elements are returned, but not in order, the factor of O(''k'' log ''k'') can be eliminated. An additional maximum selection (taking O(''k'') time) is required, but since <math>k \leq n</math>, this still yields asymptotic complexity of O(''n''). In fact, partition-based selection algorithms yield both the ''k''th smallest element itself and the ''k'' smallest elements (with other elements not in order). This can be done in O(''n'') time – average complexity of [[Quickselect]], and worst-case complexity of refined partition-based selection algorithms.
 
As with the related pivoting-based [[quicksort]] algorithm, the partition of the input into <math>L</math> and <math>R</math> may be done by making new collections for these sets, or by a method that partitions a given list or array data type in-place. Details vary depending on how the input collection is represented.
Conversely, given a selection algorithm, one can easily get an unordered partial sort (''k'' smallest elements, not in order) in O(''n'') time by iterating through the list and recording all elements less than the ''k''th element. If this results in fewer than ''k''&nbsp;&minus;&nbsp;1 elements, any remaining elements equal the ''k''th element. Care must be taken, due to the possibility of equality of elements: one must not include all elements less than ''or equal to'' the ''k''th element, as elements greater than the ''k''th element may also be equal to it.
 
The time to compare the pivot against all the other values is <math>O(n)</math>. However, pivoting methods differ in how they choose the pivot, which affects how big the subproblems in each recursive call will be. The efficiency of these methods depends greatly on the choice of the pivot.
Thus unordered partial sorting (lowest ''k'' elements, but not ordered) and selection of the ''k''th element are very similar problems. Not only do they have the same asymptotic complexity, O(''n''), but a solution to either one can be converted into a solution to the other by a straightforward algorithm (finding a max of ''k'' elements, or filtering elements of a list below a cutoff of the value of the ''k''th element).
*If the pivot were exactly at the median of the input, then each recursive call would have at most half as many values as the previous call, and the total times would add in a [[geometric series]] to <math>O(n)</math>. However, finding the median is itself a selection problem, on the entire original input. Trying to find it by a recursive call to a selection algorithm would lead to an infinite recursion, because the problem size would not decrease in each call.
*[[Quickselect]] chooses the pivot uniformly at random from the input values. It can be described as a variant of [[quicksort]], with the same pivoting strategy, but where quicksort makes two recursive calls to sort the two subcollections <math>L</math> and <math>R</math>, quickselect only makes one of these two calls. Its [[expected time]] is <math>O(n)</math>.
*The [[median of medians]] method partitions the input into sets of five elements, and then uses a decision tree to find the median of each of these sets. It then recursively calls the same selection algorithm to find the median of these <math>n/5</math> medians, using the result as its pivot. It can be shown that, for this choice of pivot, <math>\max(|L|,|R|)\le 7n/10</math>. Thus, a problem on <math>n</math> elements is reduced to two recursive problems on <math>n/5</math> and at most <math>7n/10</math> elements. The total size of these two recursive subproblems is at most <math>9n/10</math>, allowing the total time to be analyzed as a geometric series adding to <math>O(n)</math>. Unlike quickselect, this algorithm is deterministic, not randomized. It is commonly taught in undergraduate algorithms classes as an example of a [[divide and conquer]] algorithm that does not divide into two equal subproblems. However, the high constant factors in its <math>O(n)</math> time bound make it unsuitable for practical use.
*Hybrid algorithms such as [[introselect]] can be used to achieve the practical performance of quickselect with a fallback to medians of medians guaranteeing worst-case <math>O(n)</math> time.
 
===Sublinear data structures===
=== Partial selection sort ===
A simple example of selection by partial sorting is to use the partial [[selection sort]].
 
The obvious linear time algorithm to find the minimum (resp. maximum) – iterating over the list and keeping track of the minimum (resp. maximum) element so far – can be seen as a partial selection sort that selects the 1 smallest element. However, many other partial sorts also reduce to this algorithm for the case ''k'' =1, such as a partial heap sort.
 
More generally, a partial selection sort yields a simple selection algorithm which takes O(''kn'') time. This is asymptotically inefficient, but can be sufficiently efficient if ''k'' is small, and is easy to implement. Concretely, we simply find the minimum value and move it to the beginning, repeating on the remaining list until we have accumulated ''k'' elements, and then return the ''k''th element. Here is partial selection sort-based algorithm:
 
'''function''' select(list[1..n], k)
'''for''' i '''from''' 1 '''to''' k
minIndex = i
minValue = list[i]
'''for''' j '''from''' i+1 '''to''' n '''do'''
'''if''' list[j] < minValue '''then'''
minIndex = j
minValue = list[j]
swap list[i] and list[minIndex]
'''return''' list[k]
 
== Partition-based selection ==
{{further|Quickselect}}
 
Linear performance can be achieved by a partition-based selection algorithm, most basically [[Quickselect]]. Quickselect is a variant of [[Quicksort]] – in both one chooses a pivot and then partitions the data by it, but while Quicksort recurses on both sides of the partition, Quickselect only recurses on one side, namely the side on which the desired ''k''th element is. As with Quicksort, this has optimal average performance, in this case linear, but poor worst-case performance, in this case quadratic. This occurs for instance by taking the first element as the pivot and searching for the maximum element, if the data is already sorted. In practice this can be avoided by choosing a random element as pivot, which yields [[almost certain]] linear performance. Alternatively, a more careful deterministic pivot strategy can be used, such as [[median of medians]]. These are combined in the hybrid [[introselect]] algorithm (analogous to [[introsort]]), which starts with Quickselect but falls back to median of medians if progress is slow, resulting in both fast average performance and optimal worst-case performance of O(''n'').
 
The partition-based algorithms are generally done in place, which thus results in partially sorting the data. They can be done out of place, not changing the original data, at the cost of O(''n'') additional space.
 
=== Median selection as pivot strategy ===
{{further|Median of medians}}
A median-selection algorithm can be used to yield a general selection algorithm or sorting algorithm, by applying it as the pivot strategy in Quickselect or Quicksort; if the median-selection algorithm is asymptotically optimal (linear-time), the resulting selection or sorting algorithm is as well. In fact, an exact median is not necessary – an approximate median is sufficient. In the [[median of medians]] selection algorithm, the pivot strategy computes an approximate median and uses this as pivot, recursing on a smaller set to compute this pivot. In practice the overhead of pivot computation is significant, so these algorithms are generally not used, but this technique is of theoretical interest in relating selection and sorting algorithms.
 
In detail, given a median-selection algorithm, one can use it as a pivot strategy in Quickselect, obtaining a selection algorithm. If the median-selection algorithm is optimal, meaning O(''n''), then the resulting general selection algorithm is also optimal, again meaning linear. This is because Quickselect is a [[divide and conquer algorithm]], and using the median at each pivot means that at each step the search set decreases by half in size, so the overall complexity is a [[geometric series]] times the complexity of each step, and thus simply a constant times the complexity of a single step, in fact <math>2 = 1/(1-(1/2))</math> times (summing the series).
 
Similarly, given a median-selection algorithm or general selection algorithm applied to find the median, one can use it as a pivot strategy in Quicksort, obtaining a sorting algorithm. If the selection algorithm is optimal, meaning O(''n''), then the resulting sorting algorithm is optimal, meaning O(''n'' log ''n''). The median is the best pivot for sorting, as it evenly divides the data, and thus guarantees optimal sorting, assuming the selection algorithm is optimal. A sorting analog to median of medians exists, using the pivot strategy (approximate median) in Quicksort, and similarly yields an optimal Quicksort.
 
== Incremental sorting by selection ==
Converse to selection by sorting, one can incrementally sort by repeated selection. Abstractly, selection only yields a single element, the ''k''th element. However, practical selection algorithms frequently involve partial sorting, or can be modified to do so. Selecting by partial sorting naturally does so, sorting the elements up to ''k'', and selecting by partitioning also sorts some elements: the pivots are sorted to the correct positions, with the ''k''th element being the final pivot, and the elements between the pivots have values between the pivot values. The difference between partition-based selection and partition-based sorting, as in Quickselect versus Quicksort, is that in selection one recurses on only one side of each pivot, sorting only the pivots (an average of log(''n'') pivots are used), rather than recursing on both sides of the pivot.
 
This can be used to speed up subsequent selections on the same data; in the extreme, a fully sorted array allows O(1) selection. Further, compared with first doing a full sort, incrementally sorting by repeated selection [[amortized analysis|amortizes]] the sorting cost over multiple selections.
 
For partially sorted data (up to ''k''), so long as the partially sorted data and the index ''k'' up to which the data is sorted are recorded, subsequent selections of ''j'' less than or equal to ''k'' can simply select the ''j''th element, as it is already sorted, while selections of ''j'' greater than ''k'' only need to sort the elements above the ''k''th position.
 
For partitioned data, if the list of pivots is stored (for example, in a sorted list of the indices), then subsequent selections only need to select in the interval between two pivots (the nearest pivots below and above). The biggest gain is from the top-level pivots, which eliminate costly large partitions: a single pivot near the middle of the data cuts the time for future selections in half. The pivot list will grow over subsequent selections, as the data becomes more sorted, and can even be passed to a partition-based sort as the basis of a full sort.
 
== Using data structures to select in sublinear time ==
Given an unorganized list of data, linear time (Ω(''n'')) is required to find the minimum element, because we have to examine every element (otherwise, we might miss it). If we organize the list, for example by keeping it sorted at all times, then selecting the ''k''th largest element is trivial, but then insertion requires linear time, as do other operations such as combining two lists.
 
Line 80 ⟶ 46:
 
This bound is achievable for ''t''=2 but better, more complex bounds are known for larger ''t''.{{citation needed|date=April 2018}}
 
== Space complexity ==
The required space complexity of selection is O(1) additional storage, in addition to storing the array in which selection is being performed. Such space complexity can be achieved while preserving optimal O(n) time complexity.<ref>Lai T.W., Wood D. (1988) Implicit selection. In: Karlsson R., Lingas A. (eds) SWAT 88. SWAT 1988. Lecture Notes in Computer Science, vol 318. Springer, Berlin, Heidelberg</ref>
{{Expand section|date=January 2019}}
 
==Online selection algorithm==
[[online algorithm|Online]] selection may refer narrowly to computing the ''k''th smallest element of a stream, in which case partial sorting algorithms — with ''k'' + O(1) space for the ''k'' smallest elements so far — can be used, but partition-based algorithms cannot be.
 
Alternatively, selection itself may be required to be [[online algorithm|online]], that is, an element can only be selected from a sequential input at the instance of observation and each selection, respectively refusal, is irrevocable. The problem is to select, under these constraints, a specific element of the input sequence (as for example the largest or the smallest value) with largest probability. This problem can be tackled by the [[Odds algorithm]], which yields the optimal under an independence condition; it is also optimal itself as an algorithm with the number of computations being linear in the length of input.
 
The simplest example is the [[secretary problem]] of choosing the maximum with high probability, in which case optimal strategy (on random data) is to track the running maximum of the first ''n''/''e'' elements and reject them, and then select the first element that is higher than this maximum.
 
== Related problems ==
One may generalize the selection problem to apply to ranges within a list, yielding the problem of [[Range Queries|range queries]]. The question of [[Range Queries#Median|range median queries]] (computing the medians of multiple ranges) has been analyzed.
 
== Language support ==