Selection algorithm: Difference between revisions

Content deleted Content added
reorder and elab relation with sorting; distinguish list and array; remove error – partition sort can be done with a linked list
Line 1:
{{for|simulated natural selection in genetic algorithms|Selection (genetic algorithm)}}
In [[computer science]], a '''selection algorithm''' is an [[algorithm]] for finding the ''k''th smallest number in a list[[List (abstract data type)|list]] or [[Array data structure|array]]; such a number is called the ''k''th ''[[order statistic]]''). This includes the cases of finding the [[minimum]], [[maximum]], and [[median]] elements. There are O(''n''), (worst-case linear time,) selection algorithms, and sublinear performance is possible for structured data; in the extreme, O(1) for an array of sorted data. Selection is a subproblem of more complex problems like the [[nearest neighbor problem]] and [[shortest path]] problems. Many selection algorithms are derived by generalizing a [[sorting algorithm]], and conversely some sorting algorithms can be derived as repeated application of selection.
 
The simplest case of a selection algorithm is finding the minimum (or maximum) element by iterating through the list, keeping track of the running minimum (resp. maximum) – the minimum so far – and can be seen as related to the [[selection sort]]. The best-known selection algorithm is [[quickselect]], which is related to [[quicksort]]; like quicksort, it has (asymptotically) optimal average performance, but poor worst-case performance, though it can be modified to give optimal worst-case performance as well.
 
== Selection by sorting ==
Selection can be [[Reduction (complexity)|reduced]] to [[sorting algorithm|sorting]] by sorting the list or array and then extractingselecting the desired element. This method is inefficient for selecting a single element, but is efficient when many selections need to be made from aan listarray, in which case only one initial, expensive sort is needed, followed by many cheap extractionselection operations – O(1) for an array, though selection is O(''n'') in a list, even if sorted, due to lack of [[random access]]. In general, this methodsorting requires O(''n'' log ''n'') time, where ''n'' is the length of the list, (although a lower bound is possible with non-comparative sorting algorithms like [[radix sort]] and [[counting sort]]).
 
Rather than sorting the whole list or array, one can instead use [[partial sorting]] to select the ''k'' smallest or ''k'' largest elements. The ''k''th smallest (resp., ''k''th largest element) is then the largest (resp., smallest element) of the partially sorted list – this then takes O(1) to access in an array and O(''k'') to access in a list. This is more efficient than full sorting, but less efficient than simply selecting, and takes O(''n'' + ''k'' log ''k'') time, due to the sorting of the ''k'' elements. Partial sorting algorithms can often be derived from (total) sorting algorithms. As with total sorting, partial sorting means that further selections (below the ''k''th element) can be done in O(1) time for an array and O(''k'') time for a list. Further, if the partial sorting also partitions the original data into "sorted" and "unsorted", as with an in-place sort, the partial sort can be extended to a larger partial sort by only sorting the incremental portion, and if this is done, further selections above the ''k''th element can also be done relatively cheaply.
==Linear minimum/maximum algorithms==
Linear time algorithms to find minima or maxima work by iterating over the list and keeping track of the minimum or maximum element so far.
 
If partial sorting is relaxed so that the ''k'' smallest elements are returned, but not in order, the factor of O(''k'' log ''k'') can be eliminated. An additional maximum selection (taking O(''k'') time) is required, but since <math>k \leq n</math>, this still yields asymptotic complexity of O(''n''). Partition-based selection algorithms in fact yield both the ''k''th smallest element itself and the ''k'' smallest elements (with other elements not in order). This can be done in O(''n'') time – average complexity of [[quickselect]], and worst-case complexity of refined partition-based selection algorithms.
== Nonlinear general selection algorithm ==
 
Using the same ideas used in minimum/maximum algorithms, we can construct a simple, but inefficient general algorithm for finding the ''k''th smallest or ''k''th largest item in a list, requiring O(''kn'') time, which is effective when ''k'' is small. To accomplish this, we simply find the most extreme value and move it to the beginning until we reach our desired index. This can be seen as an incomplete [[selection sort]]. Here is the minimum-based algorithm:
Conversely, given a selection algorithm, one can easily get an unordered partial sort (''k'' smallest elements, not in order) in O(''n'') time by iterating through the list and recording all elements less than the ''k''th element. If this results in fewer than ''k''&nbsp;&minus;&nbsp;1 elements, any remaining elements equal the ''k''th element. Care must be taken, due to the possibility of equality of elements: one must not include all element less than ''or equal to'' the ''k''th element, as elements greater than the ''k''th element may also be equal to it.
 
=== Partial selection sort ===
A simple example of selection by partial sorting is to use the partial [[selection sort]].
 
The obvious linear time algorithm to find the minimum (resp. maximum) – iterating over the list and keeping track of the minimum (resp. maximum) element so far – can be seen as a partial selection sort that selects the 1 smallest element. However, many other partial sorts also reduce to this algorithm for the case ''k''&nbsp;=&nbsp;1, such as a partial heap sort.
 
More generally, a partial selection sort yields a simple selection algorithm which takes O(''kn'') time. This is asymptotically inefficient, but can be sufficiently efficient if ''k'' is small, and is easy to implement. Concretely, we simply find the minimum value and move it to the beginning, repeating on the remaining list until we have accumulated ''k'' elements. Here is the minimum-based algorithm:
 
'''function''' select(list[1..n], k)
Line 21 ⟶ 30:
swap list[i] and list[minIndex]
'''return''' list[k]
 
Other advantages of this method are:
* After locating the ''j''th smallest element, it requires only O(''j'' + (''k''-''j'')<sup>2</sup>) time to find the ''k''th smallest element, or only O(1) for ''k'' ≤ ''j''.
* It can be done with [[linked list]] data structures, whereas the one based on partition requires [[random access]].
 
== Partition-based general selection algorithm ==
Line 181 ⟶ 186:
 
If we choose ''h'' of size roughly sqrt(''n''), and the input is close to uniformly distributed, this scheme can perform selections in expected O(sqrt(''n'')) time. Unfortunately, this strategy is also sensitive to clustering of elements in a narrow interval, which may result in buckets with large numbers of elements (clustering can be eliminated through a good hash function, but finding the element with the ''k''th largest hash value isn't very useful). Additionally, like hash tables this structure requires table resizings to maintain efficiency as elements are added and ''n'' becomes much larger than ''h''<sup>2</sup>. A useful case of this is finding an order statistic or extremum in a finite range of data. Using above table with bucket interval 1 and maintaining counts in each bucket is much superior to other methods. Such hash tables are like [[frequency tables]] used to classify the data in [[descriptive statistics]].
 
== Selecting k smallest or largest elements ==
{{main|Partial sorting}}
Another fundamental selection problem is that of selecting the ''k'' smallest or ''k'' largest elements, which is particularly useful where we want to present just the "top ''k''" of an unsorted list, such as the top 100 corporations by gross sales. This is also commonly called [[partial sorting]].
 
== Lower bounds ==