Selection algorithm: Difference between revisions

Content deleted Content added
reorder and elab relation with sorting; distinguish list and array; remove error – partition sort can be done with a linked list
move quickselect into own article, per summary style
Line 18:
The obvious linear time algorithm to find the minimum (resp. maximum) – iterating over the list and keeping track of the minimum (resp. maximum) element so far – can be seen as a partial selection sort that selects the 1 smallest element. However, many other partial sorts also reduce to this algorithm for the case ''k'' = 1, such as a partial heap sort.
 
More generally, a partial selection sort yields a simple selection algorithm which takes O(''kn'') time. This is asymptotically inefficient, but can be sufficiently efficient if ''k'' is small, and is easy to implement. Concretely, we simply find the minimum value and move it to the beginning, repeating on the remaining list until we have accumulated ''k'' elements, and then return the ''k''th element. Here is thepartial selection minimumsort-based algorithm:
 
'''function''' select(list[1..n], k)
Line 31:
'''return''' list[k]
 
== Partition-based general selection algorithmalgorithms ==
{{see|Quickselect}}
 
{{empty-section}}
A general selection algorithm that is efficient in practice, but has poor worst-case performance, was conceived by the inventor of [[quicksort]], [[C.A.R. Hoare]], and is known as '''Hoare's selection algorithm''' or '''quickselect'''.
 
In quicksort, there is a subprocedure called partition that can, in linear time, group a list (ranging from indices <code>left</code> to <code>right</code>) into two parts, those less than a certain element, and those greater than or equal to the element. Here is pseudocode that performs a partition about the element <code>list[pivotIndex]</code>:
 
'''function''' partition(list, left, right, pivotIndex)
pivotValue := list[pivotIndex]
swap list[pivotIndex] and list[right] ''// Move pivot to end''
storeIndex := left
'''for''' i '''from''' left '''to''' right-1
'''if''' list[i] < pivotValue
swap list[storeIndex] and list[i]
increment storeIndex
swap list[right] and list[storeIndex] ''// Move pivot to its final place''
'''return''' storeIndex
 
In quicksort, we recursively sort both branches, leading to best-case [[Big-O notation|Ω]](''n'' log ''n'') time. However, when doing selection, we already know which partition our desired element lies in, since the pivot is in its final sorted position, with all those preceding it in sorted order and all those following it in sorted order. Thus a single recursive call locates the desired element in the correct partition:
 
''// Returns the k-th smallest element of list within left..right inclusive.''
'''function''' select(list, left, right, k)
'''if''' left = right ''// If the list contains only one element''
'''return''' list[left] ''// Return that element''
pivotIndex := ... ''// select a pivotIndex between left and right''
pivotNewIndex := partition(list, left, right, pivotIndex)
pivotDist := pivotNewIndex - left + 1
''// The pivot is in its final sorted position,''
''// so pivotDist reflects its 1-based position if list were sorted''
'''if''' pivotDist = k
'''return''' list[pivotNewIndex]
'''else if''' k < pivotDist
'''return''' select(list, left, pivotNewIndex - 1, k)
'''else'''
'''return''' select(list, pivotNewIndex + 1, right, k - pivotDist)
 
Note the resemblance to quicksort: just as the minimum-based selection algorithm is a partial selection sort, this is a partial quicksort, generating and partitioning only O(log ''n'') of its O(''n'') partitions. This simple procedure has expected linear performance, and, like quicksort, has quite good performance in practice. It is also an [[in-place algorithm]], requiring only constant memory overhead, since the [[tail recursion]] can be eliminated with a loop like this:
 
'''function''' select(list, left, right, k)
'''loop'''
pivotIndex := ... ''// select pivotIndex between left and right''
pivotNewIndex := partition(list, left, right, pivotIndex)
pivotDist := pivotNewIndex - left + 1
'''if''' pivotDist = k
'''return''' list[pivotNewIndex]
'''else if''' k < pivotDist
right := pivotNewIndex - 1
'''else'''
k := k - pivotDist
left := pivotNewIndex + 1
 
Like quicksort, the performance of the algorithm is sensitive to the pivot that is chosen. If bad pivots are consistently chosen, this degrades to the minimum-based selection described previously, and so can require as much as O(''n''<sup>2</sup>) time. David Musser describes a "median-of-3 killer" sequence that can force the well-known median-of-three pivot selection algorithm to fail with worst-case behavior (see ''[[#Introselect|Introselect]]'' section below).
 
== Linear general selection algorithm - Median of Medians algorithm ==
{{Infobox Algorithm
|name=Median of Medians
|class='''Selection algorithm'''
|image=
|data=[[Array data structure|Array]]
|best-time= <math>O(n)</math>
|time=<math>O(n)</math>
|space=<math>O(1)</math> auxiliary
|optimal=Yes
}}
 
A worst-case linear algorithm for the general case of selecting the ''k''th largest element was published by [[Manuel Blum|Blum]], [[Robert Floyd|Floyd]], [[Vaughan Ronald Pratt|Pratt]], [[Ron Rivest|Rivest]] and [[Robert Tarjan|Tarjan]] in their 1973 paper "Time bounds for selection", sometimes called '''BFPRT''' after the last names of the authors. It is based on the quickselect algorithm and is also known as the '''median-of-medians algorithm'''.
 
Although quickselect is linear-time on average, it can require quadratic time with poor pivot choices (consider the case of pivoting around the largest element at each step). The solution to make it O(n) in the ''worst'' case is to consistently find "good" pivots. A good pivot is one for which we can establish that a constant proportion of elements fall both below and above it.
 
The ''Select'' algorithm divides the list into groups of five elements. (Left over elements are ignored for now.) Then, for each group of five, the median is calculated (an operation that can potentially be made very fast if the five values can be loaded into [[Processor register|registers]] and compared). (If sorting in-place, then these medians are moved into one contiguous block in the list.) ''Select'' is then called recursively on this sublist of ''n''/5 elements to find their true median. Finally, the "median of medians" is chosen to be the pivot.
 
''//returns the index of the median of medians.''
''//requires a variant of select, "selectIdx"''
''//which returns the index of the selected item rather than the value''
'''function''' medianOfMedians(list, left, right)
numMedians = (right - left) / 5
'''for''' i '''from''' 0 '''to''' numMedians
''//get the median of the five-element subgroup''
subLeft := left + i*5
subRight := subLeft + 4
if (subRight > right) subRight := right
''//alternatively, use a faster method that works on lists of size 5''
medianIdx := selectIdx(list, subLeft, subRight, (subRight - subLeft) / 2)
''//move the median to a contiguous block at the beginning of the list''
swap list[left+i] and list[medianIdx]
''//select the median from the contiguous block''
'''return''' selectIdx(list, left, left + numMedians, numMedians / 2)
 
=== Properties of pivot ===
The chosen pivot is both less than and greater than half of the elements in the list of medians, which is around ''n''/10 elements (½×''n''/5) for each half. Each of these elements is a median of 5, making it less than 2 other elements and greater than 2 other elements outside the block. Hence, the pivot is less than 3(''n''/10) elements outside the block, and greater than another 3(''n''/10) elements inside the block. Thus the chosen median splits the elements somewhere between 30%/70% and 70%/30%, which assures worst-case linear behavior of the algorithm. To visualize:
 
{|class="wikitable" border=1
|+ One iteration on the list {0,1,2,3,...99}
!
| bgcolor=gray|12||||bgcolor=gray|15||||bgcolor=gray|11||||bgcolor=gray|2||||bgcolor=gray|9||||bgcolor=gray|5||||bgcolor=gray|0||||bgcolor=gray|7||||bgcolor=gray|3||||bgcolor=gray|21||||bgcolor=gray|44||||bgcolor=gray|40||||bgcolor=gray|1||||bgcolor=gray|18||||bgcolor=gray|20||||bgcolor=gray|32||||bgcolor=gray|19||||bgcolor=gray|35||||bgcolor=gray|37||||bgcolor=gray|39
|-
!
| bgcolor=gray|13||||bgcolor=gray|16||||bgcolor=gray|14||||bgcolor=gray|8||||bgcolor=gray|10||||bgcolor=gray|26||||bgcolor=gray|6||||bgcolor=gray|33||||bgcolor=gray|4||||bgcolor=gray|27||||bgcolor=white|49||||bgcolor=gray|46||||bgcolor=white|52||||bgcolor=gray|25||||bgcolor=white|51||||bgcolor=gray|34||||bgcolor=gray|43||||bgcolor=white|56||||bgcolor=white|72||||bgcolor=white|79
|-
! Medians
| bgcolor=gray|17||||bgcolor=gray|23||||bgcolor=gray|24||||bgcolor=gray|28||||bgcolor=gray|29||||bgcolor=gray|30||||bgcolor=gray|31||||bgcolor=gray|36||||bgcolor=gray|42||||bgcolor=red|47||||bgcolor=white|50||||bgcolor=white|55||||bgcolor=white|58||||bgcolor=white|60||||bgcolor=white|63||||bgcolor=white|65||||bgcolor=white|66||||bgcolor=white|67||||bgcolor=white|81||||bgcolor=white|83
|-
!
| bgcolor=gray|22||||bgcolor=gray|45||||bgcolor=gray|38||||bgcolor=white|53||||bgcolor=white|61||||bgcolor=gray|41||||bgcolor=white|62||||bgcolor=white|82||||bgcolor=white|54||||bgcolor=white|48||||bgcolor=white|59||||bgcolor=white|57||||bgcolor=white|71||||bgcolor=white|78||||bgcolor=white|64||||bgcolor=white|80||||bgcolor=white|70||||bgcolor=white|76||||bgcolor=white|85||||bgcolor=white|87
|-
!
| bgcolor=white|96||||bgcolor=white|95||||bgcolor=white|94||||bgcolor=white|86||||bgcolor=white|89||||bgcolor=white|69||||bgcolor=white|68||||bgcolor=white|97||||bgcolor=white|73||||bgcolor=white|92||||bgcolor=white|74||||bgcolor=white|88||||bgcolor=white|99||||bgcolor=white|84||||bgcolor=white|75||||bgcolor=white|90||||bgcolor=white|77||||bgcolor=white|93||||bgcolor=white|98||||bgcolor=white|91
|-
|}
 
(red = "(one of the two possible) median of medians", gray = "number < red", white = "number > red")
 
5-tuples are shown here sorted by median, for clarity. Sorting the tuples is not necessary because we only need the median for use as pivot element.
 
Note that all elements above/left of the red (30% of the 100 elements) are less, and all elements below/right of the red (another 30% of the 100 elements) are greater.
 
=== Proof of O(n) running time ===
 
The median-calculating recursive call does not exceed worst-case linear behavior because the list of medians is 20% of the size of the list, while the other recursive call recurses on at most 70% of the list, making the running time
:<math>T(n) \leq T(n \cdot 2/10) + T(n \cdot 7/10) + c \cdot n.</math>
 
The O(''n'') term ''c n'' is for the partitioning work (we visited each element a constant number of times, in order to form them into n/5 groups and take each median in O(1) time).
 
From this, using induction, one can easily show that
:<math>T(n) \leq 10 \cdot c \cdot n \in O(n).</math>
 
 
=== Important notes ===
 
Although this approach optimizes quite well, it is typically outperformed in practice by the expected linear algorithm with random pivot choices{{Citation needed|date=December 2009}}.
 
The median-of-medians algorithm can be used to construct a worst-case O(''n''&nbsp;log&nbsp;''n'') [[quicksort]] algorithm, by using it to find the median at every step.
 
== Introselect ==
 
[[David Musser]]'s well-known [[introsort]] achieves practical performance comparable to quicksort while preserving ''O''(''n'' log ''n'') worst-case behavior by creating a hybrid of quicksort and [[heapsort]]. In the same paper, Musser introduced an "introspective selection" algorithm, popularly called '''introselect''', which combines Hoare's algorithm with the worst-case linear algorithm described above to achieve worst-case linear selection with performance similar to Hoare's algorithm.<ref>David R. Musser. Introspective Sorting and Selection Algorithms. ''Software: Practice and Experience'', vol. 27, no. 8, pp.983&ndash;993. 1997. Section: Introspective Selection Algorithms.</ref> It works by optimistically starting out with Hoare's algorithm and only switching to the worst-time linear algorithm if it recurses too many times without making sufficient progress. Simply limiting the recursion to constant depth is not good enough, since this would make the algorithm switch on all sufficiently large lists. Musser discusses a couple of simple approaches:
* Keep track of the list of sizes of the subpartitions processed so far. If at any point ''k'' recursive calls have been made without halving the list size, for some small positive ''k'', switch to the worst-case linear algorithm.
* Sum the size of all partitions generated so far. If this exceeds the list size times some small positive constant ''k'', switch to the worst-case linear algorithm. This sum is easy to track in a single scalar variable.
 
Both approaches limit the recursion depth to ''k'' ⌈log ''n''⌉ = ''O''(log ''n'') and the total running time to ''O''(''n)''. The paper suggested that more research on introselect was forthcoming, but the author retired in 2007 without having published any such further research.
 
== Selection as incremental sorting ==