Selection algorithm: Difference between revisions

Content deleted Content added
Change text description of worst case performance to match every other worst case performance mentioned in the article.
remove math tags in lead
Line 1:
{{for|simulated natural selection in genetic algorithms|Selection (genetic algorithm)}}
In [[computer science]], a '''selection algorithm''' is an [[algorithm]] for finding the ''k''th smallest number in a list (such a number is called the ''k''th ''[[order statistic]]''). This includes the cases of finding the [[minimum]], [[maximum]], and [[median]] elements. There are <math>O(''n'')</math>, worst-case linear time, selection algorithms. Selection is a subproblem of more complex problems like the [[nearest neighbor problem]] and [[shortest path]] problems.
 
== Selection by sorting ==
Selection can be [[Reduction (complexity)|reduced]] to [[sorting algorithm|sorting]] by sorting the list and then extracting the desired element. This method is efficient when many selections need to be made from a list, in which case only one initial, expensive sort is needed, followed by many cheap extraction operations. In general, this method requires O(''n'' log ''n'') time, where ''n'' is the length of the list (although a lower bound is possible with [[radix sort]]).
 
==Linear minimum/maximum algorithms==
Line 24:
Other advantages of this method are:
* After locating the ''j''th smallest element, it requires only O(''j'' + (''k''-''j'')<sup>2</sup>) time to find the ''k''th smallest element, or only O(1) for ''k'' ≤ ''j''.
* It can be done with d[[linked list]] data structures, whereas the one based on partition requires [[random access]].
 
== Partition-based general selection algorithm ==
Line 30:
A general selection algorithm that is efficient in practice, but has poor worst-case performance, was conceived by the inventor of [[quicksort]], [[C.A.R. Hoare]], and is known as '''Hoare's selection algorithm''' or '''quickselect'''.
 
In quicksort, there is a subprocedure called partition that can, in linear time, group a list (ranging from indices <code>left</code> to <code>right</code>) into two parts, those less than a certain element, and those greater than or equal to the element. Here is pseudocode that performs a partition about the element <code>list[pivotIndex]</code>:
 
'''function''' partition(list, left, right, pivotIndex)
Line 47:
''// Returns the k-th smallest element of list within left..right inclusive.''
'''function''' select(list, left, right, k)
'''if''' left = right ''// If the list contains only one element''
'''return''' list[left] ''// Return that element''
pivotIndex := ... ''// select a pivotIndex between left and right''
Line 74:
'''else'''
k := k - pivotDist
Like quicksort left := pivotNewIndex +, the performance of the algorithm is sensitive to the pivot that is chosen. If bad pivots are consistently chosen, this degrades to the minimum-based selection described previously, and so can require as much as O(''n''<sup>2</sup>) time. David Musser describes a "median-of-3 killer" sequence that can force the well-known median-of-three pivot selection algorithm toofto fail with worst-case behavior (see ''[[#Introselect|Introselect]]'' section below).
left := pivotNewIndex + 1
 
Like quicksort, the performance of the algorithm is sensitive to the pivot that is chosen. If bad pivots are consistently chosen, this degrades to the minimum-based selection described previously, and so can require as much as O(''n''<sup>2</sup>) time. David Musser describes a "median-of-3 killer" sequence that can force the well-known median-of-three pivot selection algorithm to fail with worst-case behavior (see ''[[#Introselect|Introselect]]'' section below).
 
== Linear general selection algorithm - Median of Medians algorithm ==
Line 90 ⟶ 88:
}}
 
A worst-case linear algorithm for the general case of selecting the ''k''th largest element was published by [[Manuel Blum|Blum]], [[Robert Floyd|Floyd]], [[Vaughan Ronald Pratt|Pratt]], [[Ron Rivest|Rivest]] and [[Robert Tarjan|Tarjan]] in their 1973 paper "Time bounds for selection", sometimes called '''BFPRT''' after the last names of the authors. It is based on the quickselect algorithm and is also known as the '''median-of-medians algorithm'''.
 
Although quickselect is linear-time on average, it can require quadratic time with poor pivot choices (consider the case of pivoting around the largest element at each step). The solution to make it O(n) in the ''worst'' case is to consistently find "good" pivots. A good pivot is one for which we can establish that a constant proportion of elements fall both below and above it.
 
The ''Select'' algorithm divides the list into groups of five elements. (Left over elements are ignored for now.) Then, for each group of five, the median is calculated (an operation that can potentially be made very fast if the five values can be loaded into [[Processor register|registers]] and compared). (If sorting in-place, then these medians are moved into one contiguous block in the list.) ''Select'' is then called recursively on this sublist of ''n''/5 elements to find their true median. Finally, the "median of medians" is chosen to be the pivot.
 
''//returns the index of the median of medians.''
Line 103 ⟶ 101:
'''for''' i '''from''' 0 '''to''' numMedians
''//get the median of the five-element subgroup''
subLeft := left + i*5:= subLeft +(subRight > right) subRight := right
subRight := subLeft + 4
if (subRight > right) subRight := right
''//alternatively, use a faster method that works on lists of size 5''
medianIdx := selectIdx(list, subLeft, subRight, (subRight - subLeft) / 2)
Line 114 ⟶ 110:
 
=== Properties of pivot ===
The chosen pivot is both less than and greater than half of the elements in the list of medians, which is around <math>n/10</math> elements <math>(1/2 * (n/5))</math> for each half. Each of these elements is a median of 5, making it less thanthangreater 2 other elements and greater than 2 other elements outside thethanthe block. Hence, the pivot is less than <math>3(n/10)</math> elements outside the block, and greater than another <math>3(n/10)</math> elements inside the block. Thus the chosen median splits the elements somewhere between 30%/70% and 70%/30%, which assures worst-case linear behavior of the algorithm. To visualize:
 
{|class="wikitable" border=1
|+ One iteration on the list {0,1,2,3,...99}
!
| bgcolor=gray|12||||bgcolor=gray|15||||bgcolor=gray|11||||bgcolor=gray|2||||bgcolor=gray|9||||bgcolor=gray|5||||bgcolor=gray|0||||bgcolor=gray|7||||bgcolor=gray|3||||bgcolor=gray|21||||bgcolor=gray|44||||bgcolor=gray|40||||bgcolor=gray|1||||bgcolor=gray|18||||bgcolor=gray|20||||bgcolor=gray|32||||bgcolor=gray|19||||bgcolor=gray|35||||bgcolor=gray|37||||bgcolor=gray|39
Line 135 ⟶ 131:
|}
 
(red = "(one of the two possible) median of medians", gray = "number < red", white = "number > red")
 
5-tuples are shown here sorted by median, for clarity. Sorting the tuples is not necessary because we only need the median for use as pivot element.
 
Note that all elements above/left of the red (30% of the 100 elements) are less, and all elements below/right of the red (another 30% of the 100 elements) are greater.
 
=== Proof of O(n) running time ===
Line 146 ⟶ 142:
:<math>T(n) \leq T(n \cdot 2/10) + T(n \cdot 7/10) + O(n).</math>
 
The O(''n'') is for the partitioning work (we visited each element a constant number of times, in order to form them into n/5 groups and take each median in O(1) time).
From this, one can then show that
:<math>T(n) \leq c \cdot n \cdot (1 + (9/10) + (9/10)^2 + \cdots) \in O(n).</math>
 
Line 162 ⟶ 158:
* Sum the size of all partitions generated so far. If this exceeds the list size times some small positive constant ''k'', switch to the worst-case linear algorithm. This sum is easy to track in a single scalar variable.
 
Both approaches limit the recursion depth to ''k'' ⌈log ''n''⌉ = ''O''(log ''n'') and the total running time to ''O''(''n)''. The paper suggested that more research on introselect was forthcoming, but the author retired in 2007 without having published anyinany such further research.
 
== Selection as incremental sorting ==
One of the advantages of the sort-and-index approach, as mentioned, is its ability to [[amortized analysis|amortize]] the sorting cost over many subsequent selections. However, sometimes the number of selections that will be done is not known indin advance, and may be either small or large. In these cases, we can adapt the algorithms given above to simultaneously select an element while [[Partial sorting|partially sorting]] the list, thus accelerating future selections.
 
Both the selection procedure based on minimum-finding and the one based on partitioning can be seen as a form of partial sort. The minimum-based algorithm sorts the list up to the given index, and so clearly speeds up future selections, especially of smaller indexes. The partition-based algorithm does not achieve the same behaviour automatically, but can be adapted to remember its previous pivot choices and reuse them wherever possible, avoiding costly partition operations, particularly the top-level one. The list becomes gradually more sorted as more partition operations are done incrementallyd; no pivots are ever "lost". If desired, this same pivot list could be passed on to quicksort to reuse, again avoiding many costly partition operations.
 
== Using data structures to select in sublinear time ==
Given an unorganized list of data, linear time (Ω(''n'')) is required to find the minimum element, because we have to examine every element (otherwise, we might miss it). If we organize the list, for example by keeping it sorted at all times, then selecting the ''k''th largest element is trivial, but then insertion requires linear time, as do other operations such as combining two lists.
 
The strategy to find an order statistic in [[sublinear time]] is to store the data in an organized fashion using suitable data structures that facilitate the selection. Two such data structures are tree-based structures and frequency tables.
 
When only the minimum (or maximum) is needed, a good approach is to use a [[Heap_(data_structure)|heap]], which is able to find the minimum (or maximum) element in constant time, while all other operations, including insertion, are O(log ''n'') or better. More generally, a [[self-balancing binary search tree]] can easily be [[Order statistic tree|augmented]] to make it possible to both insert an element and find the ''k''th largest element in O(log ''n'') time. We simply store in each node a count of how many descendants it has, and use this to determine which path to follow. The information can be updated efficiently since adding a node only affects the counts of its O(log ''n'') ancestors, and tree rotations only affect the counts of the nodes involved in the rotation.
Line 178 ⟶ 174:
Another simple strategy is based on some of the same concepts as the [[hash table]]. When we know the range of values beforehand, we can divide that range into ''h'' subintervals and assign these to ''h'' buckets. When we insert an element, we add it to the bucket corresponding to the interval it falls in. To find the minimum or maximum element, we scan from the beginning or end for the first nonempty bucket and find the minimum or maximum element in that bucket. In general, to find the ''k''th element, we maintain a count of the number of elements in each bucket, then scan the buckets from left to right adding up counts until we find the bucket containing the desired element, then use the expected linear-time algorithm to find the correct element in that bucket.
 
If we choose ''h'' of size roughly sqrt(''n''), and the input is close to uniformly distributed, this scheme can perform selections in expected O(sqrt(''n'')) time. Unfortunately, this strategy is also sensitive to clustering of elements in a narrow interval, which may result in buckets with large numbers of elements (clustering can be eliminated through a good hash function, but finding the element with the ''k''th largest hash value isn't very useful). Additionally, like hash tables this structure requires table resizings to maintain efficiency as elements are added and ''n'' becomes much larger than ''h''<sup>2</sup>. A useful case of this is finding an order statistic or extremum in a finite range of data. Using above table with bucket interval 1 and maintaining counts inintervalin each bucket is much superior to other methods. Such hash tables are like [[frequency tables]] used to classify the data in [[descriptive statistics]].
 
== Selecting k smallest or largest elements ==
{{main|Partial sorting}}
Another fundamental selection problem is that of selecting the ''k'' smallest or ''k'' largest elements, which is particularly useful where we want to present just the "top ''k''" of an unsorted list, such as the top 100 corporations by gross salestopsales. This is also commonly called [[partial sorting]].
 
== Lower bounds ==
In ''[[The Art of Computer Programming]]'', Donald E. Knuth discussed a number of lower bounds for the number of comparisons required to locate the ''t'' smallest entries of an unorganized list of ''n'' items (using only comparisons). There is a trivial lower bound of ''n'' &minus; 1 for the minimum or maximum entry. To see this, consider a tournament where each game represents one comparison. Since every player except the winner of the tournament must lose a game before we know the winner, we have a lower bound of ''n'' &minus; 1 comparisons.
 
The story becomes more complex for other indexes. We define <math>W_{t}(n)</math> as the minimum number of comparisons required to find the ''t'' smallest values. Knuth references a paper published by S. S. Kislitsyn, which shows an upper bound on this value:
Line 191 ⟶ 187:
:<math>W_{t}(n) \leq n - t + \sum_{n+1-t < j \leq n} \lceil{\log_2\, j}\rceil \quad \text{for}\, n \geq t</math>
 
This bound is achievable for ''t''=2 but better, more complex bounds are known for larger ''t''.
 
== Language support ==
Very few languages have built-in support for general selection, although many provide facilities for finding the smallest or largest element of a list. A notable exception is [[C++]], which provides a templated <code>nth_element</code> method with a guarantee of expected linear time. It is implied but not required that it is based on Hoare's algorithm by its requirement of expected linear time. (Ref section 25.3.2 of ISO/IEC 14882:2003(E) and 14882:1998(E), see also [http://www.sgi.com/tech/stl/nth_element.html SGI STL description of nth_element])
 
C++ also provides the [http://www.sgi.com/tech/stl/partial_sort.html partial_sort] algorithm, which solves the problem of selecting the smallest ''k'' elements (sorted), with a time complexity of O(''n'' log ''k''). No algorithm is provided for selecting the greatest ''k'' elements since this should be done by inverting the orderingdordering [[Predicate (computer programming)|predicate]].
 
For [[Perl]], the module [http://search.cpan.org/dist/Sort-Key-Top Sort::Key::Top], available from [[CPAN]], provides a set of functions to select the top n elements from a list using several orderings and custom key extraction procedures. Furthermore, the [http://search.cpan.org/dist/Statistics-CaseResampling Statistics::CaseResampling] module provides a function to calculate quantiles using quickselect.