Selection algorithm: Difference between revisions

Content deleted Content added
exact #s
m start adding nowraps
Line 1:
{{Short description|An algorithm for finding the kth smallest number in a list or array}}
{{for|simulated natural selection in genetic algorithms|Selection (genetic algorithm)}}
In [[computer science]], a '''selection algorithm''' is an [[algorithm]] for finding the <math>k</math>th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the {{nowrap|<math>k</math>th}} [[order statistic]]. Selection includes as special cases the problems of finding the [[minimum]], [[median]], and [[maximum]] element in the collection. Selection algorithms include [[quickselect]], and the [[median of medians]] algorithm. When applied to a collection of <math>n</math> values, these algorithms take [[linear time]], <math>O(n)</math> as expressed using [[big O notation]]. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted [[Array (data structure)|array]] takes {{nowrap|time <math>O(1)</math>.}}
 
==Problem statement==
An algorithm for the selection problem takes as input a collection of values, and a {{nowrap|number <math>k</math>.}} It outputs the {{nowrap|<math>k</math>th}} smallest of these values. For this should be well-defined, it should be possible to [[Sorting|sort]] the values into an order from smallest to largest; for instance, they may be numbers, or some other kind of object with a numeric key. However, they are not assumed to have been already sorted. Often, selection algorithms are restricted to a comparison-based [[model of computation]], as in [[comparison sort]] algorithms, where the algorithm has access to a comparison operation that can determine the relative ordering of any two values, but may not perform any other kind of arithmetic operations on these values.
For this should be well-defined, it should be possible to [[Sorting|sort]] the values into an order from smallest to largest; for instance, they may be numbers, or some other kind of object with a numeric key. However, they are not assumed to have been already sorted. Often, selection algorithms are restricted to a comparison-based [[model of computation]], as in [[comparison sort]] algorithms, where the algorithm has access to a comparison operation that can determine the relative ordering of any two values, but may not perform any other kind of arithmetic operations on these values.
 
To simplify the problem, some sources may assume that the values are all distinct from each {{nowrap|other,{{r|clrs}}}} or that some consistent tie-breaking method has been used to assign an ordering to pairs of items with the same value as each other. Another variation in the problem definition concerns the numbering of the ordered values: is the smallest value obtained by {{nowrap|setting <math>k=0</math>,}} as in [[zero-based numbering]] of arrays, or is it obtained by {{nowrap|setting <math>k=1</math>,}} following the usual English-language conventions for the smallest, second-smallest, etc.? This article follows the conventions used by Cormen et al., according to which all values are distinct and the minimum value is obtained from {{nowrap|<math>k=1</math>.{{r|clrs}}}}
 
With these conventions, the maximum value, among a collection of <math>n</math> values, is obtained by {{nowrap|setting <math>k=n</math>.}} When <math>n</math> is an [[odd number]], the [[median]] of the collection is obtained by {{nowrap|setting <math>k=(n+1)/2</math>.}} When <math>n</math> is even, there are two choices for the median, obtained by rounding this choice of <math>k</math> down or up, respectively: the ''lower median'' with <math>k=n/2</math> and the ''upper median'' with {{nowrap|<math>k=n/2+1</math>.{{r|clrs}}}}
 
==Algorithms==
===Sorting and heapselect===
As a baseline algorithm, selection of the {{nowrap|<math>k</math>th}} smallest value in a collection of values can be performed very simply by the following two steps:
* [[Sorting|Sort]] the collection
* If the output of the sorting algorithm is an array, jump to its {{nowrap|<math>k</math>th}} element; otherwise, scan the sorted sequence to find the {{nowrap|<math>k</math>th}} element.
The time for this method is dominated by the sorting step, which requires <math>\Theta(n\log n)</math> time using a {{nowrap|[[comparison sort]].{{r|clrs|skiena}}}} Even when [[integer sorting]] algorithms may be used, these are generally slower than the linear time that may be achieved using specialized selection algorithms. Nevertheless, the simplicity of this approach makes it attractive, especially when a highly-optimized sorting routine is provided as part of a runtime library, but a selection algorithm is not. For inputs of moderate size, sorting can be faster than non-random selection algorithms, because of the smaller constant factors in its running {{nowrap|time.{{r|erickson}}}} This method also produces a sorted version of the collection, which may be useful for other later computations, and in particular for selection with other choices {{nowrap|of <math>k</math>.{{r|skiena}}}}
 
For a sorting algorithm that generates one item at a time, such as [[selection sort]], the scan can be done in tandem with the sort, and the sort can be terminated once the {{nowrap|<math>k</math>th}} element has been found. One possible design of a consolation bracket in a [[single-elimination tournament]], in which the teams who lost to the eventual winner play another mini-tournament to determine second place, can be seen as an instance of this {{nowrap|method.{{r|bfprt}}}} Applying this optimization to [[heapsort]] produces the [[heapselect]] algorithm, which can select the {{nowrap|<math>k</math>th}} smallest value in {{nowrap|time <math>O(n+k\log n)</math>.}} This is fast when <math>k</math> is small relative {{nowrap|to <math>n</math>,}} but degenerates to <math>O(n\log n)</math> for larger values {{nowrap|of <math>k</math>,}} such as the choice <math>k=n/2</math> used for median finding.
 
===Pivoting===
Many methods for selection are based on choosing a special "pivot" element from the input, and using comparisons with this element to divide the remaining <math>n-1</math> input values into two subsets: the set <math>L</math> of elements less than the pivot, and the set <math>R</math> of elements greater than the pivot. The algorithm can then determine where the {{nowrap|<math>k</math>th}} smallest value is to be found, based on a comparison of <math>k</math> with the sizes of these sets. In particular, {{nowrap|if <math>k\le|L|</math>,}} the {{nowrap|<math>k</math>th}} smallest value is {{nowrap|in <math>L</math>,}} and can be found recursively by applying the same selection algorithm {{nowrap|to <math>L</math>.}} {{nowrap|If <math>k=|L|+1</math>,}} then the {{nowrap|<math>k</math>th}} smallest value is the pivot, and it can be returned immediately. In the remaining case, the {{nowrap|<math>k</math>th}} smallest value is {{nowrap|in <math>R</math>,}} and more specifically it is the element in position <math>k-|L|-1</math> {{nowrap|of <math>R</math>.}} It can be found by applying a selection algorithm recursively, seeking the value in this position {{nowrap|in <math>R</math>.{{r|kletar}}}}
 
As with the related pivoting-based [[quicksort]] algorithm, the partition of the input into <math>L</math> and <math>R</math> may be done by making new collections for these sets, or by a method that partitions a given list or array data type in-place. Details vary depending on how the input collection is {{nowrap|represented.<ref>For instance, Cormen et al. use an in-place array partition, while Kleinberg and Tardos describe the input as a set and use a method that partitions it into two new sets.</ref>}} The time to compare the pivot against all the other values {{nowrap|is <math>O(n)</math>.{{r|kletar}}}} However, pivoting methods differ in how they choose the pivot, which affects how big the subproblems in each recursive call will be. The efficiency of these methods depends greatly on the choice of the pivot. If the pivot is chosen badly, the running time of this method can be as slow {{nowrap|as <math>O(n^2)</math>.{{r|erickson}}}}
*If the pivot were exactly at the median of the input, then each recursive call would have at most half as many values as the previous call, and the total times would add in a [[geometric series]] {{nowrap|to <math>O(n)</math>.}} However, finding the median is itself a selection problem, on the entire original input. Trying to find it by a recursive call to a selection algorithm would lead to an infinite recursion, because the problem size would not decrease in each {{nowrap|call.{{r|kletar}}}}
*[[Quickselect]] chooses the pivot uniformly at random from the input values. It can be described as a variant of [[quicksort]], with the same pivoting strategy, but where quicksort makes two recursive calls to sort the two subcollections <math>L</math> {{nowrap|and <math>R</math>,}} quickselect only makes one of these two calls. Its [[expected time]] {{nowrap|is <math>O(n)</math>.{{r|clrs|kletar}}}}
*The [[Floyd–Rivest algorithm]], a variation of quickselect, chooses a pivot by randomly sampling a subset of <math>r</math> data values, for some sample {{nowrap|size <math>r</math>,}} and then recursively selecting two elements somewhat above and below position <math>rk/n</math> of the sample to use as pivots. With this choice, it is likely that <math>k</math> is sandwiched between the two pivots, so that after pivoting only a small number of data values between the pivots are left for a recursive call. This method can achieve an expected number of comparisons that is {{nowrap|<math>n+\min(k,n-k)+o(n)</math>.{{r|floriv}}}} In their original work, Floyd and Rivest claimed that the <math>o(n)</math> term could be made as small as <math>O(\sqrt n)</math> by a recursive sampling scheme, but the correctness of their analysis has been {{nowrap|questioned.{{r|brown|prt}}}} Instead, more rigorous analysis has shown that a version of their algorithm achieves <math>O(\sqrt{n\log n})</math> for this {{nowrap|term.{{r|knuth}}}}
*The [[median of medians]] method partitions the input into sets of five elements, and then uses some other method (rather than a recursive call) to find the median of each of these sets in constant time per set. It then recursively calls the same selection algorithm to find the median of these <math>n/5</math> medians, using the result as its pivot. It can be shown that, for this choice of pivot, {{nowrap|<math>\max(|L|,|R|)\le 7n/10</math>.}} Thus, a problem on <math>n</math> elements is reduced to two recursive problems on <math>n/5</math> and at most <math>7n/10</math> elements. The total size of these two recursive subproblems is at {{nowrap|most <math>9n/10</math>,}} allowing the total time to be analyzed as a geometric series adding {{nowrap|to <math>O(n)</math>.}} Unlike quickselect, this algorithm is deterministic, not {{nowrap|randomized.{{r|clrs|erickson|bfprt}}}} It was the first linear-time deterministic selection algorithm {{nowrap|known,{{r|bfprt}}}} and is commonly taught in undergraduate algorithms classes as an example of a [[divide and conquer]] algorithm that does not divide into two equal subproblems. However, the high constant factors in its <math>O(n)</math> time bound make it slower than quickselect in {{nowrap|practice,{{r|skiena}}}} and slower even than sorting for inputs of moderate {{nowrap|size.{{r|erickson}}}}
*Hybrid algorithms such as [[introselect]] can be used to achieve the practical performance of quickselect with a fallback to medians of medians guaranteeing worst-case <math>O(n)</math> {{nowrap|time.{{r|musser}}}}
 
===Factories===
The deterministic selection algorithms with the smallest known numbers of comparisons, for values of <math>k</math> that are far from <math>1</math> {{nowrap|or <math>n</math>,}} are based on the concept of ''factories'', introduced in 1976 by [[Arnold Schönhage]], [[Mike Paterson]], and {{nowrap|[[Nick Pippenger]].{{r|spp}}}} These are methods that build [[partial order]]s of certain specified types, on small subsets of input values, by combining smaller partial orders using comparisons on their elements. As a very simple case, for instance, one type of factory can take as input a sequence of single-element partial orders and produce as output two-element totally ordered sets, obtained by comparing the two elements of two input orders. The goal of such an algorithm is to combine together different factories, with the outputs of some factories going to the inputs of others, in order to eventually obtain a partial order in which one element (the {{nowrap|<math>k</math>th}} smallest) is larger than some <math>k-1</math> other elements and smaller than another <math>n-k</math> others. A careful design of these factories leads to an algorithm that, when applied to median-finding, uses at most <math>2.942n</math> comparisons. For other values {{nowrap|of <math>k</math>,}} the number of comparisons is {{nowrap|smaller.{{r|dz99}}}}
 
===Sublinear data structures===
When data is already organized into a [[data structure]], it may be possible to perform selection in an amount of time that is sublinear in the number of values. As a simple case of this, for data already sorted into an array, selecting the <math>k</math> element may be performed by a single array lookup, in constant time.
 
For data organized as a [[binary heap]] it is possible to perform selection in {{nowrap|time <math>O(k)</math>,}} independent of the size <math>n</math> of the whole tree, and faster than the <math>O(k\log n)</math> time bound that would be obtained from {{nowrap|[[best-first search]].{{r|frederickson}}}} This same method can be applied more generally to data organized as any kind of heap-ordered tree (a tree in which each node stores one value in which the parent of each non-root node has a smaller value than its child). This method of performing selection in a heap has been applied to problems of listing multiple solutions to combinatorial optimization problems, such as finding the [[k shortest path routing|{{mvar|k}} shortest paths]] in a weighted graph, by defining a [[State space (computer science)|state space]] of solutions in the form of an [[implicit graph|implicitly defined]] heap-ordered tree, and then applying this selection algorithm to this {{nowrap|tree.{{r|kpaths}}}}
 
For a collection of data values undergoing dynamic insertions and deletions, it is possible to augment a [[self-balancing binary search tree]] structure with a constant amount of additional information per tree node, allowing insertions, deletions, and selection queries that ask for the {{nowrap|<math>k</math>th}} element in the current set to all be performed in <math>O(\log n)</math> time per {{nowrap|operation.{{r|clrs}}}}
 
== Lower bounds ==
The <math>O(n)</math> running time of the selection algorithms described above is necessary, because a selection algorithm that can handle inputs in an arbitrary order must take that much time to look at all of its inputs; if any one of its input values is not compared, that one value could be the one that should have been selected, and the algorithm can be made to produce an incorrect answer. Beyond this simple argument, there has been a significant amount of research on the exact number of comparisons needed for selection, both in the randomized and deterministic cases.
 
Selecting the minimum of <math>n</math> values requires <math>n-1</math> comparisons, because the <math>n-1</math> values that are not selected must each have been determined to be non-minimal, by being the largest in some comparison, and no two of these values can be largest in the same comparison. The same argument applies symmetrically to selecting the {{nowrap|maximum.{{r|knuth}}}}
 
The next simplest case is selecting the second-smallest. After several incorrect attempts, the first tight lower bound on this case was published in 1964 by Soviet mathematician [[Sergey Kislitsyn]]. It can be shown by observing that selecting the second-smallest also requires distinguishing the smallest value from the rest, and by considering the number <math>p</math> of comparisons involving the smallest value that an algorithm for this problem makes. Each of the <math>p</math> items that were compared to the smallest value is a candidate for second-smallest, and <math>p-1</math> of these values must be found larger than another value in a second comparison in order to rule them out as second-smallest.