Content deleted Content added
→Problem statement: clarify numbers |
Undid revision 1272462687 by 49.207.232.175 (talk) the values must have an ordering. They may not be presented in that ordering. |
||
(24 intermediate revisions by 12 users not shown) | |||
Line 1:
{{Short description|Method for finding kth smallest value}}
{{Good article}}
{{Use mdy dates|cs1-dates=ly|date=April 2023}}
{{Use list-defined references|date=April 2023}}
▲{{for|simulated natural selection in genetic algorithms|Selection (genetic algorithm)}}
In [[computer science]], a '''selection algorithm''' is an [[algorithm]] for finding the <math>k</math>th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the {{nowrap|<math>k</math>th}} [[order statistic]]. Selection includes as special cases the problems of finding the [[minimum]], [[median]], and [[maximum]] element in the collection. Selection algorithms include [[quickselect]], and the [[median of medians]] algorithm. When applied to a collection of <math>n</math> values, these algorithms take [[linear time]], <math>O(n)</math> as expressed using [[big O notation]]. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted [[Array (data structure)|array]] takes {{nowrap|time <math>O(1)</math>.}}
==Problem statement==
An algorithm for the selection problem takes as input a collection of values, and a {{nowrap|number <math>k</math>.}} It outputs the {{nowrap|<math>k</math>th}} smallest of these values, or, in some versions of the problem, a collection of the <math>k</math> smallest values. For this to be well-defined, it should be possible to [[Sorting|sort]] the values into an order from smallest to largest; for instance, they may be [[Integer (computer science)|integers]], [[floating-point number]]s, or some other kind of [[Object (computer science)|object]] with a numeric key. However, they are not assumed to have been already sorted. Often, selection algorithms are restricted to a comparison-based [[model of computation]], as in [[comparison sort]] algorithms, where the algorithm has access to a [[Relational operator|comparison operation]] that can determine the relative ordering of any two values, but may not perform any other kind of [[Arithmetic|arithmetic operations]] on these values.{{r|cunmun}}
To simplify the problem, some
With these conventions, the maximum value, among a collection of <math>n</math> values, is obtained by {{nowrap|setting <math>k=n</math>.}} When <math>n</math> is an [[odd number]], the [[median]] of the collection is obtained by {{nowrap|setting <math>k=(n+1)/2</math>.}} When <math>n</math> is even, there are two choices for the median, obtained by rounding this choice of <math>k</math> down or up, respectively: the ''lower median'' with <math>k=n/2</math> and the ''upper median'' with {{nowrap|<math>k=n/2+1</math>.{{r|clrs}}}}
Line 14 ⟶ 15:
==Algorithms==
===Sorting and heapselect===
As a baseline algorithm, selection of the {{nowrap|<math>k</math>th}} smallest value in a collection of values can be performed
* [[Sorting|Sort]] the collection
* If the output of the sorting algorithm is an [[Array (data type)|array]],
The time for this method is dominated by the sorting step, which requires <math>\Theta(n\log n)</math> time using a {{nowrap|[[comparison sort]].{{r|clrs|skiena}}}} Even when [[integer sorting]] algorithms may be used, these are generally slower than the linear time that may be achieved using specialized selection algorithms. Nevertheless, the simplicity of this approach makes it attractive, especially when a highly-optimized sorting routine is provided as part of a runtime library, but a selection algorithm is not. For inputs of moderate size, sorting can be faster than non-random selection algorithms, because of the smaller constant factors in its running {{nowrap|time.{{r|erickson}}}} This method also produces a sorted version of the collection, which may be useful for other later computations, and in particular for selection with other choices {{nowrap|of <math>k</math>.{{r|skiena}}}}
Line 59 ⟶ 60:
== Exact numbers of comparisons ==
[[File:Median of 5.svg|thumb|upright=0.9|Finding the median of five values using six comparisons. Each step shows the comparisons to be performed next as yellow line segments, and a [[Hasse diagram]] of the order relations found so far (with smaller=lower and larger=higher) as blue line segments. The red elements have already been found to be greater than three others and so cannot be the median. The larger of the two elements in the final comparison is the median.]]
[[Donald Knuth|Knuth]] supplies the following triangle of numbers summarizing pairs of <math>n</math> and <math>k</math> for which the exact number of comparisons needed by an optimal selection algorithm is known. The {{nowrap|<math>n</math>th}} row of the triangle (starting with <math>n=1</math> in the top row) gives the numbers of comparisons for inputs of <math>n</math> values, and the {{nowrap|<math>k</math>th}} number within each row gives the number of comparisons needed to select the {{nowrap|<math>k</math>th}} smallest value from an input of that size. The rows are symmetric because selecting the {{nowrap|<math>k</math>th}} smallest requires exactly the same number of comparisons, in the worst case, as selecting the {{nowrap|<math>k</math>th}} {{nowrap|largest.{{r|knuth}}}}
{{center|0}}
Line 77 ⟶ 78:
Very few languages have built-in support for general selection, although many provide facilities for finding the smallest or largest element of a list. A notable exception is the [[Standard Template Library]] for [[C++]], which provides a templated <code>nth_element</code> method with a guarantee of expected linear time.{{r|skiena}}
[[Python (programming language)|Python]]'s standard library
Since 2017, [[Matlab]] has included <code>maxk()</code> and <code>mink()</code> functions, which return the maximal (minimal) <math>k</math> values in a vector as well as their indices. The Matlab documentation does not specify which algorithm these functions use or what their running {{nowrap|time is.{{r|matlab}}}}
== History==
[[Quickselect]] was presented without analysis by [[Tony Hoare]] {{nowrap|in 1965,{{r|hoare}}}} and first analyzed in a 1971 technical report by {{nowrap|[[Donald Knuth]].{{r|floriv}}}} The first known linear time deterministic selection algorithm is the [[median of medians]] method, published in 1973 by [[Manuel Blum]], [[Robert W. Floyd]], [[Vaughan Pratt]], [[Ron Rivest]], and [[Robert Tarjan]].{{r|bfprt}} They trace the formulation of the selection problem to work of Charles L. Dodgson (better known as [[Lewis Carroll]]) who in 1883
== See also ==
Line 126 ⟶ 127:
| publisher = Association for Computing Machinery
| title = Proceedings of the 17th Annual ACM Symposium on Theory of Computing, May 6–8, 1985, Providence, Rhode Island, USA
| year = 1985
| isbn = 0-89791-151-2
}}</ref>
<ref name=bfprt>{{cite journal
Line 170 ⟶ 173:
| title = Space-Efficient Data Structures, Streams, and Algorithms – Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday
| volume = 8066
| year = 2013| isbn = 978-3-642-40272-2 }}</ref>
<ref name=brown>{{cite journal
Line 182 ⟶ 185:
| volume = 2| s2cid = 13985011
}}</ref>
<ref name=carroll>{{cite book|title=Lawn Tennis Tournaments: The True Method of Assigning Prizes with a Proof of the Fallacy of the Present Method|first=Charles L.|last=Dodgson|author-link=Lewis Carroll|year=1883|___location=London|publisher=Macmillan and Co.}} See also {{cite book|title=The Mathematical World of Charles L. Dodgson (Lewis Carroll)|editor1-first=Robin|editor1-last=Wilson|editor2-first=Amirouche|editor2-last=Moktefi|publisher=Oxford University Press|year=2019|isbn=9780192549013|contribution=Lawn tennis tournaments|page=129|contribution-url=https://books.google.com/books?id=OBGIDwAAQBAJ&pg=PA129}}</ref>
<ref name=chan>{{cite journal
Line 207 ⟶ 212:
| title = Mathematical Foundations of Computer Science 1993, 18th International Symposium, MFCS'93, Gdansk, Poland, August 30 – September 3, 1993, Proceedings
| volume = 711
| year = 1993
| isbn = 978-3-540-57182-7
| hdl-access = free
}}</ref>
<ref name=clrs>{{Introduction to Algorithms|edition=3|chapter=Chapter 9: Medians and order statistics|pages=213–227}}; "Section 14.1: Dynamic order statistics", pp. 339–345</ref>
Line 234 ⟶ 242:
| volume = 36
| year = 1989| s2cid = 10947879
| doi-access = free
}}</ref>
Line 308 ⟶ 317:
| s2cid = 3064709
| title = Expected time bounds for selection
| volume = 18| doi-access = free
<ref name=frederickson>{{cite journal
Line 319 ⟶ 329:
| title = An optimal algorithm for selection in a min-heap
| volume = 104
| year = 1993
}}</ref>
<ref name=frejoh>{{cite journal
Line 393 ⟶ 404:
| volume = 40
| year = 1993| s2cid = 17956460
| doi-access = free
}}</ref>
Line 410 ⟶ 422:
| title = 2nd Symposium on Simplicity in Algorithms, SOSA 2019, January 8–9, 2019, San Diego, CA, USA
| volume = 69
| year = 2019
}}</ref>
<ref name=kletar>{{cite book
Line 464 ⟶ 477:
| publisher = IEEE Computer Society
| title = 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18–21, 2014
| year = 2014| isbn = 978-1-4799-6517-5 }}</ref>
<ref name=prt>{{cite journal
Line 478 ⟶ 491:
| volume = 26
| year = 1983| s2cid = 3211474
| doi-access = free
}}</ref>
Line 530 ⟶ 544:
<ref name=matlab>{{cite web|url=https://www.mathworks.com/help/matlab/ref/mink.html|title=mink: Find k smallest elements of array|work=Matlab R2023a documentation|publisher=Mathworks|access-date=2023-03-30}}</ref>
<ref name=python>{{cite web|url=https://
}}
|