Selection algorithm: Difference between revisions

Content deleted Content added
tail bounds on quickselect, after Mike Goodrich pointed out a mistake related to this in the quickselect article
kkzz
Line 37:
 
===Sublinear data structures===
When data is already organized into a [[data structure]], it may be possible to perform selection in an amount of time that is sublinear in the number of values. As a simple case of this, for data already sorted into an array, selecting the {{nowrap|<math>k</math>th}} element may be performed by a single array lookup, in constant {{nowrap|time.{{r|frejoh}}}} For values organized into a two-dimensional array of {{nowrap|size <math>m\times n</math>,}} with sorted rows and columns, selection may be performed in time {{nowrap|<math>O\bigl(m\log(2n/m)\bigr)</math>,}} or faster when <math>k</math> is small relative to the array {{nowrap|dimensions.{{r|frejoh|kkzz}}}} For a collection of <math>m</math> one-dimensional sorted arrays, with <math>k_i</math> items less than the selected item in the {{nowrap|<math>i</math>th}} array, the time is {{nowrap|<math display=inline>O\bigl(m+\sum_{i=1}^m\log(k_i+1)\bigr)</math>.{{r|kkzz}}}}
 
Selection from data in a [[binary heap]] takes {{nowrap|time <math>O(k)</math>.}} This is independent of the size <math>n</math> of the heap, and faster than the <math>O(k\log n)</math> time bound that would be obtained from {{nowrap|[[best-first search]].{{r|kkzz|frederickson}}}} This same method can be applied more generally to data organized as any kind of heap-ordered tree (a tree in which each node stores one value in which the parent of each non-root node has a smaller value than its child). This method of performing selection in a heap has been applied to problems of listing multiple solutions to combinatorial optimization problems, such as finding the [[k shortest path routing|{{mvar|k}} shortest paths]] in a weighted graph, by defining a [[State space (computer science)|state space]] of solutions in the form of an [[implicit graph|implicitly defined]] heap-ordered tree, and then applying this selection algorithm to this {{nowrap|tree.{{r|kpaths}}}} In the other direction, linear time selection algorithms have been used as a subroutine in a [[priority queue]] data structure related to the heap, improving the time for extracting its {{nowrap|<math>k</math>th}} item from <math>O(\log n)</math> to {{nowrap|<math>O(\log^* n+\log k)</math>;}} here <math>\log^* n</math> is the {{nowrap|[[iterated logarithm]].{{r|bks}}}}
 
For a collection of data values undergoing dynamic insertions and deletions, the [[order statistic tree]] augments a [[self-balancing binary search tree]] structure with a constant amount of additional information per tree node, allowing insertions, deletions, and selection queries that ask for the {{nowrap|<math>k</math>th}} element in the current set to all be performed in <math>O(\log n)</math> time per {{nowrap|operation.{{r|clrs}}}} Going beyond the comparison model of computation, faster times per operation are possible for values that are small integers, on which binary arithmetic operations are {{nowrap|allowed.{{r|pattho}}}} It is not possible for a [[streaming algorithms|streaming algorithm]] with memory sublinear in both <math>n</math> and <math>k</math> to solve selection queries exactly for dynamic data, but the [[count–min sketch]] can be used to solve selection queries approximately, by finding a value whose position in the ordering of the elements (if it were added to them) would be within <math>\varepsilon n</math> steps of <math>k</math>, for a sketch whose size is within logarithmic factors of <math>1/\varepsilon</math>.{{r|cormut}}
 
== Lower bounds ==
The <math>O(n)</math> running time of the selection algorithms described above is necessary, because a selection algorithm that can handle inputs in an arbitrary order must take that much time to look at all of its inputs; if any one of its input values is not compared, that one value could be the one that should have been selected, and the algorithm can be made to produce an incorrect answer.{{r|kkzz}} Beyond this simple argument, there has been a significant amount of research on the exact number of comparisons needed for selection, both in the randomized and deterministic cases.
 
Selecting the minimum of <math>n</math> values requires <math>n-1</math> comparisons, because the <math>n-1</math> values that are not selected must each have been determined to be non-minimal, by being the largest in some comparison, and no two of these values can be largest in the same comparison. The same argument applies symmetrically to selecting the {{nowrap|maximum.{{r|knuth}}}}
Line 374:
| volume = 40
| year = 1993}}</ref>
 
<ref name=kkzz>{{cite conference
| last1 = Kaplan | first1 = Haim
| last2 = Kozma | first2 = László
| last3 = Zamir | first3 = Or
| last4 = Zwick | first4 = Uri
| editor1-last = Fineman | editor1-first = Jeremy T.
| editor2-last = Mitzenmacher | editor2-first = Michael
| contribution = Selection from heaps, row-sorted matrices, and <math>X+Y</math> using soft heaps
| doi = 10.4230/OASIcs.SOSA.2019.5
| pages = 5:1–5:21
| publisher = Schloss Dagstuhl – Leibniz-Zentrum für Informatik
| series = OASIcs
| title = 2nd Symposium on Simplicity in Algorithms, SOSA 2019, January 8–9, 2019, San Diego, CA, USA
| volume = 69
| year = 2019}}</ref>
 
<ref name=kletar>{{cite book