Selection algorithm: Difference between revisions

Content deleted Content added
pull refs to end for greater source readability and remove some bad ones; promote CLRS to ref
Line 27:
 
===Sublinear data structures===
When data is already organized into a [[data structure]], it may be possible to perform selection in an amount of time that is sublinear in the number of values. As a simple case of this, for data already sorted into an array, selecting the <math>k</math> element may be performed by a single array lookup, in constant time.
Given an unorganized list of data, linear time (Ω(''n'')) is required to find the minimum element, because we have to examine every element (otherwise, we might miss it). If we organize the list, for example by keeping it sorted at all times, then selecting the ''k''th largest element is trivial, but then insertion requires linear time, as do other operations such as combining two lists.
 
For data organized as a [[binary heap]], or more generally as a heap-ordered tree (a tree in which each node stores a value and has a bounded number of children, all having larger values), it is possible to perform selection in time <math>O(k)</math>. This method has been applied to several problems of listing multiple solutions to combinatorial optimization problems, such as finding the [[k shortest path routing|{{mvar|k}} shortest paths]] in a weighted graph, by defining a [[State space (computer science)|state space]] of solutions in the form of an [[implicit graph|implicitly defined]] heap-ordered tree, and then applying this selection algorithm to this tree.
The strategy to find an order statistic in [[sublinear time]] is to store the data in an organized fashion using suitable data structures that facilitate the selection. Two such data structures are tree-based structures and frequency tables.
 
For a collection of data values undergoing dynamic insertions and deletions, it is possible to augment a [[self-balancing binary search tree]] structure with a constant amount of additional information per tree node, allowing insertions, deletions, and selection queries that ask for the <math>k</math>th element in the current set to all be performed in <math>O(\log n)</math> time per operation.{{r|clrs}}
When only the minimum (or maximum) is needed, a good approach is to use a [[Heap (data structure)|heap]], which is able to find the minimum (or maximum) element in constant time, while all other operations, including insertion, are O(log ''n'') or better. More generally, a [[self-balancing binary search tree]] can easily be augmented to make it possible to both insert an element and find the ''k''th largest element in O(log ''n'') time; this is called an ''[[order statistic tree]].'' We simply store in each node a count of how many descendants it has, and use this to determine which path to follow. The information can be updated efficiently since adding a node only affects the counts of its O(log ''n'') ancestors, and tree rotations only affect the counts of the nodes involved in the rotation.
 
Another simple strategy is based on some of the same concepts as the [[hash table]]. When we know the range of values beforehand, we can divide that range into ''h'' subintervals and assign these to ''h'' buckets. When we insert an element, we add it to the bucket corresponding to the interval it falls in. To find the minimum or maximum element, we scan from the beginning or end for the first nonempty bucket and find the minimum or maximum element in that bucket. In general, to find the ''k''th element, we maintain a count of the number of elements in each bucket, then scan the buckets from left to right adding up counts until we find the bucket containing the desired element, then use the expected linear-time algorithm to find the correct element in that bucket.
 
If we choose ''h'' of size roughly sqrt(''n''), and the input is close to uniformly distributed, this scheme can perform selections in expected O(sqrt(''n'')) time. Unfortunately, this strategy is also sensitive to clustering of elements in a narrow interval, which may result in buckets with large numbers of elements (clustering can be eliminated through a good hash function, but finding the element with the ''k''th largest hash value isn't very useful). Additionally, like hash tables this structure requires table resizings to maintain efficiency as elements are added and ''n'' becomes much larger than ''h''<sup>2</sup>. A useful case of this is finding an order statistic or extremum in a finite range of data. Using above table with bucket interval 1 and maintaining counts in each bucket is much superior to other methods. Such hash tables are like [[frequency tables]] used to classify the data in [[descriptive statistics]].
 
== Lower bounds ==