Partial sorting: Difference between revisions

Content deleted Content added
restate more simply, elab
Selecting k smallest or largest elements: L3 → L2 section, remove lede that’s been merged into article lede
Line 3:
In terms of indices, in a partially sorted list, for every index ''i'' from 1 to ''k,'' the ''i''th element is in the same place as it would be in the fully sorted list: element ''i'' of the partially sorted list contains [[order statistic]] ''i'' of the input list.
 
=== Direct application of the linear-time selection algorithm ===
== Selecting k smallest or largest elements ==
A special case of partial sorting is that of selecting the ''k'' smallest or ''k'' largest elements, which is particularly useful where we want to present just the "top ''k''" of an unsorted list, such as the top 100 corporations by gross sales.
 
=== Direct application of the linear-time selection algorithm ===
 
The linear-time selection algorithm described above can be used to find the ''k'' smallest or the ''k'' largest elements in worst-case linear time O(''n''). To find the ''k'' smallest elements, find the ''k''th smallest element using the [[Selection algorithm#Linear general selection algorithm - Median of Medians algorithm|linear-time median-of-medians selection algorithm]]. After that, [[Quicksort#In-place_version|partition]] the array with the ''k''th smallest element as pivot. The ''k'' smallest elements will be the first ''k'' elements.
 
=== Data structure-based solutions ===
 
Another simple method is to add each element of the list into a [[priority queue]] data structure, such as a [[heap (data structure)|heap]] or [[self-balancing binary search tree]], with at most ''k'' elements. Whenever the data structure has more than ''k'' elements, we remove the largest element, which can be done in O(log ''k'') time. Each insertion operation also takes O(log ''k'') time, resulting in O(''n''log ''k'') time overall. If [[heap (data structure)|heap]] is used, you may need to sort it further resulting in O((''n'' + ''k'')log''k'').
Line 18 ⟶ 15:
We can achieve an O(log ''n'') time solution using [[skip lists]]. Skip lists are sorted data structures that allow insertion, deletion and indexed retrieval in O(log ''n'') time. Thus, for any given percentile, we can insert a new element into (and possibly delete an old element from) the list in O(log ''n''), calculate the corresponding index(es) and finally access the percentile value in O(log ''n'') time. See, for example, this Python-based implementation for calculating [http://code.activestate.com/recipes/576930-efficient-running-median-using-an-indexable-skipli running median].
 
=== Optimised sorting algorithms ===
More efficient than any of these are specialized partial sorting algorithms based on [[mergesort]] and [[quicksort]]. The simplest is the quicksort variation: there is no need to recursively sort partitions which only contain elements that would fall after the ''k''th place in the end (starting from the "left" boundary). Thus, if the pivot falls in position ''k'' or later, we recurse only on the left partition:
 
Line 44 ⟶ 41:
The resulting algorithm requires an expected time of only O(''n''), which is the best such an algorithm can hope for.
 
===Tournament Algorithm===
Another method is the tournament algorithm. The idea is to conduct a knockout minimal round tournament to decide the ranks. It first organises the games (comparisons) between adjacent pairs and moves the winners to next round until championship (the first best) is decided. It also constructs the tournament tree along the way. Now the second best element must be among the direct losers to winner and these losers can be found out by walking in the binary tree in O(log ''n'') time. It organises another tournament to decide the second best among these potential elements. The third best must be one among the losers of the second best in either of the two tournament trees. The approach continues until we find ''k'' elements. This algorithm takes O(''n'' + k log ''n'') complexity, which for any fixed ''k'' independent of ''n'' is O(''n'').