Partial sorting: Difference between revisions

Content deleted Content added
Data structure-based solutions: remove last suggestion, which appears to solve a wholly different problem (selection in a stream, not partial sorting)
m Bot: http → https
 
(44 intermediate revisions by 21 users not shown)
Line 1:
In [[computer science]], '''partial sorting''' is a [[Relaxation (approximation)|relaxed]] variant of the [[Sorting algorithm|sorting]] problem. Total sorting is the problem of returning a list of items such that its elements all appear in order, while partial sorting is returning a list of the ''k'' smallest (or ''k'' largest) elements in order. The other elements (above the ''k'' smallest ones) may also be storedsorted, as in an in-place partial sort, or may be discarded, which is common in streaming partial sorts. A common practical example of partial sorting is computing the "Top 100" of some list. A further relaxation only requires returning a list of the ''k'' smallest elements, but without requiring that these be ordered. This latter form is quite close to the [[selection algorithm|selection]] problem, and a solution to one problem can be easily converted to a solution to the other.
 
In terms of indices, in a partially sorted list, for every index ''i'' from 1 to ''k,'' the ''i''-th element is in the same place as it would be in the fully sorted list: element ''i'' of the partially sorted list contains [[order statistic]] ''i'' of the input list.
 
==Offline problems==
== Direct application of the linear-time selection algorithm ==
===Heap-based solution===
[[Heap (data structure)|Heaps]] admit a simple single-pass partial sort when {{mvar|k}} is fixed: insert the first {{mvar|k}} elements of the input into a max-heap. Then make one pass over the remaining elements, add each to the heap in turn, and remove the largest element. Each insertion operation takes {{math|''O''(log ''k'')}} time, resulting in {{math|''O''(''n'' log ''k'')}} time overall; this "partial heapsort" algorithm is practical for small values of {{mvar|k}} and in [[online algorithm|online]] settings.<ref name="aofa04slides"/> An "online heapselect" algorithm described below, based on a min-heap, takes {{math|''O''(''n'' + ''k'' log ''n'')}}.<ref name="aofa04slides"/>
 
===Solution by partitioning selection===
The linear-time selection algorithm described above can be used to find the ''k'' smallest or the ''k'' largest elements in worst-case linear time O(''n''). To find the ''k'' smallest elements, find the ''k''th smallest element using the [[Selection algorithm#Linear general selection algorithm - Median of Medians algorithm|linear-time median-of-medians selection algorithm]]. After that, [[Quicksort#In-place_version|partition]] the array with the ''k''th smallest element as pivot. The ''k'' smallest elements will be the first ''k'' elements.
A further relaxation requiring only a list of the {{mvar|k}} smallest elements, but without requiring that these be ordered, makes the problem equivalent to [[Selection algorithm#Partition-based selection|partition-based selection]]; the original partial sorting problem can be solved by such a selection algorithm to obtain an array where the first {{mvar|k}} elements are the {{mvar|k}} smallest, and sorting these, at a total cost of {{math|''O''(''n'' + ''k'' log ''k'')}} operations. A popular choice to implement this algorithm scheme is to combine [[quickselect]] and [[quicksort]]; the result is sometimes called "quickselsort".<ref name="aofa04slides"/>
 
Common in current (as of 2022) C++ STL implementations is a pass of [[Heap (data structure)#Applications|heapselect]] for a list of ''k'' elements, followed by a [[heapsort]] for the final result.<ref>{{cite web |title=std::partial_sort |url=https://en.cppreference.com/w/cpp/algorithm/partial_sort |website=en.cppreference.com}}</ref>
== Data structure-based solutions ==
 
==={{anchor|Partial quicksort}} Specialised sorting algorithms===
Another simple method is to add each element of the list into a [[priority queue]] data structure, such as a [[heap (data structure)|heap]] or [[self-balancing binary search tree]], with at most ''k'' elements. Whenever the data structure has more than ''k'' elements, we remove the largest element, which can be done in O(log ''k'') time. Each insertion operation also takes O(log ''k'') time, resulting in O(''n'' log ''k'') time overall.
More efficient than any ofthe theseaforementioned are specialized partial sorting algorithms based on [[mergesort]] and [[quicksort]]. The simplest isIn the quicksort variation:variant, there is no need to recursively sort partitions which only contain elements that would fall after the ''{{mvar|k'}}'th place in the endfinal sorted array (starting from the "left" boundary). Thus, if the pivot falls in position ''{{mvar|k''}} or later, we recurse only on the left partition:<ref>{{cite conference |last=Martínez |first=Conrado |title=Partial quicksort |conference=Proc. 6th ACM-SIAM Workshop on Algorithm Engineering and Experiments and 1st ACM-SIAM Workshop on Analytic Algorithmics and Combinatorics |year=2004 |url=https://www.lsi.upc.edu/~conrado/research/reports/ALCOMFT-TR-03-50.pdf}}</ref>
 
'''function''' partial_quicksort(A, i, j, k) '''is'''
It is possible to transform the list into a [[binary heap]] in Θ(''n'') time, and then traverse the heap using a modified [[breadth-first search]] algorithm that places the elements in a [[priority queue]]{{clarify|reason=a heap is a priority queue!}} (instead of the ordinary queue that is normally used in a BFS), and terminate the scan after traversing exactly ''k'' elements. As the queue size remains O(''k'') throughout the traversal, it would require O(''k'' log ''k'') time to complete, leading to a time bound of O(''n'' + ''k'' log ''k'') on this algorithm.{{citation needed}}
'''if''' righti >< leftj '''then'''
p ← pivot(A, i, j)
p pivotNewIndex := partition(listA, lefti, rightj, pivotIndexp)
partial_quicksort(A, i, p-1, k)
'''if''' rightp >< leftk-1 '''then'''
partial_quicksort(A, p+1, j, k)
 
The resulting algorithm is called partial quicksort and requires an ''expected'' time of only {{math|''O''(''n'' + ''k'' log ''k'')}}, and is quite efficient in practice, especially if we substitutea [[selection sort]] is used as a base case when ''{{mvar|k''}} becomes small relative to ''{{mvar|n''}}. However, the worst-case time complexity is still very bad, in the case of a bad pivot selection. Pivot selection along the lines of the worst-case linear time selection algorithm (see {{section link|Quicksort|Choice of pivot}}) could be used to get better worst-case performance. Partial quicksort, quickselect (including the multiple variant), and quicksort can all be generalized into what is known as a ''chunksort''.<ref name="aofa04slides"/>
== Optimised sorting algorithms ==
More efficient than any of these are specialized partial sorting algorithms based on [[mergesort]] and [[quicksort]]. The simplest is the quicksort variation: there is no need to recursively sort partitions which only contain elements that would fall after the ''k''th place in the end (starting from the "left" boundary). Thus, if the pivot falls in position ''k'' or later, we recurse only on the left partition:
 
==Incremental sorting==
'''function''' quicksortFirstK(list, left, right, k)
Incremental sorting is a version of the partial sorting problem where the input is given up front but {{mvar|k}} is unknown: given a {{mvar|k}}-sorted array, it should be possible to extend the partially sorted part so that the array becomes {{math|(''k''+1)}}-sorted.{{r|paredes}}
'''if''' right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
quicksortFirstK(list, left, pivotNewIndex-1, k)
'''if''' pivotNewIndex < left + k
quicksortFirstK(list, pivotNewIndex+1, right, k)
 
[[Heap (data structure)|Heaps]] lead to an {{math|''O''(''n'' + ''k'' log ''n'')}} "online heapselect" solution to incremental partial sorting: first "heapify", in linear time, the complete input array to produce a min-heap. Then extract the minimum of the heap {{mvar|k}} times.<ref name="aofa04slides">{{cite conference |author=Conrado Martínez |year=2004 |title=On partial sorting |url=https://www.lsi.upc.edu/~conrado/research/talks/aofa04.pdf |conference=10th Seminar on the Analysis of Algorithms}}</ref>
The resulting algorithm requires an ''expected'' time of only O(''n'' + ''k'' log ''k''), and is quite efficient in practice, especially if we substitute selection sort when ''k'' becomes small relative to ''n''. However, the worst-case time complexity is still very bad, in the case of a bad pivot selection. Pivot selection along the lines of the worst-case linear time selection algorithm could be used to get better worst-case performance.
 
A different incremental sort can be obtained by modifying quickselect. The version due to Paredes and Navarro maintains a [[stack (data structure)|stack]] of pivots across calls, so that incremental sorting can be accomplished by repeatedly requesting the smallest item of an array {{mvar|A}} from the following algorithm:<ref name="paredes">{{Cite conference| doi = 10.1137/1.9781611972863.16| chapter = Optimal Incremental Sorting| title = Proc. Eighth Workshop on Algorithm Engineering and Experiments (ALENEX)| pages = 171–182| year = 2006| last1 = Paredes | first1 = Rodrigo| last2 = Navarro | first2 = Gonzalo| isbn = 978-1-61197-286-3| citeseerx = 10.1.1.218.4119}}</ref>
Even better is if we don't require those ''k'' items to be themselves sorted. Losing that requirement means we can ignore all partitions that fall entirely before '''or''' after the ''k''th place. We recurse only into the partition that actually contains the ''k''th element itself.
 
<div style="margin-left: 35px; width: 600px">
'''function''' quickfindFirstK(list, left, right, k)
{{framebox|blue}}
'''if''' right > left
Algorithm {{math|IQS(''A'' : array, ''i'' : integer, ''S'' : stack)}} returns the {{mvar|i}}'th smallest element in {{mvar|A}}
select pivotIndex between left and right
* If {{math|''i'' {{=}} top(''S'')}}:
pivotNewIndex := partition(list, left, right, pivotIndex)
** Pop {{mvar|S}}
'''if''' pivotNewIndex > left + k ''// new condition''
** Return {{math|''A''[''i'']}}
quickfindFirstK(list, left, pivotNewIndex-1, k)
* Let {{math|pivot ← random [''i'', top(''S''))}}
'''if''' pivotNewIndex < left + k
* Update {{math|pivot ← partition(''A''[''i'' : top(''S'')), ''A''[pivot])}}
quickfindFirstK(list, pivotNewIndex+1, right, k+left-pivotNewIndex-1)
* Push {{math|pivot}} onto {{mvar|S}}
* Return {{math|IQS(''A'', ''i'', ''S'')}}
{{frame-footer}}
</div>
 
The stack {{mvar|S}} is initialized to contain only the length {{mvar|n}} of {{mvar|A}}. {{mvar|k}}-sorting the array is done by calling {{math|IQS(''A'', ''i'', ''S'')}} for {{math|''i'' {{=}} 0, 1, 2, ...}}; this sequence of calls has [[average-case complexity]] {{math|''O''(''n'' + ''k'' log ''k'')}}, which is asymptotically equivalent to {{math|''O''(''n'' + ''k'' log ''n'')}}. The worst-case time is quadratic, but this can be fixed by replacing the random pivot selection by the [[median of medians]] algorithm.{{r|paredes}}
The resulting algorithm requires an expected time of only O(''n''), which is the best such an algorithm can hope for.
 
==Tournament Algorithm==
Another method is the tournament algorithm. The idea is to conduct a knockout minimal round tournament to decide the ranks. It first organises the games (comparisons) between adjacent pairs and moves the winners to next round until championship (the first best) is decided. It also constructs the tournament tree along the way. Now the second best element must be among the direct losers to winner and these losers can be found out by walking in the binary tree in O(log ''n'') time. It organises another tournament to decide the second best among these potential elements. The third best must be one among the losers of the second best in either of the two tournament trees. The approach continues until we find ''k'' elements. This algorithm takes O(''n'' + k log ''n'') complexity, which for any fixed ''k'' independent of ''n'' is O(''n'').
 
== Language/library support ==
* The [[C++]] standard specifies a library function called <code>[httphttps://en.cppreference.com/w/cpp/algorithm/partial_sort std::partial_sort]</code>.
* The [[Python (programming language)|Python]] standard library includes functions <code>[httphttps://docs.python.org/library/heapq.html#heapq.nlargest nlargest]</code> and <code>[httphttps://docs.python.org/library/heapq.html#heapq.nsmallest nsmallest]</code> in its <code>heapq</code> module.
* The [[Julia_(programming_language)|Julia]] standard library includes a <code>[https://docs.julialang.org/en/v1/base/sort/#Base.Sort.PartialQuickSort PartialQuickSort]</code> algorithm used in <code>[https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsort! partialsort!]</code> and variants.
 
== See also ==
* [[Selection algorithm]]
 
==References==
{{reflist}}
 
== External links ==
* J.M. Chambers (1971). [http://dl.acm.org/citation.cfm?id=362602 Partial sorting]. [[Communications of the ACM|CACM]] '''14'''(5):357–358.
* [http://www.siam.org/meetings/analco04/abstracts/CMartinez.pdf Partial quicksort]
 
[[Category:Sorting algorithms]]
[[Category:Online sorts]]
[[Category:Articles with example pseudocode]]