K-way merge algorithm: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Alter: pages. Removed parameters. Formatted dashes. | Use this bot. Report bugs. | Suggested by Jonesey95 | via #UCB_webform 502/1317
Adding local short description: "Sequence merge algorithm in computer science", overriding Wikidata description "type of merge algorithm"
 
(17 intermediate revisions by 12 users not shown)
Line 1:
{{Short description|Sequence merge algorithm in computer science}}
{{DISPLAYTITLE:''k''-way merge algorithm}}
In [[computer science]], '''''k''-way merge algorithms''' or multiway merges are a specific type of [[Merge algorithm|sequence merge algorithms]] that specialize in taking in k sorted lists and merging them into a single sorted list. These merge algorithms generally refer to merge algorithms that take in a number of sorted lists greater than two. 2Two-way merges are also referred to as binary merges. The k-way merge is also an external sorting algorithm.
 
== 2Two-way merge ==
A 2-way merge, or a binary merge, has been studied extensively due to its key role in [[merge sort]]. An example of such is the classic merge that appears frequently in merge sort examples. The classic merge outputs the data item with the lowest key at each step; given some sorted lists, it produces a sorted list containing all the elements in any of the input lists, and it does so in time proportional to the sum of the lengths of the input lists.
 
Denote by A[1..p] and B[1..q] two arrays sorted in increasing order.
Further, denote by C[1..n] the output array.
The canonical 2-Wayway merge algorithm<ref>{{cite book|author1=Thomas H. Cormen|authorlink1=Thomas H. Cormen|author2=Charles E. Leiserson|author3=Ronald L. Rivest|authorlink3=Ron Rivest|author4=Clifford Stein|title=Introduction To Algorithms|url=https://books.google.com/books?id=NLngYyWFl_YC&pg=PA11|year=2001|publisher=MIT Press|isbn=978-0-262-03293-3|pages=28–29}}</ref> stores indices i, j, and k into A, B, and C respectively.
Initially, these indices refer to the first element, i.e., are 1.
If A[i] < B[j], then the algorithm copies A[i] into C[k] and increases i and k.
Line 19 ⟶ 20:
n is equal to the size of the output array and the sum of the sizes of the k input arrays.
For simplicity, we assume that none of the input arrays is empty.
As a consequence <math>k <\le n</math>, which simplifies the reported running times.
The problem can be solved in <math>\mathcal{O}(n \cdot \log(k))</math> running time with <math>\mathcal{O}(n)</math> space.
Several algorithms that achieve this running time exist.
 
=== Iterative 2-Wayway merge ===
 
The problem can be solved by iteratively merging two of the k arrays using a 2-way merge until only a single array is left. If the arrays are merged in arbitrary order, then the resulting running time is only O(kn). This is suboptimal.
Line 29 ⟶ 30:
The running time can be improved by iteratively merging the first with the second, the third with the fourth, and so on. As the number of arrays is halved in each iteration, there are only Θ(log k) iterations. In each iteration every element is moved exactly once. The running time per iteration is therefore in Θ(n) as n is the number of elements. The total running time is therefore in Θ(n log k).
 
We can further improve upon this algorithm, by iteratively merging the two shortest arrays. It is clear that this minimizes the running time and can therefore not be worse than the strategy described in the previous paragraph. The running time is therefore in O(n log k). Fortunately, in border cases the running time can be better. Consider for example the degenerate case, where all but one array contain only one element. The strategy explained in the previous paragraph needs Θ(n log k) running time, while the improved one only needs Θ(n + k log k) running time.
 
=== Direct ''k''-way merge ===
Line 40 ⟶ 41:
 
We can improve upon this by computing the smallest element faster.
By using either [[Heap (data structure)|heaps]], tournament trees, or [[Splay_tree|splay treestree]]s, the smallest element can be determined in O(log k) time.
The resulting running times are therefore in O(n log k).
 
Line 53 ⟶ 54:
|last1=Bentley
|first1=Jon Louis
|author-link=Jon_Bentley_Jon Bentley (computer_scientistcomputer scientist)
|title=Programming Pearls
|date=2000
Line 74 ⟶ 75:
[[File:Tournament tree.png|thumb|Tournament tree]]
The Tournament Tree <ref name="knuth98">
{{cite book| last = Knuth| first = Donald| author-link = Donald Knuth| series = [[The Art of Computer Programming]]| volume= 3| title= Sorting and Searching| edition = 2nd| publisher = Addison-Wesley| year= 1998| chapter = Chapter 5.4.1. Multiway Merging and Replacement Selection| pages = 252–255| isbn = 0-201-89685-0}}</ref> is based on an elimination tournament, like it can be foundas in sports competitions. In each game, two of the input elements compete. The winner is promoted to the next round. Therefore, we get a [[binary tree]] of games. The list is sorted in ascending order, so the winner of a game is the smaller one of both elements.
 
[[File:Loser tree.png|thumb|Loser tree]]
Line 86 ⟶ 87:
===== Algorithm =====
 
A tournament tree can be represented as a balanced binary tree by adding sentinels to the input lists (i.e. adding a member to the end of each list with a value of infinity) and by adding null lists (comprising only a sentinel) until the number of lists is a power of two. The balanced tree can be stored in a single array. The parent element can be reached by dividing the current index by two.
 
When one of the leaves is updated, all games from the leaf to the root are replayed. In the following [[pseudocode]], an object oriented tree is used instead of an array because it is easier to understand. Additionally, the number of lists to merge is assumed to be a power of two.
 
'''function''' merge(L1, ..., Ln)
buildTree(heads of L1, ..., Ln)
'''while''' tree has elements
winner := tree.winner
Line 120 ⟶ 121:
===== Running time =====
 
In the beginning, the tree is first created in time Θ(k). In each step of merging, only the games on the path from the new element to the root need to be replayed. In each layer, only one comparison is needed. As the tree is balanced, the path from one of the input arrays to the root contains only Θ(log k) elements. In total, there are n elements that need to be transferred. The resulting total running time is therefore in Θ(n log k). <ref name="knuth98" />
===== Example =====
Line 161 ⟶ 162:
[[File:Loser tree merge.gif|centre|thumb|Visualization for the whole algorithm]]
 
=== Lower Runningbound Timeon Boundrunning time ===
 
One can show that no comparison-based ''k''-way merge algorithm exists with a running time in ''O''(''n'' f(k)) where f grows asymptotically slower than a logarithm, and n being the total number of elements. (Excluding data with desirable distributions such as disjoint ranges.) The proof is a straightforward reduction from comparison-based sorting. Suppose that such an algorithm existed, then we could construct a comparison-based sorting algorithm with running time ''O''(''n'' f(''n'')) as follows: Chop the input array into n arrays of size 1. Merge these n arrays with the ''k''-way merge algorithm. The resulting array is sorted and the algorithm has a running time in ''O''(''n'' f(''n'')). This is a contradiction to the well-known result that no comparison-based sorting algorithm with a worst case running time below ''O''(''n'' log ''n'') exists.
One can show that no comparison-based ''k''-way merge algorithm exists with a running time in O(n f(k)) where f grows asymptotically slower than a logarithm.
(Excluding data with desirable distributions such as disjoint ranges.)
The proof is a straightforward reduction from comparison-based sorting.
Suppose that such an algorithm existed, then we could construct a comparison-based sorting algorithm with running time O(n f(n)) as follows:
Chop the input array into n arrays of size 1.
Merge these n arrays with the ''k''-way merge algorithm.
The resulting array is sorted and the algorithm has a running time in O(n f(n)).
This is a contradiction to the well-known result that no comparison-based sorting algorithm with a worst case running time below O(n log n) exists.
 
== External sorting ==
Line 176 ⟶ 170:
''k''-way merges are used in external sorting procedures.<ref>{{Cite book|title = Data Structures and Algorithm Analysis in C++, Third Edition|url = https://books.google.com/books?id=AijEAgAAQBAJ|publisher = Courier Corporation|date = 2012-07-26|isbn = 9780486172620|first = Clifford A.|last = Shaffer}}</ref> [[External sorting|External sorting algorithms]] are a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). ''k''-way merge algorithms usually take place in the second stage of external sorting algorithms, much like they do for merge sort.
 
A multiway merge allows for the files outside of memory to be merged in fewer passes than in a binary merge. If there are 6 runs that need be merged then a binary merge would need to take 3 merge passes, as opposed to a 6-way merge's single merge pass. This reduction of merge passes is especially important considering the large amount of information that is usually being sorted in the first place, allowing for greater speed-ups while also reducing the amount of accesses to slower memorystorage.
 
== References ==