K-way merge algorithm: Difference between revisions

Content deleted Content added
m Added {{orphan}} and {{uncategorized}} tags to article
Adding local short description: "Sequence merge algorithm in computer science", overriding Wikidata description "type of merge algorithm"
 
(84 intermediate revisions by 52 users not shown)
Line 1:
{{Short description|Sequence merge algorithm in computer science}}
{{orphan|date=November 2015}}
{{DISPLAYTITLE:''k''-way merge algorithm}}
In computer science, '''K-Way Merge Algorithms''' are a specific type of [[Merge algorithm|Sequence Merge Algorithms]] that specialize in taking in multiple sorted lists and merging them into a single sorted list. These merge algorithms generally refer to merge algorithms that take in a number of sorted lists greater than two. 2-Way Merges are referred to as binary merges on the other hand and are also utilized in k-way merge algorithms.
In [[computer science]], '''''k''-way merge algorithms''' or multiway merges are a specific type of [[Merge algorithm|sequence merge algorithms]] that specialize in taking in k sorted lists and merging them into a single sorted list. These merge algorithms generally refer to merge algorithms that take in a number of sorted lists greater than two. Two-way merges are also referred to as binary merges. The k-way merge is also an external sorting algorithm.
 
== 2Two-Wayway Mergemerge ==
A 2-Wayway Mergemerge, or a binary merge, has been studied extensively due to its key role in [[Mergemerge sort]]. An example of such is the classic merge that appears frequently in merge sort examples. The classic merge outputs the data item with the lowest key at each step; given some sorted lists, it produces a sorted list containing all the elements in any of the input lists, and it does so in time proportional to the sum of the lengths of the input lists. There are algorithms that exist that can operate in better than linear times such as the Hwang-Lin Merging Algorithm.<ref>F.K.Hwang and S. Lin, \A Simple Algorithm for Merging Two Disjoint Linearly Ordered Sets", SIAM Journal on Computing 1 (1972), 31-39.</ref>
 
Denote by A[1..p] and B[1..q] two arrays sorted in increasing order.
=== Example<ref>{{Cite web|title = ALGORITHM TO MERGE SORTED ARRAYS (Java, C++) {{!}} Algorithms and Data Structures|url = http://www.algolist.net/Algorithms/Merge/Sorted_arrays|website = www.algolist.net|accessdate = 2015-11-19}}</ref> ===
Further, denote by C[1..n] the output array.
Assume, that we have two arrays A[0..m-1] and B[0..n-1] that are sorted in ascending order and we want to merge them into an array C[0..m+n-1] with the same order. 
The canonical 2-way merge algorithm<ref>{{cite book|author1=Thomas H. Cormen|authorlink1=Thomas H. Cormen|author2=Charles E. Leiserson|author3=Ronald L. Rivest|authorlink3=Ron Rivest|author4=Clifford Stein|title=Introduction To Algorithms|url=https://books.google.com/books?id=NLngYyWFl_YC&pg=PA11|year=2001|publisher=MIT Press|isbn=978-0-262-03293-3|pages=28–29}}</ref> stores indices i, j, and k into A, B, and C respectively.
Initially, these indices refer to the first element, i.e., are 1.
If A[i] < B[j], then the algorithm copies A[i] into C[k] and increases i and k.
Otherwise, the algorithm copies B[j] into C[k] and increases j and k.
A special case arises if either i or j have reached the end of A or B.
In this case the algorithm copies the remaining elements of B or A into C and terminates.
 
== ''k''-way merge ==
# Introduce read-indices '''i''', '''j''' to traverse arrays A and B, accordingly. Introduce write-index '''k''' to store position of the first free cell in the resulting array. By default '''i''' = '''j''' = '''k''' = 0.
The ''k''-way merge problem consists of merging k sorted arrays to produce a single sorted array with the same elements.
# At each step: if both indices are in range ('''i''' < m and '''j''' < n), choose minimum of (A['''i'''], B['''j''']) and write it to C['''k''']. Otherwise go to step 4.
Denote by n the total number of elements.
# Increase '''k''' and index of the array, algorithm located minimal value at, by one. Repeat step 2.
#n Copyis theequal restto valuesthe size fromof the output array, whichand indexthe issum stillof inthe range,sizes toof the resultingk arrayinput arrays.
For simplicity, we assume that none of the input arrays is empty.
As a consequence <math>k \le n</math>, which simplifies the reported running times.
The problem can be solved in <math>\mathcal{O}(n \cdot \log(k))</math> running time with <math>\mathcal{O}(n)</math> space.
Several algorithms that achieve this running time exist.
 
=== ApplyingIterative 2-Wayway Merge when k > 2merge ===
If the number of sorted lists k is greater than 2, the 2-Way Merge function found in merge sort can still be used to merge everything into a single sorted list.
 
The problem can be solved by iteratively merging two of the k arrays using a 2-way merge until only a single array is left. If the arrays are merged in arbitrary order, then the resulting running time is only O(kn). This is suboptimal.
[[File:HuffmanCodeAlg.png|thumb|An example of Huffman Coding, which uses the same technique as in the optimal merge. The values shown would represent each list length.]]
 
The running time can be improved by iteratively merging the first with the second, the third with the fourth, and so on. As the number of arrays is halved in each iteration, there are only Θ(log k) iterations. In each iteration every element is moved exactly once. The running time per iteration is therefore in Θ(n) as n is the number of elements. The total running time is therefore in Θ(n log k).
==== Non-optimal ====
Let D={n<sub>1</sub>, ... , n<sub>k</sub>} be the set of sequences to be merged. Pick n<sub>i</sub>, n<sub>j</sub>∈ D and then merge them together using the merge function. The new set D is then D' = (D - {n<sub>i</sub>, n<sub>j</sub>}) ∪ {n<sub>i</sub>+n<sub>j</sub>}. This process is repeated until |D| = 1. The question then becomes how to pick n<sub>i</sub> and n<sub>j</sub>. How the merge algorithm picks n<sub>i</sub> and n<sub>j</sub> determines the cost of the overall algorithm. The worst case running time for this algorithm reaches O(m<sup>2</sup> • n).
 
We can further improve upon this algorithm, by iteratively merging the two shortest arrays. It is clear that this minimizes the running time and can therefore not be worse than the strategy described in the previous paragraph. The running time is therefore in O(n log k). Fortunately, in border cases the running time can be better. Consider for example the degenerate case, where all but one array contain only one element. The strategy explained in the previous paragraph needs Θ(n log k) running time, while the improved one only needs Θ(n + k log k) running time.
==== optimal merge pattern ====
The optimal merge pattern is found by utilizing a [[Greedy algorithm]] that selects the two shortest lists at each time to merge. This technique is similar to the one used in [[Huffman coding]]. The algorithm picks  n<sub>i</sub>, n<sub>j</sub>∈ D such that |n| ≥ |n<sub>i</sub>| and |n| ≥ |n<sub>j</sub>| ∀ n∈ D.<ref>{{Cite book|title = DESIGN AND ANALYSIS OF ALGORITHMS|url = https://books.google.com/books?id=jYz1AAAAQBAJ|publisher = PHI Learning Pvt. Ltd.|date = 2013-08-21|isbn = 9788120348066|language = en|first = MANAS RANJAN|last = KABAT}}</ref>
 
=== Direct ''k''-way merge ===
On the optimal merge pattern on the hand can reduce the running time to O(m • n • log m)<ref>{{Cite book|title = Discrete Mathematics|url = https://books.google.com/books?id=tUAZAQAAIAAJ|publisher = Addison-Wesley|date = 1997-01-01|isbn = 9780673980397|language = en}}</ref>. By choosing the shortest lists to merge each time, this lets the algorithm minimize how many times it is necessary to copy the same value into each new merged list.
 
In this case, we would simultaneously merge k-runs together.
== Ideal Merge ==
The Ideal Merge technique is another merge method for merging greater than two lists except it does not use the 2-Way merge technique. The ideal merging technique was discussed and saw use as a part of UnShuffle Sort.<ref>'''Art S. Kagel''', ''Unshuffle Algorithm, Not Quite a Sort?'', Computer Language Magazine, 3(11), November 1985.</ref>
 
A straightforward implementation would scan all k arrays to determine the minimum.
Given a group of sorted lists ''S'' that we want to merge into list ''S''', the algorithm is as follows:
This straightforward implementation results in a running time of Θ(kn).
Note that this is mentioned only as a possibility, for the sake of discussion. Although it would work, it is not efficient.
 
We can improve upon this by computing the smallest element faster.
# Each list is sorted by the value of its head element
By using either [[Heap (data structure)|heaps]], tournament trees, or [[splay tree]]s, the smallest element can be determined in O(log k) time.
# Then the head element of the first list is removed and placed into ''S'''
The resulting running times are therefore in O(n log k).
# After the head element of the first list is removed, it is resorted according to the value of its new head element
# This repeats until all lists are empty
 
The heap is more commonly used, although a tournament tree is faster in practice. A heap uses approximately 2*log(k) comparisons in each step because it handles the tree from the root down to the bottom and needs to compare both children of each node. Meanwhile, a tournament tree only needs log(k) comparisons because it starts on the bottom of the tree and works up to the root, only making a single comparison in each layer. The tournament tree should therefore be the preferred implementation.
The head elements are generally stored in a priority queue. Depending on how the priority queue is implemented, the running time can vary. If ideal merge keeps its information in a sorted list, then inserting a new head element to the list would be done through a [[Linear search]] and the running time will be Θ(M • N) where N is the total number of elements in the sorted lists, and M is the total number of sorted lists.
 
==== Heap ====
On the other hand, if the sorted items in a [[Heap (data structure)|heap]], then the running time becomes Θ(N log M).
 
The idea is to maintain a min-heap of the k lists, each keyed by their smallest current element. A simple algorithm builds an output buffer with nodes from the heap. Start by building a min-heap of nodes, where each node consists of a head element of the list, and the rest (or tail) of the list. Because the lists are sorted initially, the head is the smallest element of each list; the heap property guarantees that the root contains the minimum element over all lists. Extract the root node from the heap, add the head element to the output buffer, create a new node out of the tail, and insert it into the heap. Repeat until there is only one node left in the heap, at which point just append that remaining list (head and tail) to the output buffer.
=== Binary Tree Implementation ===
As an example, the above technique can be implemented using a binary tree for its priority queue<ref>{{Cite book|title = Fundamentals of data structures|url = https://books.google.com/books?id=kdRQAAAAMAAJ|publisher = Computer Science Press|date = 1983-01-01|isbn = 9780914894209|language = en|first = Ellis|last = Horowitz|first2 = Sartaj|last2 = Sahni}}</ref>. This can help limit the number of comparisons between the element heads. The binary tree is built containing the results of comparing the head of each array. The topmost node of the binary tree is then popped off and its leaf is refilled with the next element in its array.
 
Using pointers, an in-place heap algorithm
As an example, let's say we have 4 sorted arrays:<ref>{{Cite web|title = kway - Very fast merge sort - Google Project Hosting|url = https://code.google.com/p/kway/|website = code.google.com|accessdate = 2015-11-22}}</ref><blockquote><code>{5, 10, 15, 20}</code></blockquote><blockquote><code>{10, 13, 16, 19}</code></blockquote><blockquote><code>{2, 19, 26, 40}</code> </blockquote><blockquote><code>{18, 22, 23, 24}</code></blockquote>We start with the heads of each array and then build a binary tree from there.
<ref>{{cite book
|last1=Bentley
|first1=Jon Louis
|author-link=Jon Bentley (computer scientist)
|title=Programming Pearls
|date=2000
|publisher=Addison Wesley
|isbn=0201657880
|pages=147–162
|edition=2nd}}</ref>
allocates a min-heap of pointers into the input arrays.
Initially these pointers point to the smallest elements of the input array.
The pointers are sorted by the value that they point to.
In an O(k) preprocessing step the heap is created using the standard heapify procedure.
Afterwards, the algorithm iteratively transfers the element that the root pointer points to, increases this pointer and executes the standard decrease key procedure upon the root element.
The running time of the increase key procedure is bounded by O(log k).
As there are n elements, the total running time is O(n log k).
 
Note that the operation of replacing the key and iteratively doing decrease-key or sift-down are not supported by many Priority Queue libraries such as C++ stl and Java. Doing an extract-min and insert function is less efficient.
[[File:Binary Ideal Merge 1.png|centre|thumb]]
 
==== Tournament Tree ====
The nodes from each array are compared to each other, before the value 2 is found as the lowest list head element. That value is then popped off, and its leaf is refilled with the next value in the list.
 
[[File:BinaryTournament Ideal Merge 2tree.png|centrethumb|thumbTournament tree]]
The Tournament Tree <ref name="knuth98">
{{cite book| last = Knuth| first = Donald| author-link = Donald Knuth| series = [[The Art of Computer Programming]]| volume= 3| title= Sorting and Searching| edition = 2nd| publisher = Addison-Wesley| year= 1998| chapter = Chapter 5.4.1. Multiway Merging and Replacement Selection| pages = 252–255| isbn = 0-201-89685-0}}</ref> is based on an elimination tournament, as in sports competitions. In each game, two of the input elements compete. The winner is promoted to the next round. Therefore, we get a [[binary tree]] of games. The list is sorted in ascending order, so the winner of a game is the smaller one of both elements.
 
[[File:Loser tree.png|thumb|Loser tree]]
The value 2 is repopulated by the next value in the sorted list, 19. The comparisons end with 5 being the smallest value, and thus the next value to be popped off. This continues until all of the sorted lists are empty.
 
For k-way merging, it is more efficient to only store the loser of each game (see image). The data structure is therefore called a loser tree. When building the tree or replacing an element with the next one from its list, we still promote the winner of the game to the top. The tree is filled like in a sports match but the nodes only store the loser. Usually, an additional node above the root is added that represents the overall winner. Every leaf stores a pointer to one of the input arrays. Every inner node stores a value and an index. The index of an inner node indicates which input array the value comes from. The value contains a copy of the first element of the corresponding input array.
 
The algorithm iteratively appends the minimum element to the result and then removes the element from the corresponding input list. It updates the nodes on the path from the updated leaf to the root (''replacement selection''). The removed element is the overall winner. Therefore, it has won each game on the path from the input array to the root. When selecting a new element from the input array, the element needs to compete against the previous losers on the path to the root. When using a loser tree, the partner for replaying the games is already stored in the nodes. The loser of each replayed game is written to the node and the winner is iteratively promoted to the top. When the root is reached, the new overall winner was found and can be used in the next round of merging.
 
The images of the tournament tree and the loser tree in this section use the same data and can be compared to understand the way a loser tree works.
 
===== Algorithm =====
 
A tournament tree can be represented as a balanced binary tree by adding sentinels to the input lists (i.e. adding a member to the end of each list with a value of infinity) and by adding null lists (comprising only a sentinel) until the number of lists is a power of two. The balanced tree can be stored in a single array. The parent element can be reached by dividing the current index by two.
 
When one of the leaves is updated, all games from the leaf to the root are replayed. In the following [[pseudocode]], an object oriented tree is used instead of an array because it is easier to understand. Additionally, the number of lists to merge is assumed to be a power of two.
 
'''function''' merge(L1, ..., Ln)
buildTree(heads of L1, ..., Ln)
'''while''' tree has elements
winner := tree.winner
output winner.value
new := winner.index.next
replayGames(winner, new) // Replacement selection
 
'''function''' replayGames(node, new)
loser, winner := playGame(node, new)
node.value := loser.value
node.index := loser.index
'''if''' node != root
replayGames(node.parent, winner)
 
'''function''' buildTree(elements)
nextLayer := new Array()
'''while''' elements not empty
el1 := elements.take()
el2 := elements.take()
loser, winner := playGame(el1, el2)
parent := new Node(el1, el2, loser)
nextLayer.add(parent)
'''if''' nextLayer.size == 1
'''return''' nextLayer // only root
'''else'''
'''return''' buildTree(nextLayer)
 
===== Running time =====
 
In the beginning, the tree is first created in time Θ(k). In each step of merging, only the games on the path from the new element to the root need to be replayed. In each layer, only one comparison is needed. As the tree is balanced, the path from one of the input arrays to the root contains only Θ(log k) elements. In total, there are n elements that need to be transferred. The resulting total running time is therefore in Θ(n log k).<ref name="knuth98" />
===== Example =====
 
The following section contains a detailed example for the replacement selection step and one example for a complete merge containing multiple replacement selections.
 
====== Replacement selection ======
 
Games are replayed from the bottom to the top. In each layer of the tree, the currently stored element of the node and the element that was provided from the layer below compete. The winner is promoted to the top until we found the new overall winner. The loser is stored in the node of the tree.
 
[[File:Loser tree replacement selection.gif|centre|thumb|Example for replacement selection]]
 
{| class="wikitable"
|-
! Step !! Action
|-
| 1 || Leaf 1 (overall winner) is replaced by 9, the next element from the input list.
|-
| 2 || Replaying the game 9 vs 7 (previous loser). 7 wins because it is smaller. Therefore, 7 is promoted to the top while 9 is saved in the node.
|-
| 3 || Replaying the game 7 vs 3 (previous loser). 3 wins because it is smaller. Therefore, 3 is promoted to the top while 7 is saved in the node.
|-
| 4 || Replaying the game 3 vs 2 (previous loser). 2 wins because it is smaller. Therefore, 2 is promoted to the top while 3 is saved in the node.
|-
| 5 || The new overall winner 2 is saved above the root.
|}
 
====== Merge ======
To execute the merge itself, the overall smallest element is repeatedly replaced with the next input element. After that, the games to the top are replayed.
 
This example uses four sorted arrays as input.
 
{2, 7, 16}
{5, 10, 20}
{3, 6, 21}
{4, 8, 9}
 
The algorithm is initiated with the heads of each input list. Using these elements, a binary tree of losers is built. For merging, the lowest list element 2 is determined by looking at the overall minimum element at the top of the tree. That value is then popped off, and its leaf is refilled with 7, the next value in the input list. The games on the way to the top are replayed like in the previous section about replacement selection. The next element that is removed is 3. Starting from the next value in the list, 6, the games are replayed up until the root. This is being repeated until the minimum of the tree equals infinity.
 
[[File:Loser tree merge.gif|centre|thumb|Visualization for the whole algorithm]]
 
=== Lower bound on running time ===
 
One can show that no comparison-based ''k''-way merge algorithm exists with a running time in ''O''(''n'' f(k)) where f grows asymptotically slower than a logarithm, and n being the total number of elements. (Excluding data with desirable distributions such as disjoint ranges.) The proof is a straightforward reduction from comparison-based sorting. Suppose that such an algorithm existed, then we could construct a comparison-based sorting algorithm with running time ''O''(''n'' f(''n'')) as follows: Chop the input array into n arrays of size 1. Merge these n arrays with the ''k''-way merge algorithm. The resulting array is sorted and the algorithm has a running time in ''O''(''n'' f(''n'')). This is a contradiction to the well-known result that no comparison-based sorting algorithm with a worst case running time below ''O''(''n'' log ''n'') exists.
 
== External sorting ==
{{See also|External sorting}}
''k''-way merges are used in external sorting procedures.<ref>{{Cite book|title = Data Structures and Algorithm Analysis in C++, Third Edition|url = https://books.google.com/books?id=AijEAgAAQBAJ|publisher = Courier Corporation|date = 2012-07-26|isbn = 9780486172620|first = Clifford A.|last = Shaffer}}</ref> [[External sorting|External sorting algorithms]] are a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). ''k''-way merge algorithms usually take place in the second stage of external sorting algorithms, much like they do for merge sort.
 
A multiway merge allows for the files outside of memory to be merged in fewer passes than in a binary merge. If there are 6 runs that need be merged then a binary merge would need to take 3 merge passes, as opposed to a 6-way merge's single merge pass. This reduction of merge passes is especially important considering the large amount of information that is usually being sorted in the first place, allowing for greater speed-ups while also reducing the amount of accesses to slower storage.
 
== References ==
{{Reflist}}
 
[[Category:Sorting algorithms]]
== Further reading ==
* {{cite book| last = Knuth| first = Donald| authorlink = Donald Knuth| series = [[The Art of Computer Programming]]| volume= 3| title= Sorting and Searching| edition = 2nd| publisher = Addison-Wesley| year= 1998| chapter = Section 5.2.4: Sorting by Merging| pages = 158–168| isbn = 0-201-89685-0| ref = harv}}
* {{cite book|author1=Thomas H Cormen|author2=Charles E Leiserson|author3=Ronald L Rivest|coauthors=Clifford Stein|title=Introduction To Algorithms|url=http://books.google.com/books?id=NLngYyWFl_YC&pg=PA11|year=2001|publisher=MIT Press|isbn=978-0-262-03293-3|pages=28–29}}
 
{{uncategorized|date=November 2015}}