K-way merge algorithm: Difference between revisions

Content deleted Content added
No edit summary
Tag: references removed
Line 1:
{{orphan|date=November 2015}}
In computer science, '''K-Way Merge Algorithms''' or '''Multiway Merges''' are a specific type of [[Merge algorithm|Sequence Merge Algorithms]] that specialize in taking in multiple sorted lists and merging them into a single sorted list. These merge algorithms generally refer to merge algorithms that take in a number of sorted lists greater than two. 2-Way Merges are also referred to as binary merges on the other hand and are also utilized in k-way merge algorithms. K-Way merge algorithms generally find use in sorting algorithms such as [[Patience sorting|Patience Sorting]].
 
== 2-Way Merge ==
A 2-Way Merge, or a binary merge, has been studied extensively due to its key role in [[Merge sort]]. An example of such is the classic merge that appears frequently in merge sort examples. The classic merge outputs the data item with the lowest key at each step; given some sorted lists, it produces a sorted list containing all the elements in any of the input lists, and it does so in time proportional to the sum of the lengths of the input lists. There are algorithms that exist that can operate in better than linear times such as the Hwang-Lin Merging Algorithm.<ref>F.K.Hwang and S. Lin, \A Simple Algorithm for Merging Two Disjoint Linearly Ordered Sets", SIAM Journal on Computing 1 (1972), 31-39.</ref>
 
Denote by A[1..p] and B[1..q] two arrays sorted in increasing order.
=== Example<ref>{{Cite web|title = ALGORITHM TO MERGE SORTED ARRAYS (Java, C++) {{!}} Algorithms and Data Structures|url = http://www.algolist.net/Algorithms/Merge/Sorted_arrays|website = www.algolist.net|accessdate = 2015-11-19}}</ref> ===
Further, denote by C[1..n] the output array.
Assume, that we have two arrays A[0..p-1] and B[0..q-1] that are sorted in non-decreasing order and we want to merge them into an array C[0..n-1] with the same order.
*The cannonical 2-Way Merge algorithm <ref>{{cite book|author1=Thomas H. Cormen|authorlink1=Thomas H. Cormen|author2=Charles E. Leiserson|author3=Ronald L. Rivest|authorlink3=Ron Rivest|author4=Clifford Stein|title=Introduction To Algorithms|url=https://books.google.com/books?id=NLngYyWFl_YC&pg=PA11|year=2001|publisher=MIT Press|isbn=978-0-262-03293-3|pages=28–29}}</ref> stores indices i, j, and k into A, B, and C respectively.
Initially, these indices refer to the first element, i.e., are 1.
If A[i] < B[j], then the algorithm copies A[i] into C[k] and increases i and k.
Otherwise, the algorithm copies B[j] into C[k] and increases i and k.
A special case arises if either i or j have reached the end of A or B.
In this case the algorithm copies the remaining elements of B or A into C and terminates.
 
== IdealK-Way Merge ==
# Introduce read-indices '''i''', '''j''' to traverse arrays A and B, accordingly. Introduce write-index '''k''' to store position of the first free cell in the resulting array. Initialize '''i''', '''j''', and '''k''' with 0.
The K-way merge problem consists of merging k sorted arrays to produce a single sorted array with the same elements.
# At each step: if both indices are in range ('''i''' < p and '''j''' < q), choose minimum of (A['''i'''], B['''j''']) and write it to C['''k''']. If A['''i'''] and B['''j'''] are the same then pick either. Otherwise go to step 4.
Denote by n the total number of elements.
# Increase '''k''' by one. Further, depending on which list's element was smaller, either increase '''i''' or '''j''' by one. Goto step 2.
#n Copyis theequal remainingto valuesthe fromsize of the output array, whoseand endthe wassum of the sizes of the notk yetinput reachedarrays.
For simplicity, we assume that none of the input arrays is empty.
As a consequence k < n, which simplifies the reported running times.
The problem can be solved in O(n log k) running time with O(n) space.
Several algorithms that achieve this running time exist.
 
=== ApplyingIterative 2-Way Merge when k > 2 ===
If the number of sorted lists k is greater than 2, the 2-Way Merge function found in merge sort can still be used to merge everything into a single sorted list.
 
The problem can be merged by iteratively merging two of the k arrays using a 2-Way Merge until only a single array is left.
[[File:HuffmanCodeAlg.png|thumb|An example of Huffman Coding, which uses the same technique as in the optimal merge. The values shown would represent each list length.]]
If the arrays are merged in arbitrary order, then the resulting running time is only O(kn).
This is suboptimal.
 
The running time can be improved by iteratively merging the first with the second, the third with the fourth, and so one.
==== Non-optimal ====
As the number of arrays is halved in each iteration, there are only Θ(log k) iteration.
Let D={n<sub>1</sub>, ... , n<sub>k</sub>} be the set of sequences to be merged. Pick n<sub>i</sub>, n<sub>j</sub>∈ D and then merge them together using the merge function. The new set D is then D' = (D - {n<sub>i</sub>, n<sub>j</sub>}) ∪ {n<sub>i</sub>+n<sub>j</sub>}. This process is repeated until |D| = 1. The question then becomes how to pick n<sub>i</sub> and n<sub>j</sub>. How the merge algorithm picks n<sub>i</sub> and n<sub>j</sub> determines the cost of the overall algorithm. The worst case running time for this algorithm reaches O(k<sup>2</sup> • n) where n is the number of elements in the merged list.
In each iteration every element is moved exactly once.
The running time per iteration is therefore in Θ(n) as n is the number of elements.
The total running time is therefore in Θ(n log k).
 
We can further improve upon this algorithm, by iteratively merging the two shortest arrays.
==== optimal merge pattern ====
It is clear that this minimizes the running time and can therefore not be worse than the stratedy described in the previous paragraph.
The optimal merge pattern is found by utilizing a [[Greedy algorithm]] that selects the two shortest lists at each time to merge. This technique is similar to the one used in [[Huffman coding]]. The algorithm picks n<sub>i</sub>, n<sub>j</sub>∈ D such that |n| ≥ |n<sub>i</sub>| and |n| ≥ |n<sub>j</sub>| ∀ n∈ D.<ref>{{Cite book|title = DESIGN AND ANALYSIS OF ALGORITHMS|url = https://books.google.com/books?id=jYz1AAAAQBAJ|publisher = PHI Learning Pvt. Ltd.|date = 2013-08-21|isbn = 9788120348066|first = MANAS RANJAN|last = KABAT}}</ref>
The running time is therefore in O(n log k).
Fortunately, in border cases the running time can be better.
Consider for example the degenerate case, where all but one array contain only one element.
The stratedy explained in the previous paragraph needs Θ(n log k) running time, while the improved one only needs Θ(n) running time
 
=== Direct K-Way Merge ===
On the optimal merge pattern on the hand can reduce the running time to O(k • n • log k).<ref>{{Cite book|title = Discrete Mathematics|url = https://books.google.com/books?id=tUAZAQAAIAAJ|publisher = Addison-Wesley|date = 1997-01-01|isbn = 9780673980397}}</ref> By choosing the shortest lists to merge each time, this lets the algorithm minimize how many times it is necessary to copy the same value into each new merged list.
 
The basic idea of a direct K-Way Merge consists of efficiently computing the minimum element of all k arrays and then transfering it to the output array.
== Ideal Merge ==
A straight-forward implementation would scan all k arrays to determine the minimum.
The Ideal Merge technique is another merge method for merging greater than two lists except it does not use the 2-Way merge technique. The ideal merging technique was discussed and saw use as a part of UnShuffle Sort.<ref>'''Art S. Kagel''', ''Unshuffle Algorithm, Not Quite a Sort?'', Computer Language Magazine, 3(11), November 1985.</ref>
This straight-forward implementation results in a running time of Θ(kn).
 
==== Non-optimalHeap ====
Given a group of sorted lists ''S'' that we want to merge into list ''S''', the algorithm is as follows:
 
We can improve upon this by computing the smallest element faster.
# Each list is sorted by the value of its head element
By using either [[Heap (data structure)|heaps]] or tournament trees the smallest element can be determined in O(log k) time.
# Then the head element of the first list is removed and placed into ''S'''
The resulting running times are therefore in O(n log k).
# After the head element of the first list is removed, it is resorted according to the value of its new head element
# This repeats until all lists are empty
 
The heap algorithm allocates a min-heap of pointers into the input arrays.
The head elements are generally stored in a priority queue. Depending on how the priority queue is implemented, the running time can vary. If ideal merge keeps its information in a sorted list, then inserting a new head element to the list would be done through a [[Linear search]] and the running time will be Θ(k • n) where n is the total number of elements in the sorted lists, and k is the total number of sorted lists.
Initially these pointers point to the smallest elements of the input array.
#The Eachpointers list isare sorted by the value ofthat itsthey headpoint elementto.
In a O(k) preprocessing step the heap is created using the standard heapify procedure.
Afterwards, the algorithm iteratively transfers the element that the root pointer points to, increases this pointer and executes the standard increase key procedure upon the root element.
The running time of the increase key procedure is bounded by O(log k).
As there are n elements, the total running time is O(n log k).
 
==== Tournament Tree ====
On the other hand, if the sorted items in a [[Heap (data structure)|heap]], then the running time becomes Θ(n log k).
 
The tournament tree<ref>
=== Binary Tree Implementation ===
* {{cite book| last = Knuth| first = Donald| authorlink = Donald Knuth| series = [[The Art of Computer Programming]]| volume= 3| title= Sorting and Searching| edition = 2nd| publisher = Addison-Wesley| year= 1998| chapter = SectionChapter 5.2.4:.1. Sorting byMultiway Merging and Replacement Selection| pages = 158–168| isbn = 0-201-89685-0| ref = harv}}</ref> algorithms constructs a balanced binary tree with k leafs.
As an example, the above technique can be implemented using a [[binary tree]] for its priority queue.<ref>{{Cite book|title = Fundamentals of data structures|url = https://books.google.com/books?id=kdRQAAAAMAAJ|publisher = Computer Science Press|date = 1983-01-01|isbn = 9780914894209|first = Ellis|last = Horowitz|first2 = Sartaj|last2 = Sahni}}</ref> This can help limit the number of comparisons between the element heads. The binary tree is built containing the results of comparing the head of each array. The topmost node of the binary tree is then popped off and its leaf is refilled with the next element in its array.
Every leaf stores a pointer into one of the input arrays.
Further, for every node in the tree, a value and an index is stored.
For the k-th leaf the value is the value that its pointer points to and the index is k.
For every internal leaf the value is minimum of the values of the node's children.
The index of an internal leaf indicates which input array the value comes from.
The root of the tournament tree contains the smallest element and the index of the corresponding input array.
The algorithm iteratively transfers this element and advances the corresponding pointer.
It then updates the values of the nodes on the path from the corresponding leaf to the root.
As the tree is balanced, this path contains only Θ(log k) element.
As there n elements that need to be extracted, the resulting total running time is Θ(n log k).
On the other hand, if the sorted items in a [[Heap (data structure)|heap]], then the running time becomes Θ(n log k).
 
===== Example =====
As an example, let's say we have 4 sorted arrays:<ref>{{Cite web|title = kway - Very fast merge sort - Google Project Hosting|url = https://code.google.com/p/kway/|website = code.google.com|accessdate = 2015-11-22}}</ref><blockquote><code>{5, 10, 15, 20}</code></blockquote><blockquote><code>{10, 13, 16, 19}</code></blockquote><blockquote><code>{2, 19, 26, 40}</code> </blockquote><blockquote><code>{18, 22, 23, 24}</code></blockquote>We start with the heads of each array and then build a binary tree from there.
 
As an example, letLet's say we have 4 sorted arrays:<ref>{{Cite web|title = kway - Very fast merge sort - Google Project Hosting|url = https://code.google.com/p/kway/|website = code.google.com|accessdate = 2015-11-22}}</ref><blockquote><code>{5, 10, 15, 20}</code></blockquote><blockquote><code>{10, 13, 16, 19}</code></blockquote><blockquote><code>{2, 19, 26, 40}</code> </blockquote><blockquote><code>{18, 22, 23, 24}</code></blockquote>We start with the heads of each array and then build a binary tree from there.
 
[[File:Binary Ideal Merge 1.png|centre|thumb]]
Line 54 ⟶ 91:
 
== External Sorting<ref>{{Cite book|title = Data Structures and Algorithm Analysis in C++, Third Edition|url = https://books.google.com/books?id=AijEAgAAQBAJ|publisher = Courier Corporation|date = 2012-07-26|isbn = 9780486172620|first = Clifford A.|last = Shaffer}}</ref> ==
K-way merges find greatergreate use in external sorting procedures. [[External sorting|External sorting algorithms]] are a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). K-way merge algorithms usually take place in the second stage of external sorting algorithms, much like they do for merge sort.
 
A multiway merge like that discussed in ideal merge allows for the files outside of memory to be merged in fewer passes than in a binary merge. If there are 6 runs that need be merged together then a binary merge would need to take 3 merge passes, as opposed to a 6-way merges single merge pass. This reduction of merge passes is especially important considering the large amount of information that is usually being sorted in the first place, allowing for greater speed-ups while also reducing the amount of accesses to slower memory.
 
== References ==
{{Reflist}}
 
== Further reading ==
* {{cite book| last = Knuth| first = Donald| authorlink = Donald Knuth| series = [[The Art of Computer Programming]]| volume= 3| title= Sorting and Searching| edition = 2nd| publisher = Addison-Wesley| year= 1998| chapter = Section 5.2.4: Sorting by Merging| pages = 158–168| isbn = 0-201-89685-0| ref = harv}}
* {{cite book|author1=Thomas H. Cormen|authorlink1=Thomas H. Cormen|author2=Charles E. Leiserson|author3=Ronald L. Rivest|authorlink3=Ron Rivest|author4=Clifford Stein|title=Introduction To Algorithms|url=https://books.google.com/books?id=NLngYyWFl_YC&pg=PA11|year=2001|publisher=MIT Press|isbn=978-0-262-03293-3|pages=28–29}}
 
[[Category:Sorting algorithms]]