Merge algorithm

Merge algorithms are a family of algorithms that run sequentially over multiple sorted lists, typically producing more sorted lists as output. This is well-suited for machines with tape drives. Use has declined due to large random access memories, and many applications of merge algorithms have faster alternatives when a random-access memory is available.^{[citation needed]}

The general merge algorithm has a set of pointers p_0..n that point to positions in a set of lists L_0..n. Initially they point to the first item in each list. The algorithm is as follows:

While any of p_0..n still point to data inside of L_0..n instead of past the end:

do something with the data items p_0..n point to in their respective lists
find out which of those pointers points to the item with the lowest key; advance one of those pointers to the next item in its list

Pseudocode

A simple non-recursive pseudocode implementation of merge with two lists might be written as follows:

function merge(a, b)
	var list result
	var int i, j := 0
	
	while (i < length(a)) and (j < length(b))
		if a[i] = b[j]
		       add a[i] to result  // In case list a and b are identical, each holding only
                                           // one element, say x, this algorithm would output the list
                                           // consisting of a single x instead of the list holding two
                                           // x's which is at least counter-intuitive.
                       i := i + 1
                       j := j + 1
		else if a[i] < b[j]
			add a[i] to result
                       i := i + 1
		else
			add b[j] to result
			j := j + 1

	while i < length(a)
		add a[i] to result
		i := i + 1
	while j < length(b)
		add b[j] to result
		j := j + 1
	
	return result

Analysis

Merge algorithms generally run in time proportional to the sum of the lengths of the lists; merge algorithms that operate on large numbers of lists at once will multiply the sum of the lengths of the lists by the time to figure out which of the pointers points to the lowest item, which can be accomplished with a heap-based priority queue in O(log n) time, for O(m log n) time, where n is the number of lists being merged and m is the sum of the lengths of the lists. When merging two lists of length m, there is a lower bound of 2m − 1 comparisons required in the worst case.

The classic merge (the one used in merge sort) outputs the data item with the lowest key at each step; given some sorted lists, it produces a sorted list containing all the elements in any of the input lists, and it does so in time proportional to the sum of the lengths of the input lists.

In parallel computing, arrays of sorted values may be merged efficiently using an all nearest smaller values computation.^[1]

Uses

Merge can also be used for a variety of other things:

given a set of current account balances and a set of transactions, both sorted by account number, produce the set of new account balances after the transactions are applied; this requires always advancing the "new transactions" pointer in preference to the "account number" pointer when the two have equal keys, and adding all the numbers on either tape with the same account number to produce the new balance.
produce a sorted list of records with keys present in all the lists (equijoin); this requires outputting a record whenever the keys of all the p_0..n are equal.
similarly for finding the largest number on one tape smaller than each number on another tape (e.g. to figure out what tax bracket each person is in).
similarly for computing set differences: all the records in one list with no corresponding records in another.
insertion/update operations in search engines to produce an inverted index
merging plays a central role in Dijkstra's functional programming technique for generating regular numbers

Language support

Few languages have built-in support for merging. An exception is C++'s STL: it has the function std::merge, which merges two sorted ranges of iterators, and std::inplace_merge, which merges two consecutive sorted ranges in-place. In addition, the std::list (linked list) class has its own merge method which merges another list into itself. The type of the elements merged must support the less-than (<) operator, or it must be provided with a custom comparator.

Also, PHP has the array_merge() and array_merge_recursive() functions.

Notes

^ Berkman, Omer; Schieber, Baruch; Vishkin, Uzi (1993), "Optimal double logarithmic parallel algorithms based on finding all nearest smaller values", Journal of Algorithms, 14 (3): 344–370, doi:10.1006/jagm.1993.1018.

References

Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0. Pages 158–160 of section 5.2.4: Sorting by Merging. Section 5.3.2: Minimum-Comparison Merging, pp.197–207.

External links

Merge algorithm for sorted arrays (with Java and C++ implementations)

[1] Berkman, Omer; Schieber, Baruch; Vishkin, Uzi (1993), "Optimal double logarithmic parallel algorithms based on finding all nearest smaller values", Journal of Algorithms, 14 (3): 344–370, doi:10.1006/jagm.1993.1018.

[1]