Ruzzo–Tompa algorithm

The Ruzzo-Tompa algorithm is a linear time algorithm for finding all non-overlapping, contiguous, maximal scoring subsequences in a sequence of real numbers^[1]. This algorithm is an improvement over previously known quadratic time algorithms. The maximum scoring subsequence from the set produced by the algorithm is a solution to the Maximum subarray problem.

The problem of find disjoint maximal subsequences is of practical importance in the analysis of DNA. Maximal subsequences algorithms have been used in the identification of transmembrane segments and the evaluation of sequence homology^[2].

Algorithm

The standard implementation of the Ruzzo-Tompa algorithm runs in $O(n)$ time and uses $O(n)$ space, where $n$ is the length of the list of scores. The description of the algorithm provided by Ruzzo and Tompa is as follows:

Read the scores left to right and maintain the cumulative sum of the scores read. Maintain an ordered list

I_{1},I_{2},...,I_{j}

of disjoint subsequences. For each subsequence

I_{j},recordthecumulativetotal<math>L_{j}

of all scores up to but not including the leftmost score of

I_{j}

, and the total

R_{j}

up to and including the rightmost score of

I_{j}

.

The following Python code implements the Ruzzo-Tompa algorithm:

def RuzzoTompa(scores):
	k=0
	total = 0;
	# Allocating arrays of size n
	I,L,R,Lidx = [[0]*len(scores) for _ in range(4)]
	for i,s in enumerate(scores):
		total += s
		if s > 0:
			I[k] = (i,i+1)
			Lidx[k] = i
			L[k] = total-s
			R[k] = total
			while(True):
				maxj = None
				for j in range(k-1,-1,-1):
					if L[j] < L[k]:
						maxj = j
						break;
				if maxj != None and R[maxj] < R[k]:
					I[maxj] = (Lidx[maxj],i+1)
					R[maxj] = total
					k = maxj
				else:
					k+=1
					break;
	
	return [scores[I[l][0]:I[l][1]] for l in range(k)]

References

^ Ruzzo, Walter L.; Martin, Tompa (1999). "A Linear Time Algorithm for Finding All Maximal Scoring Subsequences". Proceedings. International Conference on Intelligent Systems for Molecular Biology: 234–241. PMID 10786306.
^ Karlin, S; Altschul, SF (Jun 15, 1993). "Applications and statistics for multiple high-scoring segments in molecular sequences". Proceedings of the National Academy of Sciences of the United States of America. 90 (12): 5873–5877. PMID 8390686.

[1] Ruzzo, Walter L.; Martin, Tompa (1999). "A Linear Time Algorithm for Finding All Maximal Scoring Subsequences". Proceedings. International Conference on Intelligent Systems for Molecular Biology: 234–241. PMID 10786306.

[2] Karlin, S; Altschul, SF (Jun 15, 1993). "Applications and statistics for multiple high-scoring segments in molecular sequences". Proceedings of the National Academy of Sciences of the United States of America. 90 (12): 5873–5877. PMID 8390686.

[1]

[2]