Ruzzo–Tompa algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 19:13, 17 April 2018 edit John Douglas Huff (talk \| contribs) 32 edits No edit summary ← Previous edit		Latest revision as of 08:42, 27 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,863,401 edits Added bibcode. Removed URL that duplicated identifier. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 332/967
(40 intermediate revisions by 16 users not shown)
Line 1: The '''~~Ruzzo-Tompa~~Ruzzo–Tompa algorithm''' or the '''RT algorithm'''<ref name=":0">{{Cite journal \|last1=Spouge \|first1=John L. \|last2=Ramírez \|first2=Leonardo Mariño \|last3=Sheetlin \|first3=Sergey L. \|date=2014 \|title=Searching for repeats, as an example of using the generalised Ruzzo-Tompa algorithm to find optimal subsequences with gaps \|journal=International Journal of Bioinformatics Research and Applications \|language=en \|volume=10 \|issue=4/5 \|pages=384–408 \|doi=10.1504/IJBRA.2014.062991 \|issn=1744-5485 \|pmc=4135518 \|pmid=24989859}}</ref> is a ~~linear~~[[Time complexity#Linear time\|linear-time]] [[algorithm]] for finding all non-overlapping, contiguous, maximal scoring subsequences in a sequence of real numbers.<ref name="ruzzo_tompa">{{cite journal\|last1=Ruzzo\|first1=Walter L.\|last2=Martin\|first2=Tompa\|title=A ~~Linear~~linear ~~Time~~time ~~Algorithm~~algorithm for ~~Finding~~finding ~~All~~all ~~Maximal~~maximal ~~Scoring~~scoring ~~Subsequences~~subsequences\|journal=Proceedings. International Conference on Intelligent Systems for Molecular Biology\|date=1999\|pages=~~234-241~~234–241\|pmid=10786306\|isbn=9781577350835\|url=https://dl.acm.org/citation.cfm?id=660812\|ref=ruzzo-tompa}}</ref> The Ruzzo–Tompa algorithm was proposed by Walter L. Ruzzo and Martin Tompa.<ref>{{Cite web \|title=A Linear Time Algorithm for Finding All Maximal Scoring Subsequences \|url=https://homes.cs.washington.edu/~ruzzo/papers/maxseq.pdf}}</ref> This algorithm is an improvement over previously known quadratic time algorithms.<ref name=":0" /> The maximum scoring subsequence from the set produced by the algorithm is also a solution to the [[~~Maximum~~maximum subarray problem]]. The Ruzzo–Tompa algorithm has applications in [[bioinformatics]],<ref name="karlin" /> [[web scraping]],<ref name="pasternack">{{cite book\|last1=Pasternack\|first1=Jeff\|last2=Roth\|first2=Dan\|title=Proceedings of the 18th international conference on World wide web \|chapter=Extracting article text from the web with maximum subsequence segmentation \|date=2009\|pages=971–980\|doi=10.1145/1526709.1526840\|isbn=9781605584874\|s2cid=346124}}</ref> and [[information retrieval]].<ref name="liang">{{cite book\|last1=Liang\|first1=Shangsong\|last2=Ren\|first2=Zhaochun\|last3=Weerkamp\|first3=Wouter\|last4=Meij\|first4=Edgar\|last5=de Rijke\|first5=Maarten\|title=Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management \|chapter=Time-Aware Rank Aggregation for Microblog Search \|date=2014\|pages=989–998\|doi=10.1145/2661829.2661905\|isbn=9781450325981\|citeseerx=10.1.1.681.6828\|s2cid=14287901}}</ref> The problem of find disjoint maximal subsequences is of practical importance in the analysis of [[DNA]]. Maximal subsequences algorithms have been used in the identification of transmembrane segments and the evaluation of [[sequence homology]]<ref name="ruzzo_tompa">{{cite journal\|last1=Karlin\|first1=S\|last2=Altschul\|first2=SF\|title=Applications and statistics for multiple high-scoring segments in molecular sequences\|journal=Proceedings of the National Academy of Sciences of the United States of America\|date=Jun 15, 1993\|volume=90\|issue=12\|pages=5873-5877\|pmid=8390686}}</ref>▼ ==Applications== ==Problem Definition==▼ ===Bioinformatics=== ▲The Ruzzo–Tompa algorithm has been used in [[Bioinformatics]] tools to study biological data. The problem of ~~find~~finding disjoint maximal subsequences is of practical importance in the analysis of [[DNA]]. Maximal subsequences algorithms have been used in the identification of transmembrane segments and the evaluation of [[sequence homology]].<ref name="~~ruzzo_tompa~~karlin">{{cite journal\|last1=Karlin\|first1=S\|last2=Altschul\|first2=SF\|title=Applications and statistics for multiple high-scoring segments in molecular sequences\|journal=Proceedings of the National Academy of Sciences of the United States of America\|date=Jun 15, 1993\|volume=90\|issue=12\|pages=~~5873-5877~~5873–5877\|pmid=8390686\|pmc=46825\|doi=10.1073/pnas.90.12.5873\|bibcode=1993PNAS...90.5873K\|doi-access=free}}</ref> The algorithm is used in [[sequence alignment]] which is used as a method of identifying similar [[DNA]], [[RNA]], or [[protein]] sequences.<ref>{{Cite book \|last1=Spouge \|first1=John L. \|last2=Mariño-Ramírez \|first2=Leonardo \|last3=Sheetlin \|first3=Sergey L. \|title=2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS) \|chapter=The ruzzo-tompa algorithm can find the maximal paths in weighted, directed graphs on a one-dimensional lattice \|year=2012 \|pages=1–6 \|doi=10.1109/ICCABS.2012.6182645\|isbn=978-1-4673-1321-6 \|s2cid=14584619 }}</ref> Accounting for the ordering of pairs of high-scoring subsequences in two sequences creates better sequence alignments. This is because the biological model suggests that separate high-scoring subsequence pairs arise from insertions or deletions within a matching region. Requiring consistent ordering of high-scoring subsequence pairs increases their statistical significance.<ref name="karlin" /> ==Algorithm==▼ ===Web scraping=== The Ruzzo–Tompa algorithm is used in [[Web scraping]] to extract information from web pages. Pasternack and Roth proposed a method for extracting important blocks of text from HTML documents. The web pages are first [[Lexical analysis#Tokenization\|tokenized]] and the score for each token is found using local, token-level classifiers.<ref>{{Cite web \|date=2021-07-30 \|title=Web Scraping: Everything You Need To Know \|url=https://datamam.com/web-scraping/ \|access-date=2023-02-16 \|website=Datamam \|language=en-US}}</ref> A modified version of the Ruzzo–Tompa algorithm is then used to find the k highest-valued subsequences of tokens. These subsequences are then used as predictions of important blocks of text in the article.<ref name="pasternack" /> ===Information retrieval=== The Ruzzo–Tompa algorithm has been used in [[Information retrieval]] search algorithms. Liang et al. proposed a [[data fusion]] method to combine the search results of several microblog search algorithms. In their method, the Ruzzo–Tompa algorithm is used to detect [[Bursting\|bursts]] of information.<ref name="liang" /> ▲==Problem ~~Definition~~definition== The problem of finding all maximal subsequences is defined as follows: Given a list of real numbered scores <math>x_1,x_2,\ldots,x_n</math>, find the list of contiguous subsequences that gives the greatest total score, where the score of each subsequence <math>S_{i,j} = \sum_{i\leq k\leq j} x_k</math>. The subsequences must be disjoint (non-overlapping) and have a positive score.<ref>{{Cite book \|last1=Spouge \|first1=John L. \|last2=Mariño-Ramírez \|first2=Leonardo \|last3=Sheetlin \|first3=Sergey L. \|title=2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS) \|chapter=The ruzzo-tompa algorithm can find the maximal paths in weighted, directed graphs on a one-dimensional lattice \|year=2012 \|pages=1–6 \|doi=10.1109/ICCABS.2012.6182645\|isbn=978-1-4673-1321-6 \|s2cid=14584619 }}</ref> ==Other algorithms== There are several approaches to solving the all maximal scoring subsequences problem. A natural approach is to use existing, linear time algorithms to find the maximum subsequence (see [[maximum subarray problem]]) and then recursively find the maximal subsequences to the left and right of the maximum subsequence. The analysis of this algorithm is similar to that of [[Quicksort]]: The maximum subsequence could be small in comparison to the rest of sequence, leading to a running time of <math>O(n^2)</math> in the worst case. ▲==Algorithm== The standard implementation of the Ruzzo-Tompa algorithm runs in <math>O(n)</math> time and uses <math>O(n)</math> space, where <math>n</math> is the length of the list of scores. The algorithm uses [[dynamic programming]] to progressively build the final solution by incrementally solving progressively larger subsets of the problem. The description of the algorithm provided by Ruzzo and Tompa is as follows:▼ [[File:Animation of Ruzzo-Tompa Algorithm.ogv\|300px\|thumb\|This animation shows the Ruzzo–Tompa algorithm running with an input sequence of 11 integers each represented by a line segment in the graph. Segments with bold lines represent maximal segments found so far. The animation shows the state of <math>I, R </math> and <math>L</math> at each step. Below that it shows the current state the algorithm which correspond to steps 1–4 in the [[#Algorithm\|Algorithm]] section of this page. The red highlight shows the algorithm finding a value for <math>j</math> in steps 1 and 3. If the value of <math>j</math> satisfies the inequalities in those steps the highlight turns green. : Read the scores left to right and maintain the cumulative sum of the scores read. Maintain an ordered list <math>I_1,I_2,...,I_j</math> of disjoint subsequences. For each subsequence <math>I_j</math>, record the cumulative total <math>L_j</math> of all scores up to but not including the leftmost score of <math>I_j</math>, and the total <math>R_j</math> up to and including the rightmost score of <math>I_j</math>.▼ At the end of the animation, the maximal subsequences will be bolded and displayed in <math>I</math>.<ref name=":0" /> : The lists are initially empty. Scores are read from left to right and are processed as follows. Nonpositive scores are require no special processing, so the next score is read. A positive score is incorporated into a new sub-sequence <math>I_k</math> of length one that is then integrated into the list by the following process.▼ ]] ▲The standard implementation of the ~~Ruzzo-Tompa~~Ruzzo–Tompa algorithm runs in <math>O(n)</math> time and uses ~~<math>~~''O''(''n'')~~</math>~~ space, where ~~<math>~~''n~~</math>~~'' is the length of the list of scores. The algorithm uses [[dynamic programming]] to progressively build the final solution by incrementally solving progressively larger subsets of the problem. The description of the algorithm provided by Ruzzo and Tompa is as follows: ▲: Read the scores left to right and maintain the cumulative sum of the scores read. Maintain an ordered list <math>I_1,I_2,~~...~~\ldots,I_j</math> of disjoint subsequences. For each subsequence <math>I_j</math>, record the cumulative total <math>L_j</math> of all scores up to but not including the leftmost score of <math>I_j</math>, and the total <math>R_j</math> up to and including the rightmost score of <math>I_j</math>. ▲: The lists are initially empty. Scores are read from left to right and are processed as follows. Nonpositive scores ~~are~~ require no special processing, so the next score is read. A positive score is incorporated into a new sub-sequence <math>I_k</math> of length one that is then integrated into the list by the following process.: # The list <math>I</math> is searched from right to left for the maximum value of <math>j</math> satisfying <math>L_j<L_k</math> # If there is no such <math>j</math>, then add <math>I_k</math> to the end of the list. # If there is such a <math>j</math>, and <math>R_j \geq R_k</math>, then add <math>I_k</math> to the end of the list. # Otherwise (i.e., there is such a j, but <math>R_j < R_k</math>), extend the subsequence <math>I_k</math> to the left to encompass everything up to and including the leftmost score in <math>I_j</math>. Delete subsequences <math>I_j,I_j+1,~~...~~\ldots,I_k-1</math> from the list, and append <math>I_k</math> to the end of the list. Reconsider the newly extended subsequence <math>I_k</math> (now renumbered <math>I_j</math>) as in step 1. :Once the end of the input is reached, all subsequences remaining on the list <math>I</math> are maximal.<ref name="ruzzo_tompa">< /~~ref~~> The following [[Python (programming language)\|Python]] code implements the ~~Ruzzo-Tompa~~Ruzzo–Tompa algorithm:▼ As example of the algorithm running, consider the input score list <math>L=[4,-5,3,-2,1,2]</math>. On input <math>4</math> in step 1, no satisfying <math>j</math> is found so step 2 applies and <math>[4]</math> is appended to <math>I</math>, so <math>I=[[4]], R=[4], L=[0]</math>. The input <math>-5</math> is skipped and the input <math>3</math> is read. In step 1, no satisfying <math>j</math> is found so <math>[3]</math> is appended to <math>I</math>, so <math>I=[[4],[3]], R=[4,2], L=[0,-1]</math>. The input <math>-2</math> is skipped and the input <math>1</math> is read. In step 1, a value of <math>j=1</math> is found to satisfy <math>L[j]<L[k]</math>, so step 3 applies. In step 3, <math>R[j]\ngeq R[k]</math>, so <math>[1] is appended to <math>I</math>, and now <math>I=[[4],[3],[1]], R=[4,2,1], L=[0,-1,0]</math>. <~~source~~syntaxhighlight lang="python" line="1">▼ ▲The following [[Python (programming language)\|Python]] code implements the Ruzzo-Tompa algorithm: def ~~RuzzoTompa~~ruzzo_tompa(scores):▼ """Ruzzo–Tompa algorithm.""" ▲<source lang="python" line="1"> ~~total~~k = 0;▼ ▲def RuzzoTompa(scores): ktotal = 0 ▲ total = 0; # Allocating arrays of size n I, L, R, Lidx = [[0] * len(scores) for _ in range(4)] for i, s in enumerate(scores): total += s if s > 0: # store I[k] by (start,end) indices of scores I[k] = (i, i + 1) Lidx[k] = i L[k] = total - s R[k] = total while( True): maxj = None for j in range(k - 1, -1, -1): if L[j] < L[k]: maxj = j break; if maxj !=is not None and R[maxj] < R[k]: I[maxj] = (Lidx[maxj], i + 1) R[maxj] = total k = maxj else: k += 1 break; # Getting maximal subsequences using stored indices return [scores[I[l][0] : I[l][1]] for l in range(k)] </syntaxhighlight> ~~</source>~~ == See also == * [[Maximum subarray problem]] * [[Quicksort]] == References == ~~<!-- Inline citations added to your article will automatically display here. See https://en.wikipedia.org/wiki/WP:REFB for instructions on how to add citations. -->~~ {{reflist}} == Further reading == * {{cite journal \| last1=Ali \| first1=Syed Arslan \| last2=Raza \| first2=Basit \| last3=Malik \| first3=Ahmad Kamran \| last4=Shahid \| first4=Ahmad Raza \| last5=Faheem \| first5=Muhammad \| last6=Alquhayz \| first6=Hani \| last7=Kumar \| first7=Yogan Jaya \| title=An Optimally Configured and Improved Deep Belief Network (OCI-DBN) Approach for Heart Disease Prediction Based on Ruzzo–Tompa and Stacked Genetic Algorithm \| journal=IEEE Access \| publisher=Institute of Electrical and Electronics Engineers (IEEE) \| volume=8 \| year=2020 \| issn=2169-3536 \| doi=10.1109/access.2020.2985646 \| pages=65947–65958\| bibcode=2020IEEEA...865947A \| s2cid=215817246 \| doi-access=free }} {{DEFAULTSORT:Ruzzo-Tompa algorithm}} [[Category:Optimization algorithms and methods]] [[Category:Dynamic programming]] [[Category:Articles with example Python (programming language) code]]