Ruzzo–Tompa algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 04:03, 17 April 2018 edit John Douglas Huff (talk \| contribs) 32 edits No edit summary ← Previous edit		Latest revision as of 08:42, 27 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,863,510 edits Added bibcode. Removed URL that duplicated identifier. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 332/967
(46 intermediate revisions by 17 users not shown)
Line 1: The '''~~Ruzzo-Tompa~~Ruzzo–Tompa algorithm''' or the '''RT algorithm'''<ref name=":0">{{Cite journal \|last1=Spouge \|first1=John L. \|last2=Ramírez \|first2=Leonardo Mariño \|last3=Sheetlin \|first3=Sergey L. \|date=2014 \|title=Searching for repeats, as an example of using the generalised Ruzzo-Tompa algorithm to find optimal subsequences with gaps \|journal=International Journal of Bioinformatics Research and Applications \|language=en \|volume=10 \|issue=4/5 \|pages=384–408 \|doi=10.1504/IJBRA.2014.062991 \|issn=1744-5485 \|pmc=4135518 \|pmid=24989859}}</ref> is a ~~linear~~[[Time complexity#Linear time\|linear-time]] [[algorithm]] for finding all non-overlapping, contiguous, maximal scoring subsequences in a sequence of real numbers.<ref name="ruzzo_tompa">{{cite journal\|last1=Ruzzo\|first1=Walter L.\|last2=Martin\|first2=Tompa\|title=A ~~Linear~~linear ~~Time~~time ~~Algorithm~~algorithm for ~~Finding~~finding ~~All~~all ~~Maximal~~maximal ~~Scoring~~scoring ~~Subsequences~~subsequences\|journal=Proceedings. International Conference on Intelligent Systems for Molecular Biology\|date=1999\|pages=~~234-241~~234–241\|pmid=10786306\|isbn=9781577350835\|url=https://dl.acm.org/citation.cfm?id=660812\|ref=ruzzo-tompa}}</ref> The Ruzzo–Tompa algorithm was proposed by Walter L. Ruzzo and Martin Tompa.<ref>{{Cite web \|title=A Linear Time Algorithm for Finding All Maximal Scoring Subsequences \|url=https://homes.cs.washington.edu/~ruzzo/papers/maxseq.pdf}}</ref> This algorithm is an improvement over previously known quadratic time algorithms.<ref name=":0" /> The maximum scoring subsequence from the set produced by the algorithm is also a solution to the [[~~Maximum~~maximum subarray problem]]. ▼ ~~{{AFC submission\|t\|\|ts=20180416061329\|u=John Douglas Huff\|ns=118\|demo=}}<!-- Important, do not remove this line before article has been created. -->~~ The Ruzzo–Tompa algorithm has applications in [[bioinformatics]],<ref name="karlin" /> [[web scraping]],<ref name="pasternack">{{cite book\|last1=Pasternack\|first1=Jeff\|last2=Roth\|first2=Dan\|title=Proceedings of the 18th international conference on World wide web \|chapter=Extracting article text from the web with maximum subsequence segmentation \|date=2009\|pages=971–980\|doi=10.1145/1526709.1526840\|isbn=9781605584874\|s2cid=346124}}</ref> and [[information retrieval]].<ref name="liang">{{cite book\|last1=Liang\|first1=Shangsong\|last2=Ren\|first2=Zhaochun\|last3=Weerkamp\|first3=Wouter\|last4=Meij\|first4=Edgar\|last5=de Rijke\|first5=Maarten\|title=Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management \|chapter=Time-Aware Rank Aggregation for Microblog Search \|date=2014\|pages=989–998\|doi=10.1145/2661829.2661905\|isbn=9781450325981\|citeseerx=10.1.1.681.6828\|s2cid=14287901}}</ref> ▲The '''Ruzzo-Tompa algorithm''' is a linear time algorithm for finding all non-overlapping, contiguous, maximal scoring subsequences in a sequence of real numbers<ref>{{cite journal\|last1=Ruzzo\|first1=Walter L.\|last2=Martin\|first2=Tompa\|title=A Linear Time Algorithm for Finding All Maximal Scoring Subsequences\|journal=Proceedings. International Conference on Intelligent Systems for Molecular Biology\|date=1999\|pages=234-241\|pmid=10786306\|url=https://dl.acm.org/citation.cfm?id=660812\|ref=ruzzo-tompa}}</ref>. This algorithm is an improvement over previously known quadratic time algorithms. The maximum scoring subsequence from the set produced by the algorithm is a solution to the [[Maximum subarray problem]]. ==Applications== The problem of find disjoint maximal subsequences is of practical importance in the analysis of [[DNA]]. Maximal subsequences algorithms have been used in the identification of transmembrane segments and the evaluation of [[sequence homology]]<ref>{{cite journal\|last1=Karlin\|first1=S\|last2=Altschul\|first2=SF\|title=Applications and statistics for multiple high-scoring segments in molecular sequences\|journal=Proceedings of the National Academy of Sciences of the United States of America\|date=Jun 15, 1993\|volume=90\|issue=12\|pages=5873-5877\|pmid=8390686}}</ref>.▼ ===Bioinformatics=== ▲The Ruzzo–Tompa algorithm has been used in [[Bioinformatics]] tools to study biological data. The problem of ~~find~~finding disjoint maximal subsequences is of practical importance in the analysis of [[DNA]]. Maximal subsequences algorithms have been used in the identification of transmembrane segments and the evaluation of [[sequence homology]].<ref name="karlin">{{cite journal\|last1=Karlin\|first1=S\|last2=Altschul\|first2=SF\|title=Applications and statistics for multiple high-scoring segments in molecular sequences\|journal=Proceedings of the National Academy of Sciences of the United States of America\|date=Jun 15, 1993\|volume=90\|issue=12\|pages=~~5873-5877~~5873–5877\|pmid=8390686\|pmc=46825\|doi=10.1073/pnas.90.12.5873\|bibcode=1993PNAS...90.5873K\|doi-access=free}}</ref>. The algorithm is used in [[sequence alignment]] which is used as a method of identifying similar [[DNA]], [[RNA]], or [[protein]] sequences.<ref>{{Cite book \|last1=Spouge \|first1=John L. \|last2=Mariño-Ramírez \|first2=Leonardo \|last3=Sheetlin \|first3=Sergey L. \|title=2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS) \|chapter=The ruzzo-tompa algorithm can find the maximal paths in weighted, directed graphs on a one-dimensional lattice \|year=2012 \|pages=1–6 \|doi=10.1109/ICCABS.2012.6182645\|isbn=978-1-4673-1321-6 \|s2cid=14584619 }}</ref> Accounting for the ordering of pairs of high-scoring subsequences in two sequences creates better sequence alignments. This is because the biological model suggests that separate high-scoring subsequence pairs arise from insertions or deletions within a matching region. Requiring consistent ordering of high-scoring subsequence pairs increases their statistical significance.<ref name="karlin" /> ==Problem Definition==▼ ===Web scraping=== The Ruzzo–Tompa algorithm is used in [[Web scraping]] to extract information from web pages. Pasternack and Roth proposed a method for extracting important blocks of text from HTML documents. The web pages are first [[Lexical analysis#Tokenization\|tokenized]] and the score for each token is found using local, token-level classifiers.<ref>{{Cite web \|date=2021-07-30 \|title=Web Scraping: Everything You Need To Know \|url=https://datamam.com/web-scraping/ \|access-date=2023-02-16 \|website=Datamam \|language=en-US}}</ref> A modified version of the Ruzzo–Tompa algorithm is then used to find the k highest-valued subsequences of tokens. These subsequences are then used as predictions of important blocks of text in the article.<ref name="pasternack" /> ===Information retrieval=== The Ruzzo–Tompa algorithm has been used in [[Information retrieval]] search algorithms. Liang et al. proposed a [[data fusion]] method to combine the search results of several microblog search algorithms. In their method, the Ruzzo–Tompa algorithm is used to detect [[Bursting\|bursts]] of information.<ref name="liang" /> ▲==Problem ~~Definition~~definition== The problem of finding all maximal subsequences is defined as follows: Given a list of real numbered scores <math>x_1,x_2,\ldots,x_n</math>, find the list of contiguous subsequences that gives the greatest total score, where the score of each subsequence <math>S_{i,j} = \sum_{i\leq k\leq j} x_k</math>. The subsequences must be disjoint (non-overlapping) and have a positive score.<ref>{{Cite book \|last1=Spouge \|first1=John L. \|last2=Mariño-Ramírez \|first2=Leonardo \|last3=Sheetlin \|first3=Sergey L. \|title=2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS) \|chapter=The ruzzo-tompa algorithm can find the maximal paths in weighted, directed graphs on a one-dimensional lattice \|year=2012 \|pages=1–6 \|doi=10.1109/ICCABS.2012.6182645\|isbn=978-1-4673-1321-6 \|s2cid=14584619 }}</ref> ==Other algorithms== There are several approaches to solving the all maximal scoring subsequences problem. A natural approach is to use existing, linear time algorithms to find the maximum subsequence (see [[maximum subarray problem]]) ~~to find the maximum subsequence~~ and then recursively find the maximal subsequences to the left and right of the maximum subsequence~~. This algorithm <math>O(n^2)</math> in the worst case~~. The analysis of this algorithm is similar to that of [[Quicksort]]: The maximum subsequence could be small in comparison to the rest of sequence., ~~It is simple~~leading to ~~build~~a anrunning ~~example~~time ~~where~~of ~~this~~<math>O(n^2)</math> ~~algorithm~~in ~~would~~the beworst ~~slow~~case.▼ ==Algorithm== [[File:Animation of Ruzzo-Tompa Algorithm.ogv\|300px\|thumb\|This animation shows the Ruzzo–Tompa algorithm running with an input sequence of 11 integers each represented by a line segment in the graph. Segments with bold lines represent maximal segments found so far. The animation shows the state of <math>I, R </math> and <math>L</math> at each step. Below that it shows the current state the algorithm which correspond to steps 1–4 in the [[#Algorithm\|Algorithm]] section of this page. The red highlight shows the algorithm finding a value for <math>j</math> in steps 1 and 3. If the value of <math>j</math> satisfies the inequalities in those steps the highlight turns green. ▲There are several approaches to solving the all maximal scoring subsequences problem. A natural approach is to use existing, linear time algorithms (see [[maximum subarray problem]]) to find the maximum subsequence and then recursively find the maximal subsequences to the left and right of the maximum subsequence. This algorithm <math>O(n^2)</math> in the worst case. The analysis of this algorithm is similar to that of [[Quicksort]]: The maximum subsequence could be small in comparison to the rest of sequence. It is simple to build an example where this algorithm would be slow At the end of the animation, the maximal subsequences will be bolded and displayed in <math>I</math>.<ref name=":0" /> The standard implementation of the Ruzzo-Tompa algorithm runs in <math>O(n)</math> time and uses <math>O(n)</math> space, where <math>n</math> is the length of the list of scores. The algorithm uses [[dynamic programming]] to progressively build the final solution by incrementally solving progressively larger subsets of the problem. The description of the algorithm provided by Ruzzo and Tompa is as follows:▼ ]] ▲The standard implementation of the ~~Ruzzo-Tompa~~Ruzzo–Tompa algorithm runs in <math>O(n)</math> time and uses ~~<math>~~''O''(''n'')~~</math>~~ space, where ~~<math>~~''n~~</math>~~'' is the length of the list of scores. The algorithm uses [[dynamic programming]] to progressively build the final solution by incrementally solving progressively larger subsets of the problem. The description of the algorithm provided by Ruzzo and Tompa is as follows: : Read the scores left to right and maintain the cumulative sum of the scores read. Maintain an ordered list <math>I_1,I_2,...,I_j</math> of disjoint subsequences. For each subsequence <math>I_j</math>, record the cumulative total <math>L_j</math> of all scores up to but not including the leftmost score of <math>I_j</math>, and the total <math>R_j</math> up to and including the rightmost score of <math>I_j</math>.▼ ▲: Read the scores left to right and maintain the cumulative sum of the scores read. Maintain an ordered list <math>I_1,I_2,~~...~~\ldots,I_j</math> of disjoint subsequences. For each subsequence <math>I_j</math>, record the cumulative total <math>L_j</math> of all scores up to but not including the leftmost score of <math>I_j</math>, and the total <math>R_j</math> up to and including the rightmost score of <math>I_j</math>. : The lists are initially empty. Scores are read from left to right and are processed as follows. Nonpositive scores are require no special processing, so the next score is read. A positive score is incorporated into a new sub-sequence <math>I_k</math> of length one that is then integrated into the list by the following process.▼ ▲: The lists are initially empty. Scores are read from left to right and are processed as follows. Nonpositive scores ~~are~~ require no special processing, so the next score is read. A positive score is incorporated into a new sub-sequence <math>I_k</math> of length one that is then integrated into the list by the following process.: # The list <math>I</math> is searched from right to left for the maximum value of <math>j</math> satisfying <math>L_j<L_k</math> # If there is no such <math>j</math>, then add <math>I_k</math> to the end of the list. # If there is such a <math>j</math>, and <math>R_j \geq R_k</math>, then add <math>I_k</math> to the end of the list. # Otherwise (i.e., there is such a j, but <math>R_j < R_k</math>), extend the subsequence <math>I_k</math> to the left to encompass everything up to and including the leftmost score in <math>I_j</math>. Delete subsequences <math>I_j,I_j+1,~~...~~\ldots,I_k-1</math> from the list, and append <math>I_k</math> to the end of the list. Reconsider the newly extended subsequence <math>I_k</math> (now renumbered <math>I_j</math>) as in step 1. :Once the end of the input is reached, all subsequences remaining on the list <math>I</math> are maximal.<ref name="ruzzo_tompa" /> The following [[Python (programming language)\|Python]] code implements the ~~Ruzzo-Tompa~~Ruzzo–Tompa algorithm:▼ <~~source~~syntaxhighlight lang="python" line="1">▼ ▲The following [[Python (programming language)\|Python]] code implements the Ruzzo-Tompa algorithm: def ~~RuzzoTompa~~ruzzo_tompa(scores):▼ """Ruzzo–Tompa algorithm.""" ▲<source lang="python" line="1"> ~~total~~k = 0;▼ ▲def RuzzoTompa(scores): ktotal = 0 ▲ total = 0; # Allocating arrays of size n I, L, R, Lidx = [[0] * len(scores) for _ in range(4)] for i, s in enumerate(scores): total += s if s > 0: # store I[k] by (start,end) indices of scores I[k] = (i, i + 1) Lidx[k] = i L[k] = total - s R[k] = total while( True): maxj = None for j in range(k - 1, -1, -1): if L[j] < L[k]: maxj = j break; if maxj !=is not None and R[maxj] < R[k]: I[maxj] = (Lidx[maxj], i + 1) R[maxj] = total k = maxj else: k += 1 break; # Getting maximal subsequences using stored indices return [scores[I[l][0] : I[l][1]] for l in range(k)] </syntaxhighlight> ~~</source>~~ == See also == * [[Maximum subarray problem]] * [[Quicksort]] == References == ~~<!-- Inline citations added to your article will automatically display here. See https://en.wikipedia.org/wiki/WP:REFB for instructions on how to add citations. -->~~ {{reflist}} == Further reading == ~~{{AFC submission\|\|\|ts=20180417001610\|u=John Douglas Huff\|ns=118}}~~ * {{cite journal \| last1=Ali \| first1=Syed Arslan \| last2=Raza \| first2=Basit \| last3=Malik \| first3=Ahmad Kamran \| last4=Shahid \| first4=Ahmad Raza \| last5=Faheem \| first5=Muhammad \| last6=Alquhayz \| first6=Hani \| last7=Kumar \| first7=Yogan Jaya \| title=An Optimally Configured and Improved Deep Belief Network (OCI-DBN) Approach for Heart Disease Prediction Based on Ruzzo–Tompa and Stacked Genetic Algorithm \| journal=IEEE Access \| publisher=Institute of Electrical and Electronics Engineers (IEEE) \| volume=8 \| year=2020 \| issn=2169-3536 \| doi=10.1109/access.2020.2985646 \| pages=65947–65958\| bibcode=2020IEEEA...865947A \| s2cid=215817246 \| doi-access=free }} {{DEFAULTSORT:Ruzzo-Tompa algorithm}} [[Category:Optimization algorithms and methods]] [[Category:Dynamic programming]] [[Category:Articles with example Python (programming language) code]]