Content deleted Content added
Citation bot (talk | contribs) Added bibcode. Removed URL that duplicated identifier. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 332/967 |
|||
(37 intermediate revisions by 16 users not shown) | |||
Line 1:
The '''
The Ruzzo–Tompa algorithm has applications in [[bioinformatics]],<ref name="karlin" /> [[web scraping]],<ref name="pasternack">{{cite book|last1=Pasternack|first1=Jeff|last2=Roth|first2=Dan|title=Proceedings of the 18th international conference on World wide web |chapter=Extracting article text from the web with maximum subsequence segmentation |date=2009|pages=971–980|doi=10.1145/1526709.1526840|isbn=9781605584874|s2cid=346124}}</ref> and [[information retrieval]].<ref name="liang">{{cite book|last1=Liang|first1=Shangsong|last2=Ren|first2=Zhaochun|last3=Weerkamp|first3=Wouter|last4=Meij|first4=Edgar|last5=de Rijke|first5=Maarten|title=Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management |chapter=Time-Aware Rank Aggregation for Microblog Search |date=2014|pages=989–998|doi=10.1145/2661829.2661905|isbn=9781450325981|citeseerx=10.1.1.681.6828|s2cid=14287901}}</ref>
The problem of find disjoint maximal subsequences is of practical importance in the analysis of [[DNA]]. Maximal subsequences algorithms have been used in the identification of transmembrane segments and the evaluation of [[sequence homology]]<ref name="ruzzo_tompa">{{cite journal|last1=Karlin|first1=S|last2=Altschul|first2=SF|title=Applications and statistics for multiple high-scoring segments in molecular sequences|journal=Proceedings of the National Academy of Sciences of the United States of America|date=Jun 15, 1993|volume=90|issue=12|pages=5873-5877|pmid=8390686}}</ref>▼
==Applications==
==Problem Definition==▼
===Bioinformatics===
▲The Ruzzo–Tompa algorithm has been used in [[Bioinformatics]] tools to study biological data. The problem of
The algorithm is used in [[sequence alignment]] which is used as a method of identifying similar [[DNA]], [[RNA]], or [[protein]] sequences.<ref>{{Cite book |last1=Spouge |first1=John L. |last2=Mariño-Ramírez |first2=Leonardo |last3=Sheetlin |first3=Sergey L. |title=2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS) |chapter=The ruzzo-tompa algorithm can find the maximal paths in weighted, directed graphs on a one-dimensional lattice |year=2012 |pages=1–6 |doi=10.1109/ICCABS.2012.6182645|isbn=978-1-4673-1321-6 |s2cid=14584619 }}</ref> Accounting for the ordering of pairs of high-scoring subsequences in two sequences creates better sequence alignments. This is because the biological model suggests that separate high-scoring subsequence pairs arise from insertions or deletions within a matching region. Requiring consistent ordering of high-scoring subsequence pairs increases their statistical significance.<ref name="karlin" />
==Algorithm==▼
===Web scraping===
The Ruzzo–Tompa algorithm is used in [[Web scraping]] to extract information from web pages. Pasternack and Roth proposed a method for extracting important blocks of text from HTML documents. The web pages are first [[Lexical analysis#Tokenization|tokenized]] and the score for each token is found using local, token-level classifiers.<ref>{{Cite web |date=2021-07-30 |title=Web Scraping: Everything You Need To Know |url=https://datamam.com/web-scraping/ |access-date=2023-02-16 |website=Datamam |language=en-US}}</ref> A modified version of the Ruzzo–Tompa algorithm is then used to find the k highest-valued subsequences of tokens. These subsequences are then used as predictions of important blocks of text in the article.<ref name="pasternack" />
===Information retrieval===
The Ruzzo–Tompa algorithm has been used in [[Information retrieval]] search algorithms. Liang et al. proposed a [[data fusion]] method to combine the search results of several microblog search algorithms. In their method, the Ruzzo–Tompa algorithm is used to detect [[Bursting|bursts]] of information.<ref name="liang" />
The problem of finding all maximal subsequences is defined as follows: Given a list of real numbered scores <math>x_1,x_2,\ldots,x_n</math>, find the list of contiguous subsequences that gives the greatest total score, where the score of each subsequence <math>S_{i,j} = \sum_{i\leq k\leq j} x_k</math>. The subsequences must be disjoint (non-overlapping) and have a positive score.<ref>{{Cite book |last1=Spouge |first1=John L. |last2=Mariño-Ramírez |first2=Leonardo |last3=Sheetlin |first3=Sergey L. |title=2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS) |chapter=The ruzzo-tompa algorithm can find the maximal paths in weighted, directed graphs on a one-dimensional lattice |year=2012 |pages=1–6 |doi=10.1109/ICCABS.2012.6182645|isbn=978-1-4673-1321-6 |s2cid=14584619 }}</ref>
==Other algorithms==
There are several approaches to solving the all maximal scoring subsequences problem. A natural approach is to use existing, linear time algorithms to find the maximum subsequence (see [[maximum subarray problem]]) and then recursively find the maximal subsequences to the left and right of the maximum subsequence. The analysis of this algorithm is similar to that of [[Quicksort]]: The maximum subsequence could be small in comparison to the rest of sequence, leading to a running time of <math>O(n^2)</math> in the worst case.
▲==Algorithm==
The standard implementation of the Ruzzo-Tompa algorithm runs in <math>O(n)</math> time and uses <math>O(n)</math> space, where <math>n</math> is the length of the list of scores. The algorithm uses [[dynamic programming]] to progressively build the final solution by incrementally solving progressively larger subsets of the problem. The description of the algorithm provided by Ruzzo and Tompa is as follows:▼
[[File:Animation of Ruzzo-Tompa Algorithm.ogv|300px|thumb|This animation shows the Ruzzo–Tompa algorithm running with an input sequence of 11 integers each represented by a line segment in the graph. Segments with bold lines represent maximal segments found so far. The animation shows the state of <math>I, R </math> and <math>L</math> at each step. Below that it shows the current state the algorithm which correspond to steps 1–4 in the [[#Algorithm|Algorithm]] section of this page. The red highlight shows the algorithm finding a value for <math>j</math> in steps 1 and 3. If the value of <math>j</math> satisfies the inequalities in those steps the highlight turns green.
: Read the scores left to right and maintain the cumulative sum of the scores read. Maintain an ordered list <math>I_1,I_2,...,I_j</math> of disjoint subsequences. For each subsequence <math>I_j</math>, record the cumulative total <math>L_j</math> of all scores up to but not including the leftmost score of <math>I_j</math>, and the total <math>R_j</math> up to and including the rightmost score of <math>I_j</math>.▼
At the end of the animation, the maximal subsequences will be bolded and displayed in <math>I</math>.<ref name=":0" />
: The lists are initially empty. Scores are read from left to right and are processed as follows. Nonpositive scores are require no special processing, so the next score is read. A positive score is incorporated into a new sub-sequence <math>I_k</math> of length one that is then integrated into the list by the following process.▼
]]
▲The standard implementation of the
▲: Read the scores left to right and maintain the cumulative sum of the scores read. Maintain an ordered list <math>I_1,I_2,
▲: The lists are initially empty. Scores are read from left to right and are processed as follows. Nonpositive scores
# The list <math>I</math> is searched from right to left for the maximum value of <math>j</math> satisfying <math>L_j<L_k</math>
# If there is no such <math>j</math>, then add <math>I_k</math> to the end of the list.
# If there is such a <math>j</math>, and <math>R_j \geq R_k</math>, then add <math>I_k</math> to the end of the list.
# Otherwise (i.e., there is such a j, but <math>R_j < R_k</math>), extend the subsequence <math>I_k</math> to the left to encompass everything up to and including the leftmost score in <math>I_j</math>. Delete subsequences <math>I_j,I_j+1,
:Once the end of the input is reached, all subsequences remaining on the list <math>I</math> are maximal.<ref name="ruzzo_tompa"
The following [[Python (programming language)|Python]] code implements the
▲The following [[Python (programming language)|Python]] code implements the Ruzzo-Tompa algorithm:
"""Ruzzo–Tompa algorithm."""
▲<source lang="python" line="1">
▲def RuzzoTompa(scores):
▲ total = 0;
# Allocating arrays of size n
I, L, R, Lidx = [[0] * len(scores) for _ in range(4)]
for i, s in enumerate(scores):
# Getting maximal subsequences using stored indices
return [scores[I[l][0] : I[l][1]] for l in range(k)]
</syntaxhighlight>
== See also ==
* [[Maximum subarray problem]]
* [[Quicksort]]
== References ==
{{reflist}}
== Further reading ==
* {{cite journal | last1=Ali | first1=Syed Arslan | last2=Raza | first2=Basit | last3=Malik | first3=Ahmad Kamran | last4=Shahid | first4=Ahmad Raza | last5=Faheem | first5=Muhammad | last6=Alquhayz | first6=Hani | last7=Kumar | first7=Yogan Jaya | title=An Optimally Configured and Improved Deep Belief Network (OCI-DBN) Approach for Heart Disease Prediction Based on Ruzzo–Tompa and Stacked Genetic Algorithm | journal=IEEE Access | publisher=Institute of Electrical and Electronics Engineers (IEEE) | volume=8 | year=2020 | issn=2169-3536 | doi=10.1109/access.2020.2985646 | pages=65947–65958| bibcode=2020IEEEA...865947A | s2cid=215817246 | doi-access=free }}
{{DEFAULTSORT:Ruzzo-Tompa algorithm}}
[[Category:Optimization algorithms and methods]]
[[Category:Dynamic programming]]
[[Category:Articles with example Python (programming language) code]]
|