Ruzzo–Tompa algorithm: Difference between revisions

Content deleted Content added
No edit summary
No edit summary
Line 10:
 
===Web Scraping===
The Ruzzo-Tompa algorithm is used in [[Web scraping]] to extract information from web pages. Pasternack and Roth proposed a method for extracting important blocks of text from HTML documents. The web pages are first [[Lexical_analysis#Tokenization|tokenized]] and the score for each token is found using local, token-level classifiers. A modified version of the Ruzzo-Tompa algorithm is then used to find the k highest-valued subsequences of tokens. These subsequences are then used as predictions of important blocks of text in the article.<ref name="pasternack" />
The Ruzzo-Tompa algorithm is used in [[Web scraping]] to extract blocks of text from web pages.
 
 
==Problem Definition==