Revision as of 03:09, 2 May 2018 edit John Douglas Huff (talk \| contribs) 32 edits No edit summary ← Previous edit		Revision as of 03:30, 2 May 2018 edit undo John Douglas Huff (talk \| contribs) 32 edits No edit summary Next edit →
Line 10: ===Web Scraping=== The Ruzzo-Tompa algorithm is used in [[Web scraping]] to extract information from web pages. Pasternack and Roth proposed a method for extracting important blocks of text from HTML documents. The web pages are first [[Lexical_analysis#Tokenization\|tokenized]] and the score for each token is found using local, token-level classifiers. A modified version of the Ruzzo-Tompa algorithm is then used to find the k highest-valued subsequences of tokens. These subsequences are then used as predictions of important blocks of text in the article.<ref name="pasternack" /> ~~The Ruzzo-Tompa algorithm is used in [[Web scraping]] to extract blocks of text from web pages.~~ ==Problem Definition==

Ruzzo–Tompa algorithm: Difference between revisions