Automatic summarization: Difference between revisions

Content deleted Content added
remove loginurls
Rescuing 1 sources and tagging 0 as dead.) #IABot (v2.0.8.8
Line 66:
====Unsupervised approach: TextRank====
Another keyphrase extraction algorithm is TextRank. While supervised methods have some nice properties, like being able to produce interpretable rules for what features characterize a keyphrase, they also require a large amount of [[training set|training data]]. Many documents with known keyphrases are needed. Furthermore, training on a specific ___domain tends to customize the extraction process to that ___domain, so the resulting classifier is not necessarily portable, as some of Turney's results demonstrate.
Unsupervised keyphrase extraction removes the need for training data. It approaches the problem from a different angle. Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm<ref>Rada Mihalcea and Paul Tarau, 2004: ''TextRank: Bringing Order into Texts'', Department of Computer Science University of North Texas {{webarchiveCite web |url=http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf |title=Archived copy |access-date=2012-07-20 |archive-date=2012-06-17 |archive-url=https://web.archive.org/web/20120617170501/http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf |archiveurl-datestatus=2012-06-17bot: unknown }}</ref> exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that [[PageRank]] selects important Web pages. Recall this is based on the notion of "prestige" or "recommendation" from [[social network]]s. In this way, TextRank does not rely on any previous training data at all, but rather can be run on any arbitrary piece of text, and it can produce output simply based on the text's intrinsic properties. Thus the algorithm is easily portable to new domains and languages.
 
TextRank is a general purpose [[Graph (abstract data type)|graph]]-based ranking algorithm for [[natural language processing|NLP]]. Essentially, it runs PageRank on a graph specially designed for a particular NLP task. For keyphrase extraction, it builds a graph using some set of text units as vertices. Edges are based on some measure of semantic or [[lexical (semiotics)|lexical]] [[semantic similarity|similarity]] between the text unit vertices. Unlike PageRank, the edges are typically undirected and can be weighted to reflect a degree of similarity. Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to [[eigenvalue]] 1 (i.e., the [[stationary distribution]] of the [[random walk]] on the graph).