Content deleted Content added
remove loginurls |
Rescuing 1 sources and tagging 0 as dead.) #IABot (v2.0.8.8 |
||
Line 66:
====Unsupervised approach: TextRank====
Another keyphrase extraction algorithm is TextRank. While supervised methods have some nice properties, like being able to produce interpretable rules for what features characterize a keyphrase, they also require a large amount of [[training set|training data]]. Many documents with known keyphrases are needed. Furthermore, training on a specific ___domain tends to customize the extraction process to that ___domain, so the resulting classifier is not necessarily portable, as some of Turney's results demonstrate.
Unsupervised keyphrase extraction removes the need for training data. It approaches the problem from a different angle. Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm<ref>Rada Mihalcea and Paul Tarau, 2004: ''TextRank: Bringing Order into Texts'', Department of Computer Science University of North Texas {{
TextRank is a general purpose [[Graph (abstract data type)|graph]]-based ranking algorithm for [[natural language processing|NLP]]. Essentially, it runs PageRank on a graph specially designed for a particular NLP task. For keyphrase extraction, it builds a graph using some set of text units as vertices. Edges are based on some measure of semantic or [[lexical (semiotics)|lexical]] [[semantic similarity|similarity]] between the text unit vertices. Unlike PageRank, the edges are typically undirected and can be weighted to reflect a degree of similarity. Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to [[eigenvalue]] 1 (i.e., the [[stationary distribution]] of the [[random walk]] on the graph).
|