Content deleted Content added
Neonrights (talk | contribs) |
→Evaluation: added ParaMetric |
||
Line 40:
== Evaluation ==
There are multiple methods that can be used to evaluate paraphrases. Since paraphrase recognition can be posed as a classification problem, most standard evaluations metrics such as [[accuracy]], [[f1 score]], or an [[receiver operating characteristic|ROC curve]] do relatively well. There is a noted difficulty calculating f1-scores due to trouble produce a complete list of paraphrases for a given phrase along with the fact that good paraphrases are dependent upon context. A metric designed to counter these problems is ParaMetric.<ref name=Burch2>{{cite conference|last1=Callison-Burch|first1=Chris|last2=Cohn|first2=Trevor|last3=Lapata|first3=Mirella|title=ParaMetric: An Automatic Evaluation Metric for Paraphrasing|booktitle=Proceedings of the 22nd International Conference on Computational Linguistics|place=Manchester|year=2008|pages=97-104|url=https://pdfs.semanticscholar.org/be0d/0df960833c1bea2a39ba9a17e5ca958018cd.pdf}}</ref> ParaMetric aims to calculate the precision and recall of an automatic paraphrase system by comparing the automatic alignment of paraphrases to a manual alignment of similar phrases. Since ParaMetric is simply rating the quality of phrase alignment, it can be used to rate paraphrase generation systems as well assuming it uses phrase alignment as part of its generation process. A noted drawback to ParaMetric is the large and exhaustive set of manual alignments that must be initially created before a rating can be produced.
The evaluation of paraphrase generation has similar difficulties as the evaluation of [[machine translation]]. Often the quality of a paraphrase is dependent upon its context, whether it is being used as a summary, and how it is generated among other factors. Additionally, a good paraphrase usually is lexically dissimilar from its source phrase. The simplest method used to evaluate paraphrase generation would be through the use of human judges. Unfortunately, evaluation through human judges tends to be time consuming. Automated approaches to evaluation prove to be challenging as it is essentially a problem as difficult as paraphrase recognition. While originally used to evaluate machine translations, [[BLEU]] has been used successfully to evaluate paraphrase generation models as well. However, paraphrases often have several lexically different but equally valid solutions which hurts BLEU and other similar evaluation metrics.<ref name=Chen>{{cite conference|last1=Chen|first1=David|last2=Dolan|first2=William|title=Collecting Highly Parallel Data for Paraphrase Evaluation|booktitle=Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies|place=Portland, Oregon|year=2008|pages=190-200|url=https://dl.acm.org/citation.cfm?id=2002497}}</ref>
Metrics specifically designed to evaluate paraphrase generation inclue PEM<ref name=Liu>{{cite conference|last1=Liu|first1=Chang|last2=Dahlmeier|first2=Daniel|last3=Ng|first3=Hwee Tou|title=PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts|booktitle=Proceedings of the 2010 Conference on Empricial Methods in Natural Language Processing|place=MIT, Massachusetts|year=2010|pages=923-932|url=http://www.aclweb.org/anthology/D10-1090}}</ref> and PINC<ref name=Chen></ref> along with the aforementioned ParaMetric. PEM (paraphrase evaluation metric) attempts to evaluate the "adequacy, fluency, and lexical dissimilarity" of paraphrases by returning a single value heuristic calculated using [[n-gram|N-grams]] overlap in a pivot language. However, a large drawback to PEM is that must be trained using a large, in-___domain parallel corpora as well as human judges.<ref name=Chen></ref> In other words, it is tantamount to training a paraphrase recognition system in order to evaluate a paraphrase generation system. PINC, on the other hand, rates▼
▲and PINC<ref name=Chen></ref>. PEM (paraphrase evaluation metric) attempts to evaluate the "adequacy, fluency, and lexical dissimilarity" of paraphrases by returning a single value heuristic calculated using [[n-gram|N-grams]] overlap in a pivot language. However, a large drawback to PEM is that must be trained using a large, in-___domain parallel corpora as well as human judges.<ref name=Chen></ref> In other words, it is tantamount to training a paraphrase recognition system in order to evaluate a paraphrase generation system.
== References ==
|