Revision as of 00:14, 11 February 2018 edit Neonrights (talk \| contribs) 65 edits →Evaluation: added description of PINC ← Previous edit		Revision as of 00:16, 11 February 2018 edit undo Neonrights (talk \| contribs) 65 edits m →Evaluation Next edit →
Line 44: The evaluation of paraphrase generation has similar difficulties as the evaluation of [[machine translation]]. Often the quality of a paraphrase is dependent upon its context, whether it is being used as a summary, and how it is generated among other factors. Additionally, a good paraphrase usually is lexically dissimilar from its source phrase. The simplest method used to evaluate paraphrase generation would be through the use of human judges. Unfortunately, evaluation through human judges tends to be time consuming. Automated approaches to evaluation prove to be challenging as it is essentially a problem as difficult as paraphrase recognition. While originally used to evaluate machine translations, [[BLEU]] has been used successfully to evaluate paraphrase generation models as well. However, paraphrases often have several lexically different but equally valid solutions which hurts BLEU and other similar evaluation metrics.<ref name=Chen>{{cite conference\|last1=Chen\|first1=David\|last2=Dolan\|first2=William\|title=Collecting Highly Parallel Data for Paraphrase Evaluation\|booktitle=Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies\|place=Portland, Oregon\|year=2008\|pages=190-200\|url=https://dl.acm.org/citation.cfm?id=2002497}}</ref> Metrics specifically designed to evaluate paraphrase generation include PINC<ref name=Chen></ref> and PEM<ref name=Liu>{{cite conference\|last1=Liu\|first1=Chang\|last2=Dahlmeier\|first2=Daniel\|last3=Ng\|first3=Hwee Tou\|title=PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts\|booktitle=Proceedings of the 2010 Conference on Empricial Methods in Natural Language Processing\|place=MIT, Massachusetts\|year=2010\|pages=923-932\|url=http://www.aclweb.org/anthology/D10-1090}}</ref> along with the aforementioned ParaMetric. PINC (paraphrase in n-gram change) is designed to be used in conjunction with BLEU and help cover its inadequacies. Since BLEU has difficulty measuring lexical dissimilarity, PINC is a measurement of the lack of n-gram overlap between a source sentence and a candidate paraphrase. It is essentially the [[Jaccard index\|Jaccard distance]] between the sentence excluding n-grams that appear in the source sentence to maintain some semantic equivalence. PEM (paraphrase evaluation metric), on the other hand, attempts to evaluate the "adequacy, fluency, and lexical dissimilarity" of paraphrases by returning a single value heuristic calculated using [[n-gram\|N-grams]] overlap in a pivot language. However, a large drawback to PEM is that must be trained using a large, in-___domain parallel corpora as well as human judges.<ref name=Chen></ref> In other words, it is tantamount to training a paraphrase recognition system in order to evaluate a paraphrase generation system. == References ==

Paraphrasing (computational linguistics): Difference between revisions