Revision as of 06:08, 23 August 2021 edit BrownHairedGirl (talk \| contribs) Autopatrolled, Extended confirmed users, File movers, Pending changes reviewers, Rollbackers 2,942,733 edits fix refs ← Previous edit		Revision as of 23:35, 31 August 2021 edit undo Ars rhetorica (talk \| contribs) 6 edits →Evaluation: Added information on common dataset for training paraphrase detectors. Next edit →
Line 44: The evaluation of paraphrase generation has similar difficulties as the evaluation of [[machine translation]]. Often the quality of a paraphrase is dependent upon its context, whether it is being used as a summary, and how it is generated among other factors. Additionally, a good paraphrase usually is lexically dissimilar from its source phrase. The simplest method used to evaluate paraphrase generation would be through the use of human judges. Unfortunately, evaluation through human judges tends to be time consuming. Automated approaches to evaluation prove to be challenging as it is essentially a problem as difficult as paraphrase recognition. While originally used to evaluate machine translations, bilingual evaluation understudy ([[BLEU]]) has been used successfully to evaluate paraphrase generation models as well. However, paraphrases often have several lexically different but equally valid solutions which hurts BLEU and other similar evaluation metrics.<ref name=Chen>{{cite conference \|last1=Chen \|first1=David \|last2=Dolan \|first2=William \|title=Collecting Highly Parallel Data for Paraphrase Evaluation \|conference=Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies \|place=Portland, Oregon \|year=2008 \|pages=190–200 \|url=https://dl.acm.org/citation.cfm?id=2002497}}</ref> Metrics specifically designed to evaluate paraphrase generation include paraphrase in n-gram change (PINC)<ref name=Chen /> and paraphrase evaluation metric (PEM)<ref name=Liu>{{cite conference\|last1=Liu\|first1=Chang\|last2=Dahlmeier\|first2=Daniel\|last3=Ng\|first3=Hwee Tou\|title=PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts \|conference=Proceedings of the 2010 Conference on Empricial Methods in Natural Language Processing \|place=MIT, Massachusetts \|year=2010 \|pages=923–932 \|url=http://www.aclweb.org/anthology/D10-1090}}</ref> along with the aforementioned ParaMetric. PINC is designed to be used in conjunction with BLEU and help cover its inadequacies. Since BLEU has difficulty measuring lexical dissimilarity, PINC is a measurement of the lack of n-gram overlap between a source sentence and a candidate paraphrase. It is essentially the [[Jaccard index\|Jaccard distance]] between the sentence excluding n-grams that appear in the source sentence to maintain some semantic equivalence. PEM, on the other hand, attempts to evaluate the "adequacy, fluency, and lexical dissimilarity" of paraphrases by returning a single value heuristic calculated using [[N-gram]]s overlap in a pivot language. However, a large drawback to PEM is that must be trained using a large, in-___domain parallel corpora as well as human judges.<ref name=Chen /> In other words, it is tantamount to training a paraphrase recognition system in order to evaluate a paraphrase generation system. The Quora Question Pairs Dataset, which contains hundreds of thousands of duplicate questions, has become a common dataset for the evaluation of paraphrase detectors.<ref>{{cite web \|title=Paraphrase Identification on Quora Question Pairs \|url=https://paperswithcode.com/sota/paraphrase-identification-on-quora-question\|website=Papers with Code}}</ref> The best performing models for paraphrase detection for the last three years have all used the Transformer architecture and all have relied on large amounts of pre-training with more general data before fine-tuning with the question pairs. == See also ==

Paraphrasing (computational linguistics): Difference between revisions