Revision as of 03:55, 24 April 2021 edit Anachronist (talk \| contribs) Edit filter managers, Autopatrolled, IP block exemptions, Administrators 68,742 edits m Reverted edits by Divyang444 (talk) to last version by Monkbot Tag: Rollback ← Previous edit		Revision as of 08:52, 6 May 2021 edit undo OAbot (talk \| contribs) Bots 643,717 edits m Open access bot: doi added to citation with #oabot. Next edit →
Line 39: == Evaluation == There are multiple methods that can be used to evaluate paraphrases. Since paraphrase recognition can be posed as a classification problem, most standard evaluations metrics such as [[accuracy]], [[f1 score]], or an [[receiver operating characteristic\|ROC curve]] do relatively well. However, there is difficulty calculating f1-scores due to trouble produce a complete list of paraphrases for a given phrase along with the fact that good paraphrases are dependent upon context. A metric designed to counter these problems is ParaMetric.<ref name=Burch2>{{cite conference\|last1=Callison-Burch\|first1=Chris\|last2=Cohn\|first2=Trevor\|last3=Lapata\|first3=Mirella\|title=ParaMetric: An Automatic Evaluation Metric for Paraphrasing\|book-title=Proceedings of the 22nd International Conference on Computational Linguistics\|place=Manchester\|year=2008\|pages=97–104\|doi=10.3115/1599081.1599094\|s2cid=837398\|url=https://pdfs.semanticscholar.org/be0d/0df960833c1bea2a39ba9a17e5ca958018cd.pdf\|doi-access=free}}</ref> ParaMetric aims to calculate the precision and recall of an automatic paraphrase system by comparing the automatic alignment of paraphrases to a manual alignment of similar phrases. Since ParaMetric is simply rating the quality of phrase alignment, it can be used to rate paraphrase generation systems as well assuming it uses phrase alignment as part of its generation process. A noted drawback to ParaMetric is the large and exhaustive set of manual alignments that must be initially created before a rating can be produced. The evaluation of paraphrase generation has similar difficulties as the evaluation of [[machine translation]]. Often the quality of a paraphrase is dependent upon its context, whether it is being used as a summary, and how it is generated among other factors. Additionally, a good paraphrase usually is lexically dissimilar from its source phrase. The simplest method used to evaluate paraphrase generation would be through the use of human judges. Unfortunately, evaluation through human judges tends to be time consuming. Automated approaches to evaluation prove to be challenging as it is essentially a problem as difficult as paraphrase recognition. While originally used to evaluate machine translations, bilingual evaluation understudy ([[BLEU]]) has been used successfully to evaluate paraphrase generation models as well. However, paraphrases often have several lexically different but equally valid solutions which hurts BLEU and other similar evaluation metrics.<ref name=Chen>{{cite conference\|last1=Chen\|first1=David\|last2=Dolan\|first2=William\|title=Collecting Highly Parallel Data for Paraphrase Evaluation\|book-title=Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies\|place=Portland, Oregon\|year=2008\|pages=190–200\|url=https://dl.acm.org/citation.cfm?id=2002497}}</ref>

Paraphrasing (computational linguistics): Difference between revisions