Paraphrasing (computational linguistics)

For the linguistics definition, see paraphrase.

Paraphrase or Paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases.

Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection.^[1] Paraphrasing is also useful in the evaluation of machine translation^[2], as well as generation of new samples to expand existing corpora.^[3]

Models

Multiple sequence alignment

Barzilay and Lee^[3] proposed a method to generate paraphrases through the usage of monolingual parallel corpora, namely news articles covering the same event on the same day. Training consists of using "multi-sequence alignment to generate sentence-level paraphrases... from [an] unannotated corpus data", as such it can be considered an instance of unsupervised learning. The main goals of the training algorithm are thus

finding recurring patterns in each individual corpus, i.e. " $X$ (injured/wounded) $Y$ people, $Z$ seriously" where $X, Y, Z$ are variables
finding pairings between such patterns the represent paraphrases, i.e. " $X$ (injured/wounded) $Y$ people, $Z$ seriously" and " $Y$ were (wounded/hurt) by $X$ , among them $Z$ were in serious condition"

Accordingly the training algorithm consists of four steps. First, clustering sentences describing similar events with similar structure together. This is achieved by judging similarity through n-gram overlap. Second, patterns are induced by computing multiple-sequence alignment between sentences clustered together producing a lattice. During this step areas of high variability are determined to be instances of arguments and should be replaced with slots. Areas of high variability are determined to be the areas between words shared by more than 50% of the cluster's sentences. Third, lattices are matched between corpora based on matching or similar arguments within their slots. Finally, new paraphrases can be generated by taking in a new sentence, determining which sentence cluster it most closely belongs to, and selecting an appropriately matching lattice. If a matching lattice is found, then slot arguments are determined then used to generate as many new paraphrases are there are lattices in the matching cluster.

Evaluation methods

External links

Microsoft Research Paraphrase Corpus - a dataset consisting of 5800 pairs of sentences extracted from news articles with annotations of whether a pair captures paraphrase/semantic equivalence
Paraphrase Database (PPDB) - A searchable database containing millions of paraphrases in 16 different languages

References

^ ^a ^b Socher, Richard; Huang, Eric; Pennington, Jeffrey; Ng, Andrew; Manning, Christopher (2011), Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection {{citation}}: Unknown parameter |booktitle= ignored (help)
^ Callison-Burch, Chris (October 25–27, 2008). "Syntactic Constraints on Paraphrases Extracted from Parallel Corpora". EMNLP '08 Proceedings of the Conference on Empierical Methods in Natural Language Processing. Honolulu, Hawaii. pp. 196–205.{{cite conference}}: CS1 maint: date format (link)
^ ^a ^b Barzilay, Regina; Lee, Lillian (May–June 2003). "Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment". Proceedings of HLT-NAACL 2003.{{cite conference}}: CS1 maint: date format (link)
^ Bannard, Colin; Callison-Burch, Chris (2005). "Paraphrasing Bilingual Parallel Corpora". Proceedings of the 43rd Annual Meeting of the ACL. Ann Arbor, Michigan. pp. 597–604. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
^ Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja (2015), Skip-Thought Vectors

[Socher-1] Socher, Richard; Huang, Eric; Pennington, Jeffrey; Ng, Andrew; Manning, Christopher (2011), Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection {{citation}}: Unknown parameter |booktitle= ignored (help)

[Callison-2] Callison-Burch, Chris (October 25–27, 2008). "Syntactic Constraints on Paraphrases Extracted from Parallel Corpora". EMNLP '08 Proceedings of the Conference on Empierical Methods in Natural Language Processing. Honolulu, Hawaii. pp. 196–205.{{cite conference}}: CS1 maint: date format (link)

[Barzilay-3] Barzilay, Regina; Lee, Lillian (May–June 2003). "Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment". Proceedings of HLT-NAACL 2003.{{cite conference}}: CS1 maint: date format (link)

[Bannard-4] Bannard, Colin; Callison-Burch, Chris (2005). "Paraphrasing Bilingual Parallel Corpora". Proceedings of the 43rd Annual Meeting of the ACL. Ann Arbor, Michigan. pp. 597–604. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)

[Kiros-5] Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja (2015), Skip-Thought Vectors

[1]

[2]

[3]

[4]

[5]

Paraphrasing (computational linguistics)

Models

Multiple sequence alignment

Translation

Autoencoders

Skip-thought vectors

Evaluation methods

External links

References