Paraphrasing (computational linguistics)

For the linguistics definition, see paraphrase.

Paraphrase or Paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases.

Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection.^[1] Paraphrasing is also useful in the evaluation of machine translation^[2], as well as generation of new samples to expand existing corpora.^[3]

Paraphrase generation

Multiple sequence alignment

Barzilay and Lee^[3] proposed a method to generate paraphrases through the usage of monolingual parallel corpora, namely news articles covering the same event on the same day. Training consists of using "multi-sequence alignment to generate sentence-level paraphrases... from [an] unannotated corpus data", as such it can be considered an instance of unsupervised learning. The main goals of the training algorithm are thus

finding recurring patterns in each individual corpus, i.e. " $X$ (injured/wounded) $Y$ people, $Z$ seriously" where $X, Y, Z$ are variables
finding pairings between such patterns the represent paraphrases, i.e. " $X$ (injured/wounded) $Y$ people, $Z$ seriously" and " $Y$ were (wounded/hurt) by $X$ , among them $Z$ were in serious condition"

Accordingly the training algorithm consists of four steps. First, clustering sentences describing similar events with similar structure together. This is achieved by judging similarity through n-gram overlap. Second, patterns are induced by computing multiple-sequence alignment between sentences clustered together producing a lattice. During this step areas of high variability are determined to be instances of arguments and should be replaced with slots. Areas of high variability are determined to be the areas between words shared by more than 50% of the cluster's sentences. Third, lattices are matched between corpora based on matching or similar arguments within their slots. Finally, new paraphrases can be generated by taking in a new sentence, determining which sentence cluster it most closely belongs to, and selecting an appropriately matching lattice. If a matching lattice is found, then slot arguments are determined then used to generate as many new paraphrases are there are lattices in the matching cluster.

Phrase-based Machine Translation

Paraphrase can also be generated through the use of phrase-based translation as proposed by Bannard and Callison-Burch^[4]. The chief concept consists of aligning phrases in a pivot language to produce potential paraphrases in the original language. For example, the phrase "under control" in an English sentence is aligned with the phrase "unter kontrolle" in its German counterpart. The phrase "unter kontrolle" is then found in another German sentence with the aligned English phrase being "in check", a paraphrase of "under control".

The probability distribution can be modeled as $\Pr(e_{2}|e_{1})$ , the probability phrase $e_{2}$ is a paraphrase of $e_{1}$ , which is equivalent to $\Pr(e_{2}|f)\Pr(f|e_{1})$ summed over all $f$ , a potential phrase translation in the pivot language. Additionally, the sentence $e_{1}$ is added as a prior to add context to the paraphrase. Thus the optimal paraphrase, ${\hat {e_{2}}}$ can be calculated as:

{\hat {e_{2}}}={\text{arg}}\max _{e_{2}\neq e_{1}}\Pr(e_{2}|e_{1},S)={\text{arg}}\max _{e_{2}\neq e_{1}}\sum _{f}\Pr(e_{2}|f,S)\Pr(f|e_{1},S)

$\Pr(e_{2}|f)$ and $\Pr(f|e_{1})$ can be approximated by simply taking their frequencies. Adding $S$ as a prior is modeled by calculating the probability of forming the $S$ when $e_{1}$ is substituted with $e_{2}$ .

Paraphrase recognition

Recursive Autoencoders

Paraphrase recognition has been attempted by Socher et al^[1] through the use of recursive autoencoders. The main concept is to produce a vector representation of a sentence along with its components through recursively using an autoencoder. The vector representations of paraphrases should have similar vector representations; they are processed, then fed as input into a neural network for classification.

Given a sentence $W$ with $m$ words, the autoencoder is designed to take 2 $n$ -dimensional word embeddings as input and produce an $n$ -dimensional vector as output. The same autoencoder is applied to every pair of words in $S$ to produce $\lfloor m/2\rfloor$ vectors. The autoencoder is then applied recursively with the new vectors as inputs until a single vector is produced. Given an odd number of inputs, the first vector is forwarded as is to the next level of recursion. The autoencoder is then trained to reproduce every vector in the full recursion tree including the initial word embeddings.

Given two sentences $W_{1}$ and $W_{2}$ of length 4 and 3 respectively, the autoencoders would produce 7 and 5 vector representations including the initial word embeddings. The euclidean distance is then taken between every combination of vectors in $W_{1}$ and $W_{2}$ to produce a similarity matrix $S\in \mathbb {R} ^{7\times 5}$ . $S$ is then subject to a dynamic min-pooling layer to produce a fixed size $n_{p}\times n_{p}$ matrix. Since $S$ are not uniform in size among all potential sentences, $S$ is split into $n_{p}$ roughly even sections. The output is then normalized to have mean 0 and standard deviation 1 and is fed into a fully connected layer with a softmax output.

Skip-thought vectors

Skip-thought vectors are an attempt to create a vector representation of the semantic meaning of a sentence in a similar fashion as the skip gram model.^[5] Skip-thought vectors are produced through the use of a skip-thought model which consists of three key components, an encoder and two decoders. Given a corpus of documents, the skip-thought model is trained to take a sentence as input and encode it into a skip-thought vector. The skip-thought vector is used as input for both decoders, one of which attempts to reproduce the previous sentence and the other the following sentence in its entirety. The encoder and decoder can be implemented through the use of an RNN or an LSTM.

Since paraphrases carry the same semantic meaning between one another, they should have similar skip-thought vectors. Thus a simple logistic regression can be trained to a good performance with the absolute difference and component-wise product of two skip-thought vectors as input.

Evaluation and challenges

Field currently has slowed development due to no standard or costly evaluation methods.^[6] Field also lacks many data sets. In the instance of paraphrase generation, results currently are evaluated by hand through the use of two native speakers.

References

^ ^a ^b Socher, Richard; Huang, Eric; Pennington, Jeffrey; Ng, Andrew; Manning, Christopher (2011), Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection {{citation}}: Unknown parameter |booktitle= ignored (help)
^ Callison-Burch, Chris (October 25–27, 2008). "Syntactic Constraints on Paraphrases Extracted from Parallel Corpora". EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing. Honolulu, Hawaii. pp. 196–205.{{cite conference}}: CS1 maint: date format (link)
^ ^a ^b Barzilay, Regina; Lee, Lillian (May–June 2003). "Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment". Proceedings of HLT-NAACL 2003.{{cite conference}}: CS1 maint: date format (link)
^ Bannard, Colin; Callison-Burch, Chris (2005). "Paraphrasing Bilingual Parallel Corpora". Proceedings of the 43rd Annual Meeting of the ACL. Ann Arbor, Michigan. pp. 597–604. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
^ Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja (2015), Skip-Thought Vectors
^ Chen, David; Dolan, William. "Collecting Highly Parallel Data for Paraphrase Evaluation". Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon. pp. 190–200. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)

External links

Microsoft Research Paraphrase Corpus - a dataset consisting of 5800 pairs of sentences extracted from news articles with annotations of whether a pair captures paraphrase/semantic equivalence
Paraphrase Database (PPDB) - A searchable database containing millions of paraphrases in 16 different languages

[Socher-1] Socher, Richard; Huang, Eric; Pennington, Jeffrey; Ng, Andrew; Manning, Christopher (2011), Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection {{citation}}: Unknown parameter |booktitle= ignored (help)

[Callison-2] Callison-Burch, Chris (October 25–27, 2008). "Syntactic Constraints on Paraphrases Extracted from Parallel Corpora". EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing. Honolulu, Hawaii. pp. 196–205.{{cite conference}}: CS1 maint: date format (link)

[Barzilay-3] Barzilay, Regina; Lee, Lillian (May–June 2003). "Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment". Proceedings of HLT-NAACL 2003.{{cite conference}}: CS1 maint: date format (link)

[Bannard-4] Bannard, Colin; Callison-Burch, Chris (2005). "Paraphrasing Bilingual Parallel Corpora". Proceedings of the 43rd Annual Meeting of the ACL. Ann Arbor, Michigan. pp. 597–604. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)

[Kiros-5] Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja (2015), Skip-Thought Vectors

[Chen-6] Chen, David; Dolan, William. "Collecting Highly Parallel Data for Paraphrase Evaluation". Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon. pp. 190–200. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)

[1]

[2]

[3]

[4]

[5]

[6]