Content deleted Content added
m added paraphrasing software Tag: Reverted |
Citation bot (talk | contribs) Removed URL that duplicated identifier. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 20/967 |
||
(44 intermediate revisions by 24 users not shown) | |||
Line 1:
{{short description|Automatic generation or recognition of paraphrased text}}
{{about|automated generation and recognition of paraphrases||Paraphrase (disambiguation)}}
'''Paraphrase''' or '''
== Paraphrase generation ==
=== Multiple sequence alignment ===
Barzilay and Lee<ref name=Barzilay>{{cite conference|last1=Barzilay|first1=Regina|last2=Lee|first2=Lillian|title=Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment|
* finding recurring patterns in each individual corpus, i.e. "{{mvar|X}} (injured/wounded) {{mvar|Y}} people, {{mvar|Z}} seriously" where {{mvar|X, Y, Z}} are variables
* finding pairings between such patterns the represent paraphrases, i.e. "{{mvar|X}} (injured/wounded) {{mvar|Y}} people, {{mvar|Z}} seriously" and "{{mvar|Y}} were (wounded/hurt) by {{mvar|X}}, among them {{mvar|Z}} were in serious condition"
This is achieved by first clustering similar sentences together using [[n-gram]] overlap. Recurring patterns are found within clusters by using multi-sequence alignment. Then the position of argument words
=== Phrase-based
Paraphrase can also be generated through the use of [[statistical machine translation#Phrase-based translation|phrase-based translation]] as proposed by Bannard and Callison-Burch.<ref name=Bannard>{{cite conference |last1=Bannard|first1=Colin|last2=Callison-Burch|first2=Chris|title=Paraphrasing Bilingual Parallel Corpora |
The probability distribution can be modeled as <math>\Pr(e_2 | e_1)</math>, the probability phrase <math>e_2</math> is a paraphrase of <math>e_1</math>, which is equivalent to <math>\Pr(e_2|f) \Pr(f|e_1)</math> summed over all <math>f</math>, a potential phrase translation in the pivot language. Additionally, the sentence <math>e_1</math> is added as a prior to add context to the paraphrase. Thus the optimal paraphrase, <math>\hat{e_2}</math> can be modeled as:
Line 22:
=== Long short-term memory ===
There has been success in using [[long short-term memory]] (LSTM) models to generate paraphrases.<ref name=Prakash>{{Citation|last1=Prakash|first1=Aaditya|last2=Hasan|first2=Sadid A.|last3=Lee|first3=Kathy|last4=Datla|first4=Vivek|last5=Qadir|first5=Ashequl|last6=Liu|first6=Joey|last7=Farri|first7=Oladimeji|title=Neural Paraphrase Generation with Staked Residual LSTM Networks|year=2016|arxiv=1610.03098|bibcode=2016arXiv161003098P}}</ref> In short, the model consists of an encoder and decoder component, both implemented using variations of a stacked [[Vanishing gradient problem#Residual networks|residual]] LSTM. First, the encoding LSTM takes a [[one-hot]] encoding of all the words in a sentence as input and produces a final hidden vector, which can
=== Transformers ===
With the introduction of [[Transformer (machine learning model)|Transformer models]], paraphrase generation approaches improved their ability to generate text by scaling [[neural network]] parameters and heavily parallelizing training through [[Feedforward neural network|feed-forward layers]].<ref>{{Cite book |last1=Zhou |first1=Jianing |last2=Bhat |first2=Suma |title=Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing |chapter=Paraphrase Generation: A Survey of the State of the Art |date=2021 |chapter-url=https://aclanthology.org/2021.emnlp-main.414 |language=en |___location=Online and Punta Cana, Dominican Republic |publisher=Association for Computational Linguistics |pages=5075–5086 |doi=10.18653/v1/2021.emnlp-main.414|s2cid=243865349 |doi-access=free }}</ref> These models are so fluent in generating text that human experts cannot identify if an example was human-authored or machine-generated.<ref>{{Cite journal |last1=Dou |first1=Yao |last2=Forbes |first2=Maxwell |last3=Koncel-Kedziorski |first3=Rik |last4=Smith |first4=Noah |last5=Choi |first5=Yejin |date=2022 |title=Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text |url=https://aclanthology.org/2022.acl-long.501 |journal=Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |language=en |___location=Dublin, Ireland |publisher=Association for Computational Linguistics |pages=7250–7274 |doi=10.18653/v1/2022.acl-long.501|s2cid=247315430 |doi-access=free |arxiv=2107.01294 }}</ref> Transformer-based paraphrase generation relies on [[Autoencoder|autoencoding]], [[Autoregressive model|autoregressive]], or [[Seq2seq|sequence-to-sequence]] methods. Autoencoder models predict word replacement candidates with a one-hot distribution over the vocabulary, while autoregressive and seq2seq models generate new text based on the source predicting one word at a time.<ref>{{Cite journal |last1=Liu |first1=Xianggen |last2=Mou |first2=Lili |last3=Meng |first3=Fandong |last4=Zhou |first4=Hao |last5=Zhou |first5=Jie |last6=Song |first6=Sen |date=2020 |title=Unsupervised Paraphrasing by Simulated Annealing |url=https://www.aclweb.org/anthology/2020.acl-main.28 |journal=Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics |language=en |___location=Online |publisher=Association for Computational Linguistics |pages=302–312 |doi=10.18653/v1/2020.acl-main.28|s2cid=202537332 |doi-access=free |arxiv=1909.03588 }}</ref><ref>{{Cite book |last1=Wahle |first1=Jan Philip |last2=Ruas |first2=Terry |last3=Meuschke |first3=Norman |last4=Gipp |first4=Bela |title=2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) |chapter=Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection |year=2021 |___location=Champaign, IL, USA |publisher=IEEE |pages=226–229 |doi=10.1109/JCDL52503.2021.00065 |isbn=978-1-6654-1770-9|s2cid=232320374 |arxiv=2103.12450 }}</ref> More advanced efforts also exist to make paraphrasing controllable according to predefined quality dimensions, such as semantic preservation or lexical diversity.<ref>{{Cite journal |last1=Bandel |first1=Elron |last2=Aharonov |first2=Ranit |last3=Shmueli-Scheuer |first3=Michal |last4=Shnayderman |first4=Ilya |last5=Slonim |first5=Noam |last6=Ein-Dor |first6=Liat |date=2022 |title=Quality Controlled Paraphrase Generation |url=https://aclanthology.org/2022.acl-long.45 |journal=Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |language=en |___location=Dublin, Ireland |publisher=Association for Computational Linguistics |pages=596–609 |doi=10.18653/v1/2022.acl-long.45|doi-access=free |arxiv=2203.10940 }}</ref> Many Transformer-based paraphrase generation methods rely on unsupervised learning to leverage large amounts of training data and scale their methods.<ref>{{Cite book |last1=Lee |first1=John Sie Yuen |last2=Lim |first2=Ho Hung |last3=Carol Webster |first3=Carol |title=Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies |chapter=Unsupervised Paraphrasability Prediction for Compound Nominalizations |date=2022 |chapter-url=https://aclanthology.org/2022.naacl-main.237 |language=en |___location=Seattle, United States |publisher=Association for Computational Linguistics |pages=3254–3263 |doi=10.18653/v1/2022.naacl-main.237|s2cid=250390695 |doi-access=free }}</ref><ref>{{Cite book |last1=Niu |first1=Tong |last2=Yavuz |first2=Semih |last3=Zhou |first3=Yingbo |last4=Keskar |first4=Nitish Shirish |last5=Wang |first5=Huan |last6=Xiong |first6=Caiming |title=Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing |chapter=Unsupervised Paraphrasing with Pretrained Language Models |date=2021 |chapter-url=https://aclanthology.org/2021.emnlp-main.417 |language=en |___location=Online and Punta Cana, Dominican Republic |publisher=Association for Computational Linguistics |pages=5136–5150 |doi=10.18653/v1/2021.emnlp-main.417|s2cid=237497412 |doi-access=free }}</ref>
== Paraphrase recognition ==
=== Recursive
Paraphrase recognition has been attempted by Socher et al<ref name=Socher>{{Citation |last1=Socher |first1=Richard |last2=Huang |first2=Eric |last3=Pennington |first3=Jeffrey |last4=Ng |first4=Andrew |last5=Manning |first5=Christopher |title=Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection |
Given a sentence <math>W</math> with <math>m</math> words, the autoencoder is designed to take 2 <math>n</math>-dimensional [[word embedding]]s as input and produce an <math>n</math>-dimensional vector as output. The same autoencoder is applied to every pair of words in <math>S</math> to produce <math>\lfloor m/2 \rfloor</math> vectors. The autoencoder is then applied recursively with the new vectors as inputs until a single vector is produced. Given an odd number of inputs, the first vector is forwarded as
Given two sentences <math>W_1</math> and <math>W_2</math> of length 4 and 3 respectively, the autoencoders would produce 7 and 5 vector representations including the initial word embeddings. The [[euclidean distance]] is then taken between every combination of vectors in <math>W_1</math> and <math>W_2</math> to produce a similarity matrix <math>S \in \mathbb{R}^{7 \times 5}</math>. <math>S</math> is then subject to a dynamic min-[[
=== Skip-thought vectors ===
Skip-thought vectors are an attempt to create a vector representation of the semantic meaning of a sentence,
Since paraphrases carry the same semantic meaning between one another, they should have similar skip-thought vectors. Thus a simple [[logistic regression]] can be trained to
=== Transformers ===
Similar to how [[Transformer (machine learning model)|Transformer models]] influenced paraphrase generation, their application in identifying paraphrases showed great success. Models such as BERT can be adapted with a [[binary classification]] layer and trained end-to-end on identification tasks.<ref>{{Cite journal |last1=Devlin |first1=Jacob |last2=Chang |first2=Ming-Wei |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina |title=Proceedings of the 2019 Conference of the North |date=2019 |url=http://aclweb.org/anthology/N19-1423 |language=en |___location=Minneapolis, Minnesota |publisher=Association for Computational Linguistics |pages=4171–4186 |doi=10.18653/v1/N19-1423|s2cid=52967399 |url-access=subscription |doi-access=free }}</ref><ref>{{Citation |last1=Wahle |first1=Jan Philip |title=Identifying Machine-Paraphrased Plagiarism |date=2022 |url=https://link.springer.com/10.1007/978-3-030-96957-8_34 |work=Information for a Better World: Shaping the Global Future |volume=13192 |pages=393–413 |editor-last=Smits |editor-first=Malte |place=Cham |publisher=Springer International Publishing |language=en |doi=10.1007/978-3-030-96957-8_34 |isbn=978-3-030-96956-1 |access-date=2022-10-06 |last2=Ruas |first2=Terry |last3=Foltýnek |first3=Tomáš |last4=Meuschke |first4=Norman |last5=Gipp |first5=Bela|s2cid=232307572 |arxiv=2103.11909 }}</ref> Transformers achieve strong results when transferring between domains and paraphrasing techniques compared to more traditional machine learning methods such as [[logistic regression]]. Other successful methods based on the Transformer architecture include using [[Adversarial machine learning|adversarial learning]] and [[Meta-learning (computer science)|meta-learning]].<ref>{{Cite book |last1=Nighojkar |first1=Animesh |last2=Licato |first2=John |title=Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) |chapter=Improving Paraphrase Detection with the Adversarial Paraphrasing Task |date=2021 |chapter-url=https://aclanthology.org/2021.acl-long.552 |language=en |___location=Online |publisher=Association for Computational Linguistics |pages=7106–7116 |doi=10.18653/v1/2021.acl-long.552|s2cid=235436269 |doi-access=free }}</ref><ref>{{Cite book |last1=Dopierre |first1=Thomas |last2=Gravier |first2=Christophe |last3=Logerais |first3=Wilfried |title=Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) |chapter=ProtAugment: Intent Detection Meta-Learning through Unsupervised Diverse Paraphrasing |date=2021 |chapter-url=https://aclanthology.org/2021.acl-long.191 |language=en |___location=Online |publisher=Association for Computational Linguistics |pages=2454–2466 |doi=10.18653/v1/2021.acl-long.191|s2cid=236460333 |doi-access=free }}</ref>
== Evaluation ==
The evaluation of paraphrase generation has similar difficulties as the evaluation of [[machine translation]].
Metrics specifically designed to evaluate paraphrase generation include paraphrase in n-gram change (PINC)<ref name=Chen /> and paraphrase evaluation metric (PEM)<ref name=Liu>{{cite conference|last1=Liu|first1=Chang|last2=Dahlmeier|first2=Daniel|last3=Ng|first3=Hwee Tou|title=PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts |
The Quora Question Pairs Dataset, which contains hundreds of thousands of duplicate questions, has become a common dataset for the evaluation of paraphrase detectors.<ref>{{cite web |title=Paraphrase Identification on Quora Question Pairs |url=https://paperswithcode.com/sota/paraphrase-identification-on-quora-question|website=Papers with Code}}</ref> Consistently reliable paraphrase detection have all used the Transformer architecture and all have relied on large amounts of pre-training with more general data before fine-tuning with the question pairs.
== See also ==
*
*
*
== References ==
Line 56 ⟶ 64:
* [https://www.microsoft.com/en-us/download/details.aspx?id=52398 Microsoft Research Paraphrase Corpus] - a dataset consisting of 5800 pairs of sentences extracted from news articles annotated to note whether a pair captures semantic equivalence
* [http://paraphrase.org/#/ Paraphrase Database (PPDB)] - A searchable database containing millions of paraphrases in 16 different languages
[[Category:Computational linguistics]]
[[Category:Machine learning]]
|