Paraphrasing (computational linguistics): Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 06:01, 27 December 2023 edit Citation bot (talk \| contribs) Bots 5,867,237 edits Alter: title, template type. Add: chapter. Removed proxy/dead URL that duplicated identifier. Removed parameters. \| Use this bot. Report bugs. \| #UCB_CommandLine ← Previous edit		Latest revision as of 04:48, 27 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,867,237 edits Removed URL that duplicated identifier. Removed parameters. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 20/967
(8 intermediate revisions by 5 users not shown)
Line 6: === Multiple sequence alignment === Barzilay and Lee<ref name=Barzilay>{{cite conference\|last1=Barzilay\|first1=Regina\|last2=Lee\|first2=Lillian\|title=Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment\|conference=Proceedings of HLT-NAACL 2003\|date=May–June 2003\|url=~~http~~https://www.cs.cornell.edu/home/llee/papers/statpar.home.html}}</ref> proposed a method to generate paraphrases through the usage of monolingual [[parallel text\|parallel corpora]], namely news articles covering the same event on the same day. Training consists of using [[multiple sequence alignment\|multi-sequence alignment]] to generate sentence-level paraphrases from an unannotated corpus. This is done by * finding recurring patterns in each individual corpus, i.e. "{{mvar\|X}} (injured/wounded) {{mvar\|Y}} people, {{mvar\|Z}} seriously" where {{mvar\|X, Y, Z}} are variables * finding pairings between such patterns the represent paraphrases, i.e. "{{mvar\|X}} (injured/wounded) {{mvar\|Y}} people, {{mvar\|Z}} seriously" and "{{mvar\|Y}} were (wounded/hurt) by {{mvar\|X}}, among them {{mvar\|Z}} were in serious condition" Line 12: This is achieved by first clustering similar sentences together using [[n-gram]] overlap. Recurring patterns are found within clusters by using multi-sequence alignment. Then the position of argument words is determined by finding areas of high variability within each cluster, aka between words shared by more than 50% of a cluster's sentences. Pairings between patterns are then found by comparing similar variable words between different corpora. Finally, new paraphrases can be generated by choosing a matching cluster for a source sentence, then substituting the source sentence's argument into any number of patterns in the cluster. === Phrase-based ~~Machine~~machine ~~Translation~~translation === Paraphrase can also be generated through the use of [[statistical machine translation#Phrase-based translation\|phrase-based translation]] as proposed by Bannard and Callison-Burch.<ref name=Bannard>{{cite conference \|last1=Bannard\|first1=Colin\|last2=Callison-Burch\|first2=Chris\|title=Paraphrasing Bilingual Parallel Corpora \|conference=Proceedings of the 43rd Annual Meeting of the ACL \|place=Ann Arbor, Michigan\|pages=597–604\|year=2005\|url=https://dl.acm.org/citation.cfm?id=1219914}}</ref> The chief concept consists of aligning phrases in a [[pivot language]] to produce potential paraphrases in the original language. For example, the phrase "under control" in an English sentence is aligned with the phrase "unter kontrolle" in its German counterpart. The phrase "unter kontrolle" is then found in another German sentence with the aligned English phrase being "in check," a paraphrase of "under control." Line 25: === Transformers === With the introduction of [[Transformer (machine learning model)\|Transformer models]], paraphrase generation approaches improved their ability to generate text by scaling [[neural network]] parameters and heavily parallelizing training through [[Feedforward neural network\|feed-forward layers]].<ref>{{Cite book \|last1=Zhou \|first1=Jianing \|last2=Bhat \|first2=Suma \|title=Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \|chapter=Paraphrase Generation: A Survey of the State of the Art \|date=2021 \|chapter-url=https://aclanthology.org/2021.emnlp-main.414 \|language=en \|___location=Online and Punta Cana, Dominican Republic \|publisher=Association for Computational Linguistics \|pages=5075–5086 \|doi=10.18653/v1/2021.emnlp-main.414\|s2cid=243865349 \|doi-access=free }}</ref> These models are so fluent in generating text that human experts cannot identify if an example was human-authored or machine-generated.<ref>{{Cite journal \|last1=Dou \|first1=Yao \|last2=Forbes \|first2=Maxwell \|last3=Koncel-Kedziorski \|first3=Rik \|last4=Smith \|first4=Noah \|last5=Choi \|first5=Yejin \|date=2022 \|title=Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text \|url=https://aclanthology.org/2022.acl-long.501 \|journal=Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|language=en \|___location=Dublin, Ireland \|publisher=Association for Computational Linguistics \|pages=7250–7274 \|doi=10.18653/v1/2022.acl-long.501\|s2cid=247315430 \|doi-access=free \|arxiv=2107.01294 }}</ref> Transformer-based paraphrase generation relies on [[Autoencoder\|autoencoding]], [[Autoregressive model\|autoregressive]], or [[Seq2seq\|sequence-to-sequence]] methods. Autoencoder models predict word replacement candidates with a one-hot distribution over the vocabulary, while autoregressive and seq2seq models generate new text based on the source predicting one word at a time.<ref>{{Cite journal \|last1=Liu \|first1=Xianggen \|last2=Mou \|first2=Lili \|last3=Meng \|first3=Fandong \|last4=Zhou \|first4=Hao \|last5=Zhou \|first5=Jie \|last6=Song \|first6=Sen \|date=2020 \|title=Unsupervised Paraphrasing by Simulated Annealing \|url=https://www.aclweb.org/anthology/2020.acl-main.28 \|journal=Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics \|language=en \|___location=Online \|publisher=Association for Computational Linguistics \|pages=302–312 \|doi=10.18653/v1/2020.acl-main.28\|s2cid=202537332 \|doi-access=free \|arxiv=1909.03588 }}</ref><ref>{{Cite book \|last1=Wahle \|first1=Jan Philip \|last2=Ruas \|first2=Terry \|last3=Meuschke \|first3=Norman \|last4=Gipp \|first4=Bela \|title=2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) \|chapter=Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection ~~\|chapter-url=https://ieeexplore.ieee.org/document/9651895~~ \|year=2021 \|___location=Champaign, IL, USA \|publisher=IEEE \|pages=226–229 \|doi=10.1109/JCDL52503.2021.00065 \|isbn=978-1-6654-1770-9\|s2cid=232320374 \|arxiv=2103.12450 }}</ref> More advanced efforts also exist to make paraphrasing controllable according to predefined quality dimensions, such as semantic preservation or lexical diversity.<ref>{{Cite journal \|last1=Bandel \|first1=Elron \|last2=Aharonov \|first2=Ranit \|last3=Shmueli-Scheuer \|first3=Michal \|last4=Shnayderman \|first4=Ilya \|last5=Slonim \|first5=Noam \|last6=Ein-Dor \|first6=Liat \|date=2022 \|title=Quality Controlled Paraphrase Generation \|url=https://aclanthology.org/2022.acl-long.45 \|journal=Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|language=en \|___location=Dublin, Ireland \|publisher=Association for Computational Linguistics \|pages=596–609 \|doi=10.18653/v1/2022.acl-long.45\|doi-access=free \|arxiv=2203.10940 }}</ref> Many Transformer-based paraphrase generation methods rely on unsupervised learning to leverage large amounts of training data and scale their methods.<ref>{{Cite book \|last1=Lee \|first1=John Sie Yuen \|last2=Lim \|first2=Ho Hung \|last3=Carol Webster \|first3=Carol \|title=Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \|chapter=Unsupervised Paraphrasability Prediction for Compound Nominalizations \|date=2022 \|chapter-url=https://aclanthology.org/2022.naacl-main.237 \|language=en \|___location=Seattle, United States \|publisher=Association for Computational Linguistics \|pages=3254–3263 \|doi=10.18653/v1/2022.naacl-main.237\|s2cid=250390695 \|doi-access=free }}</ref><ref>{{Cite book \|last1=Niu \|first1=Tong \|last2=Yavuz \|first2=Semih \|last3=Zhou \|first3=Yingbo \|last4=Keskar \|first4=Nitish Shirish \|last5=Wang \|first5=Huan \|last6=Xiong \|first6=Caiming \|title=Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \|chapter=Unsupervised Paraphrasing with Pretrained Language Models \|date=2021 \|chapter-url=https://aclanthology.org/2021.emnlp-main.417 \|language=en \|___location=Online and Punta Cana, Dominican Republic \|publisher=Association for Computational Linguistics \|pages=5136–5150 \|doi=10.18653/v1/2021.emnlp-main.417\|s2cid=237497412 \|doi-access=free }}</ref> == Paraphrase recognition == === Recursive ~~Autoencoders~~autoencoders === Paraphrase recognition has been attempted by Socher et al<ref name=Socher>{{Citation \|last1=Socher \|first1=Richard \|last2=Huang \|first2=Eric \|last3=Pennington \|first3=Jeffrey \|last4=Ng \|first4=Andrew \|last5=Manning \|first5=Christopher \|title=Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection \|chapter=Advances in Neural Information Processing Systems 24 \|year=2011 \|chapter-url=http://www.socher.org/index.php/Main/DynamicPoolingAndUnfoldingRecursiveAutoencodersForParaphraseDetection \|access-date=2017-12-29 \|archive-date=2018-01-06 \|archive-url=https://web.archive.org/web/20180106173348/http://www.socher.org/index.php/Main/DynamicPoolingAndUnfoldingRecursiveAutoencodersForParaphraseDetection \|url-status=dead }}</ref> through the use of recursive [[autoencoder]]s. The main concept is to produce a vector representation of a sentence and its components by recursively using an autoencoder. The vector representations of paraphrases should have similar vector representations; they are processed, then fed as input into a [[artificial neural network\|neural network]] for classification. Given a sentence <math>W</math> with <math>m</math> words, the autoencoder is designed to take 2 <math>n</math>-dimensional [[word embedding]]s as input and produce an <math>n</math>-dimensional vector as output. The same autoencoder is applied to every pair of words in <math>S</math> to produce <math>\lfloor m/2 \rfloor</math> vectors. The autoencoder is then applied recursively with the new vectors as inputs until a single vector is produced. Given an odd number of inputs, the first vector is forwarded as-is to the next level of recursion. The autoencoder is trained to reproduce every vector in the full recursion tree, including the initial word embeddings. Given two sentences <math>W_1</math> and <math>W_2</math> of length 4 and 3 respectively, the autoencoders would produce 7 and 5 vector representations including the initial word embeddings. The [[euclidean distance]] is then taken between every combination of vectors in <math>W_1</math> and <math>W_2</math> to produce a similarity matrix <math>S \in \mathbb{R}^{7 \times 5}</math>. <math>S</math> is then subject to a dynamic min-[[~~convolutional neural network#Pooling layer\|~~pooling layer]] to produce a fixed size <math>n_p \times n_p</math> matrix. Since <math>S</math> are not uniform in size among all potential sentences, <math>S</math> is split into <math>n_p</math> roughly even sections. The output is then normalized to have mean 0 and standard deviation 1 and is fed into a fully connected layer with a [[softmax function\|softmax]] output. The dynamic pooling to softmax model is trained using pairs of known paraphrases. === Skip-thought vectors === Line 42: === Transformers === Similar to how [[Transformer (machine learning model)\|Transformer models]] influenced paraphrase generation, their application in identifying paraphrases showed great success. Models such as BERT can be adapted with a [[binary classification]] layer and trained end-to-end on identification tasks.<ref>{{Cite journal \|last1=Devlin \|first1=Jacob \|last2=Chang \|first2=Ming-Wei \|last3=Lee \|first3=Kenton \|last4=Toutanova \|first4=Kristina \|title=Proceedings of the 2019 Conference of the North \|date=2019 \|url=http://aclweb.org/anthology/N19-1423 ~~\|journal=Proceedings of the 2019 Conference of the North~~ \|language=en \|___location=Minneapolis, Minnesota \|publisher=Association for Computational Linguistics \|pages=4171–4186 \|doi=10.18653/v1/N19-1423\|s2cid=52967399 \|url-access=subscription \|doi-access=free }}</ref><ref>{{Citation \|last1=Wahle \|first1=Jan Philip \|title=Identifying Machine-Paraphrased Plagiarism \|date=2022 \|url=https://link.springer.com/10.1007/978-3-030-96957-8_34 \|work=Information for a Better World: Shaping the Global Future \|volume=13192 \|pages=393–413 \|editor-last=Smits \|editor-first=Malte \|place=Cham \|publisher=Springer International Publishing \|language=en \|doi=10.1007/978-3-030-96957-8_34 \|isbn=978-3-030-96956-1 \|access-date=2022-10-06 \|last2=Ruas \|first2=Terry \|last3=Foltýnek \|first3=Tomáš \|last4=Meuschke \|first4=Norman \|last5=Gipp \|first5=Bela\|s2cid=232307572 \|arxiv=2103.11909 }}</ref> Transformers achieve strong results when transferring between domains and paraphrasing techniques compared to more traditional machine learning methods such as [[logistic regression]]. Other successful methods based on the Transformer architecture include using [[Adversarial machine learning\|adversarial learning]] and [[Meta-learning (computer science)\|meta-learning]].<ref>{{Cite book \|last1=Nighojkar \|first1=Animesh \|last2=Licato \|first2=John \|title=Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) \|chapter=Improving Paraphrase Detection with the Adversarial Paraphrasing Task \|date=2021 \|chapter-url=https://aclanthology.org/2021.acl-long.552 \|language=en \|___location=Online \|publisher=Association for Computational Linguistics \|pages=7106–7116 \|doi=10.18653/v1/2021.acl-long.552\|s2cid=235436269 \|doi-access=free }}</ref><ref>{{Cite book \|last1=Dopierre \|first1=Thomas \|last2=Gravier \|first2=Christophe \|last3=Logerais \|first3=Wilfried \|title=Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) \|chapter=ProtAugment: Intent Detection Meta-Learning through Unsupervised Diverse Paraphrasing \|date=2021 \|chapter-url=https://aclanthology.org/2021.acl-long.191 \|language=en \|___location=Online \|publisher=Association for Computational Linguistics \|pages=2454–2466 \|doi=10.18653/v1/2021.acl-long.191\|s2cid=236460333 \|doi-access=free }}</ref> == Evaluation == Line 54: == See also == * [[{{annotated link\|Round-trip translation]]}} * [[{{annotated link\|Text simplification]]}} * [[{{annotated link\|Text normalization]]}} == References ==