Content deleted Content added
Neonrights (talk | contribs) |
Neonrights (talk | contribs) m →Models |
||
Line 8:
=== Multi-Sequence Alignment ===
Barzilay and Lee<ref name=Barzilay>{{cite conference|last1=Barzilay|first1=Regina|last2=Lee|first2=Lillian|title=Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment|book-title=Proceedings of HLT-NAACL 2003|date=May-June 2003|url=http://www.cs.cornell.edu/home/llee/papers/statpar.home.html}}</ref> proposed a method to generate paraphrases through the usage of monolingual [[parallel text|parallel corpora]], namely news articles covering the same event on the same day. Training consists of using "[[multiple sequence alignment|multi-sequence alignment]] to generate sentence-level paraphrases... from [an] unannotated corpus data", as such it can be considered an instance of [[unsupervised learning]]. The main goals of the training algorithm are thus
* finding recurring patterns in each individual corpus, i.e. "{{mvar|X}} (injured/wounded) {{mvar|Y}} people, {{mvar|Z}} seriously" where {{mvar|X, Y, Z}} are variables
Line 17 ⟶ 14:
Accordingly the training algorithm consists of four steps. First, clustering sentences describing similar events with similar structure together. This is achieved by judging similarity through [[n-gram]] overlap. Second, patterns are induced by computing multiple-sequence alignment between sentences clustered together producing a ''lattice''. During this step areas of high variability are determined to be instances of arguments and should be replaced with ''slots''. Areas of high variability are determined to be the areas between words shared by more than 50% of the cluster's sentences. Third, lattices are matched between corpora based on matching or similar arguments within their slots. Finally, new paraphrases can be generated by taking in a new sentence, determining which sentence cluster it most closely belongs to, and selecting an appropriately matching lattice. If a matching lattice is found, then slot arguments are determined then used to generate as many new paraphrases are there are lattices in the matching cluster.
=== Translation ===
|