Content deleted Content added
A unigram model is the case where n=1, hence the prefix "uni". An n-gram model examines the relationship between the previous n-1 words and the current word. In the unigram case, the model examines 0 previous words. |
Fix disambiguation link Tags: Visual edit Mobile edit Mobile web edit |
||
(12 intermediate revisions by 7 users not shown) | |||
Line 1:
{{Short description|Purely statistical model of language}} {{DISPLAYTITLE:Word ''n''-gram language model}}
A '''word ''n''-gram language model''' is a purely statistical model of language. It has been superseded by [[recurrent neural network]]-based models, which have been superseded by [[large language model]]s. <ref>{{Cite journal|url=https://dl.acm.org/doi/10.5555/944919.944966|title=A neural probabilistic language model|first1=Yoshua|last1=Bengio|first2=Réjean|last2=Ducharme|first3=Pascal|last3=Vincent|first4=Christian|last4=Janvin|date=March 1, 2003|journal=The Journal of Machine Learning Research|volume=3|pages=1137–1155|via=ACM Digital Library}}</ref> It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if ''n'' − 1 words, an ''n''-gram model.<ref name=jm/> Special tokens were introduced to denote the start and end of a sentence <math>\langle s\rangle</math> and <math>\langle /s\rangle</math>.▼
▲A '''word ''n''-gram language model''' is a purely statistical model of language. It has been superseded by [[recurrent neural network]]
To prevent a zero probability being assigned to unseen words, each word's probability is slightly lower than its frequency count in a corpus. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen ''n''-grams, as an [[uninformative prior]]) to more sophisticated models, such as [[Good–Turing discounting]] or [[Katz's back-off model|back-off model]]s.▼
▲To prevent a zero probability being assigned to unseen words, each word's probability is slightly
== Unigram model ==
Line 99 ⟶ 100:
''n''-gram-based searching was also used for [[plagiarism detection]].
== Bias–variance tradeoff ==
{{Main|Bias–variance tradeoff}}
To choose a value for ''n'' in an ''n''-gram model, it is necessary to find the right trade-off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones.
Line 114 ⟶ 116:
=== Skip-gram language model ===
[[File:1-skip-2-gram.svg|thumb|1-skip-2-grams for the text "the rain in Spain falls mainly on the plain"]]
Skip-gram language model is an attempt at overcoming the data sparsity problem that the preceding model (i.e. word ''n''-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are ''skipped'' over (thus the name "skip-gram").<ref>{{cite web|url=http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf|title=A Closer Look at Skip-gram Modelling|author=David Guthrie|date=2006|display-authors=etal|access-date=27 April 2014|archive-url=https://web.archive.org/web/20170517144625/http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf|archive-date=17 May 2017|url-status=dead}}</ref>
Formally, a {{mvar|k}}-skip-{{mvar|n}}-gram is a length-{{mvar|n}} subsequence where the components occur at distance at most {{mvar|k}} from each other.
Line 130 ⟶ 133:
<math display="block">v(\mathrm{king}) - v(\mathrm{male}) + v(\mathrm{female}) \approx v(\mathrm{queen})</math>
where ≈ is made precise by stipulating that its right-hand side must be the [[Nearest neighbor search|nearest neighbor]] of the value of the left-hand side.<ref name="mikolov">{{cite arXiv |first1=Tomas |last1=Mikolov |first2=Kai |last2=Chen |first3=Greg |last3=Corrado |first4=Jeffrey |last4=Dean |eprint=1301.3781 |title=Efficient estimation of word representations in vector space |year=2013|class=cs.CL }}</ref><ref name="compositionality">{{cite conference |title=Distributed Representations of Words and Phrases and their Compositionality |last1=Mikolov |first1=Tomas |last2=Sutskever |first2=Ilya |last3=Chen |first3=Kai |last4=Corrado
== Syntactic ''n''-grams ==
|