Word n-gram language model: Difference between revisions

Content deleted Content added
Unigram model: rephrased
lede rephased; sections reorder
Line 1:
{{DISPLAYTITLE:word ''n''-gram language model}}
A '''word n-gram model''' wasis a [[languagestatistical model]](as thatopposed wasto usedmore until 2003, when it was superseded by arecent [[Feedforward neural network#Timeline|feedforward]] neural -network]] (with a single hidden layer-based and contextthe lengthmost ofrecent several[[deep words trained on up to 14 million of wordslearning]]-based with a CPU clusterlarge) developed by [[Yoshualanguage Bengiomodel]] with co-authors.<ref>{{Cite journal|url=https://dl.acm.org/doi/10.5555/944919.944966|title=A neural probabilistic language model|first1=Yoshua|last1=Bengio|first2=Réjean|last2=Ducharme|first3=Pascal|last3=Vincent|first4=Christian|last4=Janvin|date=March 1, 2003|journal=The Journal of Machine Learning Research|volume=3|pages=1137–1155|via=ACM Digital Library}}</ref> It is now superseded by [[neural language model|deep learning]]-based [[large language model]]s. It was based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if ''n''-1 words, an ''n''-gram model.<ref name=jm/> Special tokens were introduced to denote the start and end of a sentence <math>\langle s\rangle</math> and <math>\langle /s\rangle</math>.
 
TheTo probabilitiesprevent werea notzero equalprobability being assigned to frequencyunseen countswords, becauseeach otherwiseword's itprobability couldis notslightly assignlower athan portionits offrequency thecount totalin probabilitya masscorpus. toTo wordscalculate not contained in the training dataset.it, Variousvarious methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen ''n''-grams, as an [[uninformative prior]]) to more sophisticated models, such as [[Good–Turing discounting]] or [[Katz's back-off model|back-off model]]s.
 
If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if ''n''-1 words, an ''n''-gram model.<ref name=jm/> For example, a bigram language model would assign the probabilities of words in the sentence ''I saw the red house'' as:
 
<math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid\langle s\rangle) P(\text{saw}\mid \text{I}) P(\text{the}\mid\text{saw}) P(\text{red}\mid\text{the}) P(\text{house}\mid\text{red}) P(\langle /s\rangle\mid \text{house})</math>
 
Where <math>\langle s\rangle</math> and <math>\langle /s\rangle</math> are special tokens denoting the start and end of a sentence.
 
== Method ==
The approximation used in the model is the probability <math>P(w_1,\ldots,w_m)</math> of observing the sentence <math>w_1,\ldots,w_m</math>
 
<math display="block">P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i\mid w_1,\ldots,w_{i-1})\approx \prod^m_{i=2} P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})</math>
 
It is assumed that the probability of observing the ''i''<sup>th</sup> word ''w<sub>i</sub>'' in the context history of the preceding ''i''&nbsp;−&nbsp;1 words can be approximated by the probability of observing it in the shortened context history of the preceding ''n''&nbsp;−&nbsp;1 words (''n''<sup>th</sup>-order [[Markov property]]). To clarify, for ''n''&nbsp;=&nbsp;3 and ''i''&nbsp;=&nbsp;2 we have <math>P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})=P(w_2\mid w_1)</math>.
 
The conditional probability can be calculated from ''n''-gram model frequency counts:
 
<math display="block">P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1}) = \frac{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1},w_i)}{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1})}</math>
 
=== Example ===
 
In a bigram word (''n''&nbsp;=&nbsp;2) language model, the probability of the sentence ''I saw the red house'' is approximated as
 
<math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid\langle s\rangle) P(\text{saw}\mid \text{I}) P(\text{the}\mid\text{saw}) P(\text{red}\mid\text{the}) P(\text{house}\mid\text{red}) P(\langle /s\rangle\mid \text{house})</math>
 
whereas in a trigram (''n''&nbsp;=&nbsp;3) language model, the approximation is
 
<math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid \langle s\rangle,\langle s\rangle) P(\text{saw}\mid\langle s\rangle,I) P(\text{the}\mid\text{I, saw}) P(\text{red}\mid\text{saw, the}) P(\text{house}\mid\text{the, red}) P(\langle /s\rangle\mid\text{red, house})</math>
 
Note that the context of the first ''n''&nbsp;–&nbsp;1 ''n''-grams is filled with start-of-sentence markers, typically denoted <nowiki><s></nowiki>.
 
Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence ''*I saw the'' would always be higher than that of the longer sentence ''I saw the red house.''
 
== Unigram model ==
Line 87 ⟶ 56:
| ... || ... || ...
|}
 
== Bigram model ==
 
In a bigram word (''n''&nbsp;=&nbsp;2) language model, the probability of the sentence ''I saw the red house'' is approximated as
 
<math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid\langle s\rangle) P(\text{saw}\mid \text{I}) P(\text{the}\mid\text{saw}) P(\text{red}\mid\text{the}) P(\text{house}\mid\text{red}) P(\langle /s\rangle\mid \text{house})</math>
 
== Trigram model ==
 
whereas inIn a trigram (''n''&nbsp;=&nbsp;3) language model, the approximation is
 
<math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid \langle s\rangle,\langle s\rangle) P(\text{saw}\mid\langle s\text{rangle,I}) P(\text{the}\mid\text{I, saw}) P(\text{red}\mid\text{saw, the}) P(\text{house}\mid\text{the, red}) P(\langle /s\rangle\mid \text{red, house})</math>
 
Note that the context of the first ''n''&nbsp;–&nbsp;1 ''n''-grams is filled with start-of-sentence markers, typically denoted <nowiki><s></nowiki>.
 
Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence ''*I saw the'' would always be higher than that of the longer sentence ''I saw the red house.''
 
== Approximation method ==
The approximation usedmethod in the model iscalculates the probability <math>P(w_1,\ldots,w_m)</math> of observing the sentence <math>w_1,\ldots,w_m</math>
 
<math display="block">P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i\mid w_1,\ldots,w_{i-1})\approx \prod^m_{i=2} P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})</math>
 
It is assumed that the probability of observing the ''i''<sup>th</sup> word ''w<sub>i</sub>'' (in the context historywindow consisting of the preceding ''i''&nbsp;−&nbsp;1 words) can be approximated by the probability of observing it in the shortened context historywindow consisting of the preceding ''n''&nbsp;−&nbsp;1 words (''n''<sup>th</sup>-order [[Markov property]]). To clarify, for ''n''&nbsp;=&nbsp;3 and ''i''&nbsp;=&nbsp;2 we have <math>P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})=P(w_2\mid w_1)</math>.
 
The conditional probability can be calculated from ''n''-gram model frequency counts:
 
<math display="block">P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1}) = \frac{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1},w_i)}{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1})}</math>
 
==References==