Revision as of 11:36, 10 August 2023 edit DancingPhilosopher (talk \| contribs) Extended confirmed users 5,622 edits →Unigram model: rephrased ← Previous edit		Revision as of 12:08, 10 August 2023 edit undo DancingPhilosopher (talk \| contribs) Extended confirmed users 5,622 edits lede rephased; sections reorder Next edit →
Line 1: {{DISPLAYTITLE:word ''n''-gram language model}} A '''word n-gram model''' ~~was~~is a ~~[[language~~statistical ~~model]]~~(as ~~that~~opposed ~~was~~to ~~used~~more ~~until 2003, when it was superseded by a~~recent [[Feedforward neural network#Timeline\|feedforward]] neural -network~~]] (with a single hidden layer~~-based and ~~context~~the ~~length~~most ofrecent ~~several~~[[deep ~~words trained on up to 14 million of words~~learning]]-based ~~with a CPU cluster~~large) ~~developed by~~ [[~~Yoshua~~language ~~Bengio~~model]] ~~with co-authors~~.<ref>{{Cite journal\|url=https://dl.acm.org/doi/10.5555/944919.944966\|title=A neural probabilistic language model\|first1=Yoshua\|last1=Bengio\|first2=Réjean\|last2=Ducharme\|first3=Pascal\|last3=Vincent\|first4=Christian\|last4=Janvin\|date=March 1, 2003\|journal=The Journal of Machine Learning Research\|volume=3\|pages=1137–1155\|via=ACM Digital Library}}</ref> It is ~~now superseded by [[neural language model\|deep learning]]-based [[large language model]]s. It was~~ based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if ''n''-1 words, an ''n''-gram model.<ref name=jm/> Special tokens were introduced to denote the start and end of a sentence <math>\langle s\rangle</math> and <math>\langle /s\rangle</math>. ~~The~~To ~~probabilities~~prevent ~~were~~a ~~not~~zero ~~equal~~probability being assigned to ~~frequency~~unseen ~~counts~~words, ~~because~~each ~~otherwise~~word's itprobability ~~could~~is ~~not~~slightly ~~assign~~lower athan ~~portion~~its offrequency ~~the~~count ~~total~~in ~~probability~~a ~~mass~~corpus. toTo ~~words~~calculate ~~not contained in the training dataset.~~it, ~~Various~~various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen ''n''-grams, as an [[uninformative prior]]) to more sophisticated models, such as [[Good–Turing discounting]] or [[Katz's back-off model\|back-off model]]s. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if ''n''-1 words, an ''n''-gram model.<ref name=jm/> For example, a bigram language model would assign the probabilities of words in the sentence ''I saw the red house'' as: <math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid\langle s\rangle) P(\text{saw}\mid \text{I}) P(\text{the}\mid\text{saw}) P(\text{red}\mid\text{the}) P(\text{house}\mid\text{red}) P(\langle /s\rangle\mid \text{house})</math>▼ ~~Where <math>\langle s\rangle</math> and <math>\langle /s\rangle</math> are special tokens denoting the start and end of a sentence.~~ ~~== Method ==~~ The approximation used in the model is the probability <math>P(w_1,\ldots,w_m)</math> of observing the sentence <math>w_1,\ldots,w_m</math>▼ <math display="block">P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i\mid w_1,\ldots,w_{i-1})\approx \prod^m_{i=2} P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})</math>▼ It is assumed that the probability of observing the ''i''<sup>th</sup> word ''w<sub>i</sub>'' in the context history of the preceding ''i'' − 1 words can be approximated by the probability of observing it in the shortened context history of the preceding ''n'' − 1 words (''n''<sup>th</sup>-order [[Markov property]]). To clarify, for ''n'' = 3 and ''i'' = 2 we have <math>P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})=P(w_2\mid w_1)</math>.▼ The conditional probability can be calculated from ''n''-gram model frequency counts:▼ <math display="block">P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1}) = \frac{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1},w_i)}{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1})}</math>▼ ~~=== Example ===~~ In a bigram word (''n'' = 2) language model, the probability of the sentence ''I saw the red house'' is approximated as▼ <math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid\langle s\rangle) P(\text{saw}\mid \text{I}) P(\text{the}\mid\text{saw}) P(\text{red}\mid\text{the}) P(\text{house}\mid\text{red}) P(\langle /s\rangle\mid \text{house})</math>▼ whereas in a trigram (''n'' = 3) language model, the approximation is▼ <math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid \langle s\rangle,\langle s\rangle) P(\text{saw}\mid\langle s\rangle,I) P(\text{the}\mid\text{I, saw}) P(\text{red}\mid\text{saw, the}) P(\text{house}\mid\text{the, red}) P(\langle /s\rangle\mid\text{red, house})</math> Note that the context of the first ''n'' – 1 ''n''-grams is filled with start-of-sentence markers, typically denoted <nowiki><s></nowiki>.▼ Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence ''I saw the'' would always be higher than that of the longer sentence ''I saw the red house.''▼ == Unigram model == Line 87 ⟶ 56: \| ... \|\| ... \|\| ... \|} == Bigram model == ▲In a bigram word (''n'' = 2) language model, the probability of the sentence ''I saw the red house'' is approximated as ▲<math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid\langle s\rangle) P(\text{saw}\mid \text{I}) P(\text{the}\mid\text{saw}) P(\text{red}\mid\text{the}) P(\text{house}\mid\text{red}) P(\langle /s\rangle\mid \text{house})</math> == Trigram model == ▲~~whereas in~~In a trigram (''n'' = 3) language model, the approximation is ▲<math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid \langle s\rangle,\langle s\rangle) P(\text{saw}\mid\langle s\~~text{~~rangle,I}) P(\text{the}\mid\text{I, saw}) P(\text{red}\mid\text{saw, the}) P(\text{house}\mid\text{the, red}) P(\langle /s\rangle\mid \text{red, house})</math> ▲Note that the context of the first ''n'' – 1 ''n''-grams is filled with start-of-sentence markers, typically denoted <nowiki><s></nowiki>. ▲Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence ''I saw the'' would always be higher than that of the longer sentence ''I saw the red house.'' == Approximation method == ▲The approximation ~~used~~method ~~in the model is~~calculates the probability <math>P(w_1,\ldots,w_m)</math> of observing the sentence <math>w_1,\ldots,w_m</math> ▲<math display="block">P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i\mid w_1,\ldots,w_{i-1})\approx \prod^m_{i=2} P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})</math> ▲It is assumed that the probability of observing the ''i''<sup>th</sup> word ''w<sub>i</sub>'' (in the context ~~history~~window consisting of the preceding ''i'' − 1 words) can be approximated by the probability of observing it in the shortened context ~~history~~window consisting of the preceding ''n'' − 1 words (''n''<sup>th</sup>-order [[Markov property]]). To clarify, for ''n'' = 3 and ''i'' = 2 we have <math>P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})=P(w_2\mid w_1)</math>. ▲The conditional probability can be calculated from ''n''-gram model frequency counts: ▲<math display="block">P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1}) = \frac{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1},w_i)}{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1})}</math> ==References==

Word n-gram language model: Difference between revisions