Word n-gram language model: Difference between revisions

Content deleted Content added
when it was overperformed and superseded
rephrased
Line 1:
{{DISPLAYTITLE:word ''n''-gram language model}}
A '''word n-gram model''' was a [[language model]] that generated probabilities of a series of words, based on an (over-simplified) assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. It is no longerwas used in [[natural language processing]] because it has been superseded by [[neural language model|deep learning]]-based [[large language model]]s. Inuntil 2003, when it was overperformed and superseded by a [[multi-layer perceptron]] (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster in [[language model]]ling) by [[Yoshua Bengio]] with co-authors.<ref>{{Cite journal|url=https://dl.acm.org/doi/10.5555/944919.944966|title=A neural probabilistic language model|first1=Yoshua|last1=Bengio|first2=Réjean|last2=Ducharme|first3=Pascal|last3=Vincent|first4=Christian|last4=Janvin|date=March 1, 2003|journal=The Journal of Machine Learning Research|volume=3|pages=1137–1155|via=ACM Digital Library}}</ref> It is now superseded by [[neural language model|deep learning]]-based [[large language model]]s. It was based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words.
 
The probabilities were not equal to frequency counts, because otherwise it could not assign a portion of the total probability mass to words not contained in the training dataset. Various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen ''n''-grams, as an [[uninformative prior]]) to more sophisticated models, such as [[Good–Turing discounting]] or [[Katz's back-off model|back-off model]]s.