Word n-gram language model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 17:51, 1 February 2024 edit IllustratedMan-code (talk \| contribs) 2 edits A unigram model is the case where n=1, hence the prefix "uni". An n-gram model examines the relationship between the previous n-1 words and the current word. In the unigram case, the model examines 0 previous words. Tag: Visual edit ← Previous edit		Latest revision as of 01:30, 22 August 2025 edit undo Dpen2000 (talk \| contribs) 355 edits Fix disambiguation link Tags: Visual edit Mobile edit Mobile web edit
(12 intermediate revisions by 7 users not shown)
Line 1: {{Short description\|Purely statistical model of language}} {{DISPLAYTITLE:Word ''n''-gram language model}} A '''word ''n''-gram language model''' is a purely statistical model of language. It has been superseded by [[recurrent neural network]]-based models, which have been superseded by [[large language model]]s. <ref>{{Cite journal\|url=https://dl.acm.org/doi/10.5555/944919.944966\|title=A neural probabilistic language model\|first1=Yoshua\|last1=Bengio\|first2=Réjean\|last2=Ducharme\|first3=Pascal\|last3=Vincent\|first4=Christian\|last4=Janvin\|date=March 1, 2003\|journal=The Journal of Machine Learning Research\|volume=3\|pages=1137–1155\|via=ACM Digital Library}}</ref> It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if ''n'' − 1 words, an ''n''-gram model.<ref name=jm/> Special tokens were introduced to denote the start and end of a sentence <math>\langle s\rangle</math> and <math>\langle /s\rangle</math>.▼ ▲A '''word ''n''-gram language model''' is a purely statistical model of language. It has been superseded by [[recurrent neural network]]~~-based~~–based models, which have been superseded by [[large language model]]s. <ref>{{Cite journal \|url=https://dl.acm.org/doi/10.5555/944919.944966 \|title=A neural probabilistic language model \|first1=Yoshua \|last1=Bengio \|first2=Réjean \|last2=Ducharme \|first3=Pascal \|last3=Vincent \|first4=Christian \|last4=Janvin \|date=March 1, 2003 \|journal=The Journal of Machine Learning Research \|volume=3 \|pages=1137–1155 \|via=ACM Digital Library}}</ref> It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word ~~was~~is considered, it ~~was~~is called a bigram model; if two words, a trigram model; if ''n'' − 1 words, an ''n''-gram model.<ref name=jm/> Special tokens ~~were~~are introduced to denote the start and end of a sentence <math>\langle s\rangle</math> and <math>\langle /s\rangle</math>. To prevent a zero probability being assigned to unseen words, each word's probability is slightly lower than its frequency count in a corpus. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen ''n''-grams, as an [[uninformative prior]]) to more sophisticated models, such as [[Good–Turing discounting]] or [[Katz's back-off model\|back-off model]]s.▼ ▲To prevent a zero probability being assigned to unseen words, each word's probability is slightly ~~lower~~higher than its frequency count in a [[Text corpus\|corpus]]. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen ''n''-grams, as an [[uninformative prior]]) to more sophisticated models, such as [[Good–Turing discounting]] or [[Katz's back-off model\|back-off ~~model~~models]]s. == Unigram model == Line 99 ⟶ 100: ''n''-gram-based searching was also used for [[plagiarism detection]]. == Bias–variance tradeoff == ~~== Bias-versus-variance trade-off ==~~ {{Main\|Bias–variance tradeoff}} To choose a value for ''n'' in an ''n''-gram model, it is necessary to find the right trade-off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones. Line 114 ⟶ 116: === Skip-gram language model === [[File:1-skip-2-gram.svg\|thumb\|1-skip-2-grams for the text "the rain in Spain falls mainly on the plain"]] Skip-gram language model is an attempt at overcoming the data sparsity problem that the preceding model (i.e. word ''n''-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are ''skipped'' over (thus the name "skip-gram").<ref>{{cite web\|url=http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf\|title=A Closer Look at Skip-gram Modelling\|author=David Guthrie\|date=2006\|display-authors=etal\|access-date=27 April 2014\|archive-url=https://web.archive.org/web/20170517144625/http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf\|archive-date=17 May 2017\|url-status=dead}}</ref> Formally, a {{mvar\|k}}-skip-{{mvar\|n}}-gram is a length-{{mvar\|n}} subsequence where the components occur at distance at most {{mvar\|k}} from each other. Line 130 ⟶ 133: <math display="block">v(\mathrm{king}) - v(\mathrm{male}) + v(\mathrm{female}) \approx v(\mathrm{queen})</math> where ≈ is made precise by stipulating that its right-hand side must be the [[Nearest neighbor search\|nearest neighbor]] of the value of the left-hand side.<ref name="mikolov">{{cite arXiv \|first1=Tomas \|last1=Mikolov \|first2=Kai \|last2=Chen \|first3=Greg \|last3=Corrado \|first4=Jeffrey \|last4=Dean \|eprint=1301.3781 \|title=Efficient estimation of word representations in vector space \|year=2013\|class=cs.CL }}</ref><ref name="compositionality">{{cite conference \|title=Distributed Representations of Words and Phrases and their Compositionality \|last1=Mikolov \|first1=Tomas \|last2=Sutskever \|first2=Ilya \|last3=Chen \|first3=Kai \|last4=Corrado ~~irst4~~\|first4=Greg S. \|last5=Dean \|first5=Jeff \|conference=[[Advances in Neural Information Processing Systems]] \|pages=3111–3119 \|year=2013 \|url=http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf \|access-date=22 June 2015 \|archive-date=29 October 2020 \|archive-url=https://web.archive.org/web/20201029083132/https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf \|url-status=live }}</ref> == Syntactic ''n''-grams ==