Word n-gram language model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 11:36, 10 August 2023 edit DancingPhilosopher (talk \| contribs) Extended confirmed users 5,622 edits →Unigram model: rephrased ← Previous edit		Latest revision as of 01:30, 22 August 2025 edit undo Dpen2000 (talk \| contribs) 355 edits Fix disambiguation link Tags: Visual edit Mobile edit Mobile web edit
(30 intermediate revisions by 16 users not shown)
Line 1: {{Short description\|Purely statistical model of language}} {{DISPLAYTITLE:~~word~~Word ''n''-gram language model}} A '''word n-gram model''' was a [[language model]] that was used until 2003, when it was superseded by a [[Feedforward neural network#Timeline\|feedforward neural network]] (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster) developed by [[Yoshua Bengio]] with co-authors.<ref>{{Cite journal\|url=https://dl.acm.org/doi/10.5555/944919.944966\|title=A neural probabilistic language model\|first1=Yoshua\|last1=Bengio\|first2=Réjean\|last2=Ducharme\|first3=Pascal\|last3=Vincent\|first4=Christian\|last4=Janvin\|date=March 1, 2003\|journal=The Journal of Machine Learning Research\|volume=3\|pages=1137–1155\|via=ACM Digital Library}}</ref> It is now superseded by [[neural language model\|deep learning]]-based [[large language model]]s. It was based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. A '''word ''n''-gram language model''' is a purely statistical model of language. It has been superseded by [[recurrent neural network]]–based models, which have been superseded by [[large language model]]s.<ref>{{Cite journal \|url=https://dl.acm.org/doi/10.5555/944919.944966 \|title=A neural probabilistic language model \|first1=Yoshua \|last1=Bengio \|first2=Réjean \|last2=Ducharme \|first3=Pascal \|last3=Vincent \|first4=Christian \|last4=Janvin \|date=March 1, 2003 \|journal=The Journal of Machine Learning Research \|volume=3 \|pages=1137–1155 \|via=ACM Digital Library}}</ref> It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word is considered, it is called a bigram model; if two words, a trigram model; if ''n'' − 1 words, an ''n''-gram model.<ref name=jm/> Special tokens are introduced to denote the start and end of a sentence <math>\langle s\rangle</math> and <math>\langle /s\rangle</math>. The probabilities were not equal to frequency counts, because otherwise it could not assign a portion of the total probability mass to words not contained in the training dataset. Various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen ''n''-grams, as an [[uninformative prior]]) to more sophisticated models, such as [[Good–Turing discounting]] or [[Katz's back-off model\|back-off model]]s. To prevent a zero probability being assigned to unseen words, each word's probability is slightly higher than its frequency count in a [[Text corpus\|corpus]]. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen ''n''-grams, as an [[uninformative prior]]) to more sophisticated models, such as [[Good–Turing discounting]] or [[Katz's back-off model\|back-off models]]. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if ''n''-1 words, an ''n''-gram model.<ref name=jm/> For example, a bigram language model would assign the probabilities of words in the sentence ''I saw the red house'' as: <math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid\langle s\rangle) P(\text{saw}\mid \text{I}) P(\text{the}\mid\text{saw}) P(\text{red}\mid\text{the}) P(\text{house}\mid\text{red}) P(\langle /s\rangle\mid \text{house})</math> ~~Where <math>\langle s\rangle</math> and <math>\langle /s\rangle</math> are special tokens denoting the start and end of a sentence.~~ ~~== Method ==~~ ~~The approximation used in the model is the probability <math>P(w_1,\ldots,w_m)</math> of observing the sentence <math>w_1,\ldots,w_m</math>~~ ~~<math display="block">P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i\mid w_1,\ldots,w_{i-1})\approx \prod^m_{i=2} P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})</math>~~ It is assumed that the probability of observing the ''i''<sup>th</sup> word ''w<sub>i</sub>'' in the context history of the preceding ''i'' − 1 words can be approximated by the probability of observing it in the shortened context history of the preceding ''n'' − 1 words (''n''<sup>th</sup>-order [[Markov property]]). To clarify, for ''n'' = 3 and ''i'' = 2 we have <math>P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})=P(w_2\mid w_1)</math>. ~~The conditional probability can be calculated from ''n''-gram model frequency counts:~~ ~~<math display="block">P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1}) = \frac{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1},w_i)}{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1})}</math>~~ ~~=== Example ===~~ ~~In a bigram word (''n'' = 2) language model, the probability of the sentence ''I saw the red house'' is approximated as~~ <math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid\langle s\rangle) P(\text{saw}\mid \text{I}) P(\text{the}\mid\text{saw}) P(\text{red}\mid\text{the}) P(\text{house}\mid\text{red}) P(\langle /s\rangle\mid \text{house})</math> ~~whereas in a trigram (''n'' = 3) language model, the approximation is~~ <math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid \langle s\rangle,\langle s\rangle) P(\text{saw}\mid\langle s\rangle,I) P(\text{the}\mid\text{I, saw}) P(\text{red}\mid\text{saw, the}) P(\text{house}\mid\text{the, red}) P(\langle /s\rangle\mid\text{red, house})</math> ~~Note that the context of the first ''n'' – 1 ''n''-grams is filled with start-of-sentence markers, typically denoted <nowiki><s></nowiki>.~~ ~~Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence ''I saw the'' would always be higher than that of the longer sentence ''I saw the red house.''~~ == Unigram model == {{see also\|Bag-of-words model}} A special case, where ''n'' =0 1, is called a unigram model. Probability of each word in a sequence is independent from probabilities of other word in the sequence. Each word's probability in the ~~seqence~~sequence is equal to the word's probability in an entire document. <math display="block">P_\text{uni}(t_1t_2t_3)=P(t_1)P(t_2)P(t_3).</math> Line 87 ⟶ 57: \| ... \|\| ... \|\| ... \|} == Bigram model == In a bigram word (''n'' = 2) language model, the probability of the sentence ''I saw the red house'' is approximated as <math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid\langle s\rangle) P(\text{saw}\mid \text{I}) P(\text{the}\mid\text{saw}) P(\text{red}\mid\text{the}) P(\text{house}\mid\text{red}) P(\langle /s\rangle\mid \text{house})</math> == Trigram model == In a trigram (''n'' = 3) language model, the approximation is <math display="block">P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid \langle s\rangle,\langle s\rangle) P(\text{saw}\mid\langle s\rangle,I) P(\text{the}\mid\text{I, saw}) P(\text{red}\mid\text{saw, the}) P(\text{house}\mid\text{the, red}) P(\langle /s\rangle\mid\text{red, house})</math> Note that the context of the first ''n'' – 1 ''n''-grams is filled with start-of-sentence markers, typically denoted <nowiki><s></nowiki>. Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence ''I saw the'' would always be higher than that of the longer sentence ''I saw the red house.'' == Approximation method == The approximation method calculates the probability <math>P(w_1,\ldots,w_m)</math> of observing the sentence <math>w_1,\ldots,w_m</math> <math display="block">P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i\mid w_1,\ldots,w_{i-1})\approx \prod^m_{i=2} P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})</math> It is assumed that the probability of observing the ''i''<sup>th</sup> word ''w<sub>i</sub>'' (in the context window consisting of the preceding ''i'' − 1 words) can be approximated by the probability of observing it in the shortened context window consisting of the preceding ''n'' − 1 words (''n''<sup>th</sup>-order [[Markov property]]). To clarify, for ''n'' = 3 and ''i'' = 2 we have <math>P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})=P(w_2\mid w_1)</math>. The conditional probability can be calculated from ''n''-gram model frequency counts: <math display="block">P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1}) = \frac{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1},w_i)}{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1})}</math> === Out-of-vocabulary words === {{Main\|Statistical machine translation}} An issue when using ''n''-gram language models are out-of-vocabulary (OOV) words. They are encountered in [[computational linguistics]] and [[natural language processing]] when the input includes words which were not present in a system's dictionary or database during its preparation. By default, when a language model is estimated, the entire observed vocabulary is used. In some cases, it may be necessary to estimate the language model with a specific fixed vocabulary. In such a scenario, the ''n''-grams in the [[text corpus\|corpus]] that contain an out-of-vocabulary word are ignored. The ''n''-gram probabilities are smoothed over all the words in the vocabulary even if they were not observed.<ref>{{cite journal \| last1 = Wołk \| first1 = K. \| last2 = Marasek \| first2 = K. \| last3 = Glinkowski \| first3 = W. \| year = 2015 \| title = Telemedicine as a special case of Machine Translation \| journal = Computerized Medical Imaging and Graphics \| volume = 46 Pt 2 \| pages = 249–56 \| doi = 10.1016/j.compmedimag.2015.09.005 \| pmid = 26617328 \| arxiv = 1510.04600 \| bibcode = 2015arXiv151004600W \| s2cid = 12361426 }}</ref> Nonetheless, it is essential in some cases to explicitly model the probability of out-of-vocabulary words by introducing a special token (e.g. ''<unk>'') into the vocabulary. Out-of-vocabulary words in the corpus are effectively replaced with this special <unk> token before ''n''-grams counts are cumulated. With this option, it is possible to estimate the transition probabilities of ''n''-grams involving out-of-vocabulary words.<ref>{{Cite conference \| author = Wołk K., Marasek K. \| year = 2014 \| title = Polish-English Speech Statistical Machine Translation Systems for the IWSLT 2014 \| arxiv = 1509.09097 \| conference = Proceedings of the 11th International Workshop on Spoken Language Translation. Tahoe Lake, USA }}</ref> == ''n''-grams for approximate matching == {{Main\|Approximate string matching}} ''n''-grams were also used for approximate matching. If we convert strings (with only letters in the English alphabet) into character 3-grams, we get a <math>26^3</math>-dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Using this representation, we lose information about the string. However, we know empirically that if two strings of real text have a similar vector representation (as measured by [[cosine similarity\|cosine distance]]) then they are likely to be similar. Other metrics have also been applied to vectors of ''n''-grams with varying, sometimes better, results. For example, [[z-score]]s have been used to compare documents by examining how many standard deviations each ''n''-gram differs from its mean occurrence in a large collection, or [[text corpus]], of documents (which form the "background" vector). In the event of small counts, the [[g-score]] (also known as [[g-test]]) gave better results. It is also possible to take a more principled approach to the statistics of ''n''-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in [[Bayesian inference]]. ''n''-gram-based searching was also used for [[plagiarism detection]]. == Bias–variance tradeoff == {{Main\|Bias–variance tradeoff}} To choose a value for ''n'' in an ''n''-gram model, it is necessary to find the right trade-off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones. === Smoothing techniques === There are problems of balance weight between ''infrequent grams'' (for example, if a proper name appeared in the training data) and ''frequent grams''. Also, items not seen in the training data will be given a [[probability]] of 0.0 without [[smoothing]]. For unseen but plausible data from a sample, one can introduce [[pseudocount]]s. Pseudocounts are generally motivated on Bayesian grounds. In practice it was necessary to ''smooth'' the probability distributions by also assigning non-zero probabilities to unseen words or ''n''-grams. The reason is that models derived directly from the ''n''-gram frequency counts have severe problems when confronted with any ''n''-grams that have not explicitly been seen before – [[PPM compression algorithm\|the zero-frequency problem]]. Various smoothing methods were used, from simple "add-one" (Laplace) smoothing (assign a count of 1 to unseen ''n''-grams; see [[Rule of succession]]) to more sophisticated models, such as [[Good–Turing discounting]] or [[Katz's back-off model\|back-off models]]. Some of these methods are equivalent to assigning a [[prior distribution]] to the probabilities of the ''n''-grams and using [[Bayesian inference]] to compute the resulting [[posterior distribution\|posterior]] ''n''-gram probabilities. However, the more sophisticated smoothing models were typically not derived in this fashion, but instead through independent considerations. * [[Linear interpolation]] (e.g., taking the [[weighted mean]] of the unigram, bigram, and trigram) * [[Good–Turing frequency estimation\|Good–Turing]] discounting * [[Witten–Bell discounting]] * [[Additive smoothing\|Lidstone's smoothing]] * [[Katz's back-off model]] (trigram) * [[Kneser–Ney smoothing]] === Skip-gram language model === [[File:1-skip-2-gram.svg\|thumb\|1-skip-2-grams for the text "the rain in Spain falls mainly on the plain"]] Skip-gram language model is an attempt at overcoming the data sparsity problem that the preceding model (i.e. word ''n''-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are ''skipped'' over (thus the name "skip-gram").<ref>{{cite web\|url=http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf\|title=A Closer Look at Skip-gram Modelling\|author=David Guthrie\|date=2006\|display-authors=etal\|access-date=27 April 2014\|archive-url=https://web.archive.org/web/20170517144625/http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf\|archive-date=17 May 2017\|url-status=dead}}</ref> Formally, a {{mvar\|k}}-skip-{{mvar\|n}}-gram is a length-{{mvar\|n}} subsequence where the components occur at distance at most {{mvar\|k}} from each other. For example, in the input text: :''the rain in Spain falls mainly on the plain'' the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences :''the in'', ''rain Spain'', ''in falls'', ''Spain mainly'', ''falls on'', ''mainly the'', and ''on plain''. In skip-gram model, semantic relations between words are represented by [[linear combination]]s, capturing a form of [[compositionality]]. For example, in some such models, if {{mvar\|v}} is the function that maps a word {{mvar\|w}} to its {{mvar\|n}}-d vector representation, then <math display="block">v(\mathrm{king}) - v(\mathrm{male}) + v(\mathrm{female}) \approx v(\mathrm{queen})</math> where ≈ is made precise by stipulating that its right-hand side must be the [[Nearest neighbor search\|nearest neighbor]] of the value of the left-hand side.<ref name="mikolov">{{cite arXiv \|first1=Tomas \|last1=Mikolov \|first2=Kai \|last2=Chen \|first3=Greg \|last3=Corrado \|first4=Jeffrey \|last4=Dean \|eprint=1301.3781 \|title=Efficient estimation of word representations in vector space \|year=2013\|class=cs.CL }}</ref><ref name="compositionality">{{cite conference \|title=Distributed Representations of Words and Phrases and their Compositionality \|last1=Mikolov \|first1=Tomas \|last2=Sutskever \|first2=Ilya \|last3=Chen \|first3=Kai \|last4=Corrado\|first4=Greg S. \|last5=Dean \|first5=Jeff \|conference=[[Advances in Neural Information Processing Systems]] \|pages=3111–3119 \|year=2013 \|url=http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf \|access-date=22 June 2015 \|archive-date=29 October 2020 \|archive-url=https://web.archive.org/web/20201029083132/https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf \|url-status=live }}</ref> == Syntactic ''n''-grams == Syntactic ''n''-grams are ''n''-grams defined by paths in syntactic dependency or constituent trees rather than the linear structure of the text.<ref name="sng">{{cite book \|doi=10.1007/978-3-642-37798-3_1 \|chapter=Syntactic Dependency-Based N-grams as Classification Features \|chapter-url=http://www.cic.ipn.mx/~sidorov/sn_grams_MICAI2012.pdf \|title=Advances in Computational Intelligence \|volume=7630 \|pages=1–11 \|series=Lecture Notes in Computer Science \|year=2013 \|last1=Sidorov \|first1=Grigori \|last2=Velasquez \|first2=Francisco \|last3=Stamatatos \|first3=Efstathios \|last4=Gelbukh \|first4=Alexander \|last5=Chanona-Hernández \|first5=Liliana \|editor-last1=Batyrshin \|editor-first1=I. \|editor-last2=Mendoza \|editor-first2=M. G. \|isbn=978-3-642-37797-6 \|access-date=18 May 2019 \|archive-date=8 August 2017 \|archive-url=https://web.archive.org/web/20170808102048/http://www.cic.ipn.mx/~sidorov/sn_grams_MICAI2012.pdf \|url-status=live }}</ref><ref>{{cite journal \|last1=Sidorov \|first1=Grigori \|title=Syntactic Dependency-Based ''n''-grams in Rule Based Automatic English as Second Language Grammar Correction \|journal=International Journal of Computational Linguistics and Applications \|date=2013 \|volume=4 \|issue=2 \|pages=169–188 \|citeseerx=10.1.1.644.907 \|url=http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.644.907&rep=rep1&type=pdf \|access-date=7 October 2021 \|archive-date=7 October 2021 \|archive-url=https://web.archive.org/web/20211007004718/http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.644.907&rep=rep1&type=pdf \|url-status=live }}</ref><ref>{{cite journal \| last1 = Figueroa \| first1 = Alejandro \| last2 = Atkinson \| first2 = John \| year = 2012 \| title = Contextual Language Models For Ranking Answers To Natural Language Definition Questions \| url = https://www.researchgate.net/publication/262176888 \| journal = Computational Intelligence \| volume = 28 \| issue = 4 \| pages = 528–548 \| doi = 10.1111/j.1467-8640.2012.00426.x \| s2cid = 27378409 \| access-date = 27 May 2015 \| archive-date = 27 October 2021 \| archive-url = https://web.archive.org/web/20211027174055/https://www.researchgate.net/publication/262176888_Contextual_Language_Models_For_Ranking_Answers_To_Natural_Language_Definition_Questions \| url-status = live }}</ref> For example, the sentence "economic news has little effect on financial markets" can be transformed to syntactic ''n''-grams following the tree structure of its [[Dependency grammar\|dependency relations]]: news-economic, effect-little, effect-on-markets-financial.<ref name="sng" /> Syntactic ''n''-grams are intended to reflect syntactic structure more faithfully than linear ''n''-grams, and have many of the same applications, especially as features in a [[vector space model]]. Syntactic ''n''-grams for certain tasks gives better results than the use of standard ''n''-grams, for example, for authorship attribution.<ref>{{cite journal \|last1=Sidorov \|first1=Grigori \|last2=Velasquez \|first2=Francisco \|last3=Stamatatos \|first3=Efstathios \|last4=Gelbukh \|first4=Alexander \|last5=Chanona-Hernández \|first5=Liliana \|title=Syntactic ''n''-Grams as Machine Learning Features for Natural Language Processing \|journal=Expert Systems with Applications \|volume=41 \|issue=3 \|pages=853–860 \|doi=10.1016/j.eswa.2013.08.015 \|year=2014\|s2cid=207738654 }}</ref> Another type of syntactic ''n''-grams are part-of-speech ''n''-grams, defined as fixed-length contiguous overlapping subsequences that are extracted from part-of-speech sequences of text. Part-of-speech ''n''-grams have several applications, most commonly in information retrieval.<ref>{{cite journal \| last1 = Lioma \| first1 = C. \| last2 = van Rijsbergen \| first2 = C. J. K. \| year = 2008 \| title = Part of Speech ''n''-Grams and Information Retrieval \| url = http://main.cl-lab.dk/www/publications/2008/pdf/part-of-speech-n-grams-and-information-retrieval.pdf \| journal = French Review of Applied Linguistics \| volume = XIII \| issue = 1 \| pages = 9–22 \| via = Cairn \| access-date = 12 March 2018 \| archive-date = 13 March 2018 \| archive-url = https://web.archive.org/web/20180313031441/http://main.cl-lab.dk/www/publications/2008/pdf/part-of-speech-n-grams-and-information-retrieval.pdf \| url-status = live }}</ref> == Other applications == ''n''-grams find use in several areas of computer science, [[computational linguistics]], and applied mathematics. They have been used to: * design [[kernel trick\|kernels]] that allow [[machine learning]] algorithms such as [[support vector machine]]s to learn from string data{{citation needed\|date=January 2022}} * find likely candidates for the [[autocorrect\|correct spelling]] of a misspelled word<ref>U.S. Patent 6618697, [https://patentimages.storage.googleapis.com/84/a6/5f/6b58b2e2c2da12/US6618697.pdf Method for rule-based correction of spelling and grammar errors]</ref> * improve compression in [[data compression\|compression algorithms]] where a small area of data requires ''n''-grams of greater length * assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems, [[speech recognition]], OCR ([[optical character recognition]]), [[Intelligent Character Recognition]] ([[Intelligent character recognition\|ICR]]), [[machine translation]] and similar applications * improve retrieval in [[information retrieval]] systems when it is hoped to find similar "documents" (a term for which the conventional meaning is sometimes stretched, depending on the data set) given a single query document and a database of reference documents * improve retrieval performance in genetic sequence analysis as in the [[BLAST (biotechnology)\|BLAST]] family of programs * identify the language a text is in or the species a small sequence of DNA was taken from * predict letters or words at random in order to create text, as in the [[dissociated press]] algorithm. * [[cryptanalysis]]{{citation needed\|date=January 2022}} == See also == * [[Collocation]] * [[Feature engineering]] * [[Hidden Markov model]] * [[Longest common substring]] * [[MinHash]] * [[n-tuple]] * [[String kernel]] ==References==