Content deleted Content added
sections from n-gram article moved here; see https://en.wikipedia.org/w/index.php?title=N-gram&diff=1169661125&oldid=1169660054 |
Fix disambiguation link Tags: Visual edit Mobile edit Mobile web edit |
||
(27 intermediate revisions by 16 users not shown) | |||
Line 1:
{{Short description|Purely statistical model of language}} {{DISPLAYTITLE:
A '''word n-gram model''' is a statistical (as opposed to [[Recurrent neural network|recurrent]] neural-network-based, which replaced it in the 2000s, and [[large language model]]-based, which overperformed it in early 2020s) [[language model]].<ref>{{Cite journal|url=https://dl.acm.org/doi/10.5555/944919.944966|title=A neural probabilistic language model|first1=Yoshua|last1=Bengio|first2=Réjean|last2=Ducharme|first3=Pascal|last3=Vincent|first4=Christian|last4=Janvin|date=March 1, 2003|journal=The Journal of Machine Learning Research|volume=3|pages=1137–1155|via=ACM Digital Library}}</ref> It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if ''n''-1 words, an ''n''-gram model.<ref name=jm/> Special tokens were introduced to denote the start and end of a sentence <math>\langle s\rangle</math> and <math>\langle /s\rangle</math>.▼
▲A '''word ''n''-gram language model''' is a purely statistical
To prevent a zero probability being assigned to unseen words, each word's probability is slightly lower than its frequency count in a corpus. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen ''n''-grams, as an [[uninformative prior]]) to more sophisticated models, such as [[Good–Turing discounting]] or [[Katz's back-off model|back-off model]]s.▼
▲To prevent a zero probability being assigned to unseen words, each word's probability is slightly
== Unigram model ==
{{see also|Bag-of-words model}}
A special case, where ''n'' =
<math display="block">P_\text{uni}(t_1t_2t_3)=P(t_1)P(t_2)P(t_3).</math>
Line 87 ⟶ 88:
{{Main|Statistical machine translation}}
An issue when using ''n''-gram language models are out-of-vocabulary (OOV) words. They are encountered in [[computational linguistics]] and [[natural language processing]] when the input includes words which were not present in a system's dictionary or database during its preparation. By default, when a language model is estimated, the entire observed vocabulary is used. In some cases, it may be necessary to estimate the language model with a specific fixed vocabulary. In such a scenario, the ''n''-grams in the [[text corpus|corpus]] that contain an out-of-vocabulary word are ignored. The ''n''-gram probabilities are smoothed over all the words in the vocabulary even if they were not observed.<ref>{{cite journal | last1 = Wołk | first1 = K. | last2 = Marasek | first2 = K. | last3 = Glinkowski | first3 = W. | year = 2015 | title = Telemedicine as a special case of Machine Translation | journal = Computerized Medical Imaging and Graphics | volume = 46 Pt 2 | pages = 249–56 | doi = 10.1016/j.compmedimag.2015.09.005 | pmid = 26617328 | arxiv = 1510.04600 | bibcode = 2015arXiv151004600W | s2cid = 12361426 }}</ref>
Nonetheless, it is essential in some cases to explicitly model the probability of out-of-vocabulary words by introducing a special token (e.g. ''<unk>'') into the vocabulary. Out-of-vocabulary words in the corpus are effectively replaced with this special <unk> token before ''n''-grams counts are cumulated. With this option, it is possible to estimate the transition probabilities of ''n''-grams involving out-of-vocabulary words.<ref>{{Cite conference | author = Wołk K., Marasek K. | year = 2014 | title = Polish-English Speech Statistical Machine Translation Systems for the IWSLT 2014 | arxiv = 1509.09097 | conference = Proceedings of the 11th International Workshop on Spoken Language Translation. Tahoe Lake, USA }}</ref>
== ''n''-grams for approximate matching ==
Line 99 ⟶ 100:
''n''-gram-based searching was also used for [[plagiarism detection]].
== Bias–variance tradeoff ==
{{Main|Bias–variance tradeoff}}
To choose a value for ''n'' in an ''n''-gram model, it is necessary to find the right trade-off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones.
Line 114 ⟶ 116:
=== Skip-gram language model ===
[[File:1-skip-2-gram.svg|thumb|1-skip-2-grams for the text "the rain in Spain falls mainly on the plain"]]
Skip-gram language model is an attempt at overcoming the data sparsity problem that the preceding model (i.e. word ''n''-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are ''skipped'' over (thus the name "skip-gram").<ref>{{cite web|url=http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf|title=A Closer Look at Skip-gram Modelling|author=David Guthrie|date=2006|display-authors=etal|access-date=27 April 2014|archive-url=https://web.archive.org/web/20170517144625/http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf|archive-date=17 May 2017|url-status=dead}}</ref>
Formally, a {{mvar|k}}-skip-{{mvar|n}}-gram is a length-{{mvar|n}} subsequence where the components occur at distance at most {{mvar|k}} from each other.
Line 125 ⟶ 128:
:''the in'', ''rain Spain'', ''in falls'', ''Spain mainly'', ''falls on'', ''mainly the'', and ''on plain''.
In skip-gram model, semantic relations between words are represented by [[linear combination]]s, capturing a form of [[compositionality]]. For example, in some such models, if {{mvar|v}} is the function that maps a word {{mvar|w}} to its {{mvar|n}}-d vector representation, then
<math display="block">v(\mathrm{king}) - v(\mathrm{male}) + v(\mathrm{female}) \approx v(\mathrm{queen})</math>
where ≈ is made precise by stipulating that its right-hand side must be the [[Nearest neighbor search|nearest neighbor]] of the value of the left-hand side.<ref name="mikolov">{{cite arXiv |first1=Tomas |last1=Mikolov |first2=Kai |last2=Chen |first3=Greg |last3=Corrado |first4=Jeffrey |last4=Dean |eprint=1301.3781 |title=Efficient estimation of word representations in vector space |year=2013|class=cs.CL }}</ref><ref name="compositionality">{{cite conference |title=Distributed Representations of Words and Phrases and their Compositionality |last1=Mikolov |first1=Tomas |last2=Sutskever |first2=Ilya |last3=Chen |first3=Kai |last4=Corrado|first4=Greg S. |last5=Dean |first5=Jeff |conference=[[Advances in Neural Information Processing Systems]] |pages=3111–3119 |year=2013 |url=http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf |access-date=22 June 2015 |archive-date=29 October 2020 |archive-url=https://web.archive.org/web/20201029083132/https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf |url-status=live }}</ref>
== Syntactic ''n''-grams ==
Line 132 ⟶ 140:
Syntactic ''n''-grams are intended to reflect syntactic structure more faithfully than linear ''n''-grams, and have many of the same applications, especially as features in a [[vector space model]]. Syntactic ''n''-grams for certain tasks gives better results than the use of standard ''n''-grams, for example, for authorship attribution.<ref>{{cite journal |last1=Sidorov |first1=Grigori |last2=Velasquez |first2=Francisco |last3=Stamatatos |first3=Efstathios |last4=Gelbukh |first4=Alexander |last5=Chanona-Hernández |first5=Liliana |title=Syntactic ''n''-Grams as Machine Learning Features for Natural Language Processing |journal=Expert Systems with Applications |volume=41 |issue=3 |pages=853–860 |doi=10.1016/j.eswa.2013.08.015 |year=2014|s2cid=207738654 }}</ref>
Another type of syntactic ''n''-grams are part-of-speech ''n''-grams, defined as fixed-length contiguous overlapping subsequences that are extracted from part-of-speech sequences of text. Part-of-speech ''n''-grams have several applications, most commonly in information retrieval.<ref>{{cite journal | last1 = Lioma | first1 = C. | last2 = van Rijsbergen | first2 = C. J. K. | year = 2008 | title = Part of Speech ''n''-Grams and Information Retrieval | url = http://main.cl-lab.dk/www/publications/2008/pdf/part-of-speech-n-grams-and-information-retrieval.pdf | journal = French Review of Applied Linguistics | volume = XIII | issue = 1 | pages = 9–22 | via = Cairn | access-date = 12 March 2018 | archive-date = 13 March 2018 | archive-url = https://web.archive.org/web/20180313031441/http://main.cl-lab.dk/www/publications/2008/pdf/part-of-speech-n-grams-and-information-retrieval.pdf | url-status = live }}</ref>
== Other applications ==
Line 143 ⟶ 151:
* assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems, [[speech recognition]], OCR ([[optical character recognition]]), [[Intelligent Character Recognition]] ([[Intelligent character recognition|ICR]]), [[machine translation]] and similar applications
* improve retrieval in [[information retrieval]] systems when it is hoped to find similar "documents" (a term for which the conventional meaning is sometimes stretched, depending on the data set) given a single query document and a database of reference documents
* improve retrieval performance in genetic sequence analysis as in the [[BLAST (biotechnology)|BLAST]] family of programs
* identify the language a text is in or the species a small sequence of DNA was taken from
* predict letters or words at random in order to create text, as in the [[dissociated press]] algorithm.
Line 150 ⟶ 158:
== See also ==
* [[Collocation]]
* [[Hidden Markov model]]
* [[MinHash]]▼
* [[n-tuple]]
* [[String kernel]]
▲* [[MinHash]]
▲* [[Feature extraction]]
▲* [[Longest common substring problem]]
==References==
|