Word n-gram language model: Difference between revisions

Content deleted Content added
wikilink (instead of MLP)
Unigram model: rephrased
Line 38:
{{see also|Bag-of-words model}}
 
A special case, where n=0, is called a unigram model. Probability of each word in a sequence is independent from probabilities of other word in the sequence. Each word's probability in the seqence is equal to the word's probability in an entire document.
A special case, where n=0, the model can be treated as the combination of several one-state [[Finite-state machine|finite automata]].<ref>Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2009). ''An Introduction to Information Retrieval''. pp. 237–240. Cambridge University Press.</ref> It assumes that the probabilities of tokens in a sequence are independent, e.g.:
 
<math display="block">P_\text{uni}(t_1t_2t_3)=P(t_1)P(t_2)P(t_3).</math>
 
AThe specialmodel case,consists whereof n=0units, the model can beeach treated as the combination of several one-state [[Finite-state machine|finite automata]].<ref>Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2009). ''An Introduction to Information Retrieval''. pp. 237–240. Cambridge University Press.</ref> It assumesWords thatwith thetheir probabilities of tokens in a sequencedocument arecan independent,be eillustrated as follows.g.:
In this model, the probability of each word only depends on that word's own probability in the document, so we only have one-state finite automata as units. The automaton itself has a probability distribution over the entire vocabulary of the model, summing to 1. The following is an illustration of a unigram model of a document.
 
{| class="wikitable"
|-
! TermsWord !! ProbabilityIts probability in doc
|-
| a || 0.1
Line 61:
|}
 
Total mass of word probabilities distributed across the document's vocabulary, is 1.
<math display="block">\sum_{\text{term in doc}} P(\text{term}) = 1</math>
 
<math display="block">\sum_{\text{termword in doc}} P(\text{termword}) = 1</math>
 
The probability generated for a specific query is calculated as
 
<math display="block">P(\text{query}) = \prod_{\text{termword in query}} P(\text{termword})</math>
 
DifferentUnigram documentsmodels haveof unigramdifferent models,documents withhave different hit probabilities of words in it. The probability distributions from different documents are used to generate hit probabilities for each query. Documents can be ranked for a query according to the probabilities. Example of unigram models of two documents:
 
{| class="wikitable"
|-
! TermsWord !! ProbabilityIts probability in Doc1 !! ProbabilityIts probability in Doc2
|-
| a || 0.1 || 0.3