Revision as of 11:46, 8 August 2023 edit DancingPhilosopher (talk \| contribs) Extended confirmed users 5,622 edits wikilink (instead of MLP) ← Previous edit		Revision as of 11:36, 10 August 2023 edit undo DancingPhilosopher (talk \| contribs) Extended confirmed users 5,622 edits →Unigram model: rephrased Next edit →
Line 38: {{see also\|Bag-of-words model}} A special case, where n=0, is called a unigram model. Probability of each word in a sequence is independent from probabilities of other word in the sequence. Each word's probability in the seqence is equal to the word's probability in an entire document. A special case, where n=0, the model can be treated as the combination of several one-state [[Finite-state machine\|finite automata]].<ref>Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2009). ''An Introduction to Information Retrieval''. pp. 237–240. Cambridge University Press.</ref> It assumes that the probabilities of tokens in a sequence are independent, e.g.:▼ <math display="block">P_\text{uni}(t_1t_2t_3)=P(t_1)P(t_2)P(t_3).</math> ▲AThe ~~special~~model ~~case,~~consists ~~where~~of ~~n=0~~units, ~~the model can be~~each treated as ~~the combination of several~~ one-state [[Finite-state machine\|finite automata]].<ref>Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2009). ''An Introduction to Information Retrieval''. pp. 237–240. Cambridge University Press.</ref> It ~~assumes~~Words ~~that~~with ~~the~~their probabilities ~~of tokens~~ in a ~~sequence~~document ~~are~~can ~~independent,~~be eillustrated as follows.~~g.:~~ In this model, the probability of each word only depends on that word's own probability in the document, so we only have one-state finite automata as units. The automaton itself has a probability distribution over the entire vocabulary of the model, summing to 1. The following is an illustration of a unigram model of a document. {\| class="wikitable" \|- ! ~~Terms~~Word !! ~~Probability~~Its probability in doc \|- \| a \|\| 0.1 Line 61: \|} Total mass of word probabilities distributed across the document's vocabulary, is 1. <math display="block">\sum_{\text{term in doc}} P(\text{term}) = 1</math>▼ ▲<math display="block">\sum_{\text{~~term~~word in doc}} P(\text{~~term~~word}) = 1</math> The probability generated for a specific query is calculated as <math display="block">P(\text{query}) = \prod_{\text{~~term~~word in query}} P(\text{~~term~~word})</math> ~~Different~~Unigram ~~documents~~models ~~have~~of ~~unigram~~different ~~models,~~documents ~~with~~have different ~~hit~~ probabilities of words in it. The probability distributions from different documents are used to generate hit probabilities for each query. Documents can be ranked for a query according to the probabilities. Example of unigram models of two documents: {\| class="wikitable" \|- ! ~~Terms~~Word !! ~~Probability~~Its probability in Doc1 !! ~~Probability~~Its probability in Doc2 \|- \| a \|\| 0.1 \|\| 0.3

Word n-gram language model: Difference between revisions