Content deleted Content added
Copying some content from Language model. See history of that page for attribution. [TODO: check unreliable source that edit filter flagged] Tags: citing a blog or free web host nowiki added |
remove unreliable blogspot source. The claim it's being used to cite is pretty anodyne anyways. |
||
Line 75:
<math display="block">P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1}) = \frac{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1},w_i)}{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1})}</math>
The terms '''bigram''' and '''trigram''' language models denote ''n''-gram models with ''n'' = 2 and ''n'' = 3, respectively.
Typically, the ''n''-gram model probabilities are not derived directly from frequency counts, because models derived this way have severe problems when confronted with any ''n''-grams that have not been explicitly seen before. Instead, some form of smoothing is necessary, assigning some of the total probability mass to unseen words or ''n''-grams. Various methods are used, from simple "add-one" smoothing (assign a count of 1 to unseen ''n''-grams, as an [[uninformative prior]]) to more sophisticated models, such as [[Good–Turing discounting]] or [[Katz's back-off model|back-off model]]s.
|