A statistical language model assigns a probability to a sequence of m words P(w1, …, wm) by means of a probability distribution.


Unigram Models

A unigram model used in information retrieval can be treated as the combination of several one-state finite automata. It splits the probabilities of different terms in a context, e.g. from



to


.


In this model, the probability to hit each word all depends on its own, so we only have one-state finite automata as units. For each automaton, we only have one way to hit its only state, assigned with one probability. Viewing from the whole model, the sum of all the one-state-hitting probabilities should be 1.


N-gram Models

In an n-gram model, the probability P(w1, …, wm) of observing the sentence w1, …, wm is approximated as



Here, it is assumed that the probability of observing the ith word wi in the context history of the preceding i-1 words can be approximated by the probability of observing it in the shortened context history of the preceding n-1 words (nth order Markov property).


The conditional probability can be calculated from n-gram frequency counts:



The words bigram and trigram language model denote n-gram language models with n=2 and n=3, respectively.



Examples




References