Forward–backward algorithm: Difference between revisions

Content deleted Content added
Copyedit. Correct caps in section header.
Line 4:
The '''forward-backward algorithm''' is an [[inference]] [[algorithm]] for [[hidden Markov models]] which computes the [[posterior probability|posterior]] [[marginal probability|marginals]] of all hidden state variables given a sequence of observations/emissions <math>o_{1:t}:= o_1,\dots,o_t</math>, i.e. it computes, for all hidden state variables <math>X_k \in \{X_1, \dots, X_t\}</math>, the distribution <math>P(X_k\ |\ o_{1:t})</math>. This inference task is usually called ''smoothing''. The algorithm makes use of the principle of [[dynamic programming]] to efficiently compute the values that are required to obtain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name ''forward-backward algorithm''.
 
The term ''forward-backward algorithm'' is also used to refer to any algorithm belonging to the general class of algorithms that operate on sequence models in a forward-backward manner. In this sense, the descriptions in the remainder of this article refer but to one specific instance of this class.
 
==Overview ==
 
In the first pass, the forward-backward algorithm computes a set of forward probabilities which provide, for all <math>k \in \{1, \dots, t\}</math>, the probability of ending up in any particular state given the first <math>k</math> observations in the sequence, i.e. <math>P(X_k\ |\ o_{1:k})</math>. In the second pass, the algorithm computes a set of backward probabilities which provide the probability of observing the remaining observations given any starting point <math>k</math>, i.e. <math>P(o_{k+1:t}\ |\ X_k)</math>. These two sets of probability distributions can then be combined to obtain the distribution over states at any specific point in time given the entire observation sequence:
 
:<math>P(X_k\ |\ o_{1:t}) = P(X_k\ |\ o_{1:k}, o_{k+1:t}) \propto P(o_{k+1:t}\ |\ X_k) P(X_k\ |\ o_{1:k})</math>
 
The last step follows from an application of [[Bayes' rule]] and the [[conditional independence]] of <math>o_{k+1:t}</math> and <math>o_{1:k}</math> given <math>X_k</math>.
 
As outlined above, the algorithm involves three steps:
 
# computing forward probabilities
Line 24 ⟶ 23:
The forward-backward algorithm can be used to find the most likely state for any point in time. It cannot, however, be used to find the most likely sequence of states (see [[Viterbi algorithm]]).
 
== Forward Probabilities probabilities==
The following description takes as its base matrices of probability values rather than probability distributions. We transform the probability distributions related to a given [[hidden Markov model]] into matrix notation as follows.
The transition probabilities <math>\mathbf{P}(X_t\mid X_{t-1})</math> of a given random variable <math>X_t</math> representing all possible states in the hidden Markov model will be represented by the matrix <math>\mathbf{T}</math> where the row index, i, will represent the start state and the column index, j, represents the target state. The example below represents a system where the probability of staying in the same state after each step is 70% and the probability of transitioning to the other state is 30%. The transition matrix is then:
Line 30 ⟶ 29:
:<math>\mathbf{T} = \begin{pmatrix}
0.7 & 0.3 \\
0.3 & 0.7
\end{pmatrix}
</math>
Line 38 ⟶ 37:
:<math>\mathbf{B} = \begin{pmatrix}
0.9 & 0.1 \\
0.2 & 0.8
\end{pmatrix}
</math>
Line 50 ⟶ 49:
:<math>\mathbf{O_1} = \begin{pmatrix}
0.9 & 0.0 \\
0.0 & 0.2
\end{pmatrix}
</math>
Line 69 ⟶ 68:
 
:<math>
\mathbf{f_{0:t}}(i) = \mathbf{P}(o_1, o_2, \dots, o_t, X_t=x_i | \mathbf{\pi} )
</math>
 
Line 87 ⟶ 86:
 
:<math>
\mathbf{\hat{f}_{0:t}}(i) =
\frac{\mathbf{f_{0:t}}(i)}{\prod_{s=1}^t c_s} =
\frac{\mathbf{P}(o_1, o_2, \dots, o_t, X_t=x_i | \mathbf{\pi} )}{\mathbf{P}(o_1, o_2, \dots, o_t|\mathbf{\pi})} =
\mathbf{P}(X_t=x_i | o_1, o_2, \dots, o_t, \mathbf{\pi} )
Line 95 ⟶ 94:
We thus find that the product of the scaling factors provides us with the total probability for observing the given sequence up to time t and that the scaled probability vector provides us with the probability of being in each state at this time.
 
== Backward Probabilities probabilities==
A similar procedure can be constructed to find backward probabilities. These intend to provide the probabilities:
 
:<math>
\mathbf{b_{t:T}}(i) = \mathbf{P}(o_{t+1}, o_{t+2}, \dots, o_{T} | X_t=x_i )
</math>
 
Line 123 ⟶ 122:
 
:<math>
\mathbf{\hat{b}_{t:T}}(i) =
\frac{\mathbf{b_{t:T}}(i)}{\prod_{s=t+1}^T c_s}
</math>
Line 130 ⟶ 129:
 
:<math>
\mathbf{\gamma_t}(i) =
\mathbf{P}(X_t=x_i | o_1, o_2, \dots, o_T, \mathbf{\pi}) =
\frac{ \mathbf{P}(o_1, o_2, \dots, o_T, X_t=x_i | \mathbf{\pi} ) }{ \mathbf{P}(o_1, o_2, \dots, o_T | \mathbf{\pi} ) } =
Line 146 ⟶ 145:
:<math>\mathbf{T} = \begin{pmatrix}
0.7 & 0.3 \\
0.3 & 0.7
\end{pmatrix}
</math>
Line 154 ⟶ 153:
:<math>\mathbf{B} = \begin{pmatrix}
0.9 & 0.1 \\
0.2 & 0.8
\end{pmatrix}
</math>
 
We then observe the following sequence of events: {umbrella, umbrella, no umbrella, umbrella, umbrella} which we will represent in our calculations as:
 
:<math>
Line 164 ⟶ 163:
</math>
 
Note that <math>\mathbf{O_3}</math> differs from the others because of the "no umbrella" observation.
 
In computing the forward probabilities we begin with:
Line 187 ⟶ 186:
 
:<math>
(\mathbf{\hat{f}_{0:1}})^T =
c_1^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.5000 \\ 0.5000 \end{pmatrix}=
c_1^{-1}\begin{pmatrix}0.4500 \\ 0.1000\end{pmatrix}=
Line 194 ⟶ 193:
 
:<math>
(\mathbf{\hat{f}_{0:2}})^T =
c_2^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.8182 \\ 0.1818 \end{pmatrix}=
c_2^{-1}\begin{pmatrix}0.5645 \\ 0.0745\end{pmatrix}=
Line 201 ⟶ 200:
 
:<math>
(\mathbf{\hat{f}_{0:3}})^T =
c_3^{-1}\begin{pmatrix}0.1 & 0.0 \\ 0.0 & 0.8 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.8834 \\ 0.1166 \end{pmatrix}=
c_3^{-1}\begin{pmatrix}0.0653 \\ 0.2772\end{pmatrix}=
Line 208 ⟶ 207:
 
:<math>
(\mathbf{\hat{f}_{0:4}})^T =
c_4^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.1907 \\ 0.8093 \end{pmatrix}=
c_4^{-1}\begin{pmatrix}0.3386 \\ 0.1247\end{pmatrix}=
Line 215 ⟶ 214:
 
:<math>
(\mathbf{\hat{f}_{0:5}})^T =
c_5^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.7308 \\ 0.2692 \end{pmatrix}=
c_5^{-1}\begin{pmatrix}0.5331 \\ 0.0815\end{pmatrix}=
Line 284 ⟶ 283:
Several enhancements are known to the general forward-backward algorithm which allow for computations to take place in constant space. In addition, as it grows, algorithms have been developed to compute <math>\mathbf{f_{0:t+1}}</math> efficiently through online smoothing such as the fixed-lag smoothing (FLS) algorithm [[#RusselNorvig03|Russel & Norvig 2003 pp. 552]].
 
== Pseudocode==
<pre>
ForwardBackward(guessState, sequenceIndex):
Line 303 ⟶ 302:
 
== References==
 
* [[Lawrence Rabiner|Lawrence R. Rabiner]], [http://www.caip.rutgers.edu/~lrr/Reprints/tutorial%20on%20hmm%20and%20applications.pdf A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition]. ''Proceedings of the [[IEEE]]'', 77 (2), p.&nbsp;257&ndash;286, February 1989.
*{{cite journal |author=Lawrence R. Rabiner, B. H. Juang|title=An introduction to hidden Markov models|journal=IEEE ASSP Magazine |month=January |pages=4–15 |year=1986}}