Forward–backward algorithm

In computer science, the forward-backward algorithm is an algorithm for computing the probability of a particular observation sequence. It operates in the context of hidden Markov models.

Overview

The algorithm comprises three steps:

computing forward probabilities
computing backward probabilities
computing smoothed values

The forward and backward steps are often called "forward message pass" and "backward message pass". The wording originates from the way the algorithm processes the given observation sequence. First the algorithm moves forward starting with the first observation in the sequence and going to the last, and then returning back to the first. At each single observation in the sequence probabilities to be used for calculations at the next observation are computed. During the backward pass the algorithm simultaneously performs the smoothing step. This step allows the algorithm to take into account any past observations of output for computing more accurate results.

Formal description

The following description takes as its base matrices of probability values rather than probability distributions. We transform the probability distributions related to a given hidden Markov model into matrix notation as follows. The transition probabilities $\mathbf {P} (X_{t}\mid X_{t-1})$ of a given random variable $X_{t}$ representing all possible states in the hidden Markov model will be represented by the matrix $\mathbf {T}$ where the row index, i, will represent the start state and the column index, j, represents the target state. For example, in the example below $\mathbf {T}$ would be defined as:

$\mathbf {T} ={\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}$

Similarly, we will represent the probabilities of a new state, given some observed states as evidence, in an observation matrix $\mathbf {O_{t}}$ where each diagonal entry contains the probability of a new state given the particular observed state at t as evidence. Note that t indicates a particular observation in the observation sequence. All other entries in the matrix will be zero. For example, in the example below $\mathbf {O}$ the first observed evidence (t=1) is "umbrella". Therefore $\mathbf {O_{1}}$ would be defined as:

$\mathbf {O_{1}} ={\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}$

This allows a very elegant description of how to compute the next forward probability. Let the set of forward probabilities be stored in yet another matrix $\mathbf {f_{1:t+1}}$ . Where 1:t+1 indicates that the computed probabilities are depending on all forward probabilities from 1 to t+1 including the current one which we will describe with $\mathbf {f_{1:t}}$ . Hence, $\mathbf {f_{1:t+1}}$ is equal to the multiplication of the transposed transformation matrix with the current forward probabilities and the observation matrix for the next evidence observation. Finally the resulting matrix requires normalization, i.e. the resulting values in the matrix are divided by the sum of all values in the matrix. The normalization factor is described with $\alpha$ . Therefore the resulting forward computation is described by the following general formula:

$\mathbf {f_{1:t+1}} =\alpha \mathbf {O_{t+1}} \mathbf {T^{T}} \mathbf {f_{1:t}}$

We can describe the backward computation $\mathbf {b_{k+1:t}}$ that starts at the end of the sequence in a similar manner. Let the end of the sequence be described by the index k, starting at 0. Therefore running from k down to t=0 and calculating each backward probability can be described by the following formula:

$\mathbf {b_{k+1:t}} =\mathbf {T} \mathbf {O_{k+1}} \mathbf {b_{k+2:t}}$

Note that we use the non-transposed matrix of $\mathbf {T}$ and that the order of the terms has changed. Also note that as a final product we have not a usual matrix multiplication, but a point product. This operation multiplies each value in one matrix with the corresponding value of the other. Finally note that the description in Russel & Norvig 2003 pp. 550 excludes the point product, though the procedure is required earlier.

The third and final step involves the computation of smoothed probabilities $\mathbf {sv_{k}}$ . These are the point product of the backward and forward probabilities for each k. Therefore the formula is defined as:

$\mathbf {sv_{k}} =\alpha \mathbf {b_{k+1:t}} \times \mathbf {f_{1:k}}$

The example in the following section will show numerically the calculation of these matrices.

Example

This example takes as its basis the umbrella world in Russel & Norvig 2003 pp. 540. The example uses a longer observation sequence {umbrella, umbrella, no umbrella, umbrella, umbrella} than the example described in the book. Assume the probabilities for rain given an umbrella observation are 90% if an umbrella is observed and 20% if no umbrella is observed. Furthermore, assume that the probability of rain in t+1 given the weather in t to be a 70% chance that the weather will stay the same, and a 30% chance that the weather will change. The following matrices can be extracted from the umbrella world given the above observations:

$\mathbf {O_{1}} ={\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}~~\mathbf {O_{2}} ={\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}~~\mathbf {O_{3}} ={\begin{pmatrix}0.1&0.0\\0.0&0.8\end{pmatrix}}~~\mathbf {O_{4}} ={\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}~~\mathbf {O_{5}} ={\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}~~\mathbf {T} ={\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}~~\mathbf {T^{T}} ={\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}$

Note that $\mathbf {O_{3}}$ differs from the others because of the "no umbrella" observation. Also note that $\mathbf {T}$ and $\mathbf {T^{T}}$ are equal because they are symmetrical. Before we can start to compute forward messages we need to describe two special values, the first forward probability and the k+2 backward probability. The first forward probability at t=0 is defined by the prior of the random variable. The k+2 backward probability is defined by the "true" matrix. Therefore it follows:

$\mathbf {f_{1:0}} ={\begin{pmatrix}0.5\\0.5\end{pmatrix}}$

$\mathbf {b_{k+2:t}} ={\begin{pmatrix}1.0\\1.0\end{pmatrix}}$

Now we will first iterate through all t and compute the forward probabilities. The following matrices appear in the same order as described in the forward formula above. Some calculations may be less accurate due to the limited numbers of decimals used in this example.

$\mathbf {f_{1:1}} =\alpha {\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}{\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}{\begin{pmatrix}0.5000\\0.5000\end{pmatrix}}=\alpha {\begin{pmatrix}0.4500\\0.1000\end{pmatrix}}={\begin{pmatrix}0.8182\\0.1818\end{pmatrix}}$

$\mathbf {f_{1:2}} =\alpha {\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}{\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}{\begin{pmatrix}0.8182\\0.1818\end{pmatrix}}=\alpha {\begin{pmatrix}0.5645\\0.0745\end{pmatrix}}={\begin{pmatrix}0.8834\\0.1165\end{pmatrix}}$

$\mathbf {f_{1:3}} =\alpha {\begin{pmatrix}0.1&0.0\\0.0&0.8\end{pmatrix}}{\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}{\begin{pmatrix}0.8834\\0.1165\end{pmatrix}}=\alpha {\begin{pmatrix}0.0653\\0.2772\end{pmatrix}}={\begin{pmatrix}0.1906\\0.8093\end{pmatrix}}$

$\mathbf {f_{1:4}} =\alpha {\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}{\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}{\begin{pmatrix}0.1906\\0.8093\end{pmatrix}}=\alpha {\begin{pmatrix}0.3386\\0.1247\end{pmatrix}}={\begin{pmatrix}0.7308\\0.2691\end{pmatrix}}$

$\mathbf {f_{1:5}} =\alpha {\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}{\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}{\begin{pmatrix}0.7308\\0.2691\end{pmatrix}}=\alpha {\begin{pmatrix}0.5331\\0.0815\end{pmatrix}}={\begin{pmatrix}0.8673\\0.1326\end{pmatrix}}$

Now that we have defined the forward probabilities, we continue to compute the backward probabilities. Again the matrices appear in the order as in the backward formula above.

=> Someone should recompute the final results because the formula for the backward message was previously wrong and was corrected above and in the following equations, but the result was not recomputed.

$\mathbf {b_{5:5}} ={\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}{\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}{\begin{pmatrix}1.0000\\1.0000\end{pmatrix}}={\begin{pmatrix}0.5984\\0.0543\end{pmatrix}}={\begin{pmatrix}0.9168\\0.0831\end{pmatrix}}$

$\mathbf {b_{4:5}} ={\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}{\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}{\begin{pmatrix}0.9168\\0.0831\end{pmatrix}}={\begin{pmatrix}0.7308\\0.2691\end{pmatrix}}={\begin{pmatrix}0.8593\\0.1407\end{pmatrix}}$

$\mathbf {b_{3:5}} ={\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}{\begin{pmatrix}0.1&0.0\\0.0&0.8\end{pmatrix}}{\begin{pmatrix}0.8593\\0.1407\end{pmatrix}}={\begin{pmatrix}0.0178\\0.0845\end{pmatrix}}={\begin{pmatrix}0.1739\\0.8260\end{pmatrix}}$

$\mathbf {b_{2:5}} ={\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}{\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}{\begin{pmatrix}0.1739\\0.8260\end{pmatrix}}={\begin{pmatrix}0.1405\\0.0189\end{pmatrix}}={\begin{pmatrix}0.8814\\0.1185\end{pmatrix}}$

$\mathbf {b_{1:5}} ={\begin{pmatrix}0.7&0.3\\0.3&0.7\end{pmatrix}}{\begin{pmatrix}0.9&0.0\\0.0&0.2\end{pmatrix}}{\begin{pmatrix}0.8814\\0.1185\end{pmatrix}}={\begin{pmatrix}0.4600\\0.0462\end{pmatrix}}={\begin{pmatrix}0.9087\\0.0912\end{pmatrix}}$

Finally we will compute the smoothed probability values. The ordering of the matrices follows the smoothed values formula above.

$\mathbf {sv_{5}} =\alpha {\begin{pmatrix}1.0000\\1.0000\end{pmatrix}}\times {\begin{pmatrix}0.8673\\0.1326\end{pmatrix}}=\alpha {\begin{pmatrix}0.8673\\0.1326\end{pmatrix}}={\begin{pmatrix}0.8673\\0.1326\end{pmatrix}}$

$\mathbf {sv_{4}} =\alpha {\begin{pmatrix}0.9168\\0.0831\end{pmatrix}}\times {\begin{pmatrix}0.7308\\0.2691\end{pmatrix}}=\alpha {\begin{pmatrix}0.6699\\0.0223\end{pmatrix}}={\begin{pmatrix}0.9677\\0.322\end{pmatrix}}$

$\mathbf {sv_{3}} =\alpha {\begin{pmatrix}0.8593\\0.1407\end{pmatrix}}\times {\begin{pmatrix}0.1906\\0.8093\end{pmatrix}}=\alpha {\begin{pmatrix}0.1637\\0.1138\end{pmatrix}}={\begin{pmatrix}0.5899\\0.4101\end{pmatrix}}$

$\mathbf {sv_{2}} =\alpha {\begin{pmatrix}0.1739\\0.8260\end{pmatrix}}\times {\begin{pmatrix}0.8834\\0.1165\end{pmatrix}}=\alpha {\begin{pmatrix}0.1536\\0.0962\end{pmatrix}}={\begin{pmatrix}0.6148\\0.3852\end{pmatrix}}$

$\mathbf {sv_{1}} =\alpha {\begin{pmatrix}0.8814\\0.1185\end{pmatrix}}\times {\begin{pmatrix}0.8182\\0.1818\end{pmatrix}}=\alpha {\begin{pmatrix}0.7211\\0.0215\end{pmatrix}}={\begin{pmatrix}0.9710\\0.0289\end{pmatrix}}$

Performance

The brute-force procedure for the solution of this problem is the generation of all possible sequences of observed events and hidden states with their probabilities using the two transition matrices. The joint probability of two sequences, given the model, is calculated by multiplying the corresponding probabilities. The brute force procedure has time complexity $O(T\cdot N^{T})$ , where $T$ is the length of sequences and $N$ is the number of symbols in the state alphabet. This is intractable for realistic problems, as the number of possible hidden node sequences typically is extremely high. However, the forward-backward algorithm has time complexity $O(N^{2}T)\,$ . Several enhancements are known to the general forward-backward algorithm which allow for computations to take place in constant space. In addition, as t grows, algorithms have been developed to compute $\mathbf {f_{1:t+1}}$ efficiently through online smoothing such as the fixed-lag smoothing (FLS) algorithm Russel & Norvig 2003 pp. 552.

Pseudocode

ForwardBackward(guessState, sequenceIndex):
    if sequenceIndex is past the end of the sequence, return 1
    if (guessState, sequenceIndex) has been seen before, return saved result
    result = 0
    for each neighboring state n:
        result = result + (transition probability from guessState to n given observation element at sequenceIndex)*ForwardBackward(n, sequenceIndex+1)
    save result for (guessState, sequenceIndex)
    return result

References

Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77 (2), p. 257–286, February 1989.
Rabiner L. R., Juang B. H. (1986). "An introduction to hidden Markov models". IEEE ASSP Magazine: 4–15. {{cite journal}}: Unknown parameter |month= ignored (help)
Charniak Eugene (1993). Statistical Language Learning. MIT Press. ISDN 978-0-262-53141-2. {{cite book}}: Unknown parameter |address= ignored (|___location= suggested) (help)

Russel S., Norvig P. (2003). Artificial Intelligence A Modern Approach 2nd Edition. Pearson Education. ISBN 0-13-080302-2. {{cite book}}: Unknown parameter |address= ignored (|___location= suggested) (help)

External links

An Interactive Spreadsheet for Teaching the Forward-Backward Algorithm (spreadsheet and article with step-by-step walk-through)
Tutorial of Hidden Markov Models including the forward-backward algorithm
Collection of AI algorithms implemented in Java (including HMM and the forward-backward algorithm)

This computer science article is a stub. You can help Wikipedia by expanding it.