Forward–backward algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 11:28, 26 May 2008 edit BJJV (talk \| contribs) 9 edits →External links ← Previous edit		Latest revision as of 07:56, 20 August 2025 edit undo Bender the Bot (talk \| contribs) Bots 1,064,377 edits m →External links: HTTP to HTTPS for Brown University Tag: AWB
(212 intermediate revisions by more than 100 users not shown)
Line 1: {{Short description\|Inference algorithm for hidden Markov models}} The '''forward-backward algorithm''' is a [[dynamic programming]] [[algorithm]] for computing the [[probability]] of a particular observation sequence, given the parameters of the model, in the context of [[hidden Markov model]]s. {{Inline\|date=April 2018}} The '''forward–backward algorithm''' is an [[Statistical_inference \| inference]] [[algorithm]] for [[hidden Markov model]]s which computes the [[posterior probability\|posterior]] [[marginal probability\|marginals]] of all hidden state variables given a sequence of observations/emissions <math>o_{1:T}:= o_1,\dots,o_T</math>, i.e. it computes, for all hidden state variables <math>X_t \in \{X_1, \dots, X_T\}</math>, the distribution <math>P(X_t\ \|\ o_{1:T})</math>. This inference task is usually called '''smoothing'''. The algorithm makes use of the principle of [[dynamic programming]] to efficiently compute the values that are required to obtain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name ''forward–backward algorithm''. ~~==Overview==~~ The term ''forward–backward algorithm'' is also used to refer to any algorithm belonging to the general class of algorithms that operate on sequence models in a forward–backward manner. In this sense, the descriptions in the remainder of this article refer only to one specific instance of this class. The algorithm is composed of three steps: Computing forward probabilities, computing backward probabilities and computing smoothed values. The forward and backward steps are often called forward message pass and backward message pass. The wording originates from the way the algorithm processes the given observation sequence. First the algorithm moves forward starting with the first observation in the sequence and going to the last, and then returning back to the first. At each single observation in the sequence probabilities to be used for calculations at the next observation are computed. During the backward pass the algorithm simultaneously performes the smoothing step. This step allows the algorithm to take into account any past observations of output for computing more accurate results. ==~~Formal~~Overview ~~description~~== In the first pass, the forward–backward algorithm computes a set of forward probabilities which provide, for all <math>t \in \{1, \dots, T\}</math>, the probability of ending up in any particular state given the first <math>t</math> observations in the sequence, i.e. <math>P(X_t\ \|\ o_{1:t})</math>. In the second pass, the algorithm computes a set of backward probabilities which provide the probability of observing the remaining observations given any starting point <math>t</math>, i.e. <math>P(o_{t+1:T}\ \|\ X_t)</math>. These two sets of probability distributions can then be combined to obtain the distribution over states at any specific point in time given the entire observation sequence: The following description will be based on matrices of probability values rather than on probability distributions. We transform the probability distributions related to a given [[hidden Markov model]] into matrix notation as follows. The transition probabilities <math>\mathbf{P}(X_t\mid X_{t-1})</math> of a given random variable <math>X_t</math> representing all possible states in the [[hidden Markov model]] will be represented by the matrix <math>\mathbf{T}</math> where the row index, i, will represent the start state and the column index, j, represents the target state. For example, in the example below <math>\mathbf{T}</math> would be defined as: :<math>P(X_t\ \|\ o_{1:T}) = P(X_t\ \|\ o_{1:t}, o_{t+1:T}) \propto P(o_{t+1:T}\ \|\ X_t) P( X_t \| o_{1:t})</math> ~~<math>\mathbf{T} = \begin{pmatrix}~~ The last step follows from an application of the [[Bayes' rule]] and the [[conditional independence]] of <math>o_{t+1:T}</math> and <math>o_{1:t}</math> given <math>X_t</math>. As outlined above, the algorithm involves three steps: # computing forward probabilities # computing backward probabilities # computing smoothed values. The forward and backward steps may also be called "forward message pass" and "backward message pass" - these terms are due to the ''message-passing'' used in general [[belief propagation]] approaches. At each single observation in the sequence, probabilities to be used for calculations at the next observation are computed. The smoothing step can be calculated simultaneously during the backward pass. This step allows the algorithm to take into account any past observations of output for computing more accurate results. The forward–backward algorithm can be used to find the most likely state for any point in time. It cannot, however, be used to find the most likely sequence of states (see [[Viterbi algorithm]]). ==Forward probabilities== The following description will use matrices of probability values instead of probability distributions. However, it is important to note that the forward-backward algorithm can generally be applied to both continuous and discrete probability models. We transform the probability distributions related to a given [[hidden Markov model]] into matrix notation as follows. The transition probabilities <math>\mathbf{P}(X_t\mid X_{t-1})</math> of a given random variable <math>X_t</math> representing all possible states in the hidden Markov model will be represented by the matrix <math>\mathbf{T}</math> where the column index <math>j</math> will represent the target state and the row index <math>i</math> represents the start state. A transition from row-vector state <math>\mathbf{\pi_t}</math> to the incremental row-vector state <math>\mathbf{\pi_{t+1}}</math> is written as <math>\mathbf{\pi_{t+1}} = \mathbf{\pi_{t}} \mathbf{T}</math>. The example below represents a system where the probability of staying in the same state after each step is 70% and the probability of transitioning to the other state is 30%. The transition matrix is then: :<math>\mathbf{T} = \begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix} </math> In a typical Markov model, we would multiply a state vector by this matrix to obtain the probabilities for the subsequent state. In a hidden Markov model the state is unknown, and we instead observe events associated with the possible states. An event matrix of the form: Similarly, we will represent the probabilities of a new state, given some observed states as evidence, in an observation matrix <math>\mathbf{O_t}</math> where each diagonal entry contains the probability of a new state given the particular observed state at t as evidence. Note that t indicates a particular observation in the observation sequence. All other entries in the matrix will be zero. For example, in the example below <math>\mathbf{O}</math> the first observed evidence (t=1) is "umbrella". Therefore <math>\mathbf{O_1}</math> would be defined as: :<math>\mathbf{B} = \begin{pmatrix} 0.9 & 0.1 \\ 0.2 & 0.8 \end{pmatrix} </math> provides the probabilities for observing events given a particular state. In the above example, event 1 will be observed 90% of the time if we are in state 1 while event 2 has a 10% probability of occurring in this state. In contrast, event 1 will only be observed 20% of the time if we are in state 2 and event 2 has an 80% chance of occurring. Given an arbitrary row-vector describing the state of the system (<math>\mathbf{\pi}</math>), the probability of observing event j is then: :<math>\mathbf{P}(O = j)=\sum_{i} \pi_i B_{i,j}</math> The probability of a given state leading to the observed event j can be represented in matrix form by multiplying the state row-vector (<math>\mathbf{\pi}</math>) with an observation matrix (<math>\mathbf{O_j} = \mathrm{diag}(B_{,o_j})</math>) containing only diagonal entries. Continuing the above example, the observation matrix for event 1 would be: :<math>\mathbf{O_1} = \begin{pmatrix} 0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix} </math> This allows us to calculate the new unnormalized probabilities state vector <math>\mathbf{\pi '}</math> through Bayes rule, weighting by the likelihood that each element of <math>\mathbf{\pi}</math> generated event 1 as: This allows a very elegant description of how the next forward probability is computed. Let the set of forward probabilities be stored in yet another matrix <math>\mathbf{f_{1:t+1}}</math>. Where 1:t+1 indicates that the computed probabilities are depending on all forward probabilities from 1 to t+1 including the current one which we will describe with <math>\mathbf{f_{1:t}}</math>. Hence, <math>\mathbf{f_{1:t+1}}</math> is equal to the multiplication of the transposed transformation matrix with the current forward probabilities and the observation matrix for the next evidence observation. Finally the resulting matrix needs to be normalized, i.e. the resulting values in the matrix are divided by the sum of all values in the matrix. The normalization factor is described with <math>\alpha</math>. Therefore the resulting forward computation is described by the following general formula: :<math> ~~<math>\mathbf{f_{1:t+1}} = \alpha\mathbf{O_{t+1}}\mathbf{T^T}\mathbf{f_{1:t}}</math>~~ \mathbf{\pi '} = \mathbf{\pi} \mathbf{O_1} </math> We can now make this general procedure specific to our series of observations. Assuming an initial state vector <math>\mathbf{\pi}_0</math>, (which can be optimized as a parameter through repetitions of the forward-backward procedure), we begin with <math>\mathbf{f_{0:0}} = \mathbf{\pi}_0</math>, then updating the state distribution and weighting by the likelihood of the first observation: The backward computation <math>\mathbf{b_{k+1:t}}</math>, that is starting at the end of the sequence, can be described in a similar manner. Let the end of the sequence be described by the index k, starting at 0. Therefore running from k down to t=0 and calculating each backward probability can be described by the following formula: :<math> ~~<math>\mathbf{b_{k+1:t}} = \alpha(\mathbf{T}\mathbf{O_{k+1}}\mathbf{b_{k+2t}})\times\mathbf{f_{1:t}}</math>~~ \mathbf{f_{0:1}} = \mathbf{\pi}_0 \mathbf{T} \mathbf{O_{o_1}} </math> This process can be carried forward with additional observations using: Note that we use the non-transposed matrix of <math>\mathbf{T}</math> and that the order of the terms has changed. Also note that the final product is not a usual matrix multiplication, but a point product. This operation multiplies each value in one matrix with the corresponding value of the other. Finally note that the description in [[#RusselNorvig03\|Russel & Norvig 2003 pp. 550]] excludes the point product, thought the procedure is required earlier. The third and final step is the computation of smoothed probabilities <math>\mathbf{svt}</math>. These are the point product of the backward and forward probabilities for each t. Therefore the formular is defined as: :<math> ~~<math>\mathbf{sv_t} = \alpha\mathbf{b_{k+1:t}}\times\mathbf{f_{1:t}}</math>~~ \mathbf{f_{0:t}} = \mathbf{f_{0:t-1}} \mathbf{T} \mathbf{O_{o_t}} </math> This value is the forward unnormalized [[probability vector]]. The i'th entry of this vector provides: ~~The example in the following section will show numerically how each of these matrices are calculated.~~ :<math> ~~==Example==~~ \mathbf{f_{0:t}}(i) = \mathbf{P}(o_1, o_2, \dots, o_t, X_t=x_i \| \mathbf{\pi}_0 ) This example is based on the umbrella world in [[#RusselNorvig03\|Russel & Norvig 2003 pp. 540]]. The example uses a longer observation sequence {umbrella, umbrella, no umbrella, umbrella, umbrella} than the example described in the book. Assume the probabilities for rain given an umbrella observation are 90% if an umbrella is observed and 20% if no umbrella is observed. Further more assume that the probability of rain in t+1 given that it is raining in now (i.e. t) is 70% and the probability that it does not rain is 30%. The following matrices can be extracted from the umbrella world given the above observations: </math> Typically, we will normalize the probability vector at each step so that its entries sum to 1. A scaling factor is thus introduced at each step such that: ~~<math>~~ \mathbf{O_1} = \begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix} \mathbf{O_2} = \begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\mathbf{O_3} = \begin{pmatrix}0.1 & 0.0 \\ 0.0 & 0.8 \end{pmatrix}\mathbf{O_4} = \begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\mathbf{O_5} = \begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix} :<math> \mathbf{\hat{f}_{0:t}} = c_t^{-1}\ \mathbf{\hat{f}_{0:t-1}} \mathbf{T} \mathbf{O_{o_t}} </math> ~~<math>~~ where <math>\mathbf{\hat{f}_{0:t-1}}</math> represents the scaled vector from the previous step and <math>c_t</math> represents the scaling factor that causes the resulting vector's entries to sum to 1. The product of the scaling factors is the total probability for observing the given events irrespective of the final states: ~~\mathbf{T} = \begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix} \mathbf{T^T} = \begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}~~ :<math> \mathbf{P}(o_1, o_2, \dots, o_t\|\mathbf{\pi}_0) = \prod_{s=1}^t c_s </math> This allows us to interpret the scaled probability vector as: Note that <math>\mathbf{O_3}</math> is different from the others because of the "no umbrella" observation. Also note that <math>\mathbf{T}</math> and <math>\mathbf{T^T}</math> are equal because the are symmetrical. before we can start to compute forward messages we need to describe two special values, the first forward probability and the k+2 backward probability. The first forward probability at t=0 is defined by the prior of the random variable. The k+2 backward probability is defined by the "true" matrix. Therefore if follows: :<math> \mathbf{\hat{f}_{0:t}}(i) = ~~\mathbf{f_{1:0}}= \begin{pmatrix} 0.5 & 0.5 \end{pmatrix}~~ \frac{\mathbf{f_{0:t}}(i)}{\prod_{s=1}^t c_s} = \frac{\mathbf{P}(o_1, o_2, \dots, o_t, X_t=x_i \| \mathbf{\pi}_0 )}{\mathbf{P}(o_1, o_2, \dots, o_t\|\mathbf{\pi}_0)} = \mathbf{P}(X_t=x_i \| o_1, o_2, \dots, o_t, \mathbf{\pi}_0 ) </math> ~~<math>~~ We thus find that the product of the scaling factors provides us with the total probability for observing the given sequence up to time t and that the scaled probability vector provides us with the probability of being in each state at this time. ~~\mathbf{b_{k+2t}} = \begin{pmatrix} 1.0 & 1.0\end{pmatrix}~~ ==Backward probabilities== A similar procedure can be constructed to find backward probabilities. These intend to provide the probabilities: :<math> \mathbf{b_{t:T}}(i) = \mathbf{P}(o_{t+1}, o_{t+2}, \dots, o_{T} \| X_t=x_i ) </math> That is, we now want to assume that we start in a particular state (<math>X_t=x_i</math>), and we are now interested in the probability of observing all future events from this state. Since the initial state is assumed as given (i.e. the prior probability of this state = 100%), we begin with: Now we will first iterate through all t and compute the forward probabilities. The following matrices are appearing in the same order as described in the forward formula above. Some calculations may be less accurate due to the limited numbers of decimals used in this example. :<math> \mathbf{b_{T:T}} = [1\ 1\ 1\ \dots]^T \mathbf{f_{1:1}} = \alpha\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.5000 & 0.5000 \end{pmatrix}=\alpha\begin{pmatrix}0.4500 & 0.1000\end{pmatrix}=\begin{pmatrix}0.8182 & 0.1818 \end{pmatrix} </math> ~~<math>~~ Notice that we are now using a [[Row and column vectors\|column vector]] while the forward probabilities used row vectors. We can then work backwards using: \mathbf{f_{1:2}} = \alpha\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.8182 & 0.1818 \end{pmatrix}=\alpha\begin{pmatrix}0.5645 & 0.0745\end{pmatrix}=\begin{pmatrix}0.8834 & 0.1165 \end{pmatrix} :<math> \mathbf{b_{t-1:T}} = \mathbf{T}\mathbf{O_t}\mathbf{b_{t:T}} </math> ~~<math>~~ While we could normalize this vector as well so that its entries sum to one, this is not usually done. Noting that each entry contains the probability of the future event sequence given a particular initial state, normalizing this vector would be equivalent to applying Bayes' theorem to find the likelihood of each initial state given the future events (assuming uniform priors for the final state vector). However, it is more common to scale this vector using the same <math>c_t</math> constants used in the forward probability calculations. <math>\mathbf{b_{T:T}}</math> is not scaled, but subsequent operations use: \mathbf{f_{1:3}} = \alpha\begin{pmatrix}0.1 & 0.0 \\ 0.0 & 0.8 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.8834 & 0.1165 \end{pmatrix}=\alpha\begin{pmatrix}0.0653 & 0.2772\end{pmatrix}=\begin{pmatrix}0.1906 & 0.8093 \end{pmatrix} :<math> \mathbf{\hat{b}_{t-1:T}} = c_t^{-1} \mathbf{T}\mathbf{O_t}\mathbf{\hat{b}_{t:T}} </math> ~~<math>~~ where <math>\mathbf{\hat{b}_{t:T}}</math> represents the previous, scaled vector. This result is that the scaled probability vector is related to the backward probabilities by: \mathbf{f_{1:4}} = \alpha\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.1906 & 0.8093 \end{pmatrix}=\alpha\begin{pmatrix}0.3386 & 0.1247\end{pmatrix}=\begin{pmatrix}0.7308 & 0.2691 \end{pmatrix} :<math> \mathbf{\hat{b}_{t:T}}(i) = \frac{\mathbf{b_{t:T}}(i)}{\prod_{s=t+1}^T c_s} </math> ~~<math>~~ This is useful because it allows us to find the total probability of being in each state at a given time, t, by multiplying these values: \mathbf{f_{1:5}} = \alpha\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.7308 & 0.2691 \end{pmatrix}=\alpha\begin{pmatrix}0.5331 & 0.0815\end{pmatrix}=\begin{pmatrix}0.8673 & 0.1326 \end{pmatrix} :<math> \mathbf{\gamma_t}(i) = \mathbf{P}(X_t=x_i \| o_1, o_2, \dots, o_T, \mathbf{\pi}_0) = \frac{ \mathbf{P}(o_1, o_2, \dots, o_T, X_t=x_i \| \mathbf{\pi}_0 ) }{ \mathbf{P}(o_1, o_2, \dots, o_T \| \mathbf{\pi}_0 ) } = \frac{ \mathbf{f_{0:t}}(i) \cdot \mathbf{b_{t:T}}(i) }{ \prod_{s=1}^T c_s } = \mathbf{\hat{f}_{0:t}}(i) \cdot \mathbf{\hat{b}_{t:T}}(i) </math> To understand this, we note that <math>\mathbf{f_{0:t}}(i) \cdot \mathbf{b_{t:T}}(i)</math> provides the probability for observing the given events in a way that passes through state <math>x_i</math> at time t. This probability includes the forward probabilities covering all events up to time t as well as the backward probabilities which include all future events. This is the numerator we are looking for in our equation, and we divide by the total probability of the observation sequence to normalize this value and extract only the probability that <math>X_t=x_i</math>. These values are sometimes called the "smoothed values" as they combine the forward and backward probabilities to compute a final probability. ~~Now that the forward probabilities are defined, we continue to compute the backward probabilities. Again the matrices appear in the order as in the backward formula above.~~ The values <math>\mathbf{\gamma_t}(i)</math> thus provide the probability of being in each state at time t. As such, they are useful for determining the most probable state at any time. The term "most probable state" is somewhat ambiguous. While the most probable state is the most likely to be correct at a given point, the sequence of individually probable states is not likely to be the most probable sequence. This is because the probabilities for each point are calculated independently of each other. They do not take into account the transition probabilities between states, and it is thus possible to get states at two moments (t and t+1) that are both most probable at those time points but which have very little probability of occurring together, i.e. <math> \mathbf{P}(X_t=x_i,X_{t+1}=x_j) \neq \mathbf{P}(X_t=x_i) \mathbf{P}(X_{t+1}=x_j) </math>. The most probable sequence of states that produced an observation sequence can be found using the [[Viterbi algorithm]]. ~~<math>~~ \mathbf{b_{5:5}} = \alpha(\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix}1.0000 & 1.0000 \end{pmatrix})\times \begin{pmatrix}0.8673 & 0.1326 \end{pmatrix}=\alpha\begin{pmatrix}0.5984 & 0.0543\end{pmatrix}=\begin{pmatrix}0.9168 & 0.0831 \end{pmatrix} ==Example == This example takes as its basis the umbrella world in [[#RussellNorvig10\|Russell & Norvig 2010 Chapter 15 pp. 567]] in which we would like to infer the weather given observation of another person either carrying or not carrying an umbrella. We assume two possible states for the weather: state 1 = rain, state 2 = no rain. We assume that the weather has a 70% chance of staying the same each day and a 30% chance of changing. The transition probabilities are then: :<math>\mathbf{T} = \begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix} </math> ~~<math>~~ We also assume each state generates one of two possible events: event 1 = umbrella, event 2 = no umbrella. The conditional probabilities for these occurring in each state are given by the probability matrix: \mathbf{b_{4:5}} = \alpha(\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix}0.9168 & 0.0831 \end{pmatrix})\times \begin{pmatrix}0.7308 & 0.2691 \end{pmatrix}=\alpha\begin{pmatrix}0.7308 & 0.2691\end{pmatrix}=\begin{pmatrix}0.8593 & 0.1407 \end{pmatrix} :<math>\mathbf{B} = \begin{pmatrix} 0.9 & 0.1 \\ 0.2 & 0.8 \end{pmatrix} </math> ~~<math>~~ We then observe the following sequence of events: {umbrella, umbrella, no umbrella, umbrella, umbrella} which we will represent in our calculations as: \mathbf{b_{3:5}} = \alpha(\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.1 & 0.0 \\ 0.0 & 0.8 \end{pmatrix}\begin{pmatrix}0.8593 & 0.1407 \end{pmatrix})\times \begin{pmatrix}0.1906 & 0.8093 \end{pmatrix}=\alpha\begin{pmatrix}0.0178 & 0.0845\end{pmatrix}=\begin{pmatrix}0.1739 & 0.8260 \end{pmatrix} :<math> \mathbf{O_1} = \begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}~~\mathbf{O_2} = \begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}~~\mathbf{O_3} = \begin{pmatrix}0.1 & 0.0 \\ 0.0 & 0.8 \end{pmatrix}~~\mathbf{O_4} = \begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}~~\mathbf{O_5} = \begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix} </math> ~~<math>~~ Note that <math>\mathbf{O_3}</math> differs from the others because of the "no umbrella" observation. \mathbf{b_{2:5}} = \alpha(\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix}0.1739 & 0.8260 \end{pmatrix})\times \begin{pmatrix}0.8834 & 0.1165 \end{pmatrix}=\alpha\begin{pmatrix}0.1405 & 0.0189\end{pmatrix}=\begin{pmatrix}0.8814 & 0.1185 \end{pmatrix} In computing the forward probabilities we begin with: :<math> \mathbf{f_{0:0}}= \begin{pmatrix} 0.5 & 0.5 \end{pmatrix} </math> ~~<math>~~ which is our prior state vector indicating that we don't know which state the weather is in before our observations. While a state vector should be given as a row vector, we will use the transpose of the matrix so that the calculations below are easier to read. Our calculations are then written in the form: \mathbf{b_{1:5}} = \alpha(\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix}0.8814 & 0.1185 \end{pmatrix})\times \begin{pmatrix}0.8182 & 0.1818 \end{pmatrix}=\alpha\begin{pmatrix}0.4600 & 0.0462\end{pmatrix}=\begin{pmatrix}0.9087 & 0.0912 \end{pmatrix} :<math> (\mathbf{\hat{f}_{0:t}})^T = c_t^{-1}\mathbf{O_t}(\mathbf{T})^T(\mathbf{\hat{f}_{0:t-1}})^T </math> instead of: ~~Finally we will compute the smoothed probability values. The ordering of the matrices follows the smoothed values formula above.~~ :<math> \mathbf{\hat{f}_{0:t}} = c_t^{-1}\mathbf{\hat{f}_{0:t-1}} \mathbf{T} \mathbf{O_t} \mathbf{sv_{5}} = \alpha\begin{pmatrix}1.0000 & 1.0000 \end{pmatrix}\times \begin{pmatrix}0.8673 & 0.1326 \end{pmatrix}=\alpha\begin{pmatrix}0.8673 & 0.1326 \end{pmatrix}=\begin{pmatrix}0.8673 & 0.1326 \end{pmatrix} </math> ~~<math>~~ Notice that the [[transformation matrix]] is also transposed, but in our example the transpose is equal to the original matrix. Performing these calculations and normalizing the results provides: \mathbf{sv_{4}} = \alpha\begin{pmatrix}0.9168 & 0.0831 \end{pmatrix}\times \begin{pmatrix}0.7308 & 0.2691 \end{pmatrix}=\alpha\begin{pmatrix}0.6699 & 0.0223\end{pmatrix}=\begin{pmatrix}0.9677 & 0.322 \end{pmatrix} :<math> (\mathbf{\hat{f}_{0:1}})^T = c_1^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.5000 \\ 0.5000 \end{pmatrix}= c_1^{-1}\begin{pmatrix}0.4500 \\ 0.1000\end{pmatrix}= \begin{pmatrix}0.8182 \\ 0.1818 \end{pmatrix} </math> ~~<math>~~ :<math> \mathbf{sv_{3}} = \alpha\begin{pmatrix}0.8593 & 0.1407 \end{pmatrix}\times \begin{pmatrix}0.1906 & 0.8093 \end{pmatrix}=\alpha\begin{pmatrix}0.1637 & 0.1138\end{pmatrix}=\begin{pmatrix}0.5899 & 0.4101 \end{pmatrix} (\mathbf{\hat{f}_{0:2}})^T = c_2^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.8182 \\ 0.1818 \end{pmatrix}= c_2^{-1}\begin{pmatrix}0.5645 \\ 0.0745\end{pmatrix}= \begin{pmatrix}0.8834 \\ 0.1166 \end{pmatrix} </math> ~~<math>~~ :<math> \mathbf{sv_{2}} = \alpha\begin{pmatrix}0.1739 & 0.8260 \end{pmatrix}\times \begin{pmatrix}0.8834 & 0.1165 \end{pmatrix}=\alpha\begin{pmatrix}0.1536 & 0.0962\end{pmatrix}=\begin{pmatrix}0.6148 & 0.3852 \end{pmatrix} (\mathbf{\hat{f}_{0:3}})^T = c_3^{-1}\begin{pmatrix}0.1 & 0.0 \\ 0.0 & 0.8 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.8834 \\ 0.1166 \end{pmatrix}= c_3^{-1}\begin{pmatrix}0.0653 \\ 0.2772\end{pmatrix}= \begin{pmatrix}0.1907 \\ 0.8093 \end{pmatrix} </math> ~~<math>~~ :<math> \mathbf{sv_{1}} = \alpha\begin{pmatrix}0.8814 & 0.1185 \end{pmatrix}\times \begin{pmatrix}0.8182 & 0.1818 \end{pmatrix}=\alpha\begin{pmatrix}0.7211 & 0.0215\end{pmatrix}=\begin{pmatrix}0.9710 & 0.0289 \end{pmatrix} (\mathbf{\hat{f}_{0:4}})^T = c_4^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.1907 \\ 0.8093 \end{pmatrix}= c_4^{-1}\begin{pmatrix}0.3386 \\ 0.1247\end{pmatrix}= \begin{pmatrix}0.7308 \\ 0.2692 \end{pmatrix} </math> :<math> ~~==Performance==~~ (\mathbf{\hat{f}_{0:5}})^T = The brute force procedure for the solution of this problem is the generation of all possible sequences of observed events and hidden states with their probabilities using the two transition matrices. The joint probability of two sequences, given the model, is calculated by multiplying the corresponding probabilities. The brute force procedure has [[time complexity]] <math> O(T \cdot N^T) </math>, where <math>T</math> is the length of sequences and <math>N</math> is the number of symbols in the state alphabet. This is intractable for realistic problems, as the number of possible hidden node sequences typically is extremely high. However, the forward-backward algorithm has time complexity <math> O(N^2 T)\, </math>. c_5^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.7308 \\ 0.2692 \end{pmatrix}= Several enhancements are known to the general forward-backward algorithm which allow for computations to take place in constant space. In addition, as t grows, algorithms have been developed to compute <math>\mathbf{f_{1:t+1}}</math> efficiently through online smoothing such as the fixed-lag smoothing algorithm [[#RusselNorvig03\|Russel & Norvig 2003 pp. 552]]. c_5^{-1}\begin{pmatrix}0.5331 \\ 0.0815\end{pmatrix}= \begin{pmatrix}0.8673 \\ 0.1327 \end{pmatrix} </math> For the backward probabilities, we start with: ~~==See also==~~ [[Baum-Welch algorithm]] * [[Hidden Markov model]] * [[Viterbi algorithm]] :<math> ~~==References==~~ \mathbf{b_{5:5}} = \begin{pmatrix} 1.0 \\ 1.0\end{pmatrix} </math> We are then able to compute (using the observations in reverse order and normalizing with different constants): * [[Lawrence Rabiner\|Lawrence R. Rabiner]], [http://www.caip.rutgers.edu/~lrr/Reprints/tutorial%20on%20hmm%20and%20applications.pdf A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition]. ''Proceedings of the [[IEEE]]'', 77 (2), p. 257–286, February 1989. {{cite journal \|author=Rabiner L. R., Juang B. H.\|title=An introduction to hidden Markov models\|journal=IEEE ASSP Magazine \|month=January \|pages=4-15 \|year=1986}} {{cite book \| author = Charniak Eugene\|title = Statistical Language Learning\|publisher = MIT Press\|address = Cambridge, Massachusetts\|year = 1993\|id=ISDN 978-0-262-53141-2}} ~~<cite id=RusselNorvig03>~~ {{cite book \| author = Russel S., Norvig P.\|title = Artificial Intelligence A Modern Approach 2nd Edition\|publisher = Pearson Education\|address = Upper Saddle River, New Jersey\|year = 2003\|id=ISBN 0-13-080302-2}} ~~==External links==~~ [http://www.cs.jhu.edu/~jason/papers/#tnlp02 An Interactive Spreadsheet for Teaching the Forward-Backward Algorithm] (spreadsheet and article with step-by-step walkthrough) * [http://www.cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html Tutorial of Hidden Markov Models including the forward-backward algorithm] * [http://code.google.com/p/aima-java/ Collection of AI algorithms implemented in Java] (including HMM and the forward-backward algorithm) :<math> ~~{{compu-sci-stub}}~~ \mathbf{\hat{b}_{4:5}} = \alpha\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix}1.0000 \\ 1.0000 \end{pmatrix}=\alpha\begin{pmatrix}0.6900 \\ 0.4100\end{pmatrix}=\begin{pmatrix}0.6273 \\ 0.3727 \end{pmatrix} </math> :<math> \mathbf{\hat{b}_{3:5}} = \alpha\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix}0.6273 \\ 0.3727 \end{pmatrix}=\alpha\begin{pmatrix}0.4175 \\ 0.2215\end{pmatrix}=\begin{pmatrix}0.6533 \\ 0.3467 \end{pmatrix} </math> :<math> \mathbf{\hat{b}_{2:5}} = \alpha\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.1 & 0.0 \\ 0.0 & 0.8 \end{pmatrix}\begin{pmatrix}0.6533 \\ 0.3467 \end{pmatrix}=\alpha\begin{pmatrix}0.1289 \\ 0.2138\end{pmatrix}=\begin{pmatrix}0.3763 \\ 0.6237 \end{pmatrix} </math> :<math> \mathbf{\hat{b}_{1:5}} = \alpha\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix}0.3763 \\ 0.6237 \end{pmatrix}=\alpha\begin{pmatrix}0.2745 \\ 0.1889\end{pmatrix}=\begin{pmatrix}0.5923 \\ 0.4077 \end{pmatrix} </math> :<math> \mathbf{\hat{b}_{0:5}} = \alpha\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix}0.5923 \\ 0.4077 \end{pmatrix}=\alpha\begin{pmatrix}0.3976 \\ 0.2170\end{pmatrix}=\begin{pmatrix}0.6469 \\ 0.3531 \end{pmatrix} </math> Finally, we will compute the smoothed probability values. These results must also be scaled so that its entries sum to 1 because we did not scale the backward probabilities with the <math>c_t</math>'s found earlier. The backward probability vectors above thus actually represent the likelihood of each state at time t given the future observations. Because these vectors are proportional to the actual backward probabilities, the result has to be scaled an additional time. :<math> (\mathbf{\gamma_0})^T = \alpha\begin{pmatrix}0.5000 \\ 0.5000 \end{pmatrix}\circ \begin{pmatrix}0.6469 \\ 0.3531 \end{pmatrix}=\alpha\begin{pmatrix}0.3235 \\ 0.1765\end{pmatrix}=\begin{pmatrix}0.6469 \\ 0.3531 \end{pmatrix} </math> :<math> (\mathbf{\gamma_1})^T = \alpha\begin{pmatrix}0.8182 \\ 0.1818 \end{pmatrix}\circ \begin{pmatrix}0.5923 \\ 0.4077 \end{pmatrix}=\alpha\begin{pmatrix}0.4846 \\ 0.0741\end{pmatrix}=\begin{pmatrix}0.8673 \\ 0.1327 \end{pmatrix} </math> :<math> (\mathbf{\gamma_2})^T = \alpha\begin{pmatrix}0.8834 \\ 0.1166 \end{pmatrix}\circ \begin{pmatrix}0.3763 \\ 0.6237 \end{pmatrix}=\alpha\begin{pmatrix}0.3324 \\ 0.0728\end{pmatrix}=\begin{pmatrix}0.8204 \\ 0.1796 \end{pmatrix} </math> :<math> (\mathbf{\gamma_3})^T = \alpha\begin{pmatrix}0.1907 \\ 0.8093 \end{pmatrix}\circ \begin{pmatrix}0.6533 \\ 0.3467 \end{pmatrix}=\alpha\begin{pmatrix}0.1246 \\ 0.2806\end{pmatrix}=\begin{pmatrix}0.3075 \\ 0.6925 \end{pmatrix} </math> :<math> (\mathbf{\gamma_4})^T = \alpha\begin{pmatrix}0.7308 \\ 0.2692 \end{pmatrix}\circ \begin{pmatrix}0.6273 \\ 0.3727 \end{pmatrix}=\alpha\begin{pmatrix}0.4584 \\ 0.1003\end{pmatrix}=\begin{pmatrix}0.8204 \\ 0.1796 \end{pmatrix} </math> :<math> (\mathbf{\gamma_5})^T = \alpha\begin{pmatrix}0.8673 \\ 0.1327 \end{pmatrix}\circ \begin{pmatrix}1.0000 \\ 1.0000 \end{pmatrix}=\alpha\begin{pmatrix}0.8673 \\ 0.1327 \end{pmatrix}=\begin{pmatrix}0.8673 \\ 0.1327 \end{pmatrix} </math> Notice that the value of <math>\mathbf{\gamma_0}</math> is equal to <math>\mathbf{\hat{b}_{0:5}}</math> and that <math>\mathbf{\gamma_5}</math> is equal to <math>\mathbf{\hat{f}_{0:5}}</math>. This follows naturally because both <math>\mathbf{\hat{f}_{0:5}}</math> and <math>\mathbf{\hat{b}_{0:5}}</math> begin with uniform priors over the initial and final state vectors (respectively) and take into account all of the observations. However, <math>\mathbf{\gamma_0}</math> will only be equal to <math>\mathbf{\hat{b}_{0:5}}</math> when our initial state vector represents a uniform prior (i.e. all entries are equal). When this is not the case <math>\mathbf{\hat{b}_{0:5}}</math> needs to be combined with the initial state vector to find the most likely initial state. We thus find that the forward probabilities by themselves are sufficient to calculate the most likely final state. Similarly, the backward probabilities can be combined with the initial state vector to provide the most probable initial state given the observations. The forward and backward probabilities need only be combined to infer the most probable states between the initial and final points. The calculations above reveal that the most probable weather state on every day except for the third one was "rain". They tell us more than this, however, as they now provide a way to quantify the probabilities of each state at different times. Perhaps most importantly, our value at <math>\mathbf{\gamma_5}</math> quantifies our knowledge of the state vector at the end of the observation sequence. We can then use this to predict the probability of the various weather states tomorrow as well as the probability of observing an umbrella. ==Performance == The forward–backward algorithm runs with time complexity <math> O(S^2 T) </math> in space <math> O(S T) </math>, where <math>T</math> is the length of the time sequence and <math>S</math> is the number of symbols in the state alphabet.<ref>[[#RussellNorvig10\|Russell & Norvig 2010 pp. 579]]</ref> The algorithm can also run in constant space with time complexity <math> O(S^2 T^2) </math> by recomputing values at each step.<ref>[[#RussellNorvig10\|Russell & Norvig 2010 pp. 575]]</ref> For comparison, a [[Brute-force search\|brute-force procedure]] would generate all possible <math>S^T</math> state sequences and calculate the joint probability of each state sequence with the observed series of events, which would have [[time complexity]] <math> O(T \cdot S^T) </math>. Brute force is intractable for realistic problems, as the number of possible hidden node sequences typically is extremely high. An enhancement to the general forward-backward algorithm, called the [[Island algorithm]], trades smaller memory usage for longer running time, taking <math> O(S^2 T \log T) </math> time and <math> O(S \log T) </math> memory. Furthermore, it is possible to invert the process model to obtain an <math>O(S)</math> space, <math>O(S^2 T)</math> time algorithm, although the inverted process may not exist or be [[ill-conditioned]].<ref>{{cite journal \|last1=Binder \|first1=John \|last2=Murphy \|first2=Kevin \|last3=Russell \|first3=Stuart \|title=Space-efficient inference in dynamic probabilistic networks \|journal=Int'l, Joint Conf. On Artificial Intelligence \|date=1997 \|url=https://www.cs.ubc.ca/~murphyk/Papers/ijcai97.pdf \|access-date=8 July 2020}}</ref> In addition, algorithms have been developed to compute <math>\mathbf{f_{0:t+1}}</math> efficiently through online smoothing such as the fixed-lag smoothing (FLS) algorithm.<ref>[[#RussellNorvig10\|Russell & Norvig 2010 Figure 15.6 pp. 580]]</ref> ==Pseudocode== '''algorithm''' forward_backward '''is''' '''input:''' guessState int ''sequenceIndex'' '''output:''' ''result'' '''if''' ''sequenceIndex'' is past the end of the sequence '''then''' '''return''' 1 '''if''' (''guessState'', ''sequenceIndex'') has been seen before '''then''' '''return''' saved result ''result'' := 0 '''for each''' neighboring state n: ''result'' := result + (transition probability from ''guessState'' to n given observation element at ''sequenceIndex'') × Backward(n, ''sequenceIndex'' + 1) save result for (''guessState'', ''sequenceIndex'') '''return''' ''result'' ==Python example== Given HMM (just like in [[Viterbi algorithm]]) represented in the [[Python programming language]]: <syntaxhighlight lang="python"> states = ("Healthy", "Fever") end_state = "E" observations = ("normal", "cold", "dizzy") start_probability = {"Healthy": 0.6, "Fever": 0.4} transition_probability = { "Healthy": {"Healthy": 0.69, "Fever": 0.3, "E": 0.01}, "Fever": {"Healthy": 0.4, "Fever": 0.59, "E": 0.01}, } emission_probability = { "Healthy": {"normal": 0.5, "cold": 0.4, "dizzy": 0.1}, "Fever": {"normal": 0.1, "cold": 0.3, "dizzy": 0.6}, } </syntaxhighlight> We can write the implementation of the forward-backward algorithm like this: <syntaxhighlight lang="python"> def fwd_bkw(observations, states, start_prob, trans_prob, emm_prob, end_st): """Forward–backward algorithm.""" # Forward part of the algorithm fwd = [] for i, observation_i in enumerate(observations): f_curr = {} for st in states: if i == 0: # base case for the forward part prev_f_sum = start_prob[st] else: prev_f_sum = sum(f_prev[k] * trans_prob[k][st] for k in states) f_curr[st] = emm_prob[st][observation_i] * prev_f_sum fwd.append(f_curr) f_prev = f_curr p_fwd = sum(f_curr[k] * trans_prob[k][end_st] for k in states) # Backward part of the algorithm bkw = [] for i, observation_i_plus in enumerate(reversed(observations[1:] + (None,))): b_curr = {} for st in states: if i == 0: # base case for backward part b_curr[st] = trans_prob[st][end_st] else: b_curr[st] = sum(trans_prob[st][l] * emm_prob[l][observation_i_plus] * b_prev[l] for l in states) bkw.insert(0,b_curr) b_prev = b_curr p_bkw = sum(start_prob[l] * emm_prob[l][observations[0]] * b_curr[l] for l in states) # Merging the two parts posterior = [] for i in range(len(observations)): posterior.append({st: fwd[i][st] * bkw[i][st] / p_fwd for st in states}) assert p_fwd == p_bkw return fwd, bkw, posterior </syntaxhighlight> The function <code>fwd_bkw</code> takes the following arguments: <code>x</code> is the sequence of observations, e.g. <code>['normal', 'cold', 'dizzy']</code>; <code>states</code> is the set of hidden states; <code>a_0</code> is the start probability; <code>a</code> are the transition probabilities; and <code>e</code> are the emission probabilities. For simplicity of code, we assume that the observation sequence <code>x</code> is non-empty and that <code>a[i][j]</code> and <code>e[i][j]</code> is defined for all states i,j. In the running example, the forward-backward algorithm is used as follows: <syntaxhighlight lang="python"> def example(): return fwd_bkw( observations, states, start_probability, transition_probability, emission_probability, end_state, ) </syntaxhighlight> <syntaxhighlight lang="pycon"> >>> for line in example(): ... print(line) ... {'Healthy': 0.3, 'Fever': 0.04000000000000001} {'Healthy': 0.0892, 'Fever': 0.03408} {'Healthy': 0.007518, 'Fever': 0.028120319999999997} {'Healthy': 0.0010418399999999998, 'Fever': 0.00109578} {'Healthy': 0.00249, 'Fever': 0.00394} {'Healthy': 0.01, 'Fever': 0.01} {'Healthy': 0.8770110375573259, 'Fever': 0.1229889624426741} {'Healthy': 0.623228030950954, 'Fever': 0.3767719690490461} {'Healthy': 0.2109527048413057, 'Fever': 0.7890472951586943} </syntaxhighlight> ==See also == [[Baum–Welch algorithm]] * [[Viterbi algorithm]] * [[BCJR algorithm]] == References== {{reflist}} * [[Lawrence Rabiner\|Lawrence R. Rabiner]], A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. ''Proceedings of the [[IEEE]]'', 77 (2), p. 257–286, February 1989. [https://dx.doi.org/10.1109/5.18626 10.1109/5.18626] * {{cite journal \|author=Lawrence R. Rabiner, B. H. Juang\|title=An introduction to hidden Markov models\|journal=IEEE ASSP Magazine \|date=January 1986 \|pages=4–15}} * {{cite book \| author = Eugene Charniak\|title = Statistical Language Learning\|publisher = MIT Press\| ___location=Cambridge, Massachusetts\|year = 1993\|isbn=978-0-262-53141-2}} * <cite id = RussellNorvig10>{{cite book \| author = Stuart Russell and Peter Norvig\|title = Artificial Intelligence A Modern Approach 3rd Edition\|publisher = Pearson Education/Prentice-Hall\|___location = Upper Saddle River, New Jersey\|year = 2010\|isbn=978-0-13-604259-4}}</cite> ==External links == * [http://www.cs.jhu.edu/~jason/papers/#eisner-2002-tnlp An interactive spreadsheet for teaching the forward–backward algorithm] (spreadsheet and article with step-by-step walk-through) * [https://www.cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html Tutorial of hidden Markov models including the forward–backward algorithm] * [http://code.google.com/p/aima-java/ Collection of AI algorithms implemented in Java] (including HMM and the forward–backward algorithm) {{DEFAULTSORT:Forward-backward algorithm}} [[Category:Articles with example Python (programming language) code]] [[Category:Dynamic programming]] [[Category:Error detection and correction]] [[Category:Machine learning algorithms]] [[Category:Markov models]] ~~[[es:Algoritmo de avance-retroceso]]~~ ~~[[de:Forward-Algorithmus]]~~