Forward–backward algorithm: Difference between revisions

Content deleted Content added
No edit summary
Bender the Bot (talk | contribs)
m External links: HTTP to HTTPS for Brown University
 
(49 intermediate revisions by 33 users not shown)
Line 1:
{{Short description|Inference algorithm for hidden Markov models}}
The '''forward–backward algorithm''' is an [[Statistical_inference | inference]] [[algorithm]] for [[hidden Markov models]] which computes the [[posterior probability|posterior]] [[marginal probability|marginals]] of all hidden state variables given a sequence of observations/emissions <math>o_{1:t}:= o_1,\dots,o_t</math>, i.e. it computes, for all hidden state variables <math>X_k \in \{X_1, \dots, X_t\}</math>, the distribution <math>P(X_k\ |\ o_{1:t})</math>. This inference task is usually called ''smoothing''. The algorithm makes use of the principle of [[dynamic programming]] to compute efficiently the values that are required to obtain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name ''forward–backward algorithm''.
{{Inline|date=April 2018}}
 
The '''forward–backward algorithm''' is an [[Statistical_inference | inference]] [[algorithm]] for [[hidden Markov modelsmodel]]s which computes the [[posterior probability|posterior]] [[marginal probability|marginals]] of all hidden state variables given a sequence of observations/emissions <math>o_{1:tT}:= o_1,\dots,o_to_T</math>, i.e. it computes, for all hidden state variables <math>X_kX_t \in \{X_1, \dots, X_tX_T\}</math>, the distribution <math>P(X_kX_t\ |\ o_{1:tT})</math>. This inference task is usually called '''smoothing'''. The algorithm makes use of the principle of [[dynamic programming]] to compute efficiently compute the values that are required to obtain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name ''forward–backward algorithm''.
The term ''forward–backward algorithm'' is also used to refer to any algorithm belonging to the general class of algorithms that operate on sequence models in a forward–backward manner. In this sense, the descriptions in the remainder of this article refer but to one specific instance of this class.
 
The term ''forward–backward algorithm'' is also used to refer to any algorithm belonging to the general class of algorithms that operate on sequence models in a forward–backward manner. In this sense, the descriptions in the remainder of this article refer butonly to one specific instance of this class.
 
==Overview ==
In the first pass, the forward–backward algorithm computes a set of forward probabilities which provide, for all <math>kt \in \{1, \dots, tT\}</math>, the probability of ending up in any particular state given the first <math>kt</math> observations in the sequence, i.e. <math>P(X_kX_t\ |\ o_{1:kt})</math>. In the second pass, the algorithm computes a set of backward probabilities which provide the probability of observing the remaining observations given any starting point <math>kt</math>, i.e. <math>P(o_{kt+1:tT}\ |\ X_kX_t)</math>. These two sets of probability distributions can then be combined to obtain the distribution over states at any specific point in time given the entire observation sequence:
 
:<math>P(X_kX_t\ |\ o_{1:tT}) = P(X_kX_t\ |\ o_{1:kt}, o_{kt+1:tT}) \propto P(o_{kt+1:tT}\ |\ X_kX_t) P( \X_t | o_{1:kt}, X_k)</math>
 
The last step follows from an application of the [[Bayes' rule]] and the [[conditional independence]] of <math>o_{kt+1:tT}</math> and <math>o_{1:kt}</math> given <math>X_kX_t</math>.
 
As outlined above, the algorithm involves three steps:
Line 21 ⟶ 24:
 
==Forward probabilities==
The following description will use matrices of probability values ratherinstead thanof probability distributions. However, althoughit inis generalimportant to note that the forward-backward algorithm can generally be applied to both continuous as well asand discrete probability models.
 
We transform the probability distributions related to a given [[hidden Markov model]] into matrix notation as follows.
The transition probabilities <math>\mathbf{P}(X_t\mid X_{t-1})</math> of a given random variable <math>X_t</math> representing all possible states in the hidden Markov model will be represented by the matrix <math>\mathbf{T}</math> where the column index <math>ij</math> will represent the target state and the row index <math>ji</math> represents the start state. A transition from row-vector state <math>\mathbf{\pi_t}</math> to the incremental row-vector state <math>\mathbf{\pi_{t+1}}</math> is written as <math>\mathbf{\pi_{t+1}} = \mathbf{\pi_{t}} \mathbf{T}</math>. The example below represents a system where the probability of staying in the same state after each step is 70% and the probability of transitioning to the other state is 30%. The transition matrix is then:
 
:<math>\mathbf{T} = \begin{pmatrix}
Line 32 ⟶ 35:
</math>
 
In a typical Markov model, we would multiply a state vector by this matrix to obtain the probabilities for the subsequent state. In a hidden Markov model the state is unknown, and we instead observe events associated with the possible states. An event matrix of the form:
 
:<math>\mathbf{B} = \begin{pmatrix}
Line 42 ⟶ 45:
provides the probabilities for observing events given a particular state. In the above example, event 1 will be observed 90% of the time if we are in state 1 while event 2 has a 10% probability of occurring in this state. In contrast, event 1 will only be observed 20% of the time if we are in state 2 and event 2 has an 80% chance of occurring. Given an arbitrary row-vector describing the state of the system (<math>\mathbf{\pi}</math>), the probability of observing event j is then:
 
:<math>\mathbf{P}(O = j)=\sum_{i} \pi_i b_B_{i,j}</math>
 
ThisThe probability of a given state leading to the observed event j can be represented in matrix form by multiplying the state row-vector (<math>\mathbf{\pi}</math>) bywith an observation matrix (<math>\mathbf{O_j} = \mathrm{diag}(b_B_{*,o_j})</math>) containing only diagonal entries. Each entry is the probability of the observed event given each state. Continuing the above example, anthe observation ofmatrix for event 1 would be:
 
:<math>\mathbf{O_1} = \begin{pmatrix}
Line 58 ⟶ 61:
</math>
 
We can now make this general procedure specific to our series of observations. Assuming an initial state vector <math>\mathbf{\pi_0pi}_0</math>, (which can be optimized as a parameter through repetitions of the forward-backbackward procedure), we begin with <math>\mathbf{f_{0:0}} = \mathbf{\pi}_0</math>, then updating the state distribution and weighting by the likelihood of the first observation:
 
:<math>
\mathbf{f_{0:01}} = \mathbf{\pi_0pi}_0 \mathbf{T} \mathbf{O_{o(0)o_1}}
</math>
 
Line 67 ⟶ 70:
 
:<math>
\mathbf{f_{0:t}} = \mathbf{f_{0:t-1}} \mathbf{T} \mathbf{O_{o(t)o_t}}
</math>
 
This value is the forward unnormalized [[probability vector]]. The i'th entry of this vector provides:
 
:<math>
\mathbf{f_{0:t}}(i) = \mathbf{P}(o_1, o_2, \dots, o_t, X_t=x_i | \mathbf{\pi}_0 )
</math>
 
Line 79 ⟶ 82:
 
:<math>
\mathbf{\hat{f}_{0:t}} = c_t^{-1}\ \mathbf{f_\hat{f}_{0:t-1}} \mathbf{T} \mathbf{O_{o(t)o_t}}
</math>
 
Line 85 ⟶ 88:
 
:<math>
\mathbf{P}(o_1, o_2, \dots, o_t|\mathbf{\pi}_0) = \prod_{s=1}^t c_s
</math>
 
Line 93 ⟶ 96:
\mathbf{\hat{f}_{0:t}}(i) =
\frac{\mathbf{f_{0:t}}(i)}{\prod_{s=1}^t c_s} =
\frac{\mathbf{P}(o_1, o_2, \dots, o_t, X_t=x_i | \mathbf{\pi}_0 )}{\mathbf{P}(o_1, o_2, \dots, o_t|\mathbf{\pi}_0)} =
\mathbf{P}(X_t=x_i | o_1, o_2, \dots, o_t, \mathbf{\pi}_0 )
</math>
 
Line 112 ⟶ 115:
</math>
 
Notice that we are now using a [[Row and column vectors|column vector]] while the forward probabilities used row vectors. We can then work backwards using:
 
:<math>
Line 135 ⟶ 138:
:<math>
\mathbf{\gamma_t}(i) =
\mathbf{P}(X_t=x_i | o_1, o_2, \dots, o_T, \mathbf{\pi}_0) =
\frac{ \mathbf{P}(o_1, o_2, \dots, o_T, X_t=x_i | \mathbf{\pi}_0 ) }{ \mathbf{P}(o_1, o_2, \dots, o_T | \mathbf{\pi}_0 ) } =
\frac{ \mathbf{f_{0:t}}(i) \cdot \mathbf{b_{t:T}}(i) }{ \prod_{s=1}^T c_s } =
\mathbf{\hat{f}_{0:t}}(i) \cdot \mathbf{\hat{b}_{t:T}}(i)
Line 143 ⟶ 146:
To understand this, we note that <math>\mathbf{f_{0:t}}(i) \cdot \mathbf{b_{t:T}}(i)</math> provides the probability for observing the given events in a way that passes through state <math>x_i</math> at time t. This probability includes the forward probabilities covering all events up to time t as well as the backward probabilities which include all future events. This is the numerator we are looking for in our equation, and we divide by the total probability of the observation sequence to normalize this value and extract only the probability that <math>X_t=x_i</math>. These values are sometimes called the "smoothed values" as they combine the forward and backward probabilities to compute a final probability.
 
The values <math>\mathbf{\gamma_t}(i)</math> thus provide the probability of being in each state at time t. As such, they are useful for determining the most probable state at any time. It should be noted, however, that theThe term "most probable state" is somewhat ambiguous. While the most probable state is the most likely to be correct at a given point, the sequence of individually probable states is not likely to be the most probable sequence. This is because the probabilities for each point are calculated independently of each other. They do not take into account the transition probabilities between states, and it is thus possible to get states at two moments (t and t+1) that are both most probable at those time points but which have very little probability of occurring together, i.e. <math> \mathbf{P}(X_t=x_i,X_{t+1}=x_j) \neq \mathbf{P}(X_t=x_i) \mathbf{P}(X_{t+1}=x_j) </math>. The most probable sequence of states that produced an observation sequence can be found using the [[Viterbi algorithm]].
 
==Example ==
This example takes as its basis the umbrella world in [[#RussellNorvig10|Russell & Norvig 2010 Chapter 15 pp. 566567]] in which we would like to infer the weather given observation of aanother manperson either carrying or not carrying an umbrella. We assume two possible states for the weather: state 1 = rain, state 2 = no rain. We assume that the weather has a 70% chance of staying the same each day and a 30% chance of changing. The transition probabilities are then:
 
:<math>\mathbf{T} = \begin{pmatrix}
Line 154 ⟶ 157:
</math>
 
We also assume each state generates 2one of two possible events: event 1 = umbrella, event 2 = no umbrella. The conditional probabilities for these occurring in each state are given by the probability matrix:
 
:<math>\mathbf{B} = \begin{pmatrix}
Line 179 ⟶ 182:
 
:<math>
(\mathbf{\hat{f}_{0:t}})^T = cc_t^{-1}\mathbf{O_t}(\mathbf{T})^T(\mathbf{\hat{f}_{0:t-1}})^T
</math>
 
Line 185 ⟶ 188:
 
:<math>
\mathbf{\hat{f}_{0:t}} = cc_t^{-1}\mathbf{\hat{f}_{0:t-1}} \mathbf{T} \mathbf{O_t}
</math>
 
Notice that the [[transformation matrix]] is also transposed, but in our example the transpose is equal to the original matrix. Performing these calculations and normalizing the results provides:
 
:<math>
Line 225 ⟶ 228:
</math>
 
For the backward probabilities, we start with:
 
:<math>
Line 253 ⟶ 256:
</math>
 
Finally, we will compute the smoothed probability values. These result alsoresults must also be scaled so that its entries sum to 1 because we did not scale the backward probabilities with the <math>c_t</math>'s found earlier. The backward probability vectors above thus actually represent the likelihood of each state at time t given the future observations. Because these vectors are proportional to the actual backward probabilities, the result has to be scaled an additional time.
 
:<math>
Line 281 ⟶ 284:
Notice that the value of <math>\mathbf{\gamma_0}</math> is equal to <math>\mathbf{\hat{b}_{0:5}}</math> and that <math>\mathbf{\gamma_5}</math> is equal to <math>\mathbf{\hat{f}_{0:5}}</math>. This follows naturally because both <math>\mathbf{\hat{f}_{0:5}}</math> and <math>\mathbf{\hat{b}_{0:5}}</math> begin with uniform priors over the initial and final state vectors (respectively) and take into account all of the observations. However, <math>\mathbf{\gamma_0}</math> will only be equal to <math>\mathbf{\hat{b}_{0:5}}</math> when our initial state vector represents a uniform prior (i.e. all entries are equal). When this is not the case <math>\mathbf{\hat{b}_{0:5}}</math> needs to be combined with the initial state vector to find the most likely initial state. We thus find that the forward probabilities by themselves are sufficient to calculate the most likely final state. Similarly, the backward probabilities can be combined with the initial state vector to provide the most probable initial state given the observations. The forward and backward probabilities need only be combined to infer the most probable states between the initial and final points.
 
The calculations above reveal that the most probable weather state on every day except for the third one was "rain.". They tell us more than this, however, as they now provide a way to quantify the probabilities of each state at different times. Perhaps most importantly, our value at <math>\mathbf{\gamma_5}</math> quantifies our knowledge of the state vector at the end of the observation sequence. We can then use this to predict the probability of the various weather states tomorrow as well as the probability of observing an umbrella.
 
==Performance ==
The brute-forceforward–backward procedurealgorithm forruns thewith solutiontime ofcomplexity this<math> problemO(S^2 isT) the</math> generationin ofspace all<math> possibleO(S T) </math>, where <math>N^T</math> stateis sequencesthe andlength calculatingof the jointtime probabilitysequence ofand each<math>S</math> stateis sequencethe withnumber of symbols in the observedstate seriesalphabet.<ref>[[#RussellNorvig10|Russell of& eventsNorvig 2010 pp. 579]]</ref> This approachThe hasalgorithm can also run in constant space with [[time complexity]] <math> O(T \cdot NS^2 T^2) </math>, whereby recomputing values at each step.<mathref>T[[#RussellNorvig10|Russell & Norvig 2010 pp. 575]]</mathref> isFor thecomparison, lengtha of[[Brute-force sequencessearch|brute-force andprocedure]] would generate all possible <math>NS^T</math> isstate sequences and calculate the numberjoint probability of symbolseach instate sequence with the stateobserved alphabetseries of events, which would have [[time complexity]] <math> O(T \cdot S^T) </math>. ThisBrute force is intractable for realistic problems, as the number of possible hidden node sequences typically is extremely high. However, the forward–backward algorithm has time complexity <math> O(N^2 T)\, </math>.
 
An enhancement to the general forward-backward algorithm, called the [[Island algorithm]], trades smaller memory usage for longer running time, taking <math> O(NS^2 T \log T)\, </math> time and <math> O(N^2S \log T)\, </math> memory. OnFurthermore, ait computeris withpossible anto unlimitedinvert numberthe ofprocess processors,model thisto canobtain bean reduced<math>O(S)</math> tospace, <math> O(NS^2 T)\, </math> totaltime timealgorithm, whilealthough stillthe takinginverted onlyprocess <math>may O(N^2not \logexist T)\,or </math>be memory[[ill-conditioned]].<ref>J.{{cite journal |last1=Binder, K.|first1=John |last2=Murphy and S.|first2=Kevin |last3=Russell. |first3=Stuart |title=Space-Efficientefficient Inferenceinference in Dynamicdynamic Probabilistic Networks.probabilistic networks |journal=Int'l, Joint Conf. onOn Artificial Intelligence, |date=1997 |url=https://www.cs.ubc.ca/~murphyk/Papers/ijcai97.pdf |access-date=8 July 2020}}</ref>
 
In addition, algorithms have been developed to compute <math>\mathbf{f_{0:t+1}}</math> efficiently through online smoothing such as the fixed-lag smoothing (FLS) algorithm .<ref>[[#RussellNorvig10|Russell & Norvig 2010 Figure 15.6 pp. 580]].</ref>
 
==Pseudocode==
<pre>
Backward(guessState, sequenceIndex):
if sequenceIndex is past the end of the sequence, return 1
if (guessState, sequenceIndex) has been seen before, return saved result
result = 0
for each neighboring state n:
result = result + (transition probability from guessState to
n given observation element at sequenceIndex)
* Backward(n, sequenceIndex+1)
save result for (guessState, sequenceIndex)
return result
</pre>
 
'''algorithm''' forward_backward '''is'''
==Python example==
'''input:''' guessState
Given HMM (just like in [[Viterbi algorithm]]) represented in the [[Python programming language]]:
int ''sequenceIndex''
<source lang="python">
'''output:''' ''result''
states = ('Healthy', 'Fever')
end_state = 'E'
'''if''' ''sequenceIndex'' is past the end of the sequence, return 1'''then'''
observations = ('normal', 'cold', 'dizzy')
'''return''' 1
'''if''' (''guessState'', ''sequenceIndex'') has been seen before, return saved result'''then'''
'''return''' saved result
''result'' := 0
start_probability = {'Healthy': 0.6, 'Fever': 0.4}
'''for each''' neighboring state n:
transition_probability = {
''result'' := result + (transition probability from ''guessState'' to
'Healthy' : {'Healthy': 0.69, 'Fever': 0.3, 'E': 0.01},
n given observation element at ''sequenceIndex'')
'Fever' : {'Healthy': 0.4, 'Fever': 0.59, 'E': 0.01},
* × Backward(n, ''sequenceIndex'' + 1)
}
Backward save result for (''guessState'', ''sequenceIndex''):
'''return''' ''result''
 
==Python example==
Given HMM (just like in [[Viterbi algorithm]]) represented in the [[Python programming language]]:
<sourcesyntaxhighlight lang="python">
states = ('"Healthy'", '"Fever'")
end_state = '"E'"
 
observations = ('"normal'", '"cold'", '"dizzy'")
 
start_probability = {'"Healthy'": 0.6, '"Fever'": 0.4}
 
transition_probability = {
'Healthy' "Healthy": {'"Healthy'": 0.69, '"Fever'": 0.3, '"E'": 0.01},
'Fever' "Fever": {'"Healthy'": 0.4, '"Fever'": 0.59, '"E'": 0.01},
}
 
emission_probability = {
'Healthy' "Healthy": {'"normal'": 0.5, '"cold'": 0.4, '"dizzy'": 0.1},
'Fever' "Fever": {'"normal'": 0.1, '"cold'": 0.3, '"dizzy'": 0.6},
}
</syntaxhighlight>
</source>
 
We can write the implementation of the forward-backward algorithm like this:
<sourcesyntaxhighlight lang="python">
def fwd_bkw(observations, states, start_prob, trans_prob, emm_prob, end_st):
# forward part of the"""Forward–backward algorithm."""
# Forward part of the algorithm
fwd = []
f_prev = {}
for i, observation_i in enumerate(observations):
f_curr = {}
Line 338 ⟶ 350:
prev_f_sum = start_prob[st]
else:
prev_f_sum = sum(f_prev[k] * trans_prob[k][st] for k in states)
 
f_curr[st] = emm_prob[st][observation_i] * prev_f_sum
Line 347 ⟶ 359:
p_fwd = sum(f_curr[k] * trans_prob[k][end_st] for k in states)
 
# backwardBackward part of the algorithm
bkw = []
for i, observation_i_plus in enumerate(reversed(observations[1:] + (None,))):
b_prev = {}
for i, observation_i_plus in enumerate(reversed(observations[1:]+(None,))):
b_curr = {}
for st in states:
Line 364 ⟶ 375:
p_bkw = sum(start_prob[l] * emm_prob[l][observations[0]] * b_curr[l] for l in states)
 
# mergingMerging the two parts
posterior = []
for i in range(len(observations)):
Line 371 ⟶ 382:
assert p_fwd == p_bkw
return fwd, bkw, posterior
</syntaxhighlight>
 
</source>
 
The function <code>fwd_bkw</code> takes the following arguments:
Line 384 ⟶ 394:
 
In the running example, the forward-backward algorithm is used as follows:
<sourcesyntaxhighlight lang="python">
def example():
return fwd_bkw(
observations,
states,
start_probability,
transition_probability,
emission_probability,
end_state),
} )
</source>
</syntaxhighlight>
<source lang="python">
<syntaxhighlight lang="pycon">
 
>>> for line in example():
... print(*line)
Line 401 ⟶ 412:
{'Healthy': 0.0010418399999999998, 'Fever': 0.00109578} {'Healthy': 0.00249, 'Fever': 0.00394} {'Healthy': 0.01, 'Fever': 0.01}
{'Healthy': 0.8770110375573259, 'Fever': 0.1229889624426741} {'Healthy': 0.623228030950954, 'Fever': 0.3767719690490461} {'Healthy': 0.2109527048413057, 'Fever': 0.7890472951586943}
</syntaxhighlight>
</source>
 
==See also ==
* [[Baum-WelchBaum–Welch algorithm]]
* [[Viterbi algorithm]]
* [[BCJR algorithm]]
Line 410 ⟶ 421:
== References==
{{reflist}}
* [[Lawrence Rabiner|Lawrence R. Rabiner]], A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. ''Proceedings of the [[IEEE]]'', 77 (2), p.&nbsp;257–286, February 1989. [httphttps://dx.doi.org/10.1109/5.18626 10.1109/5.18626]
* {{cite journal |author=Lawrence R. Rabiner, B. H. Juang|title=An introduction to hidden Markov models|journal=IEEE ASSP Magazine |date=January 1986 |pages=4–15}}
* {{cite book | author = Eugene Charniak|title = Statistical Language Learning|publisher = MIT Press| publication-place___location=Cambridge, Massachusetts|year = 1993|isbn=978-0-262-53141-2}}
* <cite id = RussellNorvig10>{{cite book | author = Stuart Russell and Peter Norvig|title = Artificial Intelligence A Modern Approach 3rd Edition|publisher = Pearson Education/Prentice-Hall|publication-place___location = Upper Saddle River, New Jersey|year = 2010|isbn=978-0-13-604259-4}}</cite>
<cite id = RussellNorvig10>
*{{cite book | author = Stuart Russell and Peter Norvig|title = Artificial Intelligence A Modern Approach 3rd Edition|publisher = Pearson Education/Prentice-Hall|publication-place = Upper Saddle River, New Jersey|year = 2010|isbn=978-0-13-604259-4}}
 
==External links ==
* [http://www.cs.jhu.edu/~jason/papers/#eisner-2002-tnlp An interactive spreadsheet for teaching the forward–backward algorithm] (spreadsheet and article with step-by-step walk-through)
* [httphttps://www.cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html Tutorial of hidden Markov models including the forward–backward algorithm]
* [http://code.google.com/p/aima-java/ Collection of AI algorithms implemented in Java] (including HMM and the forward–backward algorithm)
 
{{DEFAULTSORT:Forward-backward algorithm}}
[[Category:Articles with example Python (programming language) code]]
[[Category:Dynamic programming]]
[[Category:Error detection and correction]]