Forward–backward algorithm: Difference between revisions

Content deleted Content added
Python example: Removed unnecessary initialisations of b_prev and f_prev
Bender the Bot (talk | contribs)
m External links: HTTP to HTTPS for Brown University
 
(19 intermediate revisions by 12 users not shown)
Line 1:
{{Short description|Inference algorithm for hidden Markov models}}
{{Inline|date=April 2018}}
 
The '''forward–backward algorithm''' is an [[Statistical_inference | inference]] [[algorithm]] for [[hidden Markov model]]s which computes the [[posterior probability|posterior]] [[marginal probability|marginals]] of all hidden state variables given a sequence of observations/emissions <math>o_{1:T}:= o_1,\dots,o_T</math>, i.e. it computes, for all hidden state variables <math>X_t \in \{X_1, \dots, X_T\}</math>, the distribution <math>P(X_t\ |\ o_{1:T})</math>. This inference task is usually called '''smoothing'''. The algorithm makes use of the principle of [[dynamic programming]] to efficiently compute the values that are required to obtain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name ''forward–backward algorithm''.
 
The term ''forward–backward algorithm'' is also used to refer to any algorithm belonging to the general class of algorithms that operate on sequence models in a forward–backward manner. In this sense, the descriptions in the remainder of this article refer butonly to one specific instance of this class.
 
==Overview ==
Line 23 ⟶ 24:
 
==Forward probabilities==
The following description will use matrices of probability values ratherinstead thanof probability distributions. However, althoughit inis generalimportant to note that the forward-backward algorithm can generally be applied to both continuous as well asand discrete probability models.
 
We transform the probability distributions related to a given [[hidden Markov model]] into matrix notation as follows.
Line 34 ⟶ 35:
</math>
 
In a typical Markov model, we would multiply a state vector by this matrix to obtain the probabilities for the subsequent state. In a hidden Markov model the state is unknown, and we instead observe events associated with the possible states. An event matrix of the form:
 
:<math>\mathbf{B} = \begin{pmatrix}
Line 46 ⟶ 47:
:<math>\mathbf{P}(O = j)=\sum_{i} \pi_i B_{i,j}</math>
 
ThisThe probability of a given state leading to the observed event j can be represented in matrix form by multiplying the state row-vector (<math>\mathbf{\pi}</math>) bywith an observation matrix (<math>\mathbf{O_j} = \mathrm{diag}(B_{*,o_j})</math>) containing only diagonal entries. Each entry is the probability of the observed event given each state. Continuing the above example, anthe observation ofmatrix for event 1 would be:
 
:<math>\mathbf{O_1} = \begin{pmatrix}
Line 60 ⟶ 61:
</math>
 
We can now make this general procedure specific to our series of observations. Assuming an initial state vector <math>\mathbf{\pi}_0</math>, (which can be optimized as a parameter through repetitions of the forward-backbackward procedure), we begin with <math>\mathbf{f_{0:0}} = \mathbf{\pi}_0</math>, then updating the state distribution and weighting by the likelihood of the first observation:
 
:<math>
\mathbf{f_{0:1}} = \mathbf{\pi}_0 \mathbf{T} \mathbf{O_{o(1)o_1}}
</math>
 
Line 69 ⟶ 70:
 
:<math>
\mathbf{f_{0:t}} = \mathbf{f_{0:t-1}} \mathbf{T} \mathbf{O_{o(t)o_t}}
</math>
 
This value is the forward unnormalized [[probability vector]]. The i'th entry of this vector provides:
 
:<math>
Line 81 ⟶ 82:
 
:<math>
\mathbf{\hat{f}_{0:t}} = c_t^{-1}\ \mathbf{\hat{f}_{0:t-1}} \mathbf{T} \mathbf{O_{o(t)o_t}}
</math>
 
Line 114 ⟶ 115:
</math>
 
Notice that we are now using a [[Row and column vectors|column vector]] while the forward probabilities used row vectors. We can then work backwards using:
 
:<math>
Line 190 ⟶ 191:
</math>
 
Notice that the [[transformation matrix]] is also transposed, but in our example the transpose is equal to the original matrix. Performing these calculations and normalizing the results provides:
 
:<math>
Line 227 ⟶ 228:
</math>
 
For the backward probabilities, we start with:
 
:<math>
Line 255 ⟶ 256:
</math>
 
Finally, we will compute the smoothed probability values. These result alsoresults must also be scaled so that its entries sum to 1 because we did not scale the backward probabilities with the <math>c_t</math>'s found earlier. The backward probability vectors above thus actually represent the likelihood of each state at time t given the future observations. Because these vectors are proportional to the actual backward probabilities, the result has to be scaled an additional time.
 
:<math>
Line 286 ⟶ 287:
 
==Performance ==
The brute-forceforward–backward procedurealgorithm forruns thewith solutiontime ofcomplexity this<math> problemO(S^2 isT) the</math> generationin ofspace all<math> possibleO(S T) </math>, where <math>N^T</math> stateis sequencesthe andlength calculatingof the jointtime probabilitysequence ofand each<math>S</math> stateis sequencethe withnumber of symbols in the observedstate seriesalphabet.<ref>[[#RussellNorvig10|Russell of& eventsNorvig 2010 pp. 579]]</ref> This approachThe hasalgorithm can also run in constant space with [[time complexity]] <math> O(T \cdot NS^2 T^2) </math>, whereby recomputing values at each step.<mathref>T[[#RussellNorvig10|Russell & Norvig 2010 pp. 575]]</mathref> isFor thecomparison, lengtha of[[Brute-force sequencessearch|brute-force andprocedure]] would generate all possible <math>NS^T</math> isstate sequences and calculate the numberjoint probability of symbolseach instate sequence with the stateobserved alphabetseries of events, which would have [[time complexity]] <math> O(T \cdot S^T) </math>. ThisBrute force is intractable for realistic problems, as the number of possible hidden node sequences typically is extremely high. However, the forward–backward algorithm has time complexity <math> O(N^2 T)\, </math>.
 
An enhancement to the general forward-backward algorithm, called the [[Island algorithm]], trades smaller memory usage for longer running time, taking <math> O(NS^2 T \log T)\, </math> time and <math> O(NS \log T)\, </math> memory. OnFurthermore, ait computeris withpossible anto unlimitedinvert numberthe ofprocess processors,model thisto canobtain bean reduced<math>O(S)</math> tospace, <math> O(NS^2 T)\, </math> totaltime timealgorithm, whilealthough stillthe takinginverted onlyprocess <math>may O(Nnot \logexist T)\,or </math>be memory[[ill-conditioned]].<ref>J.{{cite journal |last1=Binder, K.|first1=John |last2=Murphy and S.|first2=Kevin |last3=Russell. |first3=Stuart |title=Space-Efficientefficient Inferenceinference in Dynamicdynamic Probabilistic Networks.probabilistic networks |journal=Int'l, Joint Conf. onOn Artificial Intelligence, |date=1997 |url=https://www.cs.ubc.ca/~murphyk/Papers/ijcai97.pdf |access-date=8 July 2020}}</ref>
 
In addition, algorithms have been developed to compute <math>\mathbf{f_{0:t+1}}</math> efficiently through online smoothing such as the fixed-lag smoothing (FLS) algorithm .<ref>[[#RussellNorvig10|Russell & Norvig 2010 Figure 15.6 pp. 580]].</ref>
 
==Pseudocode==
Line 318 ⟶ 319:
Given HMM (just like in [[Viterbi algorithm]]) represented in the [[Python programming language]]:
<syntaxhighlight lang="python">
states = ('"Healthy'", '"Fever'")
end_state = '"E'"
 
observations = ('"normal'", '"cold'", '"dizzy'")
 
start_probability = {'"Healthy'": 0.6, '"Fever'": 0.4}
 
transition_probability = {
'Healthy' "Healthy": {'"Healthy'": 0.69, '"Fever'": 0.3, '"E'": 0.01},
'Fever' "Fever": {'"Healthy'": 0.4, '"Fever'": 0.59, '"E'": 0.01},
}
 
emission_probability = {
'Healthy' "Healthy": {'"normal'": 0.5, '"cold'": 0.4, '"dizzy'": 0.1},
'Fever' "Fever": {'"normal'": 0.1, '"cold'": 0.3, '"dizzy'": 0.6},
}
</syntaxhighlight>
 
Line 349 ⟶ 350:
prev_f_sum = start_prob[st]
else:
prev_f_sum = sum(f_prev[k] * trans_prob[k][st] for k in states)
 
f_curr[st] = emm_prob[st][observation_i] * prev_f_sum
Line 360 ⟶ 361:
# Backward part of the algorithm
bkw = []
for i, observation_i_plus in enumerate(reversed(observations[1:] + (None,))):
b_curr = {}
for st in states:
Line 381 ⟶ 382:
assert p_fwd == p_bkw
return fwd, bkw, posterior
 
</syntaxhighlight>
 
Line 396:
<syntaxhighlight lang="python">
def example():
return fwd_bkw(
observations,
states,
start_probability,
transition_probability,
emission_probability,
end_state),
)
</syntaxhighlight>
<syntaxhighlight lang="pycon">
Line 419 ⟶ 421:
== References==
{{reflist}}
* [[Lawrence Rabiner|Lawrence R. Rabiner]], A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. ''Proceedings of the [[IEEE]]'', 77 (2), p.&nbsp;257–286, February 1989. [https://dx.doi.org/10.1109/5.18626 10.1109/5.18626]
* {{cite journal |author=Lawrence R. Rabiner, B. H. Juang|title=An introduction to hidden Markov models|journal=IEEE ASSP Magazine |date=January 1986 |pages=4–15}}
* {{cite book | author = Eugene Charniak|title = Statistical Language Learning|publisher = MIT Press| ___location=Cambridge, Massachusetts|year = 1993|isbn=978-0-262-53141-2}}
* <cite id = RussellNorvig10>{{cite book | author = Stuart Russell and Peter Norvig|title = Artificial Intelligence A Modern Approach 3rd Edition|publisher = Pearson Education/Prentice-Hall|___location = Upper Saddle River, New Jersey|year = 2010|isbn=978-0-13-604259-4}}</cite>
 
==External links ==
* [http://www.cs.jhu.edu/~jason/papers/#eisner-2002-tnlp An interactive spreadsheet for teaching the forward–backward algorithm] (spreadsheet and article with step-by-step walk-through)
* [httphttps://www.cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html Tutorial of hidden Markov models including the forward–backward algorithm]
* [http://code.google.com/p/aima-java/ Collection of AI algorithms implemented in Java] (including HMM and the forward–backward algorithm)
 
{{DEFAULTSORT:Forward-backward algorithm}}
[[Category:Articles with example Python (programming language) code]]
[[Category:Dynamic programming]]
[[Category:Error detection and correction]]