Content deleted Content added
m →External links: HTTP to HTTPS for Brown University |
|||
(139 intermediate revisions by 87 users not shown) | |||
Line 1:
{{Short description|Inference algorithm for hidden Markov models}}
{{
The '''
The term ''
==Overview ==
In the first pass, the forward–backward algorithm computes a set of forward probabilities which provide, for all <math>t \in \{1, \dots, T\}</math>, the probability of ending up in any particular state given the first <math>t</math> observations in the sequence, i.e. <math>P(X_t\ |\ o_{1:t})</math>. In the second pass, the algorithm computes a set of backward probabilities which provide the probability of observing the remaining observations given any starting point <math>t</math>, i.e. <math>P(o_{t+1:T}\ |\ X_t)</math>. These two sets of probability distributions can then be combined to obtain the distribution over states at any specific point in time given the entire observation sequence:
:<math>P(X_t\ |\ o_{1:T}) = P(X_t\ |\ o_{1:t}, o_{t+1:T}) \propto P(o_{t+1:T}\ |\ X_t) P( X_t | o_{1:t})</math>
The last step follows from an application of the [[Bayes' rule]] and the [[conditional independence]] of <math>o_{t+1:T}</math> and <math>o_{1:t}</math> given <math>X_t</math>.
As outlined above, the algorithm involves three steps:
# computing forward probabilities
Line 20 ⟶ 19:
# computing smoothed values.
The forward and backward steps
The forward–backward algorithm can be used to find the most likely state for any point in time. It cannot, however, be used to find the most likely sequence of states (see [[Viterbi algorithm]]).
==Forward probabilities==
The following description will use matrices of probability values instead of probability distributions. However, it is important to note that the forward-backward algorithm can generally be applied to both continuous and discrete probability models.
We transform the probability distributions related to a given [[hidden Markov model]] into matrix notation as follows.
The transition probabilities <math>\mathbf{P}(X_t\mid X_{t-1})</math> of a given random variable <math>X_t</math> representing all possible states in the hidden Markov model will be represented by the matrix <math>\mathbf{T}</math> where the column index <math>j</math> will represent the target state and the row index <math>i</math> represents the start state. A transition from row-vector state <math>\mathbf{\pi_t}</math> to the incremental row-vector state <math>\mathbf{\pi_{t+1}}</math> is written as <math>\mathbf{\pi_{t+1}} = \mathbf{\pi_{t}} \mathbf{T}</math>. The example below represents a system where the probability of staying in the same state after each step is 70% and the probability of transitioning to the other state is 30%. The transition matrix is then:
:<math>\mathbf{T} = \begin{pmatrix}
0.7 & 0.3 \\
0.3 & 0.7
\end{pmatrix}
</math>
In a typical Markov model, we would multiply a state vector by this matrix to obtain the probabilities for the subsequent state. In a hidden Markov model the state is unknown, and we instead observe events associated with the possible states. An event matrix of the form:
:<math>\mathbf{B} = \begin{pmatrix}
0.9 & 0.1 \\
0.2 & 0.8
\end{pmatrix}
</math>
provides the probabilities for observing events given a particular state. In the above example, event 1 will be observed 90% of the time if we are in state 1 while event 2 has a 10% probability of occurring in this state. In contrast, event 1 will only be observed 20% of the time if we are in state 2 and event 2 has an 80% chance of occurring. Given
:<math>\mathbf{P}(O = j)=\sum_{i} \pi_i
:<math>\mathbf{O_1} = \begin{pmatrix}
0.9 & 0.0 \\
0.0 & 0.2
\end{pmatrix}
</math>
This allows us to calculate the new unnormalized probabilities
:<math>
\mathbf{
</math>
We can now make this general procedure specific to our series of observations. Assuming an initial state vector <math>\mathbf{\pi}_0</math>, (which can be optimized as a parameter through repetitions of the forward-backward procedure), we begin with <math>\mathbf{f_{0:0}} = \mathbf{\pi}_0</math>, then updating the state distribution and weighting by the likelihood of the first observation:
:<math>
\mathbf{f_{0:
</math>
This process can be carried forward with additional observations using:
:<math>
\mathbf{f_{0:t}}
</math>
This value is the forward unnormalized [[probability vector]]. The i'th entry of this vector provides:
:<math>
\mathbf{f_{0:t}}(i) = \mathbf{P}(o_1, o_2, \dots, o_t, X_t=x_i | \mathbf{\pi}_0 )
</math>
Line 75 ⟶ 82:
:<math>
\mathbf{\hat{f}_{0:t}} = c_t^{-1}\ \mathbf{\hat{f}_{0:t-1}} \mathbf{T} \mathbf{
</math>
Line 81 ⟶ 88:
:<math>
\mathbf{P}(o_1, o_2, \dots, o_t|\mathbf{\pi}_0) = \prod_{s=1}^t c_s
</math>
Line 87 ⟶ 94:
:<math>
\mathbf{\hat{f}_{0:t}}(i) =
\frac{\mathbf{f_{0:t}}(i)}{\prod_{s=1}^t c_s} =
\frac{\mathbf{P}(o_1, o_2, \dots, o_t, X_t=x_i | \mathbf{\pi}_0 )}{\mathbf{P}(o_1, o_2, \dots, o_t|\mathbf{\pi}_0)} =
\mathbf{P}(X_t=x_i | o_1, o_2, \dots, o_t, \mathbf{\pi}_0 )
</math>
We thus find that the product of the scaling factors provides us with the total probability for observing the given sequence up to time t and that the scaled probability vector provides us with the probability of being in each state at this time.
==
A similar procedure can be constructed to find backward probabilities. These intend to provide the probabilities:
:<math>
\mathbf{b_{t:T}}(i) = \mathbf{P}(o_{t+1}, o_{t+2}, \dots, o_{T} | X_t=x_i )
</math>
Line 108 ⟶ 115:
</math>
Notice that we are now using a [[Row and column vectors|column vector]] while the forward probabilities used row vectors. We can then work backwards using:
:<math>
Line 123 ⟶ 130:
:<math>
\mathbf{\hat{b}_{t:T}}(i) =
\frac{\mathbf{b_{t:T}}(i)}{\prod_{s=t+1}^T c_s}
</math>
Line 130 ⟶ 137:
:<math>
\mathbf{\gamma_t}(i) =
\mathbf{P}(X_t=x_i | o_1, o_2, \dots, o_T, \mathbf{\pi}_0) =
\frac{ \mathbf{P}(o_1, o_2, \dots, o_T, X_t=x_i | \mathbf{\pi}_0 ) }{ \mathbf{P}(o_1, o_2, \dots, o_T | \mathbf{\pi}_0 ) } =
\frac{ \mathbf{f_{0:t}}(i) \cdot \mathbf{b_{t:T}}(i) }{ \prod_{s=1}^T c_s } =
\mathbf{\hat{f}_{0:t}}(i) \cdot \mathbf{\hat{b}_{t:T}}(i)
</math>
To understand this, we note that <math>\mathbf{f_{0:t}}(i) \cdot \mathbf{b_{t:T}}(i)</math> provides the probability for observing the given events in a way that passes through state <math>
The values <math>\mathbf{\gamma_t}(i)</math> thus provide the probability of being in each state at time t. As such, they are useful for determining the most probable state at any time.
==Example ==
This example takes as its basis the umbrella world in [[#
:<math>\mathbf{T} = \begin{pmatrix}
0.7 & 0.3 \\
0.3 & 0.7
\end{pmatrix}
</math>
We also assume each state generates
:<math>\mathbf{B} = \begin{pmatrix}
0.9 & 0.1 \\
0.2 & 0.8
\end{pmatrix}
</math>
We then observe the following sequence of events: {umbrella, umbrella, no umbrella, umbrella, umbrella} which we will represent in our calculations as:
:<math>
Line 164 ⟶ 171:
</math>
Note that <math>\mathbf{O_3}</math> differs from the others because of the "no umbrella" observation.
In computing the forward probabilities we begin with:
Line 175 ⟶ 182:
:<math>
(\mathbf{\hat{f}_{0:t}})^T =
</math>
Line 181 ⟶ 188:
:<math>
\mathbf{\hat{f}_{0:t}} =
</math>
Notice that the [[transformation matrix]] is also transposed, but in our example the transpose is equal to the original matrix. Performing these calculations and normalizing the results provides:
:<math>
(\mathbf{\hat{f}_{0:1}})^T =
c_1^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.5000 \\ 0.5000 \end{pmatrix}=
c_1^{-1}\begin{pmatrix}0.4500 \\ 0.1000\end{pmatrix}=
Line 194 ⟶ 201:
:<math>
(\mathbf{\hat{f}_{0:2}})^T =
c_2^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.8182 \\ 0.1818 \end{pmatrix}=
c_2^{-1}\begin{pmatrix}0.5645 \\ 0.0745\end{pmatrix}=
Line 201 ⟶ 208:
:<math>
(\mathbf{\hat{f}_{0:3}})^T =
c_3^{-1}\begin{pmatrix}0.1 & 0.0 \\ 0.0 & 0.8 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.8834 \\ 0.1166 \end{pmatrix}=
c_3^{-1}\begin{pmatrix}0.0653 \\ 0.2772\end{pmatrix}=
Line 208 ⟶ 215:
:<math>
(\mathbf{\hat{f}_{0:4}})^T =
c_4^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.1907 \\ 0.8093 \end{pmatrix}=
c_4^{-1}\begin{pmatrix}0.3386 \\ 0.1247\end{pmatrix}=
Line 215 ⟶ 222:
:<math>
(\mathbf{\hat{f}_{0:5}})^T =
c_5^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.7308 \\ 0.2692 \end{pmatrix}=
c_5^{-1}\begin{pmatrix}0.5331 \\ 0.0815\end{pmatrix}=
Line 221 ⟶ 228:
</math>
For the backward probabilities, we start with:
:<math>
Line 249 ⟶ 256:
</math>
Finally, we will compute the smoothed probability values. These
:<math>
(\mathbf{\gamma_0})^T = \alpha\begin{pmatrix}0.5000 \\ 0.5000 \end{pmatrix}\
</math>
:<math>
(\mathbf{\gamma_1})^T = \alpha\begin{pmatrix}0.8182 \\ 0.1818 \end{pmatrix}\
</math>
:<math>
(\mathbf{\gamma_2})^T = \alpha\begin{pmatrix}0.8834 \\ 0.1166 \end{pmatrix}\
</math>
:<math>
(\mathbf{\gamma_3})^T = \alpha\begin{pmatrix}0.1907 \\ 0.8093 \end{pmatrix}\
</math>
:<math>
(\mathbf{\gamma_4})^T = \alpha\begin{pmatrix}0.7308 \\ 0.2692 \end{pmatrix}\
</math>
:<math>
(\mathbf{\gamma_5})^T = \alpha\begin{pmatrix}0.8673 \\ 0.1327 \end{pmatrix}\
</math>
Notice that the value of <math>\mathbf{\gamma_0}</math> is equal to <math>\mathbf{\hat{b}_{0:5}}</math> and that <math>\mathbf{\gamma_5}</math> is equal to <math>\mathbf{\hat{f}_{0:5}}</math>. This follows naturally because both <math>\mathbf{\hat{f}_{0:5}}</math> and <math>\mathbf{\hat{b}_{0:5}}</math> begin with uniform priors over the initial and final state vectors (respectively) and take into account all of the observations. However, <math>\mathbf{\gamma_0}</math> will only be equal to <math>\mathbf{\hat{b}_{0:5}}</math> when our initial state vector represents a uniform prior (i.e. all entries are equal). When this is not the case <math>\mathbf{\hat{b}_{0:5}}</math> needs to be combined with the initial state vector to find the most likely initial state. We thus find that the forward probabilities by themselves are sufficient to calculate the most likely final state. Similarly, the backward probabilities can be combined with the initial state vector to provide the most probable initial state given the observations. The forward and backward probabilities need only be combined to infer the most probable states between the initial and final points.
The calculations above reveal that the most probable weather state on every day except for the third one was "rain
==Performance ==
The
An enhancement to the general forward-backward algorithm, called the [[Island algorithm]], trades smaller memory usage for longer running time, taking <math> O(S^2 T \log T) </math> time and <math> O(S \log T) </math> memory. Furthermore, it is possible to invert the process model to obtain an <math>O(S)</math> space, <math>O(S^2 T)</math> time algorithm, although the inverted process may not exist or be [[ill-conditioned]].<ref>{{cite journal |last1=Binder |first1=John |last2=Murphy |first2=Kevin |last3=Russell |first3=Stuart |title=Space-efficient inference in dynamic probabilistic networks |journal=Int'l, Joint Conf. On Artificial Intelligence |date=1997 |url=https://www.cs.ubc.ca/~murphyk/Papers/ijcai97.pdf |access-date=8 July 2020}}</ref>
In addition, algorithms have been developed to compute <math>\mathbf{f_{0:t+1}}</math> efficiently through online smoothing such as the fixed-lag smoothing (FLS) algorithm.<ref>[[#RussellNorvig10|Russell & Norvig 2010 Figure 15.6 pp. 580]]</ref>
==Pseudocode==
'''algorithm''' forward_backward '''is'''
'''input:''' guessState
int ''sequenceIndex''
'''output:''' ''result''
'''if''' ''sequenceIndex'' is past the end of the sequence '''then'''
'''return'''
'''if''' (''guessState'', ''sequenceIndex'') has been seen before '''then'''
'''return''' saved result
''result'' := 0
'''for each''' neighboring state n:
''result'' := result + (transition probability from ''guessState'' to
n given observation element at ''sequenceIndex'')
× Backward(n, ''sequenceIndex'' + 1)
save result for (''guessState'', ''sequenceIndex'')
'''return''' ''result''
==Python example==
Given HMM (just like in [[Viterbi algorithm]]) represented in the [[Python programming language]]:
<syntaxhighlight lang="python">
states = ("Healthy", "Fever")
end_state = "E"
observations = ("normal", "cold", "dizzy")
start_probability = {"Healthy": 0.6, "Fever": 0.4}
transition_probability = {
"Healthy": {"Healthy": 0.69, "Fever": 0.3, "E": 0.01},
"Fever": {"Healthy": 0.4, "Fever": 0.59, "E": 0.01},
}
emission_probability = {
"Healthy": {"normal": 0.5, "cold": 0.4, "dizzy": 0.1},
"Fever": {"normal": 0.1, "cold": 0.3, "dizzy": 0.6},
}
</syntaxhighlight>
We can write the implementation of the forward-backward algorithm like this:
<syntaxhighlight lang="python">
def fwd_bkw(observations, states, start_prob, trans_prob, emm_prob, end_st):
"""Forward–backward algorithm."""
# Forward part of the algorithm
fwd = []
for i, observation_i in enumerate(observations):
f_curr = {}
for st in states:
if i == 0:
# base case for the forward part
prev_f_sum = start_prob[st]
else:
prev_f_sum = sum(f_prev[k] * trans_prob[k][st] for k in states)
f_curr[st] = emm_prob[st][observation_i] * prev_f_sum
fwd.append(f_curr)
f_prev = f_curr
p_fwd = sum(f_curr[k] * trans_prob[k][end_st] for k in states)
# Backward part of the algorithm
bkw = []
for i, observation_i_plus in enumerate(reversed(observations[1:] + (None,))):
b_curr = {}
for st in states:
if i == 0:
# base case for backward part
b_curr[st] = trans_prob[st][end_st]
else:
b_curr[st] = sum(trans_prob[st][l] * emm_prob[l][observation_i_plus] * b_prev[l] for l in states)
bkw.insert(0,b_curr)
b_prev = b_curr
p_bkw = sum(start_prob[l] * emm_prob[l][observations[0]] * b_curr[l] for l in states)
# Merging the two parts
posterior = []
for i in range(len(observations)):
posterior.append({st: fwd[i][st] * bkw[i][st] / p_fwd for st in states})
assert p_fwd == p_bkw
return fwd, bkw, posterior
</syntaxhighlight>
The function <code>fwd_bkw</code> takes the following arguments:
<code>x</code> is the sequence of observations, e.g. <code>['normal', 'cold', 'dizzy']</code>;
<code>states</code> is the set of hidden states;
<code>a_0</code> is the start probability;
<code>a</code> are the transition probabilities;
and <code>e</code> are the emission probabilities.
For simplicity of code, we assume that the observation sequence <code>x</code> is non-empty and that <code>a[i][j]</code> and <code>e[i][j]</code> is defined for all states i,j.
In the running example, the forward-backward algorithm is used as follows:
<syntaxhighlight lang="python">
def example():
return fwd_bkw(
observations,
states,
start_probability,
transition_probability,
emission_probability,
end_state,
)
</syntaxhighlight>
<syntaxhighlight lang="pycon">
>>> for line in example():
... print(*line)
...
{'Healthy': 0.3, 'Fever': 0.04000000000000001} {'Healthy': 0.0892, 'Fever': 0.03408} {'Healthy': 0.007518, 'Fever': 0.028120319999999997}
{'Healthy': 0.0010418399999999998, 'Fever': 0.00109578} {'Healthy': 0.00249, 'Fever': 0.00394} {'Healthy': 0.01, 'Fever': 0.01}
{'Healthy': 0.8770110375573259, 'Fever': 0.1229889624426741} {'Healthy': 0.623228030950954, 'Fever': 0.3767719690490461} {'Healthy': 0.2109527048413057, 'Fever': 0.7890472951586943}
</syntaxhighlight>
==See also ==
* [[
* [[Viterbi algorithm]]
* [[BCJR algorithm]]
== References==
{{reflist}}
* [[Lawrence Rabiner|Lawrence R. Rabiner]],
* {{cite journal |author=Lawrence R. Rabiner, B. H. Juang|title=An introduction to hidden Markov models|journal=IEEE ASSP Magazine |
* {{cite book | author = Eugene Charniak|title = Statistical Language Learning|publisher = MIT Press|
* <cite id = RussellNorvig10>{{cite book | author = Stuart Russell and Peter Norvig|title = Artificial Intelligence A Modern Approach 3rd Edition|publisher = Pearson Education/Prentice-Hall|___location = Upper Saddle River, New Jersey|year = 2010|isbn=978-0-13-604259-4}}</cite>
==External links ==
* [http://www.cs.jhu.edu/~jason/papers/#
* [
* [http://code.google.com/p/aima-java/ Collection of AI algorithms implemented in Java] (including HMM and the
{{DEFAULTSORT:Forward-backward algorithm}}
[[Category:Articles with example Python (programming language) code]]
[[Category:Dynamic programming]]
[[Category:Error detection and correction]]
[[Category:Machine learning algorithms]]
[[Category:Markov models]]
|