Content deleted Content added
m →Critic |
→Critic: SARSA |
||
Line 50:
For example, if the critic is estimating the state-value function <math>V^{\pi_\theta}(s)</math>, then it can be learned by any value function approximation method. Let the critic be a function approximator <math>V_\phi(s)</math> with parameters <math>\phi</math>.
The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:<math display="block">\delta_t = R_t + \gamma V_\phi(S_{
Similarly, if the critic is estimating the action-value function <math>Q^{\pi_\theta}</math>, then it can be learned by [[Q-learning]] or [[State–action–reward–state–action|SARSA]]. In SARSA, the critic maintains an estimate of the Q-function, parameterized by <math>\phi</math>, denoted as <math>Q_\phi(s, a)</math>. The temporal difference error is then calculated as <math>\delta_i = R_i + \gamma Q_\theta(S_{i+1}, A_{i+1}) - Q_\theta(S_i,A_i)</math>. The critic is then updated by<math display="block">\theta \leftarrow \theta + \alpha \delta_i \nabla_\theta Q_\theta(S_i, A_i)</math>
== Variants ==
|