Revision as of 05:43, 21 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →Critic Tag: Visual edit ← Previous edit		Revision as of 05:44, 21 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits m →Critic Tag: Visual edit Next edit →
Line 50: For example, if the critic is estimating the state-value function <math>V^{\pi_\theta}(s)</math>, then it can be learned by any value function approximation method. Let the critic be a function approximator <math>V_\phi(s)</math> with parameters <math>\phi</math>. The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:<math display="block">\delta_t = R_t + \gamma V_\phi(S_{t+1}) - V_\phi(S_t)</math>The critic parameters are updated by gradient descent on the squared TD error:<math display="block">\phi \leftarrow \phi - \alpha \nabla_\phi (\delta_t)^2 = \phi + \alpha \delta_t \nabla_\phi V_\phi(S_t)</math>where <math>\alpha</math> is the learning rate. Note that the gradient is taken with respect to the <math>\phi</math> in <math>V_\phi(S_t)</math> only, since the <math>\phi</math> in <math>\gamma V_\phi(S_{t+1})</math> constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use [[automatic differentiation]], and requires "stopping the gradient" at that point. Similarly, if the critic is estimating the action-value function <math>Q^{\pi_\theta}(s,a)</math>, then it can be learned by [[Q-learning]] or [[State–action–reward–state–action\|SARSA]].

Actor-critic algorithm: Difference between revisions