Revision as of 05:54, 21 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →Critic: advantage Tag: Visual edit ← Previous edit		Revision as of 05:56, 21 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →Critic Tag: Visual edit Next edit →
Line 52: The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:<math display="block">\delta_i = R_i + \gamma V_\phi(S_{i+1}) - V_\phi(S_i)</math>The critic parameters are updated by gradient descent on the squared TD error:<math display="block">\phi \leftarrow \phi - \alpha \nabla_\phi (\delta_i)^2 = \phi + \alpha \delta_i \nabla_\phi V_\phi(S_i)</math>where <math>\alpha</math> is the learning rate. Note that the gradient is taken with respect to the <math>\phi</math> in <math>V_\phi(S_i)</math> only, since the <math>\phi</math> in <math>\gamma V_\phi(S_{i+1})</math> constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use [[automatic differentiation]], and requires "stopping the gradient" at that point. Similarly, if the critic is estimating the action-value function <math>Q^{\pi_\theta}</math>, then it can be learned by [[Q-learning]] or [[State–action–reward–state–action\|SARSA]]. In SARSA, the critic maintains an estimate of the Q-function, parameterized by <math>\phi</math>, denoted as <math>Q_\phi(s, a)</math>. The temporal difference error is then calculated as <math>\delta_i = R_i + \gamma Q_\theta(S_{i+1}, A_{i+1}) - Q_\theta(S_i,A_i)</math>. The critic is then updated by<math display="block">\theta \leftarrow \theta + \alpha \delta_i \nabla_\theta Q_\theta(S_i, A_i)</math>The advantage critic can be trained by training both a Q-function <math>Q_\phi(s,a)</math> and a state-value function <math>V_\phi(s)</math>, then let <math>A_\phi(s,a) = Q_\phi(s,a) - V_\phi(s)</math>. Although, it is more common to train just a state-value function <math>V_\phi(s)</math>, then estimate the advantage by<ref name=":0" /><math display="block">A_\phi(S_i,A_i) \approx \sum_{j\in 0:n-1} \gamma^{j}R_{i+j} + \gamma^{n}V_\phi(S_{i+n}) - V_\phi(S_i)</math>Here, <math>n</math> is a positive integer. The higher <math>n</math> is, the more lower is the bias in the advantage estimation, but at the price of higher variance. The '''Generalized Advantage Estimation (GAE)''': introduces a hyperparameter <math>▼ == Variants ==▼ '''Asynchronous Advantage Actor-Critic (A3C)''': [[Parallel computing\|Parallel and asynchronous]] version of A2C.<ref name=":0" />▼ * '''Soft Actor-Critic (SAC)''': Incorporates entropy maximization for improved exploration.<ref>{{Citation \|last=Haarnoja \|first=Tuomas \|title=Soft Actor-Critic Algorithms and Applications \|date=2019-01-29 \|url=https://arxiv.org/abs/1812.05905 \|doi=10.48550/arXiv.1812.05905 \|last2=Zhou \|first2=Aurick \|last3=Hartikainen \|first3=Kristian \|last4=Tucker \|first4=George \|last5=Ha \|first5=Sehoon \|last6=Tan \|first6=Jie \|last7=Kumar \|first7=Vikash \|last8=Zhu \|first8=Henry \|last9=Gupta \|first9=Abhishek}}</ref>▼ * '''Deep Deterministic Policy Gradient (DDPG)''': Specialized for continuous action spaces.<ref>{{Citation \|last=Lillicrap \|first=Timothy P. \|title=Continuous control with deep reinforcement learning \|date=2019-07-05 \|url=https://arxiv.org/abs/1509.02971 \|doi=10.48550/arXiv.1509.02971 \|last2=Hunt \|first2=Jonathan J. \|last3=Pritzel \|first3=Alexander \|last4=Heess \|first4=Nicolas \|last5=Erez \|first5=Tom \|last6=Tassa \|first6=Yuval \|last7=Silver \|first7=David \|last8=Wierstra \|first8=Daan}}</ref>▼ ▲* '''Generalized Advantage Estimation (GAE)''': introduces a hyperparameter <math> \lambda </math> that smoothly interpolates between Monte Carlo returns (<math> Line 68 ⟶ 63: \lambda </math> being the decay strength.<ref>{{Citation \|last=Schulman \|first=John \|title=High-Dimensional Continuous Control Using Generalized Advantage Estimation \|date=2018-10-20 \|url=https://arxiv.org/abs/1506.02438 \|doi=10.48550/arXiv.1506.02438 \|last2=Moritz \|first2=Philipp \|last3=Levine \|first3=Sergey \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter}}</ref> ▲== Variants == ▲* '''Asynchronous Advantage Actor-Critic (A3C)''': [[Parallel computing\|Parallel and asynchronous]] version of A2C.<ref name=":0" /> ▲* '''Soft Actor-Critic (SAC)''': Incorporates entropy maximization for improved exploration.<ref>{{Citation \|last=Haarnoja \|first=Tuomas \|title=Soft Actor-Critic Algorithms and Applications \|date=2019-01-29 \|url=https://arxiv.org/abs/1812.05905 \|doi=10.48550/arXiv.1812.05905 \|last2=Zhou \|first2=Aurick \|last3=Hartikainen \|first3=Kristian \|last4=Tucker \|first4=George \|last5=Ha \|first5=Sehoon \|last6=Tan \|first6=Jie \|last7=Kumar \|first7=Vikash \|last8=Zhu \|first8=Henry \|last9=Gupta \|first9=Abhishek}}</ref> ▲* '''Deep Deterministic Policy Gradient (DDPG)''': Specialized for continuous action spaces.<ref>{{Citation \|last=Lillicrap \|first=Timothy P. \|title=Continuous control with deep reinforcement learning \|date=2019-07-05 \|url=https://arxiv.org/abs/1509.02971 \|doi=10.48550/arXiv.1509.02971 \|last2=Hunt \|first2=Jonathan J. \|last3=Pritzel \|first3=Alexander \|last4=Heess \|first4=Nicolas \|last5=Erez \|first5=Tom \|last6=Tassa \|first6=Yuval \|last7=Silver \|first7=David \|last8=Wierstra \|first8=Daan}}</ref> == See also ==

Actor-critic algorithm: Difference between revisions