Actor-critic algorithm: Difference between revisions

Content deleted Content added
BattyBot (talk | contribs)
Fixed reference date error(s) (see CS1 errors: dates for details) and AWB general fixes, added orphan tag
Line 1:
{{Short description|Reinforcement learning algorithm that combines policy and value estimation}}
The '''actor-critic algorithm''' (AC) algorithmsis area onefamily of the[[reinforcement mainlearning]] algorithm(RL) familiesalgorithms usedthat incombine modernpolicy-based RL algorithms such as [[Policy gradient method|policy gradient methods]], and value-based RL algorithms such as [[Q-learning]], [[State–action–reward–state–action|SARSA]], and [[Temporal difference learning|TD learning]].<ref>{{Cite journal |last=Arulkumaran |first=Kai |last2=Deisenroth |first2=Marc Peter |last3=Brundage |first3=Miles |last4=Bharath |first4=Anil Anthony |date=November 2017 |title=Deep Reinforcement Learning: A Brief Survey |url=http://ieeexplore.ieee.org/document/8103164/ |journal=IEEE Signal Processing Magazine |volume=34 |issue=6 |pages=26–38 |doi=10.1109/MSP.2017.2743240 |issn=1053-5888}}</ref>
{{Orphan|date=January 2025}}
 
TheAn '''actor-criticAC algorithm''' (AC) is a family of [[reinforcement learning]] (RL) algorithms that combine policy-based and value-based RL algorithms. It consists of two main components: an "'''actor'''" that determines which actions to take according to a policy function, and a "'''critic'''" that evaluates those actions according to a value function.<ref>{{Cite journal |last=Konda |first=Vijay |last2=Tsitsiklis |first2=John |date=1999 |title=Actor-Critic Algorithms |url=https://proceedings.neurips.cc/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=12}}</ref> Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.
 
The AC algorithms are one of the main algorithm families used in modern RL.<ref>{{Cite journal |last=Arulkumaran |first=Kai |last2=Deisenroth |first2=Marc Peter |last3=Brundage |first3=Miles |last4=Bharath |first4=Anil Anthony |date=November 2017 |title=Deep Reinforcement Learning: A Brief Survey |url=http://ieeexplore.ieee.org/document/8103164/ |journal=IEEE Signal Processing Magazine |volume=34 |issue=6 |pages=26–38 |doi=10.1109/MSP.2017.2743240 |issn=1053-5888}}</ref>
 
== Overview ==
 
The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.
The actor-critic method belongs to the family of [[policy gradient method]]s but addresses their high variance issue by incorporating a value function approximator (the critic). The actor uses a policy function <math>\pi(a|s)</math>, while the critic estimates either the [[value function]] <math>V(s)</math>, the action-value Q-function <math>Q(s,a)
 
The actor-critic method belongs to the family of [[policy gradient method]]s but addresses their high variance issue by incorporating a value function approximator (the critic). The actor uses a policy function <math>\pi(a|s)</math>, while the critic estimates either the [[value function]] <math>V(s)</math>, the action-value Q-function <math>Q(s,a)
</math>, the advantage function <math>A(s,a)</math>, or any combination thereof.
 
Line 27:
</math> is the time-horizon (which can be infinite).
 
The goal of policy gradient method is to optimize <math>J(\theta)</math> by [[Gradient descent|gradient ascent]] on the policy gradient <math>\nabla J(\theta)</math>.
 
As detailed on the [[Policy gradient method#Actor-critic methods|policy gradient method]] page, there are many [[Unbiased estimator|unbiased estimators]] of the policy gradient:<math display="block">\nabla_\theta J(\theta) = E_{\pi_\theta}\left[\sum_{0\leq j \leq T} \nabla_\theta\ln\pi_\theta(A_j| S_j)
\cdot \Psi_j
\Big|S_0 = s_0 \right]</math>where <math display="inline">\Psi_j</math> is a linear sum of the following:
 
* <math display="inline">\sum_{0 \leq i\leq T} (\gamma^i R_i)</math>: never used.
* <math display="inline">\gamma^j\sum_{j \leq i\leq T} (\gamma^{i-j} R_i)</math>: used by the REINFORCE algorithm.
* <math display="inline">\gamma^j \sum_{j \leq i\leq T} (\gamma^{i-j} R_i) - b(S_j) </math>: used by the REINFORCE with baseline algorithm.
* <math display="inline">\gamma^j \left(R_j + \gamma V^{\pi_\theta}( S_{j+1}) - V^{\pi_\theta}( S_{j})\right)</math>: 1-step TD learning.
* <math display="inline">\gamma^j Q^{\pi_\theta}(S_j, A_j)</math>.
* <math display="inline">\gamma^j A^{\pi_\theta}(S_j, A_j)</math>: '''Advantage Actor-Critic (A2C)''': Uses the advantage function instead of TD error.<ref name=":0">{{Citation |last=Mnih |first=Volodymyr |title=Asynchronous Methods for Deep Reinforcement Learning |date=2016-06-16 |url=https://arxiv.org/abs/1602.01783 |doi=10.48550/arXiv.1602.01783 |last2=Badia |first2=Adrià Puigdomènech |last3=Mirza |first3=Mehdi |last4=Graves |first4=Alex |last5=Lillicrap |first5=Timothy P. |last6=Harley |first6=Tim |last7=Silver |first7=David |last8=Kavukcuoglu |first8=Koray}}</ref>
 
* <math display="inline">\gamma^j \left(R_j + \gamma R_{j+1} + \gamma^2 V^{\pi_\theta}( S_{j+2}) - V^{\pi_\theta}( S_{j})\right)</math>: 2-step TD learning.
* <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: n-step TD learning.
* <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation |last=Schulman |first=John |title=High-Dimensional Continuous Control Using Generalized Advantage Estimation |date=2018-10-20 |url=https://arxiv.org/abs/1506.02438 |doi=10.48550/arXiv.1506.02438 |last2=Moritz |first2=Philipp |last3=Levine |first3=Sergey |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the n-step TD learning ones.
 
== Variants ==
 
* '''Advantage Actor-Critic (A2C)''': Uses the advantage function instead of TD error.<ref name=":0">{{Citation |last=Mnih |first=Volodymyr |title=Asynchronous Methods for Deep Reinforcement Learning |date=2016-06-16 |url=https://arxiv.org/abs/1602.01783 |doi=10.48550/arXiv.1602.01783 |last2=Badia |first2=Adrià Puigdomènech |last3=Mirza |first3=Mehdi |last4=Graves |first4=Alex |last5=Lillicrap |first5=Timothy P. |last6=Harley |first6=Tim |last7=Silver |first7=David |last8=Kavukcuoglu |first8=Koray}}</ref>
* '''Asynchronous Advantage Actor-Critic (A3C)''': [[Parallel computing|Parallel and asynchronous]] version of A2C.<ref name=":0" />
* '''Soft Actor-Critic (SAC)''': Incorporates entropy maximization for improved exploration.<ref>{{Citation |last=Haarnoja |first=Tuomas |title=Soft Actor-Critic Algorithms and Applications |date=2019-01-29 |url=https://arxiv.org/abs/1812.05905 |doi=10.48550/arXiv.1812.05905 |last2=Zhou |first2=Aurick |last3=Hartikainen |first3=Kristian |last4=Tucker |first4=George |last5=Ha |first5=Sehoon |last6=Tan |first6=Jie |last7=Kumar |first7=Vikash |last8=Zhu |first8=Henry |last9=Gupta |first9=Abhishek}}</ref>