Revision as of 04:33, 21 January 2025 edit BattyBot (talk \| contribs) Bots 1,957,349 edits Fixed reference date error(s) (see CS1 errors: dates for details) and AWB general fixes, added orphan tag Tag: AWB ← Previous edit		Revision as of 05:23, 21 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →Overview Tag: Visual edit Next edit →
Line 1: {{Short description\|Reinforcement learning algorithm that combines policy and value estimation}} The '''actor-critic algorithm''' (AC) ~~algorithms~~is ~~are~~a ~~one~~family of ~~the~~[[reinforcement ~~main~~learning]] ~~algorithm~~(RL) ~~families~~algorithms ~~used~~that incombine ~~modern~~policy-based RL algorithms such as [[Policy gradient method\|policy gradient methods]], and value-based RL algorithms such as [[Q-learning]], [[State–action–reward–state–action\|SARSA]], and [[Temporal difference learning\|TD learning]].<ref>{{Cite journal \|last=Arulkumaran \|first=Kai \|last2=Deisenroth \|first2=Marc Peter \|last3=Brundage \|first3=Miles \|last4=Bharath \|first4=Anil Anthony \|date=November 2017 \|title=Deep Reinforcement Learning: A Brief Survey \|url=http://ieeexplore.ieee.org/document/8103164/ \|journal=IEEE Signal Processing Magazine \|volume=34 \|issue=6 \|pages=26–38 \|doi=10.1109/MSP.2017.2743240 \|issn=1053-5888}}</ref>▼ ~~{{Orphan\|date=January 2025}}~~ ~~The~~An ~~'''actor-critic~~AC algorithm~~''' (AC) is a family of [[reinforcement learning]] (RL) algorithms that combine policy-based and value-based RL algorithms. It~~ consists of two main components: an "'''actor'''" that determines which actions to take according to a policy function, and a "'''critic'''" that evaluates those actions according to a value function.<ref>{{Cite journal \|last=Konda \|first=Vijay \|last2=Tsitsiklis \|first2=John \|date=1999 \|title=Actor-Critic Algorithms \|url=https://proceedings.neurips.cc/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=MIT Press \|volume=12}}</ref> Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases. ▲The AC algorithms are one of the main algorithm families used in modern RL.<ref>{{Cite journal \|last=Arulkumaran \|first=Kai \|last2=Deisenroth \|first2=Marc Peter \|last3=Brundage \|first3=Miles \|last4=Bharath \|first4=Anil Anthony \|date=November 2017 \|title=Deep Reinforcement Learning: A Brief Survey \|url=http://ieeexplore.ieee.org/document/8103164/ \|journal=IEEE Signal Processing Magazine \|volume=34 \|issue=6 \|pages=26–38 \|doi=10.1109/MSP.2017.2743240 \|issn=1053-5888}}</ref> == Overview == The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline. The actor-critic method belongs to the family of [[policy gradient method]]s but addresses their high variance issue by incorporating a value function approximator (the critic). The actor uses a policy function <math>\pi(a\|s)</math>, while the critic estimates either the [[value function]] <math>V(s)</math>, the action-value Q-function <math>Q(s,a)▼ ▲~~The actor-critic method belongs to the family of [[policy gradient method]]s but addresses their high variance issue by incorporating a value function approximator (the critic).~~ The actor uses a policy function <math>\pi(a\|s)</math>, while the critic estimates either the [[value function]] <math>V(s)</math>, the action-value Q-function <math>Q(s,a) </math>, the advantage function <math>A(s,a)</math>, or any combination thereof. Line 27: </math> is the time-horizon (which can be infinite). The goal of policy gradient method is to optimize <math>J(\theta)</math> by [[Gradient descent\|gradient ascent]] on the policy gradient <math>\nabla J(\theta)</math>. As detailed on the [[Policy gradient method#Actor-critic methods\|policy gradient method]] page, there are many [[Unbiased estimator\|unbiased estimators]] of the policy gradient:<math display="block">\nabla_\theta J(\theta) = E_{\pi_\theta}\left[\sum_{0\leq j \leq T} \nabla_\theta\ln\pi_\theta(A_j\| S_j) \cdot \Psi_j \Big\|S_0 = s_0 \right]</math>where <math display="inline">\Psi_j</math> is a linear sum of the following: * <math display="inline">\sum_{0 \leq i\leq T} (\gamma^i R_i)</math>: never used. * <math display="inline">\gamma^j\sum_{j \leq i\leq T} (\gamma^{i-j} R_i)</math>: used by the REINFORCE algorithm. * <math display="inline">\gamma^j \sum_{j \leq i\leq T} (\gamma^{i-j} R_i) - b(S_j) </math>: used by the REINFORCE with baseline algorithm. * <math display="inline">\gamma^j \left(R_j + \gamma V^{\pi_\theta}( S_{j+1}) - V^{\pi_\theta}( S_{j})\right)</math>: 1-step TD learning. * <math display="inline">\gamma^j Q^{\pi_\theta}(S_j, A_j)</math>. * <math display="inline">\gamma^j A^{\pi_\theta}(S_j, A_j)</math>: '''Advantage Actor-Critic (A2C)'''~~: Uses the advantage function instead of TD error~~.<ref name=":0">{{Citation \|last=Mnih \|first=Volodymyr \|title=Asynchronous Methods for Deep Reinforcement Learning \|date=2016-06-16 \|url=https://arxiv.org/abs/1602.01783 \|doi=10.48550/arXiv.1602.01783 \|last2=Badia \|first2=Adrià Puigdomènech \|last3=Mirza \|first3=Mehdi \|last4=Graves \|first4=Alex \|last5=Lillicrap \|first5=Timothy P. \|last6=Harley \|first6=Tim \|last7=Silver \|first7=David \|last8=Kavukcuoglu \|first8=Koray}}</ref>▼ * <math display="inline">\gamma^j \left(R_j + \gamma R_{j+1} + \gamma^2 V^{\pi_\theta}( S_{j+2}) - V^{\pi_\theta}( S_{j})\right)</math>: 2-step TD learning. * <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: n-step TD learning. * <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation \|last=Schulman \|first=John \|title=High-Dimensional Continuous Control Using Generalized Advantage Estimation \|date=2018-10-20 \|url=https://arxiv.org/abs/1506.02438 \|doi=10.48550/arXiv.1506.02438 \|last2=Moritz \|first2=Philipp \|last3=Levine \|first3=Sergey \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the n-step TD learning ones. == Variants == ▲* '''Advantage Actor-Critic (A2C)''': Uses the advantage function instead of TD error.<ref name=":0">{{Citation \|last=Mnih \|first=Volodymyr \|title=Asynchronous Methods for Deep Reinforcement Learning \|date=2016-06-16 \|url=https://arxiv.org/abs/1602.01783 \|doi=10.48550/arXiv.1602.01783 \|last2=Badia \|first2=Adrià Puigdomènech \|last3=Mirza \|first3=Mehdi \|last4=Graves \|first4=Alex \|last5=Lillicrap \|first5=Timothy P. \|last6=Harley \|first6=Tim \|last7=Silver \|first7=David \|last8=Kavukcuoglu \|first8=Koray}}</ref> * '''Asynchronous Advantage Actor-Critic (A3C)''': [[Parallel computing\|Parallel and asynchronous]] version of A2C.<ref name=":0" /> * '''Soft Actor-Critic (SAC)''': Incorporates entropy maximization for improved exploration.<ref>{{Citation \|last=Haarnoja \|first=Tuomas \|title=Soft Actor-Critic Algorithms and Applications \|date=2019-01-29 \|url=https://arxiv.org/abs/1812.05905 \|doi=10.48550/arXiv.1812.05905 \|last2=Zhou \|first2=Aurick \|last3=Hartikainen \|first3=Kristian \|last4=Tucker \|first4=George \|last5=Ha \|first5=Sehoon \|last6=Tan \|first6=Jie \|last7=Kumar \|first7=Vikash \|last8=Zhu \|first8=Henry \|last9=Gupta \|first9=Abhishek}}</ref>

Actor-critic algorithm: Difference between revisions