Actor-critic algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 05:23, 21 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →Overview Tag: Visual edit ← Previous edit		Latest revision as of 15:26, 20 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,863,386 edits Added url. \| Use this bot. Report bugs. \| Suggested by 16dvnk \| Category:Artificial intelligence \| #UCB_Category 69/198
(22 intermediate revisions by 6 users not shown)
Line 1: {{Short description\|Reinforcement learning ~~algorithm~~algorithms that ~~combines~~combine policy and value estimation}} The '''actor-critic algorithm''' (AC) is a family of [[reinforcement learning]] (RL) algorithms that combine policy-based RL algorithms such as [[~~Policy~~policy gradient method~~\|policy gradient methods~~]]s, and value-based RL algorithms such as value iteration, [[Q-learning]], [[State–action–reward–state–action\|SARSA]], and [[Temporal difference learning\|TD learning]].<ref>{{Cite journal \|~~last~~last1=Arulkumaran \|~~first~~first1=Kai \|last2=Deisenroth \|first2=Marc Peter \|last3=Brundage \|first3=Miles \|last4=Bharath \|first4=Anil Anthony \|date=November 2017 \|title=Deep Reinforcement Learning: A Brief Survey ~~\|url=http://ieeexplore.ieee.org/document/8103164/~~ \|journal=IEEE Signal Processing Magazine \|volume=34 \|issue=6 \|pages=26–38 \|doi=10.1109/MSP.2017.2743240 \|arxiv=1708.05866 \|bibcode=2017ISPM...34...26A \|issn=1053-5888}}</ref> An AC algorithm consists of two main components: an "'''actor'''" that determines which actions to take according to a policy function, and a "'''critic'''" that evaluates those actions according to a value function.<ref>{{Cite journal \|~~last~~last1=Konda \|~~first~~first1=Vijay \|last2=Tsitsiklis \|first2=John \|date=1999 \|title=Actor-Critic Algorithms \|url=https://proceedings.neurips.cc/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=MIT Press \|volume=12}}</ref> Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases. == Overview == Line 8: The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline. === Actor === ~~The actor uses a policy function <math>\pi(a\|s)</math>, while the critic estimates either the [[value function]] <math>V(s)</math>, the action-value Q-function <math>Q(s,a)~~ The '''actor''' uses a policy function <math>\pi(a\|s)</math>, while the ~~advantage~~critic estimates either the [[value function]] <math>AV(s,a)</math>, orthe ~~any~~action-value ~~combination~~Q-function ~~thereof.~~<math>Q(s,a), </math> the advantage function <math>A(s,a)</math>, or any combination thereof. The actor is a parameterized function <math>\pi_\theta</math>, where <math>\theta</math> are the parameters of the actor. The actor takes as argument the state of the environment <math>s</math> and produces a [[probability distribution]] <math>\pi_\theta(\cdot \| s)</math>. Line 15 ⟶ 16: If the action space is discrete, then <math>\sum_{a} \pi_\theta(a \| s) = 1</math>. If the action space is continuous, then <math>\int_{a} \pi_\theta(a \| s) da = 1</math>. The goal of policy optimization is to improve the actor. That is, to find some <math>\theta</math> that maximizes the expected episodic reward <math>J(\theta)</math>:<math display="block"> J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right] </math>where <math> \gamma Line 29 ⟶ 30: The goal of policy gradient method is to optimize <math>J(\theta)</math> by [[Gradient descent\|gradient ascent]] on the policy gradient <math>\nabla J(\theta)</math>. As detailed on the [[Policy gradient method#Actor-critic methods\|policy gradient method]] page, there are many [[~~Unbiased~~unbiased estimator~~\|unbiased estimators~~]]s of the policy gradient:<math display="block">\nabla_\theta J(\theta) = E_\mathbb{E}_{\pi_\theta}\left[\sum_{0\leq j \leq T} \nabla_\theta\ln\pi_\theta(A_j\| S_j) \cdot \Psi_j \Big\|S_0 = s_0 \right]</math>where <math display="inline">\Psi_j</math> is a linear sum of the following: * <math display="inline">\sum_{0 \leq i\leq T} (\gamma^i R_i)</math>~~: never used~~. * <math display="inline">\gamma^j\sum_{j \leq i\leq T} (\gamma^{i-j} R_i)</math>: ~~used by~~ the '''REINFORCE''' algorithm. * <math display="inline">\gamma^j \sum_{j \leq i\leq T} (\gamma^{i-j} R_i) - b(S_j) </math>: ~~used by~~ the '''REINFORCE with baseline''' algorithm. Here <math>b</math> is an arbitrary function. * <math display="inline">\gamma^j \left(R_j + \gamma V^{\pi_\theta}( S_{j+1}) - V^{\pi_\theta}( S_{j})\right)</math>: ~~1-step~~[[Temporal difference learning\|TD(1) learning]]. * <math display="inline">\gamma^j Q^{\pi_\theta}(S_j, A_j)</math>. * <math display="inline">\gamma^j A^{\pi_\theta}(S_j, A_j)</math>: '''Advantage Actor-Critic (A2C)'''.<ref name=":0">{{Citation \|~~last~~last1=Mnih \|~~first~~first1=Volodymyr \|title=Asynchronous Methods for Deep Reinforcement Learning \|date=2016-06-16 \|~~url=https://~~arxiv~~.org/abs/1602.01783 \|doi~~=~~10.48550/arXiv.~~1602.01783 \|last2=Badia \|first2=Adrià Puigdomènech \|last3=Mirza \|first3=Mehdi \|last4=Graves \|first4=Alex \|last5=Lillicrap \|first5=Timothy P. \|last6=Harley \|first6=Tim \|last7=Silver \|first7=David \|last8=Kavukcuoglu \|first8=Koray}}</ref> * <math display="inline">\gamma^j \left(R_j + \gamma R_{j+1} + \gamma^2 V^{\pi_\theta}( S_{j+2}) - V^{\pi_\theta}( S_{j})\right)</math>: ~~2-step~~ TD(2) learning.▼ * <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: ~~n-step~~ TD(n) learning.▼ * <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref name="arxiv.org">{{Citation \|~~last~~last1=Schulman \|~~first~~first1=John \|title=High-Dimensional Continuous Control Using Generalized Advantage Estimation \|date=2018-10-20 \|~~url=https://~~arxiv~~.org/abs/1506.02438 \|doi~~=~~10.48550/arXiv.~~1506.02438 \|last2=Moritz \|first2=Philipp \|last3=Levine \|first3=Sergey \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the TD(n~~-step TD~~) learning ~~ones~~terms.▼ === Critic === ▲* <math display="inline">\gamma^j \left(R_j + \gamma R_{j+1} + \gamma^2 V^{\pi_\theta}( S_{j+2}) - V^{\pi_\theta}( S_{j})\right)</math>: 2-step TD learning. In the unbiased estimators given above, certain functions such as <math>V^{\pi_\theta}, Q^{\pi_\theta}, A^{\pi_\theta}</math> appear. These are approximated by the '''critic'''. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms. ▲* <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: n-step TD learning. ▲* <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation \|last=Schulman \|first=John \|title=High-Dimensional Continuous Control Using Generalized Advantage Estimation \|date=2018-10-20 \|url=https://arxiv.org/abs/1506.02438 \|doi=10.48550/arXiv.1506.02438 \|last2=Moritz \|first2=Philipp \|last3=Levine \|first3=Sergey \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the n-step TD learning ones. For example, if the critic is estimating the state-value function <math>V^{\pi_\theta}(s)</math>, then it can be learned by any value function approximation method. Let the critic be a function approximator <math>V_\phi(s)</math> with parameters <math>\phi</math>. == Variants ==▼ The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:<math display="block">\delta_i = R_i + \gamma V_\phi(S_{i+1}) - V_\phi(S_i)</math>The critic parameters are updated by gradient descent on the squared TD error:<math display="block">\phi \leftarrow \phi - \alpha \nabla_\phi (\delta_i)^2 = \phi + \alpha \delta_i \nabla_\phi V_\phi(S_i)</math>where <math>\alpha</math> is the learning rate. Note that the gradient is taken with respect to the <math>\phi</math> in <math>V_\phi(S_i)</math> only, since the <math>\phi</math> in <math>\gamma V_\phi(S_{i+1})</math> constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use [[automatic differentiation]], and requires "stopping the gradient" at that point. * '''Asynchronous Advantage Actor-Critic (A3C)''': [[Parallel computing\|Parallel and asynchronous]] version of A2C.<ref name=":0" />▼ * '''Soft Actor-Critic (SAC)''': Incorporates entropy maximization for improved exploration.<ref>{{Citation \|last=Haarnoja \|first=Tuomas \|title=Soft Actor-Critic Algorithms and Applications \|date=2019-01-29 \|url=https://arxiv.org/abs/1812.05905 \|doi=10.48550/arXiv.1812.05905 \|last2=Zhou \|first2=Aurick \|last3=Hartikainen \|first3=Kristian \|last4=Tucker \|first4=George \|last5=Ha \|first5=Sehoon \|last6=Tan \|first6=Jie \|last7=Kumar \|first7=Vikash \|last8=Zhu \|first8=Henry \|last9=Gupta \|first9=Abhishek}}</ref>▼ Similarly, if the critic is estimating the action-value function <math>Q^{\pi_\theta}</math>, then it can be learned by [[Q-learning]] or [[State–action–reward–state–action\|SARSA]]. In SARSA, the critic maintains an estimate of the Q-function, parameterized by <math>\phi</math>, denoted as <math>Q_\phi(s, a)</math>. The temporal difference error is then calculated as <math>\delta_i = R_i + \gamma Q_\theta(S_{i+1}, A_{i+1}) - Q_\theta(S_i,A_i)</math>. The critic is then updated by<math display="block">\theta \leftarrow \theta + \alpha \delta_i \nabla_\theta Q_\theta(S_i, A_i)</math>The advantage critic can be trained by training both a Q-function <math>Q_\phi(s,a)</math> and a state-value function <math>V_\phi(s)</math>, then let <math>A_\phi(s,a) = Q_\phi(s,a) - V_\phi(s)</math>. Although, it is more common to train just a state-value function <math>V_\phi(s)</math>, then estimate the advantage by<ref name=":0" /><math display="block">A_\phi(S_i,A_i) \approx \sum_{j\in 0:n-1} \gamma^{j}R_{i+j} + \gamma^{n}V_\phi(S_{i+n}) - V_\phi(S_i)</math>Here, <math>n</math> is a positive integer. The higher <math>n</math> is, the more lower is the bias in the advantage estimation, but at the price of higher variance. * '''Deep Deterministic Policy Gradient (DDPG)''': Specialized for continuous action spaces.<ref>{{Citation \|last=Lillicrap \|first=Timothy P. \|title=Continuous control with deep reinforcement learning \|date=2019-07-05 \|url=https://arxiv.org/abs/1509.02971 \|doi=10.48550/arXiv.1509.02971 \|last2=Hunt \|first2=Jonathan J. \|last3=Pritzel \|first3=Alexander \|last4=Heess \|first4=Nicolas \|last5=Erez \|first5=Tom \|last6=Tassa \|first6=Yuval \|last7=Silver \|first7=David \|last8=Wierstra \|first8=Daan}}</ref>▼ The '''Generalized Advantage Estimation (GAE)''': introduces a hyperparameter <math> \lambda </math> that smoothly interpolates between Monte Carlo returns (<math> Line 57 ⟶ 61: </math>, low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with <math> \lambda </math> being the decay strength.<ref name="arxiv.org"/> </math> being the decay strength.<ref>{{Citation \|last=Schulman \|first=John \|title=High-Dimensional Continuous Control Using Generalized Advantage Estimation \|date=2018-10-20 \|url=https://arxiv.org/abs/1506.02438 \|doi=10.48550/arXiv.1506.02438 \|last2=Moritz \|first2=Philipp \|last3=Levine \|first3=Sergey \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter}}</ref> ▲== Variants == ▲ '''Asynchronous Advantage Actor-Critic (A3C)''': [[Parallel computing\|Parallel and asynchronous]] version of A2C.<ref name=":0" /> ▲* '''Soft Actor-Critic (SAC)''': Incorporates entropy maximization for improved exploration.<ref>{{Citation \|~~last~~last1=Haarnoja \|~~first~~first1=Tuomas \|title=Soft Actor-Critic Algorithms and Applications \|date=2019-01-29 \|~~url=https://~~arxiv~~.org/abs/1812.05905 \|doi~~=~~10.48550/arXiv.~~1812.05905 \|last2=Zhou \|first2=Aurick \|last3=Hartikainen \|first3=Kristian \|last4=Tucker \|first4=George \|last5=Ha \|first5=Sehoon \|last6=Tan \|first6=Jie \|last7=Kumar \|first7=Vikash \|last8=Zhu \|first8=Henry \|last9=Gupta \|first9=Abhishek}}</ref> ▲* '''Deep Deterministic Policy Gradient (DDPG)''': Specialized for continuous action spaces.<ref>{{Citation \|~~last~~last1=Lillicrap \|~~first~~first1=Timothy P. \|title=Continuous control with deep reinforcement learning \|date=2019-07-05 \|~~url=https://~~arxiv~~.org/abs/1509.02971 \|doi~~=~~10.48550/arXiv.~~1509.02971 \|last2=Hunt \|first2=Jonathan J. \|last3=Pritzel \|first3=Alexander \|last4=Heess \|first4=Nicolas \|last5=Erez \|first5=Tom \|last6=Tassa \|first6=Yuval \|last7=Silver \|first7=David \|last8=Wierstra \|first8=Daan}}</ref> == See also == Line 66 ⟶ 76: == References == {{Reflist\|30em}} * {{Cite journal \|~~last~~last1=Konda \|~~first~~first1=Vijay R. \|last2=Tsitsiklis \|first2=John N. \|date=January 2003 \|title=On Actor-Critic Algorithms \|url=http://epubs.siam.org/doi/10.1137/S0363012901385691 \|journal=SIAM Journal on Control and Optimization \|language=en \|volume=42 \|issue=4 \|pages=1143–1166 \|doi=10.1137/S0363012901385691 \|issn=0363-0129\|url-access=subscription }} * {{Cite book \|~~last~~last1=Sutton \|~~first~~first1=Richard S. \|title=Reinforcement learning: an introduction \|last2=Barto \|first2=Andrew G. \|date=2018 \|publisher=The MIT Press \|isbn=978-0-262-03924-6 \|edition=2 \|series=Adaptive computation and machine learning series \|___location=Cambridge, Massachusetts}} * {{Cite book \|last=Bertsekas \|first=Dimitri P. \|title=Reinforcement learning and optimal control \|date=2019 \|publisher=Athena Scientific \|isbn=978-1-886529-39-7 \|edition=2 \|___location=Belmont, Massachusetts}} * {{Cite book \|last=Grossi \|first=Csaba \|title=Algorithms for Reinforcement Learning \|date=2010 \|publisher=Springer International Publishing \|isbn=978-3-031-00423-0 \|edition=1 \|series=Synthesis Lectures on Artificial Intelligence and Machine Learning \|___location=Cham}} * {{Cite journal \|last1=Grondman \|first1=Ivo \|last2=Busoniu \|first2=Lucian \|last3=Lopes \|first3=Gabriel A. D. \|last4=Babuska \|first4=Robert \|date=November 2012 \|title=A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients \|journal=IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews \|volume=42 \|issue=6 \|pages=1291–1307 \|doi=10.1109/TSMCC.2012.2218595 \|bibcode=2012ITHMS..42.1291G \|issn=1094-6977 \|url=https://hal.science/hal-00756747 }} {{Artificial intelligence navbox}} [[Category:Reinforcement learning]] [[Category:Machine learning algorithms]]