Actor-critic algorithm: Difference between revisions

Content deleted Content added
BattyBot (talk | contribs)
Fixed reference date error(s) (see CS1 errors: dates for details) and AWB general fixes
Citation bot (talk | contribs)
Added url. | Use this bot. Report bugs. | Suggested by 16dvnk | Category:Artificial intelligence | #UCB_Category 69/198
 
(6 intermediate revisions by 5 users not shown)
Line 1:
{{Short description|Reinforcement learning algorithms that combine policy and value estimation}}
The '''actor-critic algorithm''' (AC) is a family of [[reinforcement learning]] (RL) algorithms that combine policy-based RL algorithms such as [[policy gradient method]]s, and value-based RL algorithms such as value iteration, [[Q-learning]], [[State–action–reward–state–action|SARSA]], and [[Temporal difference learning|TD learning]].<ref>{{Cite journal |last1=Arulkumaran |first1=Kai |last2=Deisenroth |first2=Marc Peter |last3=Brundage |first3=Miles |last4=Bharath |first4=Anil Anthony |date=November 2017 |title=Deep Reinforcement Learning: A Brief Survey |url=https://ieeexplore.ieee.org/document/8103164 |journal=IEEE Signal Processing Magazine |volume=34 |issue=6 |pages=26–38 |doi=10.1109/MSP.2017.2743240 |arxiv=1708.05866 |bibcode=2017ISPM...34...26A |issn=1053-5888}}</ref>
 
An AC algorithm consists of two main components: an "'''actor'''" that determines which actions to take according to a policy function, and a "'''critic'''" that evaluates those actions according to a value function.<ref>{{Cite journal |last1=Konda |first1=Vijay |last2=Tsitsiklis |first2=John |date=1999 |title=Actor-Critic Algorithms |url=https://proceedings.neurips.cc/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=12}}</ref> Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.
Line 9:
 
=== Actor ===
The '''actor''' uses a policy function <math>\pi(a|s)</math>, while the critic estimates either the [[value function]] <math>V(s)</math>, the action-value Q-function <math>Q(s,a),
</math>, the advantage function <math>A(s,a)</math>, or any combination thereof.
 
The actor is a parameterized function <math>\pi_\theta</math>, where <math>\theta</math> are the parameters of the actor. The actor takes as argument the state of the environment <math>s</math> and produces a [[probability distribution]] <math>\pi_\theta(\cdot | s)</math>.
Line 39:
* <math display="inline">\gamma^j \left(R_j + \gamma V^{\pi_\theta}( S_{j+1}) - V^{\pi_\theta}( S_{j})\right)</math>: [[Temporal difference learning|TD(1) learning]].
* <math display="inline">\gamma^j Q^{\pi_\theta}(S_j, A_j)</math>.
* <math display="inline">\gamma^j A^{\pi_\theta}(S_j, A_j)</math>: '''Advantage Actor-Critic (A2C)'''.<ref name=":0">{{Citation |last1=Mnih |first1=Volodymyr |title=Asynchronous Methods for Deep Reinforcement Learning |date=2016-06-16 |url=https://arxiv.org/abs/1602.01783 |arxiv=1602.01783 |last2=Badia |first2=Adrià Puigdomènech |last3=Mirza |first3=Mehdi |last4=Graves |first4=Alex |last5=Lillicrap |first5=Timothy P. |last6=Harley |first6=Tim |last7=Silver |first7=David |last8=Kavukcuoglu |first8=Koray}}</ref>
* <math display="inline">\gamma^j \left(R_j + \gamma R_{j+1} + \gamma^2 V^{\pi_\theta}( S_{j+2}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(2) learning.
* <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(n) learning.
* <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref name="arxiv.org">{{Citation |last1=Schulman |first1=John |title=High-Dimensional Continuous Control Using Generalized Advantage Estimation |date=2018-10-20 |url=https://arxiv.org/abs/1506.02438 |arxiv=1506.02438 |last2=Moritz |first2=Philipp |last3=Levine |first3=Sergey |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the TD(n) learning terms.
 
=== Critic ===
Line 66:
 
* '''Asynchronous Advantage Actor-Critic (A3C)''': [[Parallel computing|Parallel and asynchronous]] version of A2C.<ref name=":0" />
* '''Soft Actor-Critic (SAC)''': Incorporates entropy maximization for improved exploration.<ref>{{Citation |last1=Haarnoja |first1=Tuomas |title=Soft Actor-Critic Algorithms and Applications |date=2019-01-29 |url=https://arxiv.org/abs/1812.05905 |arxiv=1812.05905 |last2=Zhou |first2=Aurick |last3=Hartikainen |first3=Kristian |last4=Tucker |first4=George |last5=Ha |first5=Sehoon |last6=Tan |first6=Jie |last7=Kumar |first7=Vikash |last8=Zhu |first8=Henry |last9=Gupta |first9=Abhishek}}</ref>
* '''Deep Deterministic Policy Gradient (DDPG)''': Specialized for continuous action spaces.<ref>{{Citation |last1=Lillicrap |first1=Timothy P. |title=Continuous control with deep reinforcement learning |date=2019-07-05 |url=https://arxiv.org/abs/1509.02971 |arxiv=1509.02971 |last2=Hunt |first2=Jonathan J. |last3=Pritzel |first3=Alexander |last4=Heess |first4=Nicolas |last5=Erez |first5=Tom |last6=Tassa |first6=Yuval |last7=Silver |first7=David |last8=Wierstra |first8=Daan}}</ref>
 
== See also ==
Line 76:
== References ==
{{Reflist|30em}}
* {{Cite journal |last1=Konda |first1=Vijay R. |last2=Tsitsiklis |first2=John N. |date=January 2003 |title=On Actor-Critic Algorithms |url=http://epubs.siam.org/doi/10.1137/S0363012901385691 |journal=SIAM Journal on Control and Optimization |language=en |volume=42 |issue=4 |pages=1143–1166 |doi=10.1137/S0363012901385691 |issn=0363-0129|url-access=subscription }}
* {{Cite book |last1=Sutton |first1=Richard S. |title=Reinforcement learning: an introduction |last2=Barto |first2=Andrew G. |date=2018 |publisher=The MIT Press |isbn=978-0-262-03924-6 |edition=2 |series=Adaptive computation and machine learning series |___location=Cambridge, Massachusetts}}
* {{Cite book |last=Bertsekas |first=Dimitri P. |title=Reinforcement learning and optimal control |date=2019 |publisher=Athena Scientific |isbn=978-1-886529-39-7 |edition=2 |___location=Belmont, Massachusetts}}
* {{Cite book |last=Grossi |first=Csaba |title=Algorithms for Reinforcement Learning |date=2010 |publisher=Springer International Publishing |isbn=978-3-031-00423-0 |edition=1 |series=Synthesis Lectures on Artificial Intelligence and Machine Learning |___location=Cham}}
* {{Cite journal |lastlast1=Grondman |firstfirst1=Ivo |last2=Busoniu |first2=Lucian |last3=Lopes |first3=Gabriel A. D. |last4=Babuska |first4=Robert |date=November 2012 |title=A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients |url=http://ieeexplore.ieee.org/document/6392457/ |journal=IEEE Transactions on Systems, Man, and Cybernetics, - Part C: (Applications and Reviews) |volume=42 |issue=6 |pages=1291–1307 |doi=10.1109/TSMCC.2012.2218595 |bibcode=2012ITHMS..42.1291G |issn=1094-6977 |url=https://hal.science/hal-00756747 }}
{{Artificial intelligence navbox}}
 
[[Category:Reinforcement learning]]
[[Category:Machine learning algorithms]]