Content deleted Content added
Citation bot (talk | contribs) Altered url. URLs might have been anonymized. Add: bibcode, arxiv, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Cosmia Nebula | #UCB_webform |
|||
Line 1:
{{Short description|Reinforcement learning algorithm that combines policy and value estimation}}
The '''actor-critic algorithm''' (AC) is a family of [[reinforcement learning]] (RL) algorithms that combine policy-based RL algorithms such as [[Policy gradient method|policy gradient methods]], and value-based RL algorithms such as value iteration, [[Q-learning]], [[State–action–reward–state–action|SARSA]], and [[Temporal difference learning|TD learning]].<ref>{{Cite journal |
An AC algorithm consists of two main components: an "'''actor'''" that determines which actions to take according to a policy function, and a "'''critic'''" that evaluates those actions according to a value function.<ref>{{Cite journal |
== Overview ==
Line 39:
* <math display="inline">\gamma^j \left(R_j + \gamma V^{\pi_\theta}( S_{j+1}) - V^{\pi_\theta}( S_{j})\right)</math>: [[Temporal difference learning|TD(1) learning]].
* <math display="inline">\gamma^j Q^{\pi_\theta}(S_j, A_j)</math>.
* <math display="inline">\gamma^j A^{\pi_\theta}(S_j, A_j)</math>: '''Advantage Actor-Critic (A2C)'''.<ref name=":0">{{Citation |
* <math display="inline">\gamma^j \left(R_j + \gamma R_{j+1} + \gamma^2 V^{\pi_\theta}( S_{j+2}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(2) learning.
* <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(n) learning.
* <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation |
=== Critic ===
Line 62:
</math>, low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with <math>
\lambda
</math> being the decay strength.<ref>{{Citation |
== Variants ==
* '''Asynchronous Advantage Actor-Critic (A3C)''': [[Parallel computing|Parallel and asynchronous]] version of A2C.<ref name=":0" />
* '''Soft Actor-Critic (SAC)''': Incorporates entropy maximization for improved exploration.<ref>{{Citation |
* '''Deep Deterministic Policy Gradient (DDPG)''': Specialized for continuous action spaces.<ref>{{Citation |
== See also ==
Line 77:
== References ==
{{Reflist|30em}}
* {{Cite journal |
* {{Cite book |
* {{Cite book |last=Bertsekas |first=Dimitri P. |title=Reinforcement learning and optimal control |date=2019 |publisher=Athena Scientific |isbn=978-1-886529-39-7 |edition=2 |___location=Belmont, Massachusetts}}
* {{Cite book |last=Grossi |first=Csaba |title=Algorithms for Reinforcement Learning |date=2010 |publisher=Springer International Publishing |isbn=978-3-031-00423-0 |edition=1 |series=Synthesis Lectures on Artificial Intelligence and Machine Learning |___location=Cham}}
|