Content deleted Content added
Fixed reference date error(s) (see CS1 errors: dates for details) and AWB general fixes |
|||
Line 1:
{{Short description|Reinforcement learning algorithms that combine policy and value estimation}}
The '''actor-critic algorithm''' (AC) is a family of [[reinforcement learning]] (RL) algorithms that combine policy-based RL algorithms such as [[
An AC algorithm consists of two main components: an "'''actor'''" that determines which actions to take according to a policy function, and a "'''critic'''" that evaluates those actions according to a value function.<ref>{{Cite journal |last1=Konda |first1=Vijay |last2=Tsitsiklis |first2=John |date=1999 |title=Actor-Critic Algorithms |url=https://proceedings.neurips.cc/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=12}}</ref> Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.
Line 30:
The goal of policy gradient method is to optimize <math>J(\theta)</math> by [[Gradient descent|gradient ascent]] on the policy gradient <math>\nabla J(\theta)</math>.
As detailed on the [[Policy gradient method#Actor-critic methods|policy gradient method]] page, there are many [[
\cdot \Psi_j
\Big|S_0 = s_0 \right]</math>where <math display="inline">\Psi_j</math> is a linear sum of the following:
Line 40:
* <math display="inline">\gamma^j Q^{\pi_\theta}(S_j, A_j)</math>.
* <math display="inline">\gamma^j A^{\pi_\theta}(S_j, A_j)</math>: '''Advantage Actor-Critic (A2C)'''.<ref name=":0">{{Citation |last1=Mnih |first1=Volodymyr |title=Asynchronous Methods for Deep Reinforcement Learning |date=2016-06-16 |url=https://arxiv.org/abs/1602.01783 |arxiv=1602.01783 |last2=Badia |first2=Adrià Puigdomènech |last3=Mirza |first3=Mehdi |last4=Graves |first4=Alex |last5=Lillicrap |first5=Timothy P. |last6=Harley |first6=Tim |last7=Silver |first7=David |last8=Kavukcuoglu |first8=Koray}}</ref>
* <math display="inline">\gamma^j \left(R_j + \gamma R_{j+1} + \gamma^2 V^{\pi_\theta}( S_{j+2}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(2) learning.
* <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(n) learning.
* <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref name="arxiv.org">{{Citation |last1=Schulman |first1=John |title=High-Dimensional Continuous Control Using Generalized Advantage Estimation |date=2018-10-20 |url=https://arxiv.org/abs/1506.02438 |arxiv=1506.02438 |last2=Moritz |first2=Philipp |last3=Levine |first3=Sergey |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the TD(n) learning terms.
=== Critic ===
In the unbiased estimators given above, certain functions such as <math>V^{\pi_\theta}, Q^{\pi_\theta}, A^{\pi_\theta}</math> appear. These are approximated by the '''critic'''. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms.
For example, if the critic is estimating the state-value function <math>V^{\pi_\theta}(s)</math>, then it can be learned by any value function approximation method. Let the critic be a function approximator <math>V_\phi(s)</math> with parameters <math>\phi</math>.
The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:<math display="block">\delta_i = R_i + \gamma V_\phi(S_{i+1}) - V_\phi(S_i)</math>The critic parameters are updated by gradient descent on the squared TD error:<math display="block">\phi \leftarrow \phi - \alpha \nabla_\phi (\delta_i)^2 = \phi + \alpha \delta_i \nabla_\phi V_\phi(S_i)</math>where <math>\alpha</math> is the learning rate. Note that the gradient is taken with respect to the <math>\phi</math> in <math>V_\phi(S_i)</math> only, since the <math>\phi</math> in <math>\gamma V_\phi(S_{i+1})</math> constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use [[automatic differentiation]], and requires "stopping the gradient" at that point.
Line 62 ⟶ 61:
</math>, low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with <math>
\lambda
</math> being the decay strength.<ref name="arxiv.org"/>
== Variants ==
Line 81 ⟶ 80:
* {{Cite book |last=Bertsekas |first=Dimitri P. |title=Reinforcement learning and optimal control |date=2019 |publisher=Athena Scientific |isbn=978-1-886529-39-7 |edition=2 |___location=Belmont, Massachusetts}}
* {{Cite book |last=Grossi |first=Csaba |title=Algorithms for Reinforcement Learning |date=2010 |publisher=Springer International Publishing |isbn=978-3-031-00423-0 |edition=1 |series=Synthesis Lectures on Artificial Intelligence and Machine Learning |___location=Cham}}
* {{Cite journal |last=Grondman |first=Ivo |last2=Busoniu |first2=Lucian |last3=Lopes |first3=Gabriel A. D. |last4=Babuska |first4=Robert |date=November 2012
[[Category:Reinforcement learning]]
|