Content deleted Content added
→Overview: critic |
|||
Line 8:
The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.
=== Actor ===
The '''actor''' uses a policy function <math>\pi(a|s)</math>, while the critic estimates either the [[value function]] <math>V(s)</math>, the action-value Q-function <math>Q(s,a)
</math>, the advantage function <math>A(s,a)</math>, or any combination thereof.
Line 15 ⟶ 16:
If the action space is discrete, then <math>\sum_{a} \pi_\theta(a | s) = 1</math>. If the action space is continuous, then <math>\int_{a} \pi_\theta(a | s) da = 1</math>.
The goal of policy optimization is to improve the actor. That is, to find some <math>\theta</math> that maximizes the expected episodic reward <math>J(\theta)</math>:<math display="block">
J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{T} \gamma^t r_t]
</math>where <math>
Line 43 ⟶ 44:
* <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(n) learning.
* <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation |last=Schulman |first=John |title=High-Dimensional Continuous Control Using Generalized Advantage Estimation |date=2018-10-20 |url=https://arxiv.org/abs/1506.02438 |doi=10.48550/arXiv.1506.02438 |last2=Moritz |first2=Philipp |last3=Levine |first3=Sergey |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the TD(n) learning terms.
=== Critic ===
In the unbiased estimators given above, certain functions such as <math>V^{\pi_\theta}, Q^{\pi_\theta}, A^{\pi_\theta}</math> appear. These are estimated by the '''critic'''.
== Variants ==
|