Revision as of 05:25, 21 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →Overview Tag: Visual edit ← Previous edit		Revision as of 05:29, 21 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →Overview: critic Tag: Visual edit Next edit →
Line 8: The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline. === Actor === The '''actor''' uses a policy function <math>\pi(a\|s)</math>, while the critic estimates either the [[value function]] <math>V(s)</math>, the action-value Q-function <math>Q(s,a) </math>, the advantage function <math>A(s,a)</math>, or any combination thereof. Line 15 ⟶ 16: If the action space is discrete, then <math>\sum_{a} \pi_\theta(a \| s) = 1</math>. If the action space is continuous, then <math>\int_{a} \pi_\theta(a \| s) da = 1</math>. The goal of policy optimization is to improve the actor. That is, to find some <math>\theta</math> that maximizes the expected episodic reward <math>J(\theta)</math>:<math display="block"> J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{T} \gamma^t r_t] </math>where <math> Line 43 ⟶ 44: * <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(n) learning. * <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation \|last=Schulman \|first=John \|title=High-Dimensional Continuous Control Using Generalized Advantage Estimation \|date=2018-10-20 \|url=https://arxiv.org/abs/1506.02438 \|doi=10.48550/arXiv.1506.02438 \|last2=Moritz \|first2=Philipp \|last3=Levine \|first3=Sergey \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the TD(n) learning terms. === Critic === In the unbiased estimators given above, certain functions such as <math>V^{\pi_\theta}, Q^{\pi_\theta}, A^{\pi_\theta}</math> appear. These are estimated by the '''critic'''. == Variants ==

Actor-critic algorithm: Difference between revisions