Revision as of 21:46, 21 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits the listed resources are reliable Tags: Manual revert Visual edit ← Previous edit		Revision as of 21:55, 21 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits natural policy gradient Tag: Visual edit Next edit →
Line 130: === Actor-critic methods === If <math display="inline">b_t</math> is chosen well, such that <math display="inline">b_t(S_j) \approx \sum_{i \in j:T} (\gamma^i R_i) = \gamma^j V^{\pi_{\theta_t}}(S_j)</math>, this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the '''value function''' <math>V^{\pi_{\theta_t}}(S_j)</math> as possible, approaching the ideal of:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{j\in 0:T} \nabla_\theta\ln\pi_\theta(A_j\| S_j)\left(\sum_{i \in j:T} (\gamma^i R_i) - \gamma^j V^{\pi_\theta}(S_j)\right) \Big\|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\theta_t}}(S_j)</math> updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the [[actor-critic method]]s, where the policy function is the actor and the value function is the critic. The '''Q-function''' <math>Q^\pi</math> can also be used as the critic, since<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq j \leq T} \gamma^j \nabla_\theta\ln\pi_\theta(A_j\| S_j) Line 139: Subtracting the value function as a baseline, we find that the '''advantage function''' <math>A^{\pi}(S,A) = Q^{\pi}(S,A) - V^{\pi}(S)</math> can be used as the critic as well:<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq j \leq T} \gamma^j \nabla_\theta\ln\pi_\theta(A_j\| S_j) \cdot A^{\pi_\theta}(S_j, A_j) \Big\|S_0 = s_0 \right]</math>ToIn ~~summarize~~summary, there are many unbiased estimators for <math display="inline">\nabla_\theta J_\theta</math>, all in the form of: <math display="block">\nabla_\theta J(\theta) = E_{\pi_\theta}\left[\sum_{0\leq j \leq T} \nabla_\theta\ln\pi_\theta(A_j\| S_j) \cdot \Psi_j \Big\|S_0 = s_0 \right]</math> where <math display="inline">\Psi_j</math> is aany linear sum of the following terms: * <math display="inline">\sum_{0 \leq i\leq T} (\gamma^i R_i)</math>: never used. Line 158: === Other methods === Other important examples of policy gradient methods include [[Trust region policy optimization\|Trust Region Policy Optimization]] (TRPO)<ref name=":3">{{Cite journal \|last1=Schulman \|first1=John \|last2=Levine \|first2=Sergey \|last3=Moritz \|first3=Philipp \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter \|date=2015-07-06 \|title=Trust region policy optimization \|url=https://dl.acm.org/doi/10.5555/3045118.3045319 \|journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 \|series=ICML'15 \|___location=Lille, France \|publisher=JMLR.org \|pages=1889–1897}}</ref> and [[Proximal policy optimization\|Proximal Policy Optimization]] (PPO).<ref name=":0">{{Citation \|last1=Schulman \|first1=John \|title=Proximal Policy Optimization Algorithms \|date=2017-08-28 \|arxiv=1707.06347 \|last2=Wolski \|first2=Filip \|last3=Dhariwal \|first3=Prafulla \|last4=Radford \|first4=Alec \|last5=Klimov \|first5=Oleg}}</ref> == Natural policy gradient == The natural policy gradient method is a variant of the policy gradient method, proposed by ([[Sham Kakade\|Kakade]], 2001).<ref>{{Cite journal \|last=Kakade \|first=Sham M \|date=2001 \|title=A Natural Policy Gradient \|url=https://proceedings.neurips.cc/paper_files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=MIT Press \|volume=14}}</ref> The key idea is that the standard policy gradient methods, given above, involve optimizing <math>J(\theta)</math> by taking its gradient <math>\nabla_\theta J(\theta)</math>. However, this gradient depends on the particular choice of the coordinate <math>\theta</math>. So, for example, if we were to change the coordinates by <math>\theta' = 2\theta </math>, where <math>f</math> is some smooth function, then we would obtain a new policy gradient <math>\nabla_{\theta'} J(\theta') = \frac 12 \nabla_\theta J(\theta) </math>. Thus, policy gradient method is "unnatural" in the geometric sense, since its updates depends on the choice of coordinates. A "natural" policy gradient would change it so that the policy updates are [[coordinate-free]]. == See also ==

Policy gradient method: Difference between revisions