Policy gradient method: Difference between revisions

Content deleted Content added
Line 131:
 
=== Actor-critic methods ===
{{Main|Actor-critic algorithm}}
If <math display="inline">b_t</math> is chosen well, such that <math display="inline">b_t(S_j) \approx \sum_{i \in j:T} (\gamma^i R_i) = \gamma^j V^{\pi_{\theta_t}}(S_j)</math>, this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the '''value function''' <math>V^{\pi_{\theta_t}}(S_j)</math> as possible, approaching the ideal of:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{j\in 0:T} \nabla_\theta\ln\pi_\theta(A_j| S_j)\left(\sum_{i \in j:T} (\gamma^i R_i) - \gamma^j V^{\pi_\theta}(S_j)\right)
\Big|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\theta_t}}(S_j)</math> updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the [[actor-critic method]]s, where the policy function is the actor and the value function is the critic.
Line 156 ⟶ 157:
* <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: n-step TD learning.
* <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation |last=Schulman |first=John |title=High-Dimensional Continuous Control Using Generalized Advantage Estimation |date=2018-10-20 |url=https://arxiv.org/abs/1506.02438 |doi=10.48550/arXiv.1506.02438 |last2=Moritz |first2=Philipp |last3=Levine |first3=Sergey |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the n-step TD learning ones.
 
=== Other methods ===
Other important examples of policy gradient methods include [[Trust region policy optimization|Trust Region Policy Optimization]] (TRPO)<ref name=":3">{{Cite journal |last1=Schulman |first1=John |last2=Levine |first2=Sergey |last3=Moritz |first3=Philipp |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter |date=2015-07-06 |title=Trust region policy optimization |url=https://dl.acm.org/doi/10.5555/3045118.3045319 |journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 |series=ICML'15 |___location=Lille, France |publisher=JMLR.org |pages=1889–1897}}</ref> and [[Proximal policy optimization|Proximal Policy Optimization]] (PPO).<ref name=":0">{{Citation |last1=Schulman |first1=John |title=Proximal Policy Optimization Algorithms |date=2017-08-28 |arxiv=1707.06347 |last2=Wolski |first2=Filip |last3=Dhariwal |first3=Prafulla |last4=Radford |first4=Alec |last5=Klimov |first5=Oleg}}</ref>
 
== Natural policy gradient ==
{{Anchor|NPG}}
 
The natural policy gradient method is a variant of the policy gradient method, proposed by [[Sham Kakade]] in 2001.<ref>{{Cite journal |last=Kakade |first=Sham M |date=2001 |title=A Natural Policy Gradient |url=https://proceedings.neurips.cc/paper_files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=14}}</ref> Unlike standard policy gradient methods, which depend on the choice of parameters <math>\theta</math> (making updates coordinate-dependent), the natural policy gradient aims to provide a [[coordinate-free]] update, which is geometrically "natural".
 
Line 186:
Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations.
== Trust Region Policy Optimization (TRPO) ==
{{Anchor|TRPO}}
'''Trust Region Policy Optimization''' (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing a trust region constraint on policy updates.<ref name=":3" /> Developed by Schulman et al. in 2015, TRPO ensures stable policy improvements by limiting the KL divergence between successive policies, addressing key challenges in natural policy gradient methods.
 
Other'''Trust importantRegion examplesPolicy ofOptimization''' (TRPO) is a policy gradient methodsmethod includethat [[Trustextends regionthe natural policy optimization|Trustgradient Regionapproach Policyby Optimization]]enforcing (TRPO)a trust region constraint on policy updates.<ref name=":3">{{Cite journal |last1=Schulman |first1=John |last2=Levine |first2=Sergey |last3=Moritz |first3=Philipp |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter |date=2015-07-06 |title=Trust region policy optimization |url=https://dl.acm.org/doi/10.5555/3045118.3045319 |journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 |series=ICML'15 |___location=Lille, France |publisher=JMLR.org |pages=1889–1897}}</ref> andDeveloped [[Proximalby policySchulman optimization|Proximalet Policyal. Optimization]]in (PPO).<ref2015, name=":0">{{CitationTRPO |last1=Schulmanensures |first1=Johnstable |title=Proximalpolicy Policyimprovements Optimizationby Algorithmslimiting |date=2017-08-28the |arxiv=1707.06347KL |last2=Wolskidivergence |first2=Filipbetween |last3=Dhariwalsuccessive |first3=Prafullapolicies, |last4=Radfordaddressing |first4=Aleckey |last5=Klimovchallenges |first5=Oleg}}</ref>in natural policy gradient methods.
 
TRPO builds on the natural policy gradient by incorporating a trust region constraint. While the natural gradient provides a theoretically optimal direction, TRPO's line search and KL constraint mitigate errors from Taylor approximations, ensuring monotonic policy improvement. This makes TRPO more robust in practice, particularly for high-dimensional policies.
Line 232 ⟶ 234:
 
== Proximal Policy Optimization (PPO) ==
{{Anchor|PPO}}
A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0" />
 
A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0">{{Citation |last1=Schulman |first1=John |title=Proximal Policy Optimization Algorithms |date=2017-08-28 |arxiv=1707.06347 |last2=Wolski |first2=Filip |last3=Dhariwal |first3=Prafulla |last4=Radford |first4=Alec |last5=Klimov |first5=Oleg}}</ref>
 
Specifically, instead of maximizing the surrogate advantage<math display="block">
Line 305 ⟶ 309:
 
=== Group Relative Policy Optimization (GRPO) ===
{{Anchor|GRPO}}
 
The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s_i </math>, it samples multiple actions <math>a_{i,1}, \dots, a_{i,G} </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref>{{Citation |last=Shao |first=Zhihong |title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models |date=2024-04-27 |url=https://arxiv.org/abs/2402.03300 |publisher=arXiv |doi=10.48550/arXiv.2402.03300 |id=arXiv:2402.03300 |last2=Wang |first2=Peiyi |last3=Zhu |first3=Qihao |last4=Xu |first4=Runxin |last5=Song |first5=Junxiao |last6=Bi |first6=Xiao |last7=Zhang |first7=Haowei |last8=Zhang |first8=Mingchuan |last9=Li |first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s_{i}, a_{i,j}) = \frac{r(s_i, a_{i,j}) - \mu}{\sigma} </math>where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards.