Revision as of 00:11, 27 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →See also: GRPO Tag: Visual edit ← Previous edit		Revision as of 00:31, 27 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Group Relative Policy Optimization (GRPO): anchors Tag: Visual edit Next edit →
Line 131: === Actor-critic methods === {{Main\|Actor-critic algorithm}} If <math display="inline">b_t</math> is chosen well, such that <math display="inline">b_t(S_j) \approx \sum_{i \in j:T} (\gamma^i R_i) = \gamma^j V^{\pi_{\theta_t}}(S_j)</math>, this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the '''value function''' <math>V^{\pi_{\theta_t}}(S_j)</math> as possible, approaching the ideal of:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{j\in 0:T} \nabla_\theta\ln\pi_\theta(A_j\| S_j)\left(\sum_{i \in j:T} (\gamma^i R_i) - \gamma^j V^{\pi_\theta}(S_j)\right) \Big\|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\theta_t}}(S_j)</math> updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the [[actor-critic method]]s, where the policy function is the actor and the value function is the critic. Line 156 ⟶ 157: * <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: n-step TD learning. * <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation \|last=Schulman \|first=John \|title=High-Dimensional Continuous Control Using Generalized Advantage Estimation \|date=2018-10-20 \|url=https://arxiv.org/abs/1506.02438 \|doi=10.48550/arXiv.1506.02438 \|last2=Moritz \|first2=Philipp \|last3=Levine \|first3=Sergey \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the n-step TD learning ones. ~~=== Other methods ===~~ Other important examples of policy gradient methods include [[Trust region policy optimization\|Trust Region Policy Optimization]] (TRPO)<ref name=":3">{{Cite journal \|last1=Schulman \|first1=John \|last2=Levine \|first2=Sergey \|last3=Moritz \|first3=Philipp \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter \|date=2015-07-06 \|title=Trust region policy optimization \|url=https://dl.acm.org/doi/10.5555/3045118.3045319 \|journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 \|series=ICML'15 \|___location=Lille, France \|publisher=JMLR.org \|pages=1889–1897}}</ref> and [[Proximal policy optimization\|Proximal Policy Optimization]] (PPO).<ref name=":0">{{Citation \|last1=Schulman \|first1=John \|title=Proximal Policy Optimization Algorithms \|date=2017-08-28 \|arxiv=1707.06347 \|last2=Wolski \|first2=Filip \|last3=Dhariwal \|first3=Prafulla \|last4=Radford \|first4=Alec \|last5=Klimov \|first5=Oleg}}</ref>▼ == Natural policy gradient == {{Anchor\|NPG}} The natural policy gradient method is a variant of the policy gradient method, proposed by [[Sham Kakade]] in 2001.<ref>{{Cite journal \|last=Kakade \|first=Sham M \|date=2001 \|title=A Natural Policy Gradient \|url=https://proceedings.neurips.cc/paper_files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=MIT Press \|volume=14}}</ref> Unlike standard policy gradient methods, which depend on the choice of parameters <math>\theta</math> (making updates coordinate-dependent), the natural policy gradient aims to provide a [[coordinate-free]] update, which is geometrically "natural". Line 186: Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations. == Trust Region Policy Optimization (TRPO) == {{Anchor\|TRPO}} '''Trust Region Policy Optimization''' (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing a trust region constraint on policy updates.<ref name=":3" /> Developed by Schulman et al. in 2015, TRPO ensures stable policy improvements by limiting the KL divergence between successive policies, addressing key challenges in natural policy gradient methods. ▲~~Other~~'''Trust ~~important~~Region ~~examples~~Policy ofOptimization''' (TRPO) is a policy gradient ~~methods~~method ~~include~~that ~~[[Trust~~extends ~~region~~the natural policy ~~optimization\|Trust~~gradient ~~Region~~approach ~~Policy~~by ~~Optimization]]~~enforcing ~~(TRPO)~~a trust region constraint on policy updates.<ref name=":3">{{Cite journal \|last1=Schulman \|first1=John \|last2=Levine \|first2=Sergey \|last3=Moritz \|first3=Philipp \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter \|date=2015-07-06 \|title=Trust region policy optimization \|url=https://dl.acm.org/doi/10.5555/3045118.3045319 \|journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 \|series=ICML'15 \|___location=Lille, France \|publisher=JMLR.org \|pages=1889–1897}}</ref> ~~and~~Developed ~~[[Proximal~~by ~~policy~~Schulman ~~optimization\|Proximal~~et ~~Policy~~al. ~~Optimization]]~~in ~~(PPO).<ref~~2015, ~~name=":0">{{Citation~~TRPO ~~\|last1=Schulman~~ensures ~~\|first1=John~~stable ~~\|title=Proximal~~policy ~~Policy~~improvements ~~Optimization~~by ~~Algorithms~~limiting ~~\|date=2017-08-28~~the ~~\|arxiv=1707.06347~~KL ~~\|last2=Wolski~~divergence ~~\|first2=Filip~~between ~~\|last3=Dhariwal~~successive ~~\|first3=Prafulla~~policies, ~~\|last4=Radford~~addressing ~~\|first4=Alec~~key ~~\|last5=Klimov~~challenges ~~\|first5=Oleg}}</ref>~~in natural policy gradient methods. TRPO builds on the natural policy gradient by incorporating a trust region constraint. While the natural gradient provides a theoretically optimal direction, TRPO's line search and KL constraint mitigate errors from Taylor approximations, ensuring monotonic policy improvement. This makes TRPO more robust in practice, particularly for high-dimensional policies. Line 232 ⟶ 234: == Proximal Policy Optimization (PPO) == {{Anchor\|PPO}} A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0" />▼ ▲A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0">{{Citation \|last1=Schulman \|first1=John \|title=Proximal Policy Optimization Algorithms \|date=2017-08-28 \|arxiv=1707.06347 \|last2=Wolski \|first2=Filip \|last3=Dhariwal \|first3=Prafulla \|last4=Radford \|first4=Alec \|last5=Klimov \|first5=Oleg}}</ref> Specifically, instead of maximizing the surrogate advantage<math display="block"> Line 305 ⟶ 309: === Group Relative Policy Optimization (GRPO) === {{Anchor\|GRPO}} The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s_i </math>, it samples multiple actions <math>a_{i,1}, \dots, a_{i,G} </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref>{{Citation \|last=Shao \|first=Zhihong \|title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models \|date=2024-04-27 \|url=https://arxiv.org/abs/2402.03300 \|publisher=arXiv \|doi=10.48550/arXiv.2402.03300 \|id=arXiv:2402.03300 \|last2=Wang \|first2=Peiyi \|last3=Zhu \|first3=Qihao \|last4=Xu \|first4=Runxin \|last5=Song \|first5=Junxiao \|last6=Bi \|first6=Xiao \|last7=Zhang \|first7=Haowei \|last8=Zhang \|first8=Mingchuan \|last9=Li \|first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s_{i}, a_{i,j}) = \frac{r(s_i, a_{i,j}) - \mu}{\sigma} </math>where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards.

Policy gradient method: Difference between revisions