Revision as of 07:55, 27 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Group Relative Policy Optimization (GRPO): ds Tag: Visual edit ← Previous edit		Revision as of 08:04, 27 January 2025 edit undo Citation bot (talk \| contribs) Bots 5,867,213 edits Add: arxiv, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Cosmia Nebula \| #UCB_webform Next edit →
Line 156: * <math display="inline">\gamma^j \left(R_j + \gamma R_{j+1} + \gamma^2 V^{\pi_\theta}( S_{j+2}) - V^{\pi_\theta}( S_{j})\right)</math>: 2-step TD learning. * <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: n-step TD learning. * <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation \|~~last~~last1=Schulman \|~~first~~first1=John \|title=High-Dimensional Continuous Control Using Generalized Advantage Estimation \|date=2018-10-20 \|url=https://arxiv.org/abs/1506.02438 \|~~doi~~arxiv=~~10.48550/arXiv.~~1506.02438 \|last2=Moritz \|first2=Philipp \|last3=Levine \|first3=Sergey \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the n-step TD learning ones. == Natural policy gradient == Line 311: {{Anchor\|GRPO}} The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s_i </math>, it samples multiple actions <math>a_{i,1}, \dots, a_{i,G} </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref name=":1">{{Citation \|~~last~~last1=Shao \|~~first~~first1=Zhihong \|title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models \|date=2024-04-27 \|url=https://arxiv.org/abs/2402.03300 \|~~publisher~~arxiv=~~arXiv \|doi=10.48550/arXiv.2402.03300 \|id=arXiv:~~2402.03300 \|last2=Wang \|first2=Peiyi \|last3=Zhu \|first3=Qihao \|last4=Xu \|first4=Runxin \|last5=Song \|first5=Junxiao \|last6=Bi \|first6=Xiao \|last7=Zhang \|first7=Haowei \|last8=Zhang \|first8=Mingchuan \|last9=Li \|first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s_{i}, a_{i,j}) = \frac{r(s_i, a_{i,j}) - \mu}{\sigma} </math>where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards. Then, it maximizes the PPO objective, averaged over all actions:<math display="block">

Policy gradient method: Difference between revisions