Revision as of 09:49, 25 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Proximal Policy Optimization (PPO): reasoning Tag: Visual edit ← Previous edit		Revision as of 00:11, 27 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →See also: GRPO Tag: Visual edit Next edit →
Line 303: \theta_t </math> is necessary. === Group Relative Policy Optimization (GRPO) === The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s_i </math>, it samples multiple actions <math>a_{i,1}, \dots, a_{i,G} </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref>{{Citation \|last=Shao \|first=Zhihong \|title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models \|date=2024-04-27 \|url=https://arxiv.org/abs/2402.03300 \|publisher=arXiv \|doi=10.48550/arXiv.2402.03300 \|id=arXiv:2402.03300 \|last2=Wang \|first2=Peiyi \|last3=Zhu \|first3=Qihao \|last4=Xu \|first4=Runxin \|last5=Song \|first5=Junxiao \|last6=Bi \|first6=Xiao \|last7=Zhang \|first7=Haowei \|last8=Zhang \|first8=Mingchuan \|last9=Li \|first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s_{i}, a_{i,j}) = \frac{r(s_i, a_{i,j}) - \mu}{\sigma} </math>where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards. Then, it maximizes the PPO objective, averaged over all actions:<math display="block"> \max_\theta \frac{1}{G} \sum_{i=1}^G \mathbb{E}_{(s, a_1, \dots, a_G) \sim \pi_{\theta_t}}\left[ \begin{cases} \min \left(\frac{\pi_\theta(a_i\|s)}{\pi_{\theta_t}(a_i\|s)}, 1 + \epsilon \right) A^{\pi_{\theta_t}}(s, a_i) & \text{ if } A^{\pi_{\theta_t}}(s, a_i) > 0 \\ \max \left(\frac{\pi_\theta(a_i\|s)}{\pi_{\theta_t}(a_i\|s)}, 1 - \epsilon \right) A^{\pi_{\theta_t}}(s, a_i) & \text{ if } A^{\pi_{\theta_t}}(s, a_i) > 0 \end{cases} \right] </math>Intuitively, each policy update step in GRPO makes the policy more likely to respond to each state with an action that performed relatively better than other actions tried at that state, and less likely to respond with one that performed relatively worse. == See also ==

Policy gradient method: Difference between revisions