Content deleted Content added
Citation bot (talk | contribs) Add: arxiv, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Cosmia Nebula | #UCB_webform |
Citation bot (talk | contribs) Removed URL that duplicated identifier. | Use this bot. Report bugs. | Suggested by Abductive | Category:Reinforcement learning | #UCB_Category 12/14 |
||
Line 156:
* <math display="inline">\gamma^j \left(R_j + \gamma R_{j+1} + \gamma^2 V^{\pi_\theta}( S_{j+2}) - V^{\pi_\theta}( S_{j})\right)</math>: 2-step TD learning.
* <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: n-step TD learning.
* <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation |last1=Schulman |first1=John |title=High-Dimensional Continuous Control Using Generalized Advantage Estimation |date=2018-10-20
== Natural policy gradient ==
Line 311:
{{Anchor|GRPO}}
The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s_i </math>, it samples multiple actions <math>a_{i,1}, \dots, a_{i,G} </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref name=":1">{{Citation |last1=Shao |first1=Zhihong |title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models |date=2024-04-27
Then, it maximizes the PPO objective, averaged over all actions:<math display="block">
|