Content deleted Content added
→See also: GRPO |
|||
Line 131:
=== Actor-critic methods ===
{{Main|Actor-critic algorithm}}
If <math display="inline">b_t</math> is chosen well, such that <math display="inline">b_t(S_j) \approx \sum_{i \in j:T} (\gamma^i R_i) = \gamma^j V^{\pi_{\theta_t}}(S_j)</math>, this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the '''value function''' <math>V^{\pi_{\theta_t}}(S_j)</math> as possible, approaching the ideal of:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{j\in 0:T} \nabla_\theta\ln\pi_\theta(A_j| S_j)\left(\sum_{i \in j:T} (\gamma^i R_i) - \gamma^j V^{\pi_\theta}(S_j)\right)
\Big|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\theta_t}}(S_j)</math> updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the [[actor-critic method]]s, where the policy function is the actor and the value function is the critic.
Line 156 ⟶ 157:
* <math display="inline">\gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: n-step TD learning.
* <math display="inline">\gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation |last=Schulman |first=John |title=High-Dimensional Continuous Control Using Generalized Advantage Estimation |date=2018-10-20 |url=https://arxiv.org/abs/1506.02438 |doi=10.48550/arXiv.1506.02438 |last2=Moritz |first2=Philipp |last3=Levine |first3=Sergey |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the n-step TD learning ones.
Other important examples of policy gradient methods include [[Trust region policy optimization|Trust Region Policy Optimization]] (TRPO)<ref name=":3">{{Cite journal |last1=Schulman |first1=John |last2=Levine |first2=Sergey |last3=Moritz |first3=Philipp |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter |date=2015-07-06 |title=Trust region policy optimization |url=https://dl.acm.org/doi/10.5555/3045118.3045319 |journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 |series=ICML'15 |___location=Lille, France |publisher=JMLR.org |pages=1889–1897}}</ref> and [[Proximal policy optimization|Proximal Policy Optimization]] (PPO).<ref name=":0">{{Citation |last1=Schulman |first1=John |title=Proximal Policy Optimization Algorithms |date=2017-08-28 |arxiv=1707.06347 |last2=Wolski |first2=Filip |last3=Dhariwal |first3=Prafulla |last4=Radford |first4=Alec |last5=Klimov |first5=Oleg}}</ref>▼
== Natural policy gradient ==
{{Anchor|NPG}}
The natural policy gradient method is a variant of the policy gradient method, proposed by [[Sham Kakade]] in 2001.<ref>{{Cite journal |last=Kakade |first=Sham M |date=2001 |title=A Natural Policy Gradient |url=https://proceedings.neurips.cc/paper_files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=14}}</ref> Unlike standard policy gradient methods, which depend on the choice of parameters <math>\theta</math> (making updates coordinate-dependent), the natural policy gradient aims to provide a [[coordinate-free]] update, which is geometrically "natural".
Line 186:
Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations.
== Trust Region Policy Optimization (TRPO) ==
{{Anchor|TRPO}}
▲
TRPO builds on the natural policy gradient by incorporating a trust region constraint. While the natural gradient provides a theoretically optimal direction, TRPO's line search and KL constraint mitigate errors from Taylor approximations, ensuring monotonic policy improvement. This makes TRPO more robust in practice, particularly for high-dimensional policies.
Line 232 ⟶ 234:
== Proximal Policy Optimization (PPO) ==
{{Anchor|PPO}}
A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0" />▼
▲A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0">{{Citation |last1=Schulman |first1=John |title=Proximal Policy Optimization Algorithms |date=2017-08-28 |arxiv=1707.06347 |last2=Wolski |first2=Filip |last3=Dhariwal |first3=Prafulla |last4=Radford |first4=Alec |last5=Klimov |first5=Oleg}}</ref>
Specifically, instead of maximizing the surrogate advantage<math display="block">
Line 305 ⟶ 309:
=== Group Relative Policy Optimization (GRPO) ===
{{Anchor|GRPO}}
The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s_i </math>, it samples multiple actions <math>a_{i,1}, \dots, a_{i,G} </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref>{{Citation |last=Shao |first=Zhihong |title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models |date=2024-04-27 |url=https://arxiv.org/abs/2402.03300 |publisher=arXiv |doi=10.48550/arXiv.2402.03300 |id=arXiv:2402.03300 |last2=Wang |first2=Peiyi |last3=Zhu |first3=Qihao |last4=Xu |first4=Runxin |last5=Song |first5=Junxiao |last6=Bi |first6=Xiao |last7=Zhang |first7=Haowei |last8=Zhang |first8=Mingchuan |last9=Li |first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s_{i}, a_{i,j}) = \frac{r(s_i, a_{i,j}) - \mu}{\sigma} </math>where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards.
|