Content deleted Content added
→See also: GRPO |
I fixed the formula so it would express "the total reward from time $ |
||
(33 intermediate revisions by 8 users not shown) | |||
Line 6:
== Overview ==
In policy-based RL, the actor is a parameterized policy function <math>\pi_\theta</math>, where <math>\theta</math> are the parameters of the actor. The actor takes as argument the state of the environment <math>s</math> and produces a [[probability distribution]] <math>\pi_\theta(\cdot
If the action space is discrete, then <math>\sum_{a} \pi_\theta(a
The goal of policy optimization is to find some <math>\theta</math> that maximizes the expected episodic reward <math>J(\theta)</math>:<math display="block">J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t\in 0:T} \gamma^t R_t \Big| S_0 = s_0 \right]</math>where <math>
\gamma
</math> is the [[discount factor]], <math>
Line 31 ⟶ 29:
=== Policy gradient ===
The '''REINFORCE algorithm''' was the first policy gradient method.<ref>{{Cite journal |last=Williams |first=Ronald J. |date=May 1992 |title=Simple statistical gradient-following algorithms for connectionist reinforcement learning |url=http://link.springer.com/10.1007/BF00992696 |journal=Machine Learning |language=en |volume=8 |issue=3–4 |pages=229–256 |doi=10.1007/BF00992696 |issn=0885-6125}}</ref> It is based on the identity for the policy gradient<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[
\sum_{
\Big|S_0 = s_0
\right]</math> which can be improved via the "causality trick"<ref>{{Cite journal |last1=Sutton |first1=Richard S |last2=McAllester |first2=David |last3=Singh |first3=Satinder |last4=Mansour |first4=Yishay |date=1999 |title=Policy Gradient Methods for Reinforcement Learning with Function Approximation |url=https://proceedings.neurips.cc/paper_files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=12}}</ref><math display="block">
\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{
\Big|S_0 = s_0 \right]
</math>
Line 49 ⟶ 47:
}}
{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=
{{Math proof|title=Proof of the lemma|proof=
Use the [[reparameterization trick#REINFORCE estimator|reparameterization trick]].
Line 84 ⟶ 82:
\end{aligned}
</math>
}}
{{Math proof|title=Proof of the two identities|proof=▼
▲{{Math proof|title=Proof|proof=
Applying the [[reparameterization trick#REINFORCE estimator|reparameterization trick]],
Line 106 ⟶ 102:
}}
{{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block">
\nabla_\theta J(\theta) \approx \frac 1N \sum_{
</math>where the index <math>
The [[Score (statistics)|score function]] <math>\nabla_\theta \ln \pi_\theta (A_t
=== Algorithm ===
Line 115 ⟶ 111:
# Rollout <math>N</math> trajectories in the environment, using <math>\pi_{\theta_t}</math> as the policy function.
# Compute the policy gradient estimation: <math>
# Update the policy by gradient ascent: <math>\theta_{
Here, <math>\
== Variance reduction ==
REINFORCE is an '''on-policy''' algorithm, meaning that the trajectories used for the update must be sampled from the current policy <math>\pi_\theta</math>. This can lead to high variance in the updates, as the returns <math>R(\tau)</math> can vary significantly between trajectories. Many variants of REINFORCE
=== REINFORCE with baseline ===
A common way for reducing variance is the '''REINFORCE with baseline''' algorithm, based on the following identity:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{
\Big|S_0 = s_0 \right]</math>for any function <math>b: \text{States} \to \R</math>. This can be proven by applying the previous lemma.
The algorithm uses the modified gradient estimator<math display="block">
\frac 1N \sum_{
=== Actor-critic methods ===
{{Main|Actor-critic algorithm}}
If <math display="inline">
\Big|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\
The '''Q-function''' <math>Q^\pi</math> can also be used as the critic, since<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq
\cdot Q^{\pi_\theta}(
\Big|S_0 = s_0 \right]</math> by a similar argument using the tower law.
Subtracting the value function as a baseline, we find that the '''advantage function''' <math>A^{\pi}(S,A) = Q^{\pi}(S,A) - V^{\pi}(S)</math> can be used as the critic as well:<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq
\cdot A^{\pi_\theta}(
\Big|S_0 = s_0 \right]</math>In summary, there are many unbiased estimators for <math display="inline">\nabla_\theta J_\theta</math>, all in the form of: <math display="block">\nabla_\theta J(\theta) = E_{\pi_\theta}\left[\sum_{0\leq
\cdot \
\Big|S_0 = s_0 \right]</math> where <math display="inline">\
* <math display="inline">\sum_{0 \leq
* <math display="inline">\gamma^
* <math display="inline">\gamma^
* <math display="inline">\gamma^
* <math display="inline">\gamma^
* <math display="inline">\gamma^
Some more possible <math display="inline">\
* <math display="inline">\gamma^
* <math display="inline">\gamma^
* <math display="inline">\gamma^
Other important examples of policy gradient methods include [[Trust region policy optimization|Trust Region Policy Optimization]] (TRPO)<ref name=":3">{{Cite journal |last1=Schulman |first1=John |last2=Levine |first2=Sergey |last3=Moritz |first3=Philipp |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter |date=2015-07-06 |title=Trust region policy optimization |url=https://dl.acm.org/doi/10.5555/3045118.3045319 |journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 |series=ICML'15 |___location=Lille, France |publisher=JMLR.org |pages=1889–1897}}</ref> and [[Proximal policy optimization|Proximal Policy Optimization]] (PPO).<ref name=":0">{{Citation |last1=Schulman |first1=John |title=Proximal Policy Optimization Algorithms |date=2017-08-28 |arxiv=1707.06347 |last2=Wolski |first2=Filip |last3=Dhariwal |first3=Prafulla |last4=Radford |first4=Alec |last5=Klimov |first5=Oleg}}</ref>▼
== Natural policy gradient ==
{{Anchor|NPG}}
The natural policy gradient method is a variant of the policy gradient method, proposed by [[Sham Kakade]] in 2001.<ref>{{Cite journal |last=Kakade |first=Sham M |date=2001 |title=A Natural Policy Gradient |url=https://proceedings.neurips.cc/paper_files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=14}}</ref> Unlike standard policy gradient methods, which depend on the choice of parameters <math>\theta</math> (making updates coordinate-dependent), the natural policy gradient aims to provide a [[coordinate-free]] update, which is geometrically "natural".
=== Motivation ===
Standard policy gradient updates <math>\theta_{
\begin{cases}
\max_{\theta_{
\|\theta_{
\end{cases}
</math>
While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint <math>\|\theta_{
\max_{\theta_{
\bar{D}_{KL}(\pi_{\theta_{
\end{cases}</math>where the KL divergence between two policies is '''averaged''' over the state distribution
=== Fisher information approximation ===
For small <math>\epsilon</math>, the KL divergence is approximated by the [[Fisher information metric]]:<math display="block">
\bar{D}_{KL}(\pi_{\theta_{
</math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block">
F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a|s) \left(\nabla_\theta \ln \pi_\theta(a|s)\right)^T \right]
</math> This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block">
\theta_{
</math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\
Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations.
== Trust Region Policy Optimization (TRPO) ==
{{Anchor|TRPO}}
▲
The natural gradient descent is theoretically optimal, ''if'' the objective is truly a quadratic function, but this is only an approximation. TRPO's line search and KL constraint attempts to restrict the solution to within a "trust region" in which this approximation does not break down. This makes TRPO more robust in practice.
=== Formulation ===
Like natural policy gradient, TRPO iteratively updates the policy parameters <math>\theta</math> by solving a constrained optimization problem specified coordinate-free:<math display="block">
\begin{cases}
\max_{\theta} L(\
\bar{D}_{KL}(\pi_{\theta} \| \pi_{\theta_{
\end{cases}
</math>where
* <math>L(\
* <math>\epsilon</math> is the trust region radius.
Note that in general, other surrogate advantages are possible:<math display="block">L(\
The surrogate advantage <math>L(\
</math> is designed to align with the policy gradient <math>\nabla_\theta J(\theta)</math>. Specifically, when <math>\theta = \theta_t</math>, '''<math>
\nabla_\theta L(\
</math>''' equals the policy gradient derived from the advantage function:
<math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[\nabla_\theta \ln \pi_\theta(a | s) \cdot A^{\pi_\theta}(s, a) \right] = \nabla_\theta L(\theta, \theta_t)</math>However, when <math>\theta \neq \theta_i</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective.
\begin{aligned}
L(\
\bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\
\end{aligned}
</math>
where:
* <math>g = \nabla_\theta L(\
* <math>F = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\
This reduces the problem to a quadratic optimization, yielding the natural policy gradient update:
<math display="block">
\theta_{
</math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications:
Line 225 ⟶ 224:
x
</math> in <math>Fx = g</math> iteratively without explicit matrix inversion.
* Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement.
\theta_{
</math> until
L(
</math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.
== Proximal Policy Optimization (PPO) ==
{{Anchor|PPO}}
A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0" />▼
▲A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0">{{Citation |last1=Schulman |first1=John |title=Proximal Policy Optimization Algorithms |date=2017-08-28 |arxiv=1707.06347 |last2=Wolski |first2=Filip |last3=Dhariwal |first3=Prafulla |last4=Radford |first4=Alec |last5=Klimov |first5=Oleg}}</ref>
Specifically, instead of maximizing the surrogate advantage<math display="block">
\max_\theta L(\
</math> under a KL divergence constraint, it directly inserts the constraint into the surrogate advantage:<math display="block">
\max_\theta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[
\begin{cases}
\min \left(\frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)}, 1 + \epsilon \right) A^{\pi_{\theta_t}}(s, a) & \text{ if } A^{\pi_{\theta_t}}(s, a) > 0
\\
\max \left(\frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)}, 1 - \epsilon \right) A^{\pi_{\theta_t}}(s, a) & \text{ if } A^{\pi_{\theta_t}}(s, a)
\end{cases}
\right]
</math> and PPO maximizes the surrogate advantage by stochastic gradient descent, as usual.
In words, gradient-ascending the new surrogate advantage function means that, at some state <math>
Line 303 ⟶ 304:
\theta_t
</math> is necessary.
If there is a reference policy <math>
\pi_{\text{ref}}
</math> that the trained policy should not diverge too far from, then additional KL divergence penalty can be added:<math display="block">-\beta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[\log\left(\frac{\pi_{\theta}(a|s)}{\pi_{\text{ref}}(a|s)}\right) \right]</math>where <math>
\beta
</math> adjusts the strength of the penalty. This has been used in training [[Reasoning language model|reasoning language models]] with [[reinforcement learning from human feedback]].<ref name="summarizationpaper">{{cite journal |author=Nisan Stiennon |author2=Long Ouyang |author3=Jeffrey Wu |author4=Daniel Ziegler |author5=Ryan Lowe |author6=Chelsea Voss |author7=Alec Radford |author8=Dario Amodei |author9=Paul F. Christiano |date=2020 |title=Learning to summarize with human feedback |url=https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html |journal=Advances in Neural Information Processing Systems |language=en |volume=33}}</ref> The KL divergence penalty term can be estimated with lower variance using the equivalent form (see [[f-divergence]] for details):<ref name=":1" /><math display="block">-\beta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[
\log\left(\frac{\pi_{\theta}(a|s)}{\pi_{\text{ref}}(a|s)}\right)
+ \frac{\pi_{\text{ref}}(a|s)} {\pi_{\theta}(a|s)}
-1
\right]</math>
=== Group Relative Policy Optimization (GRPO) ===
{{Anchor|GRPO}}
The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s_i </math>, it samples multiple actions <math>a_{i,1}, \dots, a_{i,G} </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref>{{Citation |last=Shao |first=Zhihong |title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models |date=2024-04-27 |url=https://arxiv.org/abs/2402.03300 |publisher=arXiv |doi=10.48550/arXiv.2402.03300 |id=arXiv:2402.03300 |last2=Wang |first2=Peiyi |last3=Zhu |first3=Qihao |last4=Xu |first4=Runxin |last5=Song |first5=Junxiao |last6=Bi |first6=Xiao |last7=Zhang |first7=Haowei |last8=Zhang |first8=Mingchuan |last9=Li |first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s_{i}, a_{i,j}) = \frac{r(s_i, a_{i,j}) - \mu}{\sigma} </math>where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards.▼
▲The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>
Then, it maximizes the PPO objective, averaged over all actions:<math display="block">
Line 312 ⟶ 325:
\min \left(\frac{\pi_\theta(a_i|s)}{\pi_{\theta_t}(a_i|s)}, 1 + \epsilon \right) A^{\pi_{\theta_t}}(s, a_i) & \text{ if } A^{\pi_{\theta_t}}(s, a_i) > 0
\\
\max \left(\frac{\pi_\theta(a_i|s)}{\pi_{\theta_t}(a_i|s)}, 1 - \epsilon \right) A^{\pi_{\theta_t}}(s, a_i) & \text{ if } A^{\pi_{\theta_t}}(s, a_i)
\end{cases}
\right]
</math>Intuitively, each policy update step in GRPO makes the policy more likely to respond to each state with an action that performed relatively better than other actions tried at that state, and less likely to respond with one that performed relatively worse.
As before, the KL penalty term can be applied to encourage the trained policy to stay close to a reference policy. GRPO was first proposed in the context of training [[reasoning language model|reasoning language models]] by researchers at [[DeepSeek]].<ref name=":1" />
== See also ==
Line 334 ⟶ 349:
* {{Cite web |last=Weng |first=Lilian |date=2018-04-08 |title=Policy Gradient Algorithms |url=https://lilianweng.github.io/posts/2018-04-08-policy-gradient/ |access-date=2025-01-25 |website=lilianweng.github.io |language=en}}
* {{Cite web |title=Vanilla Policy Gradient — Spinning Up documentation |url=https://spinningup.openai.com/en/latest/algorithms/vpg.html |access-date=2025-01-25 |website=spinningup.openai.com}}
{{Artificial intelligence navbox}}
[[Category:Reinforcement learning]]
[[Category:Machine learning algorithms]]
|