Content deleted Content added
I fixed the formula so it would express "the total reward from time $ |
|||
(16 intermediate revisions by 6 users not shown) | |||
Line 6:
== Overview ==
In policy-based RL, the actor is a parameterized policy function <math>\pi_\theta</math>, where <math>\theta</math> are the parameters of the actor. The actor takes as argument the state of the environment <math>s</math> and produces a [[probability distribution]] <math>\pi_\theta(\cdot
If the action space is discrete, then <math>\sum_{a} \pi_\theta(a
The goal of policy optimization is to find some <math>\theta</math> that maximizes the expected episodic reward <math>J(\theta)</math>:<math display="block">J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t\in 0:T} \gamma^t R_t \Big| S_0 = s_0 \right]</math>where <math>
\gamma
</math> is the [[discount factor]], <math>
Line 31 ⟶ 29:
=== Policy gradient ===
The '''REINFORCE algorithm''' was the first policy gradient method.<ref>{{Cite journal |last=Williams |first=Ronald J. |date=May 1992 |title=Simple statistical gradient-following algorithms for connectionist reinforcement learning |url=http://link.springer.com/10.1007/BF00992696 |journal=Machine Learning |language=en |volume=8 |issue=3–4 |pages=229–256 |doi=10.1007/BF00992696 |issn=0885-6125}}</ref> It is based on the identity for the policy gradient<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[
\sum_{
\Big|S_0 = s_0
\right]</math> which can be improved via the "causality trick"<ref>{{Cite journal |last1=Sutton |first1=Richard S |last2=McAllester |first2=David |last3=Singh |first3=Satinder |last4=Mansour |first4=Yishay |date=1999 |title=Policy Gradient Methods for Reinforcement Learning with Function Approximation |url=https://proceedings.neurips.cc/paper_files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=12}}</ref><math display="block">
\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{
\Big|S_0 = s_0 \right]
</math>
Line 49 ⟶ 47:
}}
{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=
{{Math proof|title=Proof of the lemma|proof=
Use the [[reparameterization trick#REINFORCE estimator|reparameterization trick]].
Line 84 ⟶ 82:
\end{aligned}
</math>
}}
{{Math proof|title=Proof of the two identities|proof=▼
▲{{Math proof|title=Proof|proof=
Applying the [[reparameterization trick#REINFORCE estimator|reparameterization trick]],
Line 106 ⟶ 102:
}}
{{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block">
\nabla_\theta J(\theta) \approx \frac 1N \sum_{
</math>where the index <math>
The [[Score (statistics)|score function]] <math>\nabla_\theta \ln \pi_\theta (A_t
=== Algorithm ===
Line 115 ⟶ 111:
# Rollout <math>N</math> trajectories in the environment, using <math>\pi_{\theta_t}</math> as the policy function.
# Compute the policy gradient estimation: <math>
# Update the policy by gradient ascent: <math>\theta_{
Here, <math>\
== Variance reduction ==
REINFORCE is an '''on-policy''' algorithm, meaning that the trajectories used for the update must be sampled from the current policy <math>\pi_\theta</math>. This can lead to high variance in the updates, as the returns <math>R(\tau)</math> can vary significantly between trajectories. Many variants of REINFORCE
=== REINFORCE with baseline ===
A common way for reducing variance is the '''REINFORCE with baseline''' algorithm, based on the following identity:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{
\Big|S_0 = s_0 \right]</math>for any function <math>b: \text{States} \to \R</math>. This can be proven by applying the previous lemma.
The algorithm uses the modified gradient estimator<math display="block">
\frac 1N \sum_{
=== Actor-critic methods ===
{{Main|Actor-critic algorithm}}
If <math display="inline">
\Big|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\
The '''Q-function''' <math>Q^\pi</math> can also be used as the critic, since<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq
\cdot Q^{\pi_\theta}(
\Big|S_0 = s_0 \right]</math> by a similar argument using the tower law.
Subtracting the value function as a baseline, we find that the '''advantage function''' <math>A^{\pi}(S,A) = Q^{\pi}(S,A) - V^{\pi}(S)</math> can be used as the critic as well:<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq
\cdot A^{\pi_\theta}(
\Big|S_0 = s_0 \right]</math>In summary, there are many unbiased estimators for <math display="inline">\nabla_\theta J_\theta</math>, all in the form of: <math display="block">\nabla_\theta J(\theta) = E_{\pi_\theta}\left[\sum_{0\leq
\cdot \
\Big|S_0 = s_0 \right]</math> where <math display="inline">\
* <math display="inline">\sum_{0 \leq
* <math display="inline">\gamma^
* <math display="inline">\gamma^
* <math display="inline">\gamma^
* <math display="inline">\gamma^
* <math display="inline">\gamma^
Some more possible <math display="inline">\
* <math display="inline">\gamma^
* <math display="inline">\gamma^
* <math display="inline">\gamma^
== Natural policy gradient ==
Line 164 ⟶ 160:
=== Motivation ===
Standard policy gradient updates <math>\theta_{
\begin{cases}
\max_{\theta_{
\|\theta_{
\end{cases}
</math>
While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint <math>\|\theta_{
\max_{\theta_{
\bar{D}_{KL}(\pi_{\theta_{
\end{cases}</math>where the KL divergence between two policies is '''averaged''' over the state distribution under policy <math>\pi_{\
=== Fisher information approximation ===
For small <math>\epsilon</math>, the KL divergence is approximated by the [[Fisher information metric]]:<math display="block">
\bar{D}_{KL}(\pi_{\theta_{
</math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block">
F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a|s) \left(\nabla_\theta \ln \pi_\theta(a|s)\right)^T \right]
</math> This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block">
\theta_{
</math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\
Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations.
== Trust Region Policy Optimization (TRPO) ==
{{Anchor|TRPO}}
'''Trust Region Policy Optimization''' (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing a [[trust region]] constraint on policy updates.<ref name=":3">{{Cite journal |last1=Schulman |first1=John |last2=Levine |first2=Sergey |last3=Moritz |first3=Philipp |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter |date=2015-07-06 |title=Trust region policy optimization |url=https://dl.acm.org/doi/10.5555/3045118.3045319 |journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 |series=ICML'15 |___location=Lille, France |publisher=JMLR.org |pages=1889–1897}}</ref> Developed by Schulman et al. in 2015, TRPO
=== Formulation ===
Like natural policy gradient, TRPO iteratively updates the policy parameters <math>\theta</math> by solving a constrained optimization problem specified coordinate-free:<math display="block">
\begin{cases}
\max_{\theta} L(\theta, \
\bar{D}_{KL}(\pi_{\theta} \| \pi_{\theta_{
\end{cases}
</math>where
* <math>L(\theta, \
* <math>\epsilon</math> is the trust region radius.
Note that in general, other surrogate advantages are possible:<math display="block">L(\theta, \
The surrogate advantage <math>L(\theta, \theta_t)
Line 207 ⟶ 204:
\nabla_\theta L(\theta, \theta_t)
</math>''' equals the policy gradient derived from the advantage function:
<math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[\nabla_\theta \ln \pi_\theta(a | s) \cdot A^{\pi_\theta}(s, a) \right] = \nabla_\theta L(\theta, \theta_t)</math>However, when <math>\theta \neq \
As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block">
\begin{aligned}
L(\theta, \
\bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\
\end{aligned}
</math>
where:
* <math>g = \nabla_\theta L(\theta, \
* <math>F = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\
This reduces the problem to a quadratic optimization, yielding the natural policy gradient update:
<math display="block">
\theta_{
</math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications:
Line 227 ⟶ 224:
x
</math> in <math>Fx = g</math> iteratively without explicit matrix inversion.
* Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement.
\theta_{
</math> until
L(\theta_{
</math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.
Line 240 ⟶ 237:
Specifically, instead of maximizing the surrogate advantage<math display="block">
\max_\theta L(\theta, \theta_t) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)} A^{\pi_{\theta_t}}(s, a) \right]
</math> under a KL divergence constraint, it directly inserts the constraint into the surrogate advantage:<math display="block">
\max_\theta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[
\begin{cases}
Line 248 ⟶ 245:
\end{cases}
\right]
</math> and PPO maximizes the surrogate advantage by stochastic gradient descent, as usual.
In words, gradient-ascending the new surrogate advantage function means that, at some state <math>
Line 321 ⟶ 318:
{{Anchor|GRPO}}
The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator <math>V</math>. Instead, for each state <math>s </math>, it samples multiple actions <math>a_1, \dots, a_G </math> from the policy <math>\pi_{\theta_t} </math>, then calculate the group-relative advantage<ref name=":1">{{Citation |last1=Shao |first1=Zhihong |title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models |date=2024-04-27 |arxiv=2402.03300 |last2=Wang |first2=Peiyi |last3=Zhu |first3=Qihao |last4=Xu |first4=Runxin |last5=Song |first5=Junxiao |last6=Bi |first6=Xiao |last7=Zhang |first7=Haowei |last8=Zhang |first8=Mingchuan |last9=Li |first9=Y. K.}}</ref><math display="block">A^{\pi_{\theta_t}}(s, a_{j}) = \frac{r(s, a_{j}) - \mu}{\sigma} </math> where <math>\mu, \sigma </math> are the mean and standard deviation of <math>r(s, a_1), \dots, r(s, a_G) </math>. That is, it is the [[standard score]] of the rewards.
Then, it maximizes the PPO objective, averaged over all actions:<math display="block">
|