Policy gradient method: Difference between revisions

Content deleted Content added
Line 199:
* <math>L(\theta_t, \theta) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)} A^{\pi_{\theta_t}}(s, a) \right]</math> is the '''surrogate advantage''', measuring the performance of <math>\pi_\theta</math> relative to the old policy <math>\pi_{\theta_k}</math>.
* <math>\epsilon</math> is the trust region radius.
Note that in general, other surrogate advantages are possible:<math display="block">L(\theta_t, \theta) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)}\Psi^{\pi_{\theta_t}}(s, a) \right]</math>where <math>\Psi</math> is any linear sum of the previously mentioned type. Indeed, OpenAI recommended using the Generalized Advantage Estimate, instead of the plain advantage <math>A^{\pi_\theta}</math>.
 
The surrogate advantage <math>L(\theta_t, \theta)
Line 230 ⟶ 231:
</math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.
 
== Proximal Policy Optimization (PPO) ==
A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0" />
 
Specifically, instead of maximizing the surrogate advantage<math display="block">
\max_\theta L(\theta_t, \theta) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)} A^{\pi_{\theta_t}}(s, a) \right]
</math>under a KL divergence constraint, it directly inserts the constraint into the surrogate advantage:<math display="block">
\max_\theta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[
\begin{cases}
\min \left(\frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)}, 1 + \epsilon \right) A^{\pi_{\theta_t}}(s, a) & \text{ if } A^{\pi_{\theta_t}}(s, a) > 0
\\
\max \left(\frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)}, 1 - \epsilon \right) A^{\pi_{\theta_t}}(s, a) & \text{ if } A^{\pi_{\theta_t}}(s, a) > 0
\end{cases}
\right]
</math>and PPO maximizes the surrogate advantage by stochastic gradient descent, as usual.
 
In words, gradient-ascending the new surrogate advantage function means that, at some state <math>
s, a
</math>, if the advantage is positive: <math>
A^{\pi_{\theta_t}} (s, a) > 0
</math>, then the gradient should direct <math>
\theta
</math> towards the direction that increases the probability of performing action <math>
a
</math> under the state <math>
s
</math>. However, as soon as <math>
\theta
</math> has changed so much that <math>
\pi_\theta(a | s) \geq (1 + \epsilon) \pi_{\theta_t}(a | s)
</math>, then the gradient should stop pointing it in that direction.
 
Thus, PPO avoids pushing the parameter update too hard, and avoids changing the policy too much.
 
== See also ==