Policy gradient method: Difference between revisions

Content deleted Content added
unify notation
Line 195:
Like natural policy gradient, TRPO iteratively updates the policy parameters <math>\theta</math> by solving a constrained optimization problem specified coordinate-free:<math display="block">
\begin{cases}
\max_{\theta} L(\theta_ttheta, \thetatheta_t)\\
\bar{D}_{KL}(\pi_{\theta} \| \pi_{\theta_{t}}) \leq \epsilon
\end{cases}
</math>where
* <math>L(\theta_ttheta, \thetatheta_t) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)} A^{\pi_{\theta_t}}(s, a) \right]</math> is the '''surrogate advantage''', measuring the performance of <math>\pi_\theta</math> relative to the old policy <math>\pi_{\theta_k}</math>.
* <math>\epsilon</math> is the trust region radius.
Note that in general, other surrogate advantages are possible:<math display="block">L(\theta_ttheta, \thetatheta_t) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)}\Psi^{\pi_{\theta_t}}(s, a) \right]</math>where <math>\Psi</math> is any linear sum of the previously mentioned type. Indeed, OpenAI recommended using the Generalized Advantage Estimate, instead of the plain advantage <math>A^{\pi_\theta}</math>.
 
The surrogate advantage <math>L(\theta_ttheta, \thetatheta_t)
</math> is designed to align with the policy gradient <math>\nabla_\theta J(\theta)</math>. Specifically, when <math>\theta = \theta_t</math>, '''<math>
\nabla_\theta L(\theta_ttheta, \thetatheta_t)
</math>''' equals the policy gradient derived from the advantage function:
<math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[\nabla_\theta \ln \pi_\theta(a | s) \cdot A^{\pi_\theta}(s, a) \right] = \nabla_\theta L(\theta, \theta_t)</math>However, when <math>\theta \neq \theta_t</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective.
Line 211:
As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block">
\begin{aligned}
L(\theta_ttheta, \thetatheta_t) &\approx g^T (\theta - \theta_t), \\
\bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\theta_t}) &\approx \frac{1}{2} (\theta - \theta_t)^T H (\theta - \theta_t),
\end{aligned}
</math>
where:
* <math>g = \nabla_\theta L(\theta_ttheta, \thetatheta_t) \big|_{\theta = \theta_t}</math> is the policy gradient.
* <math>F = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\theta_t}) \big|_{\theta = \theta_t}</math> is the Fisher information matrix.
 
Line 230:
\theta_{t+1} = \theta_t + \sqrt{\frac{2\epsilon}{x^T F x}} x, \theta_t + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \dots
</math>until a <math>\theta_{t+1}</math> is found that both satisfies the KL constraint <math>\bar{D}_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) \leq \epsilon </math> and results in a higher <math>
L(\theta_t, \theta_{t+1}, \theta_t) \geq L(\theta_t, \theta_t)
</math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.
 
Line 239:
 
Specifically, instead of maximizing the surrogate advantage<math display="block">
\max_\theta L(\theta_ttheta, \thetatheta_t) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)} A^{\pi_{\theta_t}}(s, a) \right]
</math>under a KL divergence constraint, it directly inserts the constraint into the surrogate advantage:<math display="block">
\max_\theta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[