Content deleted Content added
unify notation |
|||
Line 195:
Like natural policy gradient, TRPO iteratively updates the policy parameters <math>\theta</math> by solving a constrained optimization problem specified coordinate-free:<math display="block">
\begin{cases}
\max_{\theta} L(\
\bar{D}_{KL}(\pi_{\theta} \| \pi_{\theta_{t}}) \leq \epsilon
\end{cases}
</math>where
* <math>L(\
* <math>\epsilon</math> is the trust region radius.
Note that in general, other surrogate advantages are possible:<math display="block">L(\
The surrogate advantage <math>L(\
</math> is designed to align with the policy gradient <math>\nabla_\theta J(\theta)</math>. Specifically, when <math>\theta = \theta_t</math>, '''<math>
\nabla_\theta L(\
</math>''' equals the policy gradient derived from the advantage function:
<math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[\nabla_\theta \ln \pi_\theta(a | s) \cdot A^{\pi_\theta}(s, a) \right] = \nabla_\theta L(\theta, \theta_t)</math>However, when <math>\theta \neq \theta_t</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective.
Line 211:
As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block">
\begin{aligned}
L(\
\bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\theta_t}) &\approx \frac{1}{2} (\theta - \theta_t)^T H (\theta - \theta_t),
\end{aligned}
</math>
where:
* <math>g = \nabla_\theta L(\
* <math>F = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\theta_t}) \big|_{\theta = \theta_t}</math> is the Fisher information matrix.
Line 230:
\theta_{t+1} = \theta_t + \sqrt{\frac{2\epsilon}{x^T F x}} x, \theta_t + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \dots
</math>until a <math>\theta_{t+1}</math> is found that both satisfies the KL constraint <math>\bar{D}_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) \leq \epsilon </math> and results in a higher <math>
L(
</math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.
Line 239:
Specifically, instead of maximizing the surrogate advantage<math display="block">
\max_\theta L(\
</math>under a KL divergence constraint, it directly inserts the constraint into the surrogate advantage:<math display="block">
\max_\theta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[
|