Policy gradient method: Difference between revisions

Content deleted Content added
Line 170:
\end{cases}
</math>
While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint <math>\|\theta_{t+1} - \theta_t\| </math> introduces coordinate dependence. To address this, the natural policy gradient replaces the Euclidean constraint with a [[Kullback–Leibler divergence]] (KL) constraint:<math display="block">\begin{cases}
\begin{cases}
\max_{\theta_{t+1}} J(\theta_t) + (\theta_{t+1} - \theta_t)^T \nabla_\theta J(\theta_t)\\
D_\bar{D}_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) \leq \epsilon
\end{cases}</math>where the KL divergence between two policies is averaged over the state distribution when under policy <math>\pi_{\theta_t}</math>. That is,<math display="block">D_\bar{D}_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) := \mathbb E_{s \sim \pi_{\theta_t}}[D_{KL}( \pi_{\theta_{t+1}}(\cdot | s) \| \pi_{\theta_{t}}(\cdot | s) )]</math>This ensures updates are invariant to invertible affine parameter transformations.
\end{cases}
</math>where the KL divergence between two policies is averaged over the state distribution when under policy <math>\pi_{\theta_t}</math>. That is,<math display="block">D_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) := \mathbb E_{s \sim \pi_{\theta_t}}[D_{KL}( \pi_{\theta_{t+1}}(\cdot | s) \| \pi_{\theta_{t}}(\cdot | s) )]</math>This ensures updates are invariant to invertible affine parameter transformations.
 
=== Fisher information approximation ===
For small <math>\epsilon</math>, the KL divergence is approximated by the [[Fisher information metric]]:<math display="block">
D_\bar{D}_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) \approx \frac{1}{2} (\theta_{t+1} - \theta_t)^T F(\theta_t) (\theta_{t+1} - \theta_t)
</math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block">
F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a|s) \left(\nabla_\theta \ln \pi_\theta(a|s)\right)^T \right]
Line 186 ⟶ 184:
</math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\theta_t))^T F(\theta_t)^{-1} \nabla_\theta J(\theta_t)}}</math>.
 
Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations:.
=== Practical considerations ===
== Trust Region Policy Optimization (TRPO) ==
Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations:
'''Trust Region Policy Optimization''' (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing a trust region constraint on policy updates.<ref name=":3" /> Developed by Schulman et al. in 2015, TRPO ensures stable policy improvements by limiting the KL divergence between successive policies, addressing key challenges in natural policy gradient methods.
* [[Conjugate gradient method]] to compute <math display="inline">F(\theta)^{-1} \nabla_\theta J(\theta) </math> without explicit inversion.<ref name=":3" />
 
* Instead of computing <math>F(\theta)^{-1} </math>, iteratively solve for <math>\nabla J(\theta_t) = F(\theta_t) w_t</math> using the previous step's <math>w_{t-1}</math> as an initial guess, then update by <math display="inline">
TRPO builds on the natural policy gradient by incorporating a trust region constraint. While the natural gradient provides a theoretically optimal direction, TRPO's line search and KL constraint mitigate errors from Taylor approximations, ensuring monotonic policy improvement. This makes TRPO more robust in practice, particularly for high-dimensional policies.
\theta_{t+1} = \theta_t + \alpha w_t
 
</math> with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{w_t^T F(\theta_t)w_t }}</math>.
=== Formulation ===
* Trust region methods like [[Trust region policy optimization]] (TRPO), which enforce KL constraints via constrained optimization.<ref name=":3" />
*Like [[Proximalnatural policy optimization]] (PPO)gradient, whichTRPO avoidsiteratively bothupdates <math>F(\theta)</math>the policy andparameters <math>F(\theta)^{-1}</math> by solving a first-orderconstrained approximation,optimization usingproblem clippedspecified probability ratios.coordinate-free:<refmath namedisplay=":0block" />
\begin{cases}
\max_{\theta} L(\theta_t, \theta)\\
\bar{D}_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) \leq \epsilon
\end{cases}
</math>where
* <math>L(\theta_t, \theta) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_t}(a|s)} A^{\pi_{\theta_t}}(s, a) \right]</math> is the '''surrogate advantage''', measuring the performance of <math>\pi_\theta</math> relative to the old policy <math>\pi_{\theta_k}</math>.
* <math>\epsilon</math> is the trust region radius.
 
The surrogate advantage <math>L(\theta_t, \theta)
</math> is designed to align with the policy gradient <math>\nabla_\theta J(\theta)</math>. Specifically, when <math>\theta = \theta_t</math>, '''<math>
\nabla_\theta L(\theta_t, \theta)
</math>''' equals the policy gradient derived from the advantage function:
<math display="block">
\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{j=0}^T \nabla_\theta \ln \pi_\theta(A_j | S_j) \cdot A^{\pi_\theta}(S_j, A_j) \Big| S_0 = s_0 \right]
</math>As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block">
\begin{aligned}
\mathcal{L}(\theta_t, \theta) &\approx g^T (\theta - \theta_t), \\
\bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\theta_t}) &\approx \frac{1}{2} (\theta - \theta_t)^T H (\theta - \theta_t),
\end{aligned}
</math>
where:
* <math>g = \nabla_\theta \mathcal{L}(\theta_t, \theta) \big|_{\theta = \theta_t}</math> is the policy gradient.
* <math>H = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\theta_t}) \big|_{\theta = \theta_t} </math> is the Fisher information matrix.
 
This reduces the problem to a quadratic optimization, yielding the natural policy gradient update:
<math display="block">
\theta_{t+1} = \theta_t + \sqrt{\frac{2\delta}{g^T H^{-1} g}} H^{-1} g.
</math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications:
 
* Use conjugate gradient method to solve for <math>
x
</math> in <math>Hx = g</math> iteratively without explicit matrix inversion.
* Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement by <math display="block">
\theta_{t+1} = \theta_t + \alpha^j \sqrt{\frac{2\delta}{g^T H^{-1} g}} H^{-1} g,
</math>where <math>\alpha \in (0,1)</math> is a backtracking coefficient, and <math>j</math> is the smallest integer satisfying the KL constraint and a positive surrogate advantage.
 
A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0" />
These methods address the trade-off between inversion complexity and policy update stability, making natural policy gradients feasible in large-scale applications.
 
== See also ==