Revision as of 04:11, 25 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Practical considerations Tag: Visual edit ← Previous edit		Revision as of 04:39, 25 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Actor-critic methods: TRPO Tag: Visual edit Next edit →
Line 170: \end{cases} </math> While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint <math>\\|\theta_{t+1} - \theta_t\\| </math> introduces coordinate dependence. To address this, the natural policy gradient replaces the Euclidean constraint with a [[Kullback–Leibler divergence]] (KL) constraint:<math display="block">\begin{cases} \begin{cases}▼ \max_{\theta_{t+1}} J(\theta_t) + (\theta_{t+1} - \theta_t)^T \nabla_\theta J(\theta_t)\\ D_\bar{D}_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) \leq \epsilon \end{cases}</math>where the KL divergence between two policies is averaged over the state distribution when under policy <math>\pi_{\theta_t}</math>. That is,<math display="block">D_\bar{D}_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) := \mathbb E_{s \sim \pi_{\theta_t}}[D_{KL}( \pi_{\theta_{t+1}}(\cdot \| s) \\| \pi_{\theta_{t}}(\cdot \| s) )]</math>This ensures updates are invariant to invertible affine parameter transformations.▼ \end{cases}▼ ▲</math>where the KL divergence between two policies is averaged over the state distribution when under policy <math>\pi_{\theta_t}</math>. That is,<math display="block">D_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) := \mathbb E_{s \sim \pi_{\theta_t}}[D_{KL}( \pi_{\theta_{t+1}}(\cdot \| s) \\| \pi_{\theta_{t}}(\cdot \| s) )]</math>This ensures updates are invariant to invertible affine parameter transformations. === Fisher information approximation === For small <math>\epsilon</math>, the KL divergence is approximated by the [[Fisher information metric]]:<math display="block"> D_\bar{D}_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) \approx \frac{1}{2} (\theta_{t+1} - \theta_t)^T F(\theta_t) (\theta_{t+1} - \theta_t) </math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block"> F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a\|s) \left(\nabla_\theta \ln \pi_\theta(a\|s)\right)^T \right] Line 186 ⟶ 184: </math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\theta_t))^T F(\theta_t)^{-1} \nabla_\theta J(\theta_t)}}</math>. Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations:.▼ ~~=== Practical considerations ===~~ == Trust Region Policy Optimization (TRPO) == ▲Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations: '''Trust Region Policy Optimization''' (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing a trust region constraint on policy updates.<ref name=":3" /> Developed by Schulman et al. in 2015, TRPO ensures stable policy improvements by limiting the KL divergence between successive policies, addressing key challenges in natural policy gradient methods. * [[Conjugate gradient method]] to compute <math display="inline">F(\theta)^{-1} \nabla_\theta J(\theta) </math> without explicit inversion.<ref name=":3" /> * Instead of computing <math>F(\theta)^{-1} </math>, iteratively solve for <math>\nabla J(\theta_t) = F(\theta_t) w_t</math> using the previous step's <math>w_{t-1}</math> as an initial guess, then update by <math display="inline"> TRPO builds on the natural policy gradient by incorporating a trust region constraint. While the natural gradient provides a theoretically optimal direction, TRPO's line search and KL constraint mitigate errors from Taylor approximations, ensuring monotonic policy improvement. This makes TRPO more robust in practice, particularly for high-dimensional policies. ~~\theta_{t+1} = \theta_t + \alpha w_t~~ ~~</math> with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{w_t^T F(\theta_t)w_t }}</math>.~~ === Formulation === * Trust region methods like [[Trust region policy optimization]] (TRPO), which enforce KL constraints via constrained optimization.<ref name=":3" /> Like ~~[[Proximal~~natural policy ~~optimization]] (PPO)~~gradient, ~~which~~TRPO ~~avoids~~iteratively ~~both~~updates ~~<math>F(\theta)</math>~~the policy ~~and~~parameters <math>F(\theta~~)^{-1}~~</math> by solving a ~~first-order~~constrained ~~approximation,~~optimization ~~using~~problem ~~clipped~~specified ~~probability ratios.~~coordinate-free:<~~ref~~math ~~name~~display=":0block" /> ▲\begin{cases} \max_{\theta} L(\theta_t, \theta)\\ \bar{D}_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) \leq \epsilon ▲\end{cases} </math>where <math>L(\theta_t, \theta) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)} A^{\pi_{\theta_t}}(s, a) \right]</math> is the '''surrogate advantage''', measuring the performance of <math>\pi_\theta</math> relative to the old policy <math>\pi_{\theta_k}</math>. * <math>\epsilon</math> is the trust region radius. The surrogate advantage <math>L(\theta_t, \theta) </math> is designed to align with the policy gradient <math>\nabla_\theta J(\theta)</math>. Specifically, when <math>\theta = \theta_t</math>, '''<math> \nabla_\theta L(\theta_t, \theta) </math>''' equals the policy gradient derived from the advantage function: <math display="block"> \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{j=0}^T \nabla_\theta \ln \pi_\theta(A_j \| S_j) \cdot A^{\pi_\theta}(S_j, A_j) \Big\| S_0 = s_0 \right] </math>As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block"> \begin{aligned} \mathcal{L}(\theta_t, \theta) &\approx g^T (\theta - \theta_t), \\ \bar{D}_{\text{KL}}(\pi_{\theta} \\| \pi_{\theta_t}) &\approx \frac{1}{2} (\theta - \theta_t)^T H (\theta - \theta_t), \end{aligned} </math> where: * <math>g = \nabla_\theta \mathcal{L}(\theta_t, \theta) \big\|_{\theta = \theta_t}</math> is the policy gradient. * <math>H = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \\| \pi_{\theta_t}) \big\|_{\theta = \theta_t} </math> is the Fisher information matrix. This reduces the problem to a quadratic optimization, yielding the natural policy gradient update: <math display="block"> \theta_{t+1} = \theta_t + \sqrt{\frac{2\delta}{g^T H^{-1} g}} H^{-1} g. </math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications: * Use conjugate gradient method to solve for <math> x </math> in <math>Hx = g</math> iteratively without explicit matrix inversion. * Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement by <math display="block"> \theta_{t+1} = \theta_t + \alpha^j \sqrt{\frac{2\delta}{g^T H^{-1} g}} H^{-1} g, </math>where <math>\alpha \in (0,1)</math> is a backtracking coefficient, and <math>j</math> is the smallest integer satisfying the KL constraint and a positive surrogate advantage. A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0" /> ~~These methods address the trade-off between inversion complexity and policy update stability, making natural policy gradients feasible in large-scale applications.~~ == See also ==

Policy gradient method: Difference between revisions