Revision as of 15:08, 15 May 2025 edit Migolan (talk \| contribs) 6 edits →Fisher information approximation ← Previous edit		Revision as of 15:12, 15 May 2025 edit undo Migolan (talk \| contribs) 6 edits →Formulation Next edit →
Line 192: Like natural policy gradient, TRPO iteratively updates the policy parameters <math>\theta</math> by solving a constrained optimization problem specified coordinate-free:<math display="block"> \begin{cases} \max_{\theta} L(\theta, \~~theta_t~~theta_i)\\ \bar{D}_{KL}(\pi_{\theta} \\| \pi_{\theta_{ti}}) \leq \epsilon \end{cases} </math>where * <math>L(\theta, \~~theta_t~~theta_i) = \mathbb{E}_{s, a \sim \pi_{\~~theta_t~~theta_i}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\~~theta_t~~theta_i}(a\|s)} A^{\pi_{\~~theta_t~~theta_i}}(s, a) \right]</math> is the '''surrogate advantage''', measuring the performance of <math>\pi_\theta</math> relative to the old policy <math>\pi_{\~~theta_k~~theta_i}</math>. * <math>\epsilon</math> is the trust region radius. Note that in general, other surrogate advantages are possible:<math display="block">L(\theta, \~~theta_t~~theta_i) = \mathbb{E}_{s, a \sim \pi_{\~~theta_t~~theta_i}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\~~theta_t~~theta_i}(a\|s)}\Psi^{\pi_{\~~theta_t~~theta_i}}(s, a) \right]</math>where <math>\Psi</math> is any linear sum of the previously mentioned type. Indeed, OpenAI recommended using the Generalized Advantage Estimate, instead of the plain advantage <math>A^{\pi_\theta}</math>. The surrogate advantage <math>L(\theta, \theta_t) Line 204: \nabla_\theta L(\theta, \theta_t) </math>''' equals the policy gradient derived from the advantage function: <math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[\nabla_\theta \ln \pi_\theta(a \| s) \cdot A^{\pi_\theta}(s, a) \right] = \nabla_\theta L(\theta, \theta_t)</math>However, when <math>\theta \neq \~~theta_t~~theta_i</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective. As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block"> \begin{aligned} L(\theta, \~~theta_t~~theta_i) &\approx g^T (\theta - \~~theta_t~~theta_i), \\ \bar{D}_{\text{KL}}(\pi_{\theta} \\| \pi_{\~~theta_t~~theta_i}) &\approx \frac{1}{2} (\theta - \~~theta_t~~theta_i)^T H (\theta - \~~theta_t~~theta_i), \end{aligned} </math> where: * <math>g = \nabla_\theta L(\theta, \~~theta_t~~theta_i) \big\|_{\theta = \~~theta_t~~theta_i}</math> is the policy gradient. * <math>F = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \\| \pi_{\~~theta_t~~theta_i}) \big\|_{\theta = \~~theta_t~~theta_i}</math> is the Fisher information matrix. This reduces the problem to a quadratic optimization, yielding the natural policy gradient update: <math display="block"> \theta_{ti+1} = \~~theta_t~~theta_i + \sqrt{\frac{2\epsilon}{g^T F^{-1} g}} F^{-1} g. </math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications: Line 225: </math> in <math>Fx = g</math> iteratively without explicit matrix inversion. * Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement. That is, it tests each of the following test-solutions<math display="block"> \theta_{ti+1} = \~~theta_t~~theta_i + \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \~~theta_t~~theta_i + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \~~theta_t~~theta_i + \alpha^2 \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \dots </math> until it finds one that both satisfies the KL constraint <math>\bar{D}_{KL}(\pi_{\theta_{ti+1}} \\| \pi_{\theta_{ti}}) \leq \epsilon </math> and results in a higher <math> L(\theta_{ti+1}, \~~theta_t~~theta_i) \geq L(\~~theta_t~~theta_i, \~~theta_t~~theta_i) </math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.

Policy gradient method: Difference between revisions