Revision as of 04:39, 25 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Actor-critic methods: TRPO Tag: Visual edit ← Previous edit		Revision as of 04:41, 25 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits m →Formulation Tag: Visual edit Next edit →
Line 208: </math>As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block"> \begin{aligned} ~~\mathcal{~~L}(\theta_t, \theta) &\approx g^T (\theta - \theta_t), \\ \bar{D}_{\text{KL}}(\pi_{\theta} \\| \pi_{\theta_t}) &\approx \frac{1}{2} (\theta - \theta_t)^T H (\theta - \theta_t), \end{aligned} Line 214: where: * <math>g = \nabla_\theta \mathcal{L}(\theta_t, \theta) \big\|_{\theta = \theta_t}</math> is the policy gradient. * <math>HF = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \\| \pi_{\theta_t}) \big\|_{\theta = \theta_t} </math> is the Fisher information matrix. This reduces the problem to a quadratic optimization, yielding the natural policy gradient update: <math display="block"> \theta_{t+1} = \theta_t + \sqrt{\frac{2\delta}{g^T HF^{-1} g}} H^{-1} g. </math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications: * Use conjugate gradient method to solve for <math> x </math> in <math>HxFx = g</math> iteratively without explicit matrix inversion. * Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement by <math display="block"> \theta_{t+1} = \theta_t + \alpha^j \sqrt{\frac{2\~~delta~~epsilon}{gx^T ~~H^{-1}~~F gx}} ~~H^{-1} g~~x, </math>where <math>\alpha \in (0,1)</math> is a backtracking coefficient, and <math>j</math> is the smallest integer satisfying the KL constraint and a positive surrogate advantage.

Policy gradient method: Difference between revisions