Revision as of 21:18, 28 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Motivation Tag: Visual edit ← Previous edit		Revision as of 21:20, 28 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits unify notation Tag: 2017 wikitext editor Next edit →
Line 195: Like natural policy gradient, TRPO iteratively updates the policy parameters <math>\theta</math> by solving a constrained optimization problem specified coordinate-free:<math display="block"> \begin{cases} \max_{\theta} L(\~~theta_t~~theta, \~~theta~~theta_t)\\ \bar{D}_{KL}(\pi_{\theta} \\| \pi_{\theta_{t}}) \leq \epsilon \end{cases} </math>where * <math>L(\~~theta_t~~theta, \~~theta~~theta_t) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)} A^{\pi_{\theta_t}}(s, a) \right]</math> is the '''surrogate advantage''', measuring the performance of <math>\pi_\theta</math> relative to the old policy <math>\pi_{\theta_k}</math>. * <math>\epsilon</math> is the trust region radius. Note that in general, other surrogate advantages are possible:<math display="block">L(\~~theta_t~~theta, \~~theta~~theta_t) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)}\Psi^{\pi_{\theta_t}}(s, a) \right]</math>where <math>\Psi</math> is any linear sum of the previously mentioned type. Indeed, OpenAI recommended using the Generalized Advantage Estimate, instead of the plain advantage <math>A^{\pi_\theta}</math>. The surrogate advantage <math>L(\~~theta_t~~theta, \~~theta~~theta_t) </math> is designed to align with the policy gradient <math>\nabla_\theta J(\theta)</math>. Specifically, when <math>\theta = \theta_t</math>, '''<math> \nabla_\theta L(\~~theta_t~~theta, \~~theta~~theta_t) </math>''' equals the policy gradient derived from the advantage function: <math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[\nabla_\theta \ln \pi_\theta(a \| s) \cdot A^{\pi_\theta}(s, a) \right] = \nabla_\theta L(\theta, \theta_t)</math>However, when <math>\theta \neq \theta_t</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective. Line 211: As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block"> \begin{aligned} L(\~~theta_t~~theta, \~~theta~~theta_t) &\approx g^T (\theta - \theta_t), \\ \bar{D}_{\text{KL}}(\pi_{\theta} \\| \pi_{\theta_t}) &\approx \frac{1}{2} (\theta - \theta_t)^T H (\theta - \theta_t), \end{aligned} </math> where: * <math>g = \nabla_\theta L(\~~theta_t~~theta, \~~theta~~theta_t) \big\|_{\theta = \theta_t}</math> is the policy gradient. * <math>F = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \\| \pi_{\theta_t}) \big\|_{\theta = \theta_t}</math> is the Fisher information matrix. Line 230: \theta_{t+1} = \theta_t + \sqrt{\frac{2\epsilon}{x^T F x}} x, \theta_t + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \dots </math>until a <math>\theta_{t+1}</math> is found that both satisfies the KL constraint <math>\bar{D}_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) \leq \epsilon </math> and results in a higher <math> L(~~\theta_t,~~ \theta_{t+1}, \theta_t) \geq L(\theta_t, \theta_t) </math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient. Line 239: Specifically, instead of maximizing the surrogate advantage<math display="block"> \max_\theta L(\~~theta_t~~theta, \~~theta~~theta_t) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)} A^{\pi_{\theta_t}}(s, a) \right] </math>under a KL divergence constraint, it directly inserts the constraint into the surrogate advantage:<math display="block"> \max_\theta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[

Policy gradient method: Difference between revisions