Revision as of 04:49, 25 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Formulation Tag: Visual edit ← Previous edit		Revision as of 09:40, 25 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Trust Region Policy Optimization (TRPO): PPO Tag: Visual edit Next edit →
Line 199: * <math>L(\theta_t, \theta) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)} A^{\pi_{\theta_t}}(s, a) \right]</math> is the '''surrogate advantage''', measuring the performance of <math>\pi_\theta</math> relative to the old policy <math>\pi_{\theta_k}</math>. * <math>\epsilon</math> is the trust region radius. Note that in general, other surrogate advantages are possible:<math display="block">L(\theta_t, \theta) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)}\Psi^{\pi_{\theta_t}}(s, a) \right]</math>where <math>\Psi</math> is any linear sum of the previously mentioned type. Indeed, OpenAI recommended using the Generalized Advantage Estimate, instead of the plain advantage <math>A^{\pi_\theta}</math>. The surrogate advantage <math>L(\theta_t, \theta) Line 230 ⟶ 231: </math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient. == Proximal Policy Optimization (PPO) == A further improvement is [[proximal policy optimization]] (PPO), which avoids even computing <math>F(\theta)</math> and <math>F(\theta)^{-1}</math> via a first-order approximation using clipped probability ratios.<ref name=":0" /> Specifically, instead of maximizing the surrogate advantage<math display="block"> \max_\theta L(\theta_t, \theta) = \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)} A^{\pi_{\theta_t}}(s, a) \right] </math>under a KL divergence constraint, it directly inserts the constraint into the surrogate advantage:<math display="block"> \max_\theta \mathbb{E}_{s, a \sim \pi_{\theta_t}}\left[ \begin{cases} \min \left(\frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)}, 1 + \epsilon \right) A^{\pi_{\theta_t}}(s, a) & \text{ if } A^{\pi_{\theta_t}}(s, a) > 0 \\ \max \left(\frac{\pi_\theta(a\|s)}{\pi_{\theta_t}(a\|s)}, 1 - \epsilon \right) A^{\pi_{\theta_t}}(s, a) & \text{ if } A^{\pi_{\theta_t}}(s, a) > 0 \end{cases} \right] </math>and PPO maximizes the surrogate advantage by stochastic gradient descent, as usual. In words, gradient-ascending the new surrogate advantage function means that, at some state <math> s, a </math>, if the advantage is positive: <math> A^{\pi_{\theta_t}} (s, a) > 0 </math>, then the gradient should direct <math> \theta </math> towards the direction that increases the probability of performing action <math> a </math> under the state <math> s </math>. However, as soon as <math> \theta </math> has changed so much that <math> \pi_\theta(a \| s) \geq (1 + \epsilon) \pi_{\theta_t}(a \| s) </math>, then the gradient should stop pointing it in that direction. Thus, PPO avoids pushing the parameter update too hard, and avoids changing the policy too much. == See also ==

Policy gradient method: Difference between revisions