Policy gradient method: Difference between revisions

Content deleted Content added
Line 260:
</math> has changed so much that <math>
\pi_\theta(a | s) \geq (1 + \epsilon) \pi_{\theta_t}(a | s)
</math>, then the gradient should stop pointing it in that direction. And similarly if <math>
A^{\pi_{\theta_t}} (s, a) < 0
</math>. Thus, PPO avoids pushing the parameter update too hard, and avoids changing the policy too much.
 
To be more precise, to update <math>
Thus, PPO avoids pushing the parameter update too hard, and avoids changing the policy too much.
\theta_t
</math> to <math>
\theta_{t+1}
</math> requires multiple update steps on the same batch of data. It would initialize <math>
\theta = \theta_t
</math>, then repeatedly apply gradient descent (such as the [[Adam optimizer]]) to update <math>
\theta
</math> until the surrogate advantage has stabilized. It would then assign <math>
\theta_{t+1}
</math> to <math>
\theta
</math>, and do it again.
 
During this inner-loop, the first update to <math>
\theta
</math> would not hit the <math>
1 - \epsilon, 1 + \epsilon
</math> bounds, but as <math>
\theta
</math> is updated further and further away from <math>
\theta_t
</math>, it eventually starts hitting the bounds. For each such bound hit, the corresponding gradient becomes zero, and thus PPO avoid updating <math>
\theta
</math> too far away from <math>
\theta_t
</math>.
 
This is important, because the surrogate loss assumes that the state-action pair <math>
s, a
</math> is sampled from what the agent would see if the agent runs the policy <math>
\pi_{\theta_t}
</math>, but policy gradient should be on-policy. So, as <math>
\theta
</math> changes, the surrogate loss becomes more and more ''off''-policy. This is why keeping <math>
\theta
</math> ''proximal'' to <math>
\theta_t
</math> is necessary.
 
== See also ==