Revision as of 09:40, 25 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Trust Region Policy Optimization (TRPO): PPO Tag: Visual edit ← Previous edit		Revision as of 09:49, 25 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Proximal Policy Optimization (PPO): reasoning Tag: Visual edit Next edit →
Line 260: </math> has changed so much that <math> \pi_\theta(a \| s) \geq (1 + \epsilon) \pi_{\theta_t}(a \| s) </math>, then the gradient should stop pointing it in that direction. And similarly if <math> A^{\pi_{\theta_t}} (s, a) < 0 </math>. Thus, PPO avoids pushing the parameter update too hard, and avoids changing the policy too much.▼ To be more precise, to update <math> ▲Thus, PPO avoids pushing the parameter update too hard, and avoids changing the policy too much. \theta_t </math> to <math> \theta_{t+1} </math> requires multiple update steps on the same batch of data. It would initialize <math> \theta = \theta_t </math>, then repeatedly apply gradient descent (such as the [[Adam optimizer]]) to update <math> \theta </math> until the surrogate advantage has stabilized. It would then assign <math> \theta_{t+1} </math> to <math> \theta </math>, and do it again. During this inner-loop, the first update to <math> \theta </math> would not hit the <math> 1 - \epsilon, 1 + \epsilon </math> bounds, but as <math> \theta </math> is updated further and further away from <math> \theta_t </math>, it eventually starts hitting the bounds. For each such bound hit, the corresponding gradient becomes zero, and thus PPO avoid updating <math> \theta </math> too far away from <math> \theta_t </math>. This is important, because the surrogate loss assumes that the state-action pair <math> s, a </math> is sampled from what the agent would see if the agent runs the policy <math> \pi_{\theta_t} </math>, but policy gradient should be on-policy. So, as <math> \theta </math> changes, the surrogate loss becomes more and more ''off''-policy. This is why keeping <math> \theta </math> ''proximal'' to <math> \theta_t </math> is necessary. == See also ==

Policy gradient method: Difference between revisions