Revision as of 04:01, 25 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Policy gradient: anchor REINFORCE Tag: Visual edit ← Previous edit		Revision as of 04:06, 25 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Motivation: KL Tag: Visual edit Next edit →
Line 175: D_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) \leq \epsilon \end{cases} </math>where the KL divergence between two policies is averaged over the state distribution when under policy <math>\pi_{\theta_t}</math>. That is,<math display="block">D_{KL}(\pi_{\theta_{t+1}} \\| \pi_{\theta_{t}}) := \mathbb E_{s \sim \pi_{\theta_t}}[D_{KL}( \pi_{\theta_{t+1}}(\cdot \| s) \\| \pi_{\theta_{t}}(\cdot \| s) )]</math>This ensures updates are invariant to invertible affine parameter transformations. === Fisher information approximation ===

Policy gradient method: Difference between revisions