Content deleted Content added
→Policy gradient: anchor REINFORCE |
→Motivation: KL |
||
Line 175:
D_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) \leq \epsilon
\end{cases}
</math>where the KL divergence between two policies is averaged over the state distribution when under policy <math>\pi_{\theta_t}</math>. That is,<math display="block">D_{KL}(\pi_{\theta_{t+1}} \| \pi_{\theta_{t}}) := \mathbb E_{s \sim \pi_{\theta_t}}[D_{KL}( \pi_{\theta_{t+1}}(\cdot | s) \| \pi_{\theta_{t}}(\cdot | s) )]</math>This ensures updates are invariant to invertible affine parameter transformations.
=== Fisher information approximation ===
|