Policy gradient method: Difference between revisions

Content deleted Content added
natural policy gradient
Line 11:
 
The goal of policy optimization is to find some <math>\theta</math> that maximizes the expected episodic reward <math>J(\theta)</math>:<math display="block">
J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{ti\in 0:T} \gamma^ti R_tR_i \Big| S_0 = s_0 \right]
</math>where <math>
\gamma