Content deleted Content added
natural policy gradient |
|||
Line 11:
The goal of policy optimization is to find some <math>\theta</math> that maximizes the expected episodic reward <math>J(\theta)</math>:<math display="block">
J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{
</math>where <math>
\gamma
|