Actor-critic algorithm: Difference between revisions

Content deleted Content added
mNo edit summary
Line 30:
The goal of policy gradient method is to optimize <math>J(\theta)</math> by [[Gradient descent|gradient ascent]] on the policy gradient <math>\nabla J(\theta)</math>.
 
As detailed on the [[Policy gradient method#Actor-critic methods|policy gradient method]] page, there are many [[Unbiased estimator|unbiased estimators]] of the policy gradient:<math display="block">\nabla_\theta J(\theta) = E_\mathbb{E}_{\pi_\theta}\left[\sum_{0\leq j \leq T} \nabla_\theta\ln\pi_\theta(A_j| S_j)
\cdot \Psi_j
\Big|S_0 = s_0 \right]</math>where <math display="inline">\Psi_j</math> is a linear sum of the following: