Revision as of 05:30, 21 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits m →Actor Tag: 2017 wikitext editor ← Previous edit		Revision as of 05:30, 21 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits mNo edit summary Tag: 2017 wikitext editor Next edit →
Line 30: The goal of policy gradient method is to optimize <math>J(\theta)</math> by [[Gradient descent\|gradient ascent]] on the policy gradient <math>\nabla J(\theta)</math>. As detailed on the [[Policy gradient method#Actor-critic methods\|policy gradient method]] page, there are many [[Unbiased estimator\|unbiased estimators]] of the policy gradient:<math display="block">\nabla_\theta J(\theta) = E_\mathbb{E}_{\pi_\theta}\left[\sum_{0\leq j \leq T} \nabla_\theta\ln\pi_\theta(A_j\| S_j) \cdot \Psi_j \Big\|S_0 = s_0 \right]</math>where <math display="inline">\Psi_j</math> is a linear sum of the following:

Actor-critic algorithm: Difference between revisions