Revision as of 16:43, 22 June 2025 edit 2a02:2455:17f2:1c00:921b:eff:fef8:85f3 (talk) No edit summary ← Previous edit		Latest revision as of 20:12, 9 July 2025 edit undo 14.203.192.131 (talk) I fixed the formula so it would express "the total reward from time $
Line 102: }} {{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block"> \nabla_\theta J(\theta) \approx \frac 1N \sum_{n=1}^N \left[\sum_{t\in 0:T} \nabla_\theta\ln\pi_\theta(A_{t,n}\mid S_{t,n})\sum_{\tau \in t:T} (\gamma^{\tau-t} R_{\tau ,n}) \right] </math>where the index <math>n</math> ranges over <math>N</math> rollout trajectories using the policy <math>\pi_\theta </math>.

Policy gradient method: Difference between revisions