Policy gradient method: Difference between revisions

Content deleted Content added

Revision as of 15:51, 24 May 2025 edit 2001:9e8:2d75:e600:c1d0:9d69:6873:b26a (talk) →Actor-critic methods: fixed small error Tags: Mobile edit Mobile web edit ← Previous edit		Latest revision as of 20:12, 9 July 2025 edit undo 14.203.192.131 (talk) I fixed the formula so it would express "the total reward from time $
(One intermediate revision by one other user not shown)
Line 102: }} {{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block"> \nabla_\theta J(\theta) \approx \frac 1N \sum_{n=1}^N \left[\sum_{t\in 0:T} \nabla_\theta\ln\pi_\theta(A_{t,n}\mid S_{t,n})\sum_{\tau \in t:T} (\gamma^{\tau-t} R_{\tau ,n}) \right] </math>where the index <math>n</math> ranges over <math>N</math> rollout trajectories using the policy <math>\pi_\theta </math>. Line 117: == Variance reduction == REINFORCE is an '''on-policy''' algorithm, meaning that the trajectories used for the update must be sampled from the current policy <math>\pi_\theta</math>. This can lead to high variance in the updates, as the returns <math>R(\tau)</math> can vary significantly between trajectories. Many variants of REINFORCE ~~has~~have been introduced, under the title of '''[[variance reduction]]'''. === REINFORCE with baseline ===