Policy gradient method: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 15:12, 15 May 2025 edit Migolan (talk \| contribs) 6 edits →Formulation ← Previous edit		Latest revision as of 20:12, 9 July 2025 edit undo 14.203.192.131 (talk) I fixed the formula so it would express "the total reward from time $
(2 intermediate revisions by 2 users not shown)
Line 102: }} {{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block"> \nabla_\theta J(\theta) \approx \frac 1N \sum_{n=1}^N \left[\sum_{t\in 0:T} \nabla_\theta\ln\pi_\theta(A_{t,n}\mid S_{t,n})\sum_{\tau \in t:T} (\gamma^{\tau-t} R_{\tau ,n}) \right] </math>where the index <math>n</math> ranges over <math>N</math> rollout trajectories using the policy <math>\pi_\theta </math>. Line 117: == Variance reduction == REINFORCE is an '''on-policy''' algorithm, meaning that the trajectories used for the update must be sampled from the current policy <math>\pi_\theta</math>. This can lead to high variance in the updates, as the returns <math>R(\tau)</math> can vary significantly between trajectories. Many variants of REINFORCE ~~has~~have been introduced, under the title of '''[[variance reduction]]'''. === REINFORCE with baseline === Line 128: === Actor-critic methods === {{Main\|Actor-critic algorithm}} If <math display="inline">b_i</math> is chosen well, such that <math display="inline">b_i(S_t) \approx \sum_{\tau \in t:T} (\gamma^\tau R_\tau) = \gamma^~~\tau~~t V^{\pi_{\theta_i}}(S_t)</math>, this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the '''value function''' <math>V^{\pi_{\theta_i}}(S_t)</math> as possible, approaching the ideal of:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{t\in 0:T} \nabla_\theta\ln\pi_\theta(A_t\| S_t)\left(\sum_{\tau \in t:T} (\gamma^\tau R_\tau) - \gamma^t V^{\pi_\theta}(S_t)\right) \Big\|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\theta_i}}(S_t)</math> updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the [[actor-critic method]]s, where the policy function is the actor and the value function is the critic.