Revision as of 11:57, 15 May 2025 edit Migolan (talk \| contribs) 6 edits No edit summary ← Previous edit		Revision as of 12:32, 15 May 2025 edit undo Migolan (talk \| contribs) 6 edits No edit summary Next edit →
Line 111: # Rollout <math>N</math> trajectories in the environment, using <math>\pi_{\theta_t}</math> as the policy function. # Compute the policy gradient estimation: <math>~~g_t~~g_i \leftarrow \frac 1N \sum_{n=1}^N \left[\sum_{t\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{t,n}\mid S_{t,n})\sum_{\tau \in t:T} (\gamma^\tau R_{\tau,n}) \right]</math> # Update the policy by gradient ascent: <math>\theta_{ti+1} \leftarrow \~~theta_t~~theta_i + \~~alpha_t~~alpha_i ~~g_t~~g_i</math> Here, <math>\~~alpha_t~~alpha_i</math> is the learning rate at update step <math>ti</math>. == Variance reduction == Line 123: \Big\|S_0 = s_0 \right]</math>for any function <math>b: \text{States} \to \R</math>. This can be proven by applying the previous lemma. The algorithm uses the modified gradient estimator<math display="block">~~g_t~~g_i \leftarrow \frac 1N \sum_{n=1}^N \left[\sum_{t\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{t,n}\| S_{t,n})\left(\sum_{\tau \in t:T} (\gamma^\tau R_{\tau,n}) - bb_i(S_{t,n})\right) \right]</math> and the original REINFORCE algorithm is the special case where <math>bb_i \equiv 0</math>. === Actor-critic methods === {{Main\|Actor-critic algorithm}} If <math display="inline">~~b_t~~b_i</math> is chosen well, such that <math display="inline">~~b_t~~b_i(~~S_j~~S_t) \approx \sum_{i\tau \in jt:T} (\gamma^i\tau ~~R_i~~R_\tau) = \gamma^j\tau V^{\pi_{\~~theta_t~~theta_i}}(~~S_j~~S_t)</math>, this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the '''value function''' <math>V^{\pi_{\~~theta_t~~theta_i}}(~~S_j~~S_t)</math> as possible, approaching the ideal of:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(~~A_j~~A_t\| ~~S_j~~S_t)\left(\sum_{i\tau \in jt:T} (\gamma^i\tau ~~R_i~~R_\tau) - \gamma^jt V^{\pi_\theta}(~~S_j~~S_t)\right) \Big\|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\~~theta_t~~theta_i}}(~~S_j~~S_t)</math> updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the [[actor-critic method]]s, where the policy function is the actor and the value function is the critic. The '''Q-function''' <math>Q^\pi</math> can also be used as the critic, since<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq jt \leq T} \gamma^jt \nabla_\theta\ln\pi_\theta(~~A_j~~A_t\| ~~S_j~~S_t) \cdot Q^{\pi_\theta}(~~S_j~~S_t, ~~A_j~~A_t) \Big\|S_0 = s_0 \right]</math> by a similar argument using the tower law. Subtracting the value function as a baseline, we find that the '''advantage function''' <math>A^{\pi}(S,A) = Q^{\pi}(S,A) - V^{\pi}(S)</math> can be used as the critic as well:<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq jt \leq T} \gamma^jt \nabla_\theta\ln\pi_\theta(~~A_j~~A_t\| ~~S_j~~S_t) \cdot A^{\pi_\theta}(~~S_j~~S_t, ~~A_j~~A_t) \Big\|S_0 = s_0 \right]</math>In summary, there are many unbiased estimators for <math display="inline">\nabla_\theta J_\theta</math>, all in the form of: <math display="block">\nabla_\theta J(\theta) = E_{\pi_\theta}\left[\sum_{0\leq jt \leq T} \nabla_\theta\ln\pi_\theta(~~A_j~~A_t\| ~~S_j~~S_t) \cdot \~~Psi_j~~Psi_t \Big\|S_0 = s_0 \right]</math> where <math display="inline">\~~Psi_j~~Psi_t</math> is any linear sum of the following terms: * <math display="inline">\sum_{0 \leq i\tau\leq T} (\gamma^i\tau ~~R_i~~R_\tau)</math>: never used. * <math display="inline">\gamma^jt\sum_{jt \leq i\tau\leq T} (\gamma^{i\tau-jt} ~~R_i~~R_\tau)</math>: used by the REINFORCE algorithm. * <math display="inline">\gamma^jt \sum_{jt \leq i\tau\leq T} (\gamma^{i\tau-jt} ~~R_i~~R_\tau) - b(~~S_j~~S_t) </math>: used by the REINFORCE with baseline algorithm. * <math display="inline">\gamma^jt \left(~~R_j~~R_t + \gamma V^{\pi_\theta}( S_{jt+1}) - V^{\pi_\theta}( S_{jt})\right)</math>: 1-step TD learning. * <math display="inline">\gamma^jt Q^{\pi_\theta}(~~S_j~~S_t, ~~A_j~~A_t)</math>. * <math display="inline">\gamma^jt A^{\pi_\theta}(~~S_j~~S_t, ~~A_j~~A_t)</math>. Some more possible <math display="inline">\~~Psi_j~~Psi_t</math> are as follows, with very similar proofs. * <math display="inline">\gamma^jt \left(~~R_j~~R_t + \gamma R_{jt+1} + \gamma^2 V^{\pi_\theta}( S_{jt+2}) - V^{\pi_\theta}( S_{jt})\right)</math>: 2-step TD learning. * <math display="inline">\gamma^jt \left(\sum_{k=0}^{n-1} \gamma^k R_{jt+k} + \gamma^n V^{\pi_\theta}( S_{jt+n}) - V^{\pi_\theta}( S_{jt})\right)</math>: n-step TD learning. * <math display="inline">\gamma^jt \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{jt+k} + \gamma^n V^{\pi_\theta}( S_{jt+n}) - V^{\pi_\theta}( S_{jt})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation \|last1=Schulman \|first1=John \|title=High-Dimensional Continuous Control Using Generalized Advantage Estimation \|date=2018-10-20 \|arxiv=1506.02438 \|last2=Moritz \|first2=Philipp \|last3=Levine \|first3=Sergey \|last4=Jordan \|first4=Michael \|last5=Abbeel \|first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the n-step TD learning ones. == Natural policy gradient ==

Policy gradient method: Difference between revisions