Policy gradient method: Difference between revisions

Content deleted Content added
I fixed the formula so it would express "the total reward from time $
 
(10 intermediate revisions by 4 users not shown)
Line 102:
}}
{{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block">
\nabla_\theta J(\theta) \approx \frac 1N \sum_{n=1}^N \left[\sum_{t\in 0:T} \nabla_\theta\ln\pi_\theta(A_{t,n}\mid S_{t,n})\sum_{\tau \in t:T} (\gamma^{\tau-t} R_{\tau ,n}) \right]
</math>where the index <math>n</math> ranges over <math>N</math> rollout trajectories using the policy <math>\pi_\theta </math>.
 
Line 111:
 
# Rollout <math>N</math> trajectories in the environment, using <math>\pi_{\theta_t}</math> as the policy function.
# Compute the policy gradient estimation: <math>g_tg_i \leftarrow \frac 1N \sum_{n=1}^N \left[\sum_{t\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{t,n}\mid S_{t,n})\sum_{\tau \in t:T} (\gamma^\tau R_{\tau,n}) \right]</math>
# Update the policy by gradient ascent: <math>\theta_{ti+1} \leftarrow \theta_ttheta_i + \alpha_talpha_i g_tg_i</math>
 
Here, <math>\alpha_talpha_i</math> is the learning rate at update step <math>ti</math>.
 
== Variance reduction ==
REINFORCE is an '''on-policy''' algorithm, meaning that the trajectories used for the update must be sampled from the current policy <math>\pi_\theta</math>. This can lead to high variance in the updates, as the returns <math>R(\tau)</math> can vary significantly between trajectories. Many variants of REINFORCE hashave been introduced, under the title of '''[[variance reduction]]'''.
 
=== REINFORCE with baseline ===
A common way for reducing variance is the '''REINFORCE with baseline''' algorithm, based on the following identity:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(A_jA_t| S_jS_t)\left(\sum_{i\tau \in jt:T} (\gamma^i\tau R_iR_\tau) - b(S_jS_t)\right)
\Big|S_0 = s_0 \right]</math>for any function <math>b: \text{States} \to \R</math>. This can be proven by applying the previous lemma.
 
The algorithm uses the modified gradient estimator<math display="block">g_tg_i \leftarrow
\frac 1N \sum_{kn=1}^N \left[\sum_{jt\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{jt,kn}| S_{it,kn})\left(\sum_{i\tau \in jt:T} (\gamma^i\tau R_{i\tau,kn}) - b_tb_i(S_{jt,kn})\right) \right]</math> and the original REINFORCE algorithm is the special case where <math>b_tb_i =\equiv 0</math>.
 
=== Actor-critic methods ===
{{Main|Actor-critic algorithm}}
If <math display="inline">b_tb_i</math> is chosen well, such that <math display="inline">b_tb_i(S_jS_t) \approx \sum_{i\tau \in jt:T} (\gamma^i\tau R_iR_\tau) = \gamma^jt V^{\pi_{\theta_ttheta_i}}(S_jS_t)</math>, this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the '''value function''' <math>V^{\pi_{\theta_ttheta_i}}(S_jS_t)</math> as possible, approaching the ideal of:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(A_jA_t| S_jS_t)\left(\sum_{i\tau \in jt:T} (\gamma^i\tau R_iR_\tau) - \gamma^jt V^{\pi_\theta}(S_jS_t)\right)
\Big|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\theta_ttheta_i}}(S_jS_t)</math> updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the [[actor-critic method]]s, where the policy function is the actor and the value function is the critic.
 
The '''Q-function''' <math>Q^\pi</math> can also be used as the critic, since<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq jt \leq T} \gamma^jt \nabla_\theta\ln\pi_\theta(A_jA_t| S_jS_t)
\cdot Q^{\pi_\theta}(S_jS_t, A_jA_t)
\Big|S_0 = s_0 \right]</math> by a similar argument using the tower law.
 
Subtracting the value function as a baseline, we find that the '''advantage function''' <math>A^{\pi}(S,A) = Q^{\pi}(S,A) - V^{\pi}(S)</math> can be used as the critic as well:<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq jt \leq T} \gamma^jt \nabla_\theta\ln\pi_\theta(A_jA_t| S_jS_t)
\cdot A^{\pi_\theta}(S_jS_t, A_jA_t)
\Big|S_0 = s_0 \right]</math>In summary, there are many unbiased estimators for <math display="inline">\nabla_\theta J_\theta</math>, all in the form of: <math display="block">\nabla_\theta J(\theta) = E_{\pi_\theta}\left[\sum_{0\leq jt \leq T} \nabla_\theta\ln\pi_\theta(A_jA_t| S_jS_t)
\cdot \Psi_jPsi_t
\Big|S_0 = s_0 \right]</math> where <math display="inline">\Psi_jPsi_t</math> is any linear sum of the following terms:
 
* <math display="inline">\sum_{0 \leq i\tau\leq T} (\gamma^i\tau R_iR_\tau)</math>: never used.
* <math display="inline">\gamma^jt\sum_{jt \leq i\tau\leq T} (\gamma^{i\tau-jt} R_iR_\tau)</math>: used by the REINFORCE algorithm.
* <math display="inline">\gamma^jt \sum_{jt \leq i\tau\leq T} (\gamma^{i\tau-jt} R_iR_\tau) - b(S_jS_t) </math>: used by the REINFORCE with baseline algorithm.
* <math display="inline">\gamma^jt \left(R_jR_t + \gamma V^{\pi_\theta}( S_{jt+1}) - V^{\pi_\theta}( S_{jt})\right)</math>: 1-step TD learning.
* <math display="inline">\gamma^jt Q^{\pi_\theta}(S_jS_t, A_jA_t)</math>.
* <math display="inline">\gamma^jt A^{\pi_\theta}(S_jS_t, A_jA_t)</math>.
 
Some more possible <math display="inline">\Psi_jPsi_t</math> are as follows, with very similar proofs.
 
* <math display="inline">\gamma^jt \left(R_jR_t + \gamma R_{jt+1} + \gamma^2 V^{\pi_\theta}( S_{jt+2}) - V^{\pi_\theta}( S_{jt})\right)</math>: 2-step TD learning.
* <math display="inline">\gamma^jt \left(\sum_{k=0}^{n-1} \gamma^k R_{jt+k} + \gamma^n V^{\pi_\theta}( S_{jt+n}) - V^{\pi_\theta}( S_{jt})\right)</math>: n-step TD learning.
* <math display="inline">\gamma^jt \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{jt+k} + \gamma^n V^{\pi_\theta}( S_{jt+n}) - V^{\pi_\theta}( S_{jt})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation |last1=Schulman |first1=John |title=High-Dimensional Continuous Control Using Generalized Advantage Estimation |date=2018-10-20 |arxiv=1506.02438 |last2=Moritz |first2=Philipp |last3=Levine |first3=Sergey |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the n-step TD learning ones.
 
== Natural policy gradient ==
Line 160:
 
=== Motivation ===
Standard policy gradient updates <math>\theta_{ti+1} = \theta_ttheta_i + \alpha \nabla_\theta J(\theta_ttheta_i)</math> solve a constrained optimization problem:<math display="block">
\begin{cases}
\max_{\theta_{ti+1}} J(\theta_ttheta_i) + (\theta_{ti+1} - \theta_ttheta_i)^T \nabla_\theta J(\theta_ttheta_i)\\
\|\theta_{ti+1} - \theta_{ti}\|\leq \alpha \cdot \|\nabla_\theta J(\theta_ttheta_i)\|
\end{cases}
</math>
While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint <math>\|\theta_{ti+1} - \theta_ttheta_i\| </math> introduces coordinate dependence. To address this, the natural policy gradient replaces the Euclidean constraint with a [[Kullback–Leibler divergence]] (KL) constraint:<math display="block">\begin{cases}
\max_{\theta_{ti+1}} J(\theta_ttheta_i) + (\theta_{ti+1} - \theta_ttheta_i)^T \nabla_\theta J(\theta_ttheta_i)\\
\bar{D}_{KL}(\pi_{\theta_{ti+1}} \| \pi_{\theta_{ti}}) \leq \epsilon
\end{cases}</math>where the KL divergence between two policies is '''averaged''' over the state distribution under policy <math>\pi_{\theta_ttheta_i}</math>. That is,<math display="block">\bar{D}_{KL}(\pi_{\theta_{ti+1}} \| \pi_{\theta_{ti}}) := \mathbb E_{s \sim \pi_{\theta_ttheta_i}}[D_{KL}( \pi_{\theta_{ti+1}}(\cdot | s) \| \pi_{\theta_{ti}}(\cdot | s) )]</math> This ensures updates are invariant to invertible affine parameter transformations.
 
=== Fisher information approximation ===
For small <math>\epsilon</math>, the KL divergence is approximated by the [[Fisher information metric]]:<math display="block">
\bar{D}_{KL}(\pi_{\theta_{ti+1}} \| \pi_{\theta_{ti}}) \approx \frac{1}{2} (\theta_{ti+1} - \theta_ttheta_i)^T F(\theta_ttheta_i) (\theta_{ti+1} - \theta_ttheta_i)
</math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block">
F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a|s) \left(\nabla_\theta \ln \pi_\theta(a|s)\right)^T \right]
</math> This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block">
\theta_{ti+1} = \theta_ttheta_i + \alpha F(\theta_ttheta_i)^{-1} \nabla_\theta J(\theta_ttheta_i)
</math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\theta_ttheta_i))^T F(\theta_ttheta_i)^{-1} \nabla_\theta J(\theta_ttheta_i)}}</math>.
 
Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations.
 
== Trust Region Policy Optimization (TRPO) ==
{{Anchor|TRPO}}
 
'''Trust Region Policy Optimization''' (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing a [[trust region]] constraint on policy updates.<ref name=":3">{{Cite journal |last1=Schulman |first1=John |last2=Levine |first2=Sergey |last3=Moritz |first3=Philipp |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter |date=2015-07-06 |title=Trust region policy optimization |url=https://dl.acm.org/doi/10.5555/3045118.3045319 |journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 |series=ICML'15 |___location=Lille, France |publisher=JMLR.org |pages=1889–1897}}</ref> Developed by Schulman et al. in 2015, TRPO ensuresimproves stable policy improvements by limitingupon the KL divergence between successive policies, addressing key challenges in natural policy gradient methodsmethod.
 
TRPO builds on the natural policy gradient by incorporating a trust region constraint. The natural gradient descent is theoretically optimal, ''if'' the objective is truly a quadratic function, but this is only an approximation. TRPO's line search and KL constraint attempts to restrict the solution to within a "trust region" in which this approximation does not break down. This makes TRPO more robust in practice, particularly for high-dimensional policies.
 
=== Formulation ===
Like natural policy gradient, TRPO iteratively updates the policy parameters <math>\theta</math> by solving a constrained optimization problem specified coordinate-free:<math display="block">
\begin{cases}
\max_{\theta} L(\theta, \theta_ttheta_i)\\
\bar{D}_{KL}(\pi_{\theta} \| \pi_{\theta_{ti}}) \leq \epsilon
\end{cases}
</math>where
* <math>L(\theta, \theta_ttheta_i) = \mathbb{E}_{s, a \sim \pi_{\theta_ttheta_i}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_ttheta_i}(a|s)} A^{\pi_{\theta_ttheta_i}}(s, a) \right]</math> is the '''surrogate advantage''', measuring the performance of <math>\pi_\theta</math> relative to the old policy <math>\pi_{\theta_ktheta_i}</math>.
* <math>\epsilon</math> is the trust region radius.
Note that in general, other surrogate advantages are possible:<math display="block">L(\theta, \theta_ttheta_i) = \mathbb{E}_{s, a \sim \pi_{\theta_ttheta_i}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_ttheta_i}(a|s)}\Psi^{\pi_{\theta_ttheta_i}}(s, a) \right]</math>where <math>\Psi</math> is any linear sum of the previously mentioned type. Indeed, OpenAI recommended using the Generalized Advantage Estimate, instead of the plain advantage <math>A^{\pi_\theta}</math>.
 
The surrogate advantage <math>L(\theta, \theta_t)
Line 203 ⟶ 204:
\nabla_\theta L(\theta, \theta_t)
</math>''' equals the policy gradient derived from the advantage function:
<math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[\nabla_\theta \ln \pi_\theta(a | s) \cdot A^{\pi_\theta}(s, a) \right] = \nabla_\theta L(\theta, \theta_t)</math>However, when <math>\theta \neq \theta_ttheta_i</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective.
 
As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block">
\begin{aligned}
L(\theta, \theta_ttheta_i) &\approx g^T (\theta - \theta_ttheta_i), \\
\bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\theta_ttheta_i}) &\approx \frac{1}{2} (\theta - \theta_ttheta_i)^T H (\theta - \theta_ttheta_i),
\end{aligned}
</math>
where:
* <math>g = \nabla_\theta L(\theta, \theta_ttheta_i) \big|_{\theta = \theta_ttheta_i}</math> is the policy gradient.
* <math>F = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\theta_ttheta_i}) \big|_{\theta = \theta_ttheta_i}</math> is the Fisher information matrix.
 
This reduces the problem to a quadratic optimization, yielding the natural policy gradient update:
<math display="block">
\theta_{ti+1} = \theta_ttheta_i + \sqrt{\frac{2\epsilon}{g^T F^{-1} g}} F^{-1} g.
</math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications:
 
Line 224 ⟶ 225:
</math> in <math>Fx = g</math> iteratively without explicit matrix inversion.
* Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement. That is, it tests each of the following test-solutions<math display="block">
\theta_{ti+1} = \theta_ttheta_i + \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \theta_ttheta_i + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \theta_ttheta_i + \alpha^2 \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \dots
</math> until it finds one that both satisfies the KL constraint <math>\bar{D}_{KL}(\pi_{\theta_{ti+1}} \| \pi_{\theta_{ti}}) \leq \epsilon </math> and results in a higher <math>
L(\theta_{ti+1}, \theta_ttheta_i) \geq L(\theta_ttheta_i, \theta_ttheta_i)
</math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.