Policy gradient method: Difference between revisions

Content deleted Content added
Migolan (talk | contribs)
No edit summary
I fixed the formula so it would express "the total reward from time $
 
(6 intermediate revisions by 3 users not shown)
Line 102:
}}
{{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block">
\nabla_\theta J(\theta) \approx \frac 1N \sum_{n=1}^N \left[\sum_{t\in 0:T} \nabla_\theta\ln\pi_\theta(A_{t,n}\mid S_{t,n})\sum_{\tau \in t:T} (\gamma^{\tau-t} R_{\tau ,n}) \right]
</math>where the index <math>n</math> ranges over <math>N</math> rollout trajectories using the policy <math>\pi_\theta </math>.
 
Line 111:
 
# Rollout <math>N</math> trajectories in the environment, using <math>\pi_{\theta_t}</math> as the policy function.
# Compute the policy gradient estimation: <math>g_tg_i \leftarrow \frac 1N \sum_{n=1}^N \left[\sum_{t\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{t,n}\mid S_{t,n})\sum_{\tau \in t:T} (\gamma^\tau R_{\tau,n}) \right]</math>
# Update the policy by gradient ascent: <math>\theta_{ti+1} \leftarrow \theta_ttheta_i + \alpha_talpha_i g_tg_i</math>
 
Here, <math>\alpha_talpha_i</math> is the learning rate at update step <math>ti</math>.
 
== Variance reduction ==
REINFORCE is an '''on-policy''' algorithm, meaning that the trajectories used for the update must be sampled from the current policy <math>\pi_\theta</math>. This can lead to high variance in the updates, as the returns <math>R(\tau)</math> can vary significantly between trajectories. Many variants of REINFORCE hashave been introduced, under the title of '''[[variance reduction]]'''.
 
=== REINFORCE with baseline ===
Line 123:
\Big|S_0 = s_0 \right]</math>for any function <math>b: \text{States} \to \R</math>. This can be proven by applying the previous lemma.
 
The algorithm uses the modified gradient estimator<math display="block">g_tg_i \leftarrow
\frac 1N \sum_{n=1}^N \left[\sum_{t\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{t,n}| S_{t,n})\left(\sum_{\tau \in t:T} (\gamma^\tau R_{\tau,n}) - bb_i(S_{t,n})\right) \right]</math> and the original REINFORCE algorithm is the special case where <math>bb_i \equiv 0</math>.
 
=== Actor-critic methods ===
{{Main|Actor-critic algorithm}}
If <math display="inline">b_tb_i</math> is chosen well, such that <math display="inline">b_tb_i(S_jS_t) \approx \sum_{i\tau \in jt:T} (\gamma^i\tau R_iR_\tau) = \gamma^jt V^{\pi_{\theta_ttheta_i}}(S_jS_t)</math>, this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the '''value function''' <math>V^{\pi_{\theta_ttheta_i}}(S_jS_t)</math> as possible, approaching the ideal of:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(A_jA_t| S_jS_t)\left(\sum_{i\tau \in jt:T} (\gamma^i\tau R_iR_\tau) - \gamma^jt V^{\pi_\theta}(S_jS_t)\right)
\Big|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\theta_ttheta_i}}(S_jS_t)</math> updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the [[actor-critic method]]s, where the policy function is the actor and the value function is the critic.
 
The '''Q-function''' <math>Q^\pi</math> can also be used as the critic, since<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq jt \leq T} \gamma^jt \nabla_\theta\ln\pi_\theta(A_jA_t| S_jS_t)
\cdot Q^{\pi_\theta}(S_jS_t, A_jA_t)
\Big|S_0 = s_0 \right]</math> by a similar argument using the tower law.
 
Subtracting the value function as a baseline, we find that the '''advantage function''' <math>A^{\pi}(S,A) = Q^{\pi}(S,A) - V^{\pi}(S)</math> can be used as the critic as well:<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq jt \leq T} \gamma^jt \nabla_\theta\ln\pi_\theta(A_jA_t| S_jS_t)
\cdot A^{\pi_\theta}(S_jS_t, A_jA_t)
\Big|S_0 = s_0 \right]</math>In summary, there are many unbiased estimators for <math display="inline">\nabla_\theta J_\theta</math>, all in the form of: <math display="block">\nabla_\theta J(\theta) = E_{\pi_\theta}\left[\sum_{0\leq jt \leq T} \nabla_\theta\ln\pi_\theta(A_jA_t| S_jS_t)
\cdot \Psi_jPsi_t
\Big|S_0 = s_0 \right]</math> where <math display="inline">\Psi_jPsi_t</math> is any linear sum of the following terms:
 
* <math display="inline">\sum_{0 \leq i\tau\leq T} (\gamma^i\tau R_iR_\tau)</math>: never used.
* <math display="inline">\gamma^jt\sum_{jt \leq i\tau\leq T} (\gamma^{i\tau-jt} R_iR_\tau)</math>: used by the REINFORCE algorithm.
* <math display="inline">\gamma^jt \sum_{jt \leq i\tau\leq T} (\gamma^{i\tau-jt} R_iR_\tau) - b(S_jS_t) </math>: used by the REINFORCE with baseline algorithm.
* <math display="inline">\gamma^jt \left(R_jR_t + \gamma V^{\pi_\theta}( S_{jt+1}) - V^{\pi_\theta}( S_{jt})\right)</math>: 1-step TD learning.
* <math display="inline">\gamma^jt Q^{\pi_\theta}(S_jS_t, A_jA_t)</math>.
* <math display="inline">\gamma^jt A^{\pi_\theta}(S_jS_t, A_jA_t)</math>.
 
Some more possible <math display="inline">\Psi_jPsi_t</math> are as follows, with very similar proofs.
 
* <math display="inline">\gamma^jt \left(R_jR_t + \gamma R_{jt+1} + \gamma^2 V^{\pi_\theta}( S_{jt+2}) - V^{\pi_\theta}( S_{jt})\right)</math>: 2-step TD learning.
* <math display="inline">\gamma^jt \left(\sum_{k=0}^{n-1} \gamma^k R_{jt+k} + \gamma^n V^{\pi_\theta}( S_{jt+n}) - V^{\pi_\theta}( S_{jt})\right)</math>: n-step TD learning.
* <math display="inline">\gamma^jt \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{jt+k} + \gamma^n V^{\pi_\theta}( S_{jt+n}) - V^{\pi_\theta}( S_{jt})\right)</math>: TD(λ) learning, also known as '''GAE (generalized advantage estimate)'''.<ref>{{Citation |last1=Schulman |first1=John |title=High-Dimensional Continuous Control Using Generalized Advantage Estimation |date=2018-10-20 |arxiv=1506.02438 |last2=Moritz |first2=Philipp |last3=Levine |first3=Sergey |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter}}</ref> This is obtained by an exponentially decaying sum of the n-step TD learning ones.
 
== Natural policy gradient ==
Line 160:
 
=== Motivation ===
Standard policy gradient updates <math>\theta_{ti+1} = \theta_ttheta_i + \alpha \nabla_\theta J(\theta_ttheta_i)</math> solve a constrained optimization problem:<math display="block">
\begin{cases}
\max_{\theta_{ti+1}} J(\theta_ttheta_i) + (\theta_{ti+1} - \theta_ttheta_i)^T \nabla_\theta J(\theta_ttheta_i)\\
\|\theta_{ti+1} - \theta_{ti}\|\leq \alpha \cdot \|\nabla_\theta J(\theta_ttheta_i)\|
\end{cases}
</math>
While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint <math>\|\theta_{ti+1} - \theta_ttheta_i\| </math> introduces coordinate dependence. To address this, the natural policy gradient replaces the Euclidean constraint with a [[Kullback–Leibler divergence]] (KL) constraint:<math display="block">\begin{cases}
\max_{\theta_{ti+1}} J(\theta_ttheta_i) + (\theta_{ti+1} - \theta_ttheta_i)^T \nabla_\theta J(\theta_ttheta_i)\\
\bar{D}_{KL}(\pi_{\theta_{ti+1}} \| \pi_{\theta_{ti}}) \leq \epsilon
\end{cases}</math>where the KL divergence between two policies is '''averaged''' over the state distribution under policy <math>\pi_{\theta_ttheta_i}</math>. That is,<math display="block">\bar{D}_{KL}(\pi_{\theta_{ti+1}} \| \pi_{\theta_{ti}}) := \mathbb E_{s \sim \pi_{\theta_ttheta_i}}[D_{KL}( \pi_{\theta_{ti+1}}(\cdot | s) \| \pi_{\theta_{ti}}(\cdot | s) )]</math> This ensures updates are invariant to invertible affine parameter transformations.
 
=== Fisher information approximation ===
For small <math>\epsilon</math>, the KL divergence is approximated by the [[Fisher information metric]]:<math display="block">
\bar{D}_{KL}(\pi_{\theta_{ti+1}} \| \pi_{\theta_{ti}}) \approx \frac{1}{2} (\theta_{ti+1} - \theta_ttheta_i)^T F(\theta_ttheta_i) (\theta_{ti+1} - \theta_ttheta_i)
</math>where <math>F(\theta)</math> is the [[Fisher information matrix]] of the policy, defined as:<math display="block">
F(\theta) = \mathbb{E}_{s, a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a|s) \left(\nabla_\theta \ln \pi_\theta(a|s)\right)^T \right]
</math> This transforms the problem into a problem in [[quadratic programming]], yielding the natural policy gradient update:<math display="block">
\theta_{ti+1} = \theta_ttheta_i + \alpha F(\theta_ttheta_i)^{-1} \nabla_\theta J(\theta_ttheta_i)
</math>The step size <math>\alpha</math> is typically adjusted to maintain the KL constraint, with <math display="inline">\alpha \approx \sqrt{\frac{2\epsilon}{(\nabla_\theta J(\theta_ttheta_i))^T F(\theta_ttheta_i)^{-1} \nabla_\theta J(\theta_ttheta_i)}}</math>.
 
Inverting <math>F(\theta)</math> is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations.
 
== Trust Region Policy Optimization (TRPO) ==
{{Anchor|TRPO}}
Line 191 ⟶ 192:
Like natural policy gradient, TRPO iteratively updates the policy parameters <math>\theta</math> by solving a constrained optimization problem specified coordinate-free:<math display="block">
\begin{cases}
\max_{\theta} L(\theta, \theta_ttheta_i)\\
\bar{D}_{KL}(\pi_{\theta} \| \pi_{\theta_{ti}}) \leq \epsilon
\end{cases}
</math>where
* <math>L(\theta, \theta_ttheta_i) = \mathbb{E}_{s, a \sim \pi_{\theta_ttheta_i}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_ttheta_i}(a|s)} A^{\pi_{\theta_ttheta_i}}(s, a) \right]</math> is the '''surrogate advantage''', measuring the performance of <math>\pi_\theta</math> relative to the old policy <math>\pi_{\theta_ktheta_i}</math>.
* <math>\epsilon</math> is the trust region radius.
Note that in general, other surrogate advantages are possible:<math display="block">L(\theta, \theta_ttheta_i) = \mathbb{E}_{s, a \sim \pi_{\theta_ttheta_i}}\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_ttheta_i}(a|s)}\Psi^{\pi_{\theta_ttheta_i}}(s, a) \right]</math>where <math>\Psi</math> is any linear sum of the previously mentioned type. Indeed, OpenAI recommended using the Generalized Advantage Estimate, instead of the plain advantage <math>A^{\pi_\theta}</math>.
 
The surrogate advantage <math>L(\theta, \theta_t)
Line 203 ⟶ 204:
\nabla_\theta L(\theta, \theta_t)
</math>''' equals the policy gradient derived from the advantage function:
<math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[\nabla_\theta \ln \pi_\theta(a | s) \cdot A^{\pi_\theta}(s, a) \right] = \nabla_\theta L(\theta, \theta_t)</math>However, when <math>\theta \neq \theta_ttheta_i</math>, this is not necessarily true. Thus it is a "surrogate" of the real objective.
 
As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block">
\begin{aligned}
L(\theta, \theta_ttheta_i) &\approx g^T (\theta - \theta_ttheta_i), \\
\bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\theta_ttheta_i}) &\approx \frac{1}{2} (\theta - \theta_ttheta_i)^T H (\theta - \theta_ttheta_i),
\end{aligned}
</math>
where:
* <math>g = \nabla_\theta L(\theta, \theta_ttheta_i) \big|_{\theta = \theta_ttheta_i}</math> is the policy gradient.
* <math>F = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\theta_ttheta_i}) \big|_{\theta = \theta_ttheta_i}</math> is the Fisher information matrix.
 
This reduces the problem to a quadratic optimization, yielding the natural policy gradient update:
<math display="block">
\theta_{ti+1} = \theta_ttheta_i + \sqrt{\frac{2\epsilon}{g^T F^{-1} g}} F^{-1} g.
</math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications:
 
Line 224 ⟶ 225:
</math> in <math>Fx = g</math> iteratively without explicit matrix inversion.
* Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement. That is, it tests each of the following test-solutions<math display="block">
\theta_{ti+1} = \theta_ttheta_i + \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \theta_ttheta_i + \alpha \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \theta_ttheta_i + \alpha^2 \sqrt{\frac{2\epsilon}{x^T F x}} x, \; \dots
</math> until it finds one that both satisfies the KL constraint <math>\bar{D}_{KL}(\pi_{\theta_{ti+1}} \| \pi_{\theta_{ti}}) \leq \epsilon </math> and results in a higher <math>
L(\theta_{ti+1}, \theta_ttheta_i) \geq L(\theta_ttheta_i, \theta_ttheta_i)
</math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.