Policy gradient method: Difference between revisions

Content deleted Content added
the listed resources *are* reliable
natural policy gradient
Line 130:
 
=== Actor-critic methods ===
If <math display="inline">b_t</math> is chosen well, such that <math display="inline">b_t(S_j) \approx \sum_{i \in j:T} (\gamma^i R_i) = \gamma^j V^{\pi_{\theta_t}}(S_j)</math>, this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the '''value function''' <math>V^{\pi_{\theta_t}}(S_j)</math> as possible, approaching the ideal of:<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{j\in 0:T} \nabla_\theta\ln\pi_\theta(A_j| S_j)\left(\sum_{i \in j:T} (\gamma^i R_i) - \gamma^j V^{\pi_\theta}(S_j)\right)
\Big|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\theta_t}}(S_j)</math> updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the [[actor-critic method]]s, where the policy function is the actor and the value function is the critic.
 
The '''Q-function''' <math>Q^\pi</math> can also be used as the critic, since<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq j \leq T} \gamma^j \nabla_\theta\ln\pi_\theta(A_j| S_j)
Line 139:
Subtracting the value function as a baseline, we find that the '''advantage function''' <math>A^{\pi}(S,A) = Q^{\pi}(S,A) - V^{\pi}(S)</math> can be used as the critic as well:<math display="block">\nabla_\theta J(\theta)= E_{\pi_\theta}\left[\sum_{0\leq j \leq T} \gamma^j \nabla_\theta\ln\pi_\theta(A_j| S_j)
\cdot A^{\pi_\theta}(S_j, A_j)
\Big|S_0 = s_0 \right]</math>ToIn summarizesummary, there are many unbiased estimators for <math display="inline">\nabla_\theta J_\theta</math>, all in the form of: <math display="block">\nabla_\theta J(\theta) = E_{\pi_\theta}\left[\sum_{0\leq j \leq T} \nabla_\theta\ln\pi_\theta(A_j| S_j)
\cdot \Psi_j
\Big|S_0 = s_0 \right]</math> where <math display="inline">\Psi_j</math> is aany linear sum of the following terms:
 
* <math display="inline">\sum_{0 \leq i\leq T} (\gamma^i R_i)</math>: never used.
Line 158:
=== Other methods ===
Other important examples of policy gradient methods include [[Trust region policy optimization|Trust Region Policy Optimization]] (TRPO)<ref name=":3">{{Cite journal |last1=Schulman |first1=John |last2=Levine |first2=Sergey |last3=Moritz |first3=Philipp |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter |date=2015-07-06 |title=Trust region policy optimization |url=https://dl.acm.org/doi/10.5555/3045118.3045319 |journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 |series=ICML'15 |___location=Lille, France |publisher=JMLR.org |pages=1889–1897}}</ref> and [[Proximal policy optimization|Proximal Policy Optimization]] (PPO).<ref name=":0">{{Citation |last1=Schulman |first1=John |title=Proximal Policy Optimization Algorithms |date=2017-08-28 |arxiv=1707.06347 |last2=Wolski |first2=Filip |last3=Dhariwal |first3=Prafulla |last4=Radford |first4=Alec |last5=Klimov |first5=Oleg}}</ref>
 
== Natural policy gradient ==
The natural policy gradient method is a variant of the policy gradient method, proposed by ([[Sham Kakade|Kakade]], 2001).<ref>{{Cite journal |last=Kakade |first=Sham M |date=2001 |title=A Natural Policy Gradient |url=https://proceedings.neurips.cc/paper_files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=14}}</ref> The key idea is that the standard policy gradient methods, given above, involve optimizing <math>J(\theta)</math> by taking its gradient <math>\nabla_\theta J(\theta)</math>. However, this gradient depends on the particular choice of the coordinate <math>\theta</math>. So, for example, if we were to change the coordinates by <math>\theta' = 2\theta </math>, where <math>f</math> is some smooth function, then we would obtain a new policy gradient <math>\nabla_{\theta'} J(\theta') = \frac 12 \nabla_\theta J(\theta) </math>.
 
Thus, policy gradient method is "unnatural" in the geometric sense, since its updates depends on the choice of coordinates. A "natural" policy gradient would change it so that the policy updates are [[coordinate-free]].
 
== See also ==