Policy gradient method: Difference between revisions

Content deleted Content added
Atcold (talk | contribs)
m Overview: Partial notation fix
Line 6:
 
== Overview ==
In policy-based RL, the actor is a parameterized policy function <math>\pi_\theta</math>, where <math>\theta</math> are the parameters of the actor. The actor takes as argument the state of the environment <math>s</math> and produces a [[probability distribution]] <math>\pi_\theta(\cdot |\mid s)</math>.
 
If the action space is discrete, then <math>\sum_{a} \pi_\theta(a |\mid s) = 1</math>. If the action space is continuous, then <math>\int_{a} \pi_\theta(a |\mid s) da\mathrm{d}a = 1</math>.
 
The goal of policy optimization is to find some <math>\theta</math> that maximizes the expected episodic reward <math>J(\theta)</math>:<math display="block">J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t\in 0:T} \gamma^t R_t \Big| S_0 = s_0 \right]</math>where <math>
J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{i\in 0:T} \gamma^i R_i \Big| S_0 = s_0 \right]
</math>where <math>
\gamma
</math> is the [[discount factor]], <math>
Line 31 ⟶ 29:
=== Policy gradient ===
The '''REINFORCE algorithm''' was the first policy gradient method.<ref>{{Cite journal |last=Williams |first=Ronald J. |date=May 1992 |title=Simple statistical gradient-following algorithms for connectionist reinforcement learning |url=http://link.springer.com/10.1007/BF00992696 |journal=Machine Learning |language=en |volume=8 |issue=3–4 |pages=229–256 |doi=10.1007/BF00992696 |issn=0885-6125}}</ref> It is based on the identity for the policy gradient<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[
\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(A_j|A_t S_j\mid S_t)\; \sum_{it \in 0:T} (\gamma^it R_iR_t)
\Big|S_0 = s_0
\right]</math>which can be improved via the "causality trick"<ref>{{Cite journal |last1=Sutton |first1=Richard S |last2=McAllester |first2=David |last3=Singh |first3=Satinder |last4=Mansour |first4=Yishay |date=1999 |title=Policy Gradient Methods for Reinforcement Learning with Function Approximation |url=https://proceedings.neurips.cc/paper_files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=12}}</ref><math display="block">
\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(A_j|A_t\mid S_jS_t)\sum_{i\tau \in jt:T} (\gamma^i\tau R_iR_\tau)
\Big|S_0 = s_0 \right]
</math>
Line 106 ⟶ 104:
}}
{{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block">
\nabla_\theta J(\theta) \approx \frac 1N \sum_{kn=1}^N \left[\sum_{jt\in 0:T} \nabla_\theta\ln\pi_\theta(A_{jt,kn}|\mid S_{jt,kn})\sum_{i\tau \in jt:T} (\gamma^i\tau R_{i\tau ,kn}) \right]
</math>where the index <math>kn</math> ranges over <math>N</math> rollout trajectories using the policy <math>\pi_\theta </math>.
 
The [[Score (statistics)|score function]] <math>\nabla_\theta \ln \pi_\theta (A_t |\mid S_t)</math> can be interpreted as the direction in the parameter space that increases the probability of taking action <math>A_t</math> in state <math>S_t</math>. The policy gradient, then, is a [[weighted average]] of all possible directions to increase the probability of taking any action in any state, but weighted by reward signals, so that if taking a certain action in a certain state is associated with high reward, then that direction would be highly reinforced, and vice versa.
 
=== Algorithm ===
Line 115 ⟶ 113:
 
# Rollout <math>N</math> trajectories in the environment, using <math>\pi_{\theta_t}</math> as the policy function.
# Compute the policy gradient estimation: <math>g_t \leftarrow \frac 1N \sum_{kn=1}^N \left[\sum_{jt\in 0:T} \nabla_{\theta_t}\ln\pi_\theta(A_{jt,kn}|\mid S_{jt,kn})\sum_{i\tau \in jt:T} (\gamma^i\tau R_{i\tau,kn}) \right]</math>
# Update the policy by gradient ascent: <math>\theta_{t+1} \leftarrow \theta_t + \alpha_t g_t</math>