Content deleted Content added
m →Overview: Partial notation fix |
|||
Line 6:
== Overview ==
In policy-based RL, the actor is a parameterized policy function <math>\pi_\theta</math>, where <math>\theta</math> are the parameters of the actor. The actor takes as argument the state of the environment <math>s</math> and produces a [[probability distribution]] <math>\pi_\theta(\cdot
If the action space is discrete, then <math>\sum_{a} \pi_\theta(a
The goal of policy optimization is to find some <math>\theta</math> that maximizes the expected episodic reward <math>J(\theta)</math>:<math display="block">J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t\in 0:T} \gamma^t R_t \Big| S_0 = s_0 \right]</math>where <math>
\gamma
</math> is the [[discount factor]], <math>
Line 31 ⟶ 29:
=== Policy gradient ===
The '''REINFORCE algorithm''' was the first policy gradient method.<ref>{{Cite journal |last=Williams |first=Ronald J. |date=May 1992 |title=Simple statistical gradient-following algorithms for connectionist reinforcement learning |url=http://link.springer.com/10.1007/BF00992696 |journal=Machine Learning |language=en |volume=8 |issue=3–4 |pages=229–256 |doi=10.1007/BF00992696 |issn=0885-6125}}</ref> It is based on the identity for the policy gradient<math display="block">\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[
\sum_{
\Big|S_0 = s_0
\right]</math>which can be improved via the "causality trick"<ref>{{Cite journal |last1=Sutton |first1=Richard S |last2=McAllester |first2=David |last3=Singh |first3=Satinder |last4=Mansour |first4=Yishay |date=1999 |title=Policy Gradient Methods for Reinforcement Learning with Function Approximation |url=https://proceedings.neurips.cc/paper_files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=12}}</ref><math display="block">
\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta}\left[\sum_{
\Big|S_0 = s_0 \right]
</math>
Line 106 ⟶ 104:
}}
{{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block">
\nabla_\theta J(\theta) \approx \frac 1N \sum_{
</math>where the index <math>
The [[Score (statistics)|score function]] <math>\nabla_\theta \ln \pi_\theta (A_t
=== Algorithm ===
Line 115 ⟶ 113:
# Rollout <math>N</math> trajectories in the environment, using <math>\pi_{\theta_t}</math> as the policy function.
# Compute the policy gradient estimation: <math>g_t \leftarrow \frac 1N \sum_{
# Update the policy by gradient ascent: <math>\theta_{t+1} \leftarrow \theta_t + \alpha_t g_t</math>
|