Content deleted Content added
I fixed the formula so it would express "the total reward from time $ |
|||
(3 intermediate revisions by 3 users not shown) | |||
Line 102:
}}
{{hidden end}}Thus, we have an [[unbiased estimator]] of the policy gradient:<math display="block">
\nabla_\theta J(\theta) \approx \frac 1N \sum_{n=1}^N \left[\sum_{t\in 0:T} \nabla_\theta\ln\pi_\theta(A_{t,n}\mid S_{t,n})\sum_{\tau \in t:T} (\gamma^{\tau-t} R_{\tau ,n}) \right]
</math>where the index <math>n</math> ranges over <math>N</math> rollout trajectories using the policy <math>\pi_\theta </math>.
Line 117:
== Variance reduction ==
REINFORCE is an '''on-policy''' algorithm, meaning that the trajectories used for the update must be sampled from the current policy <math>\pi_\theta</math>. This can lead to high variance in the updates, as the returns <math>R(\tau)</math> can vary significantly between trajectories. Many variants of REINFORCE
=== REINFORCE with baseline ===
Line 128:
=== Actor-critic methods ===
{{Main|Actor-critic algorithm}}
If <math display="inline">b_i</math> is chosen well, such that <math display="inline">b_i(S_t) \approx \sum_{\tau \in t:T} (\gamma^\tau R_\tau) = \gamma^
\Big|S_0 = s_0 \right]</math>Note that, as the policy <math>\pi_{\theta_t}</math> updates, the value function <math>V^{\pi_{\theta_i}}(S_t)</math> updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the [[actor-critic method]]s, where the policy function is the actor and the value function is the critic.
Line 192:
Like natural policy gradient, TRPO iteratively updates the policy parameters <math>\theta</math> by solving a constrained optimization problem specified coordinate-free:<math display="block">
\begin{cases}
\max_{\theta} L(\theta, \
\bar{D}_{KL}(\pi_{\theta} \| \pi_{\theta_{
\end{cases}
</math>where
* <math>L(\theta, \
* <math>\epsilon</math> is the trust region radius.
Note that in general, other surrogate advantages are possible:<math display="block">L(\theta, \
The surrogate advantage <math>L(\theta, \theta_t)
Line 204:
\nabla_\theta L(\theta, \theta_t)
</math>''' equals the policy gradient derived from the advantage function:
<math display="block">\nabla_\theta J(\theta) = \mathbb{E}_{(s, a) \sim \pi_\theta}\left[\nabla_\theta \ln \pi_\theta(a | s) \cdot A^{\pi_\theta}(s, a) \right] = \nabla_\theta L(\theta, \theta_t)</math>However, when <math>\theta \neq \
As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around <math>\theta_t</math>:<math display="block">
\begin{aligned}
L(\theta, \
\bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\
\end{aligned}
</math>
where:
* <math>g = \nabla_\theta L(\theta, \
* <math>F = \nabla_\theta^2 \bar{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\
This reduces the problem to a quadratic optimization, yielding the natural policy gradient update:
<math display="block">
\theta_{
</math>So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications:
Line 225:
</math> in <math>Fx = g</math> iteratively without explicit matrix inversion.
* Use [[backtracking line search]] to ensure the trust-region constraint is satisfied. Specifically, it backtracks the step size to ensure the KL constraint and policy improvement. That is, it tests each of the following test-solutions<math display="block">
\theta_{
</math> until it finds one that both satisfies the KL constraint <math>\bar{D}_{KL}(\pi_{\theta_{
L(\theta_{
</math>. Here, <math>\alpha \in (0,1)</math> is the backtracking coefficient.
|