Content deleted Content added
Tag: Reverted |
|||
Line 39:
In '''model-free''' deep reinforcement learning algorithms, a policy <math>\pi(a|s)</math> is learned without explicitly modeling the forward dynamics. A policy can be optimized to maximize returns by directly estimating the policy gradient<ref name="williams1992"/> but suffers from high variance, making it impractical for use with function approximation in deep RL. Subsequent algorithms have been developed for more stable learning and widely applied.<ref name="schulman2015trpo"/><ref name="schulman2017ppo"/> Another class of model-free deep reinforcement learning algorithms rely on [[dynamic programming]], inspired by [[temporal difference learning]] and [[Q-learning]]. In discrete action spaces, these algorithms usually learn a neural network Q-function <math>Q(s, a)</math> that estimates the future returns taking action <math>a</math> from state <math>s</math>.<ref name="DQN1"/> In continuous spaces, these algorithms often learn both a value estimate and a policy.<ref name="lillicrap2015ddpg"/><ref name="mnih2016a3c"/><ref name="haarnoja2018sac"/>
{| class="wikitable sortable" style="font-size: 96%;"
!Algorithm || class=unsortable|Description || class=unsortable|Model || Policy || class=unsortable |Action Space || class=unsortable |State Space ||Operator
|-
! scope="row" | [[Q-learning#Deep Q-learning|DQN]]
| Deep Q Network || Model-Free || Off-policy || Discrete || Continuous || Q-value
|-
! scope="row" | [[Deep Deterministic Policy Gradient|DDPG]]
| Deep Deterministic Policy Gradient || Model-Free || Off-policy || Continuous || Continuous || Q-value
|-
! scope="row" | [[Asynchronous Advantage Actor-Critic Algorithm|A3C]]
| Asynchronous Advantage Actor-Critic Algorithm || Model-Free || On-policy || Continuous || Continuous || Advantage
|-
! scope="row" | [[Trust Region Policy Optimization|TRPO]]
| Trust Region Policy Optimization || Model-Free || On-policy || Continuous or Discrete || Continuous || Advantage
|-
! scope="row" | [[Proximal Policy Optimization|PPO]]
| Proximal Policy Optimization || Model-Free || On-policy || Continuous or Discrete || Continuous || Advantage
|-
! scope="row" | [[Twin Delayed Deep Deterministic Policy Gradient|TD3]]
| Twin Delayed Deep Deterministic Policy Gradient || Model-Free || Off-policy || Continuous || Continuous || Q-value
|-
! scope="row" | [[Soft Actor-Critic|SAC]]
| Soft Actor-Critic || Model-Free || Off-policy || Continuous || Continuous || Advantage
|-
!scope="row" |[[Distributional Soft Actor-Critic|DSAC]]
|Distributional Soft Actor-Critic ||Model-free ||Off-policy ||Continuous ||Continuous ||Value distribution
|}
== Research ==
|