Deep reinforcement learning: Difference between revisions

Content deleted Content added
Tsesea (talk | contribs)
Tag: Reverted
Tsesea (talk | contribs)
Tag: Reverted
Line 34:
 
== Algorithms ==
Various techniques exist to train policies to solve tasks with deep reinforcement learning algorithms, each having their own benefits. At the highest level, there is a distinction between model-based and model-free reinforcement learning, which refers to whether the algorithm attempts to learn a forward model of the environment dynamics.
 
Deep reinforcement learning algorithms can start from a blank policy candidate and achieve superhuman performance in many complex tasks, including Atari games, StarCraft and Chinese Go. Mainstream DRL algorithms include Deep Q-Network (DQN), Dueling DQN, Double DQN (DDQN), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), Distributional SAC (DSAC), etc. These algorithms are proposed with one or several of the abovementioned tricks to alleviate one or some challenges <ref name="Li-2023"/>.
In '''model-based''' deep reinforcement learning algorithms, a forward model of the environment dynamics is estimated, usually by [[supervised learning]] using a neural network. Then, actions are obtained by using [[model predictive control]] using the learned model. Since the true environment dynamics will usually diverge from the learned dynamics, the agent re-plans often when carrying out actions in the environment. The actions selected may be optimized using [[Monte Carlo methods]] such as the [[cross-entropy method]], or a combination of model-learning with model-free methods.
 
In '''model-free''' deep reinforcement learning algorithms, a policy <math>\pi(a|s)</math> is learned without explicitly modeling the forward dynamics. A policy can be optimized to maximize returns by directly estimating the policy gradient<ref name="williams1992"/> but suffers from high variance, making it impractical for use with function approximation in deep RL. Subsequent algorithms have been developed for more stable learning and widely applied.<ref name="schulman2015trpo"/><ref name="schulman2017ppo"/> Another class of model-free deep reinforcement learning algorithms rely on [[dynamic programming]], inspired by [[temporal difference learning]] and [[Q-learning]]. In discrete action spaces, these algorithms usually learn a neural network Q-function <math>Q(s, a)</math> that estimates the future returns taking action <math>a</math> from state <math>s</math>.<ref name="DQN1"/> In continuous spaces, these algorithms often learn both a value estimate and a policy.<ref name="lillicrap2015ddpg"/><ref name="mnih2016a3c"/><ref name="haarnoja2018sac"/>
 
{| class="wikitable sortable" style="font-size: 96%;"
Line 71 ⟶ 68:
[[File:Challenges and Tricks of Deep RL.jpg|thumb|Challenges and tricks in deep reinforcement learning algorithms]]
 
Various techniques exist to train policies to solve tasks with deep reinforcement learning algorithms, each having their own benefits. At the highest level, there is a distinction between model-based and model-free reinforcement learning, which refers to whether the algorithm attempts to learn a forward model of the environment dynamics.
Deep reinforcement learning algorithms can start from a blank policy candidate and achieve superhuman performance in many complex tasks, including Atari games, StarCraft and Chinese Go. Mainstream DRL algorithms include Deep Q-Network (DQN), Dueling DQN, Double DQN (DDQN), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), Distributional SAC (DSAC), etc. These algorithms are proposed with one or several of the abovementioned tricks to alleviate one or some challenges <ref name="Li-2023"/>.
 
In '''model-based''' deep reinforcement learning algorithms, a forward model of the environment dynamics is estimated, usually by [[supervised learning]] using a neural network. Then, actions are obtained by using [[model predictive control]] using the learned model. Since the true environment dynamics will usually diverge from the learned dynamics, the agent re-plans often when carrying out actions in the environment. The actions selected may be optimized using [[Monte Carlo methods]] such as the [[cross-entropy method]], or a combination of model-learning with model-free methods.
 
In '''model-free''' deep reinforcement learning algorithms, a policy <math>\pi(a|s)</math> is learned without explicitly modeling the forward dynamics. A policy can be optimized to maximize returns by directly estimating the policy gradient<ref name="williams1992"/> but suffers from high variance, making it impractical for use with function approximation in deep RL. Subsequent algorithms have been developed for more stable learning and widely applied.<ref name="schulman2015trpo"/><ref name="schulman2017ppo"/> Another class of model-free deep reinforcement learning algorithms rely on [[dynamic programming]], inspired by [[temporal difference learning]] and [[Q-learning]]. In discrete action spaces, these algorithms usually learn a neural network Q-function <math>Q(s, a)</math> that estimates the future returns taking action <math>a</math> from state <math>s</math>.<ref name="DQN1"/> In continuous spaces, these algorithms often learn both a value estimate and a policy.<ref name="lillicrap2015ddpg"/><ref name="mnih2016a3c"/><ref name="haarnoja2018sac"/>
 
== Research ==