Revision as of 00:56, 6 December 2023 edit Tsesea (talk \| contribs) 104 edits →Algorithms Tag: Reverted ← Previous edit		Revision as of 00:58, 6 December 2023 edit undo Tsesea (talk \| contribs) 104 edits →Algorithms Tag: Reverted Next edit →
Line 34: == Algorithms == Various techniques exist to train policies to solve tasks with deep reinforcement learning algorithms, each having their own benefits. At the highest level, there is a distinction between model-based and model-free reinforcement learning, which refers to whether the algorithm attempts to learn a forward model of the environment dynamics.▼ Deep reinforcement learning algorithms can start from a blank policy candidate and achieve superhuman performance in many complex tasks, including Atari games, StarCraft and Chinese Go. Mainstream DRL algorithms include Deep Q-Network (DQN), Dueling DQN, Double DQN (DDQN), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), Distributional SAC (DSAC), etc. These algorithms are proposed with one or several of the abovementioned tricks to alleviate one or some challenges <ref name="Li-2023"/>.▼ In '''model-based''' deep reinforcement learning algorithms, a forward model of the environment dynamics is estimated, usually by [[supervised learning]] using a neural network. Then, actions are obtained by using [[model predictive control]] using the learned model. Since the true environment dynamics will usually diverge from the learned dynamics, the agent re-plans often when carrying out actions in the environment. The actions selected may be optimized using [[Monte Carlo methods]] such as the [[cross-entropy method]], or a combination of model-learning with model-free methods.▼ In '''model-free''' deep reinforcement learning algorithms, a policy <math>\pi(a\|s)</math> is learned without explicitly modeling the forward dynamics. A policy can be optimized to maximize returns by directly estimating the policy gradient<ref name="williams1992"/> but suffers from high variance, making it impractical for use with function approximation in deep RL. Subsequent algorithms have been developed for more stable learning and widely applied.<ref name="schulman2015trpo"/><ref name="schulman2017ppo"/> Another class of model-free deep reinforcement learning algorithms rely on [[dynamic programming]], inspired by [[temporal difference learning]] and [[Q-learning]]. In discrete action spaces, these algorithms usually learn a neural network Q-function <math>Q(s, a)</math> that estimates the future returns taking action <math>a</math> from state <math>s</math>.<ref name="DQN1"/> In continuous spaces, these algorithms often learn both a value estimate and a policy.<ref name="lillicrap2015ddpg"/><ref name="mnih2016a3c"/><ref name="haarnoja2018sac"/>▼ {\| class="wikitable sortable" style="font-size: 96%;" Line 71 ⟶ 68: [[File:Challenges and Tricks of Deep RL.jpg\|thumb\|Challenges and tricks in deep reinforcement learning algorithms]] ▲Various techniques exist to train policies to solve tasks with deep reinforcement learning algorithms, each having their own benefits. At the highest level, there is a distinction between model-based and model-free reinforcement learning, which refers to whether the algorithm attempts to learn a forward model of the environment dynamics. ▲Deep reinforcement learning algorithms can start from a blank policy candidate and achieve superhuman performance in many complex tasks, including Atari games, StarCraft and Chinese Go. Mainstream DRL algorithms include Deep Q-Network (DQN), Dueling DQN, Double DQN (DDQN), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), Distributional SAC (DSAC), etc. These algorithms are proposed with one or several of the abovementioned tricks to alleviate one or some challenges <ref name="Li-2023"/>. ▲In '''model-based''' deep reinforcement learning algorithms, a forward model of the environment dynamics is estimated, usually by [[supervised learning]] using a neural network. Then, actions are obtained by using [[model predictive control]] using the learned model. Since the true environment dynamics will usually diverge from the learned dynamics, the agent re-plans often when carrying out actions in the environment. The actions selected may be optimized using [[Monte Carlo methods]] such as the [[cross-entropy method]], or a combination of model-learning with model-free methods. ▲In '''model-free''' deep reinforcement learning algorithms, a policy <math>\pi(a\|s)</math> is learned without explicitly modeling the forward dynamics. A policy can be optimized to maximize returns by directly estimating the policy gradient<ref name="williams1992"/> but suffers from high variance, making it impractical for use with function approximation in deep RL. Subsequent algorithms have been developed for more stable learning and widely applied.<ref name="schulman2015trpo"/><ref name="schulman2017ppo"/> Another class of model-free deep reinforcement learning algorithms rely on [[dynamic programming]], inspired by [[temporal difference learning]] and [[Q-learning]]. In discrete action spaces, these algorithms usually learn a neural network Q-function <math>Q(s, a)</math> that estimates the future returns taking action <math>a</math> from state <math>s</math>.<ref name="DQN1"/> In continuous spaces, these algorithms often learn both a value estimate and a policy.<ref name="lillicrap2015ddpg"/><ref name="mnih2016a3c"/><ref name="haarnoja2018sac"/> == Research ==

Deep reinforcement learning: Difference between revisions