Deep reinforcement learning: Difference between revisions

Content deleted Content added
Tsesea (talk | contribs)
Undid revision 1188551698 by MrOllie (talk) This is a necessary change to this item: explain main tricks used in DRL, and how each DRL algorithm is enhanced by these tricks. I did not delete orignal materials, but move them to the bottom of newlly added materials. A new figure about RL concept is used, which is better than previous one in color and quality. Their basic meanings are the same.
Tags: Undo Reverted
Undid revision 1188556094 by Tsesea (talk) disagree. please discuss on talk page
Line 10:
 
=== Reinforcement learning ===
[[File:Concept of Reinforcement LearningMarkov_diagram_v2.jpgsvg|alt=Diagram explaining the loop recurring in reinforcement learning algorithms|thumb|Diagram of the loop recurring in reinforcement learning algorithms]]
[[Reinforcement learning]] is a process in which an agent learns to make decisions through trial and error. This problem is often modeled mathematically as a [[Markov decision process]] (MDP), where an agent at every timestep is in a state <math>s</math>, takes action <math>a</math>, receives a scalar reward and transitions to the next state <math>s'</math> according to environment dynamics <math>p(s'|s, a)</math>. The agent attempts to learn a policy <math>\pi(a|s)</math>, or map from observations to actions, in order to maximize its returns (expected sum of rewards). In reinforcement learning (as opposed to [[optimal control]]) the algorithm only has access to the dynamics <math>p(s'|s, a)</math> through sampling.
 
Line 34:
 
== Algorithms ==
 
[[File:Challenges and Tricks of Deep RL.jpg|thumb|Challenges and tricks in deep reinforcement learning algorithms <ref name="Li-2023"/>]]
Previously, it was believed that deep reinforcement learning (DRL) was a natural product of combining tabular RL and deep neural network, and its design was a trivial task. In practice, deep reinforcement learning is fundamentally complicated because it inherits a few serious challenges from both reinforcement learning and deep learning. Some challenges, including non-iid sequential data, easy divergence, overestimation, and sample inefficiency yield particularly destructive outcomes if they are not well treated. A few empirical but useful tricks have been proposed to address these prominent issues, which build the basis of various advanced DRL algorithms. These tricks include experience replay (ExR), parallel exploration (PEx), separated target network (STN), delayed policy update (DPU), constrained policy update (CPU), clipped actor criterion (CAC), double Q-functions (DQF), bounded double Q-functions (BDQ), distributional return function (DRF), entropy regularization (EnR), and soft value function (SVF) <ref name="Li-2023">{{cite book |last1=Li |first1=Shengbo |title= Reinforcement Learning for Sequential Decision and Optimal Control |date=2023 |___location=Springer Verlag, Singapore |isbn=978-9-811-97783-1 |pages=1–460 |doi=10.1007/978-981-19-7784-8 |s2cid=257928563 |edition=First | url=https://link.springer.com/book/10.1007/978-981-19-7784-8}}</ref>.
 
Deep reinforcement learning algorithms can start from a blank policy candidate and achieve superhuman performance in many complex tasks, including Atari games, StarCraft and Chinese Go. Mainstream DRL algorithms include Deep Q-Network (DQN), Dueling DQN, Double DQN (DDQN), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), Distributional SAC (DSAC), etc. These algorithms are proposed with one or several of the abovementioned tricks to alleviate one or some challenges <ref name="Li-2023"/>.
 
{| class="wikitable sortable" style="font-size: 96%;"
!Algorithm || class=unsortable|Description || class=unsortable|Model || Policy || class=unsortable |Action Space || class=unsortable |State Space ||Operator
|-
! scope="row" | [[Q-learning#Deep Q-learning|DQN]]
| Deep Q Network || Model-Free || Off-policy || Discrete || Continuous || Q-value
|-
! scope="row" | [[Deep Deterministic Policy Gradient|DDPG]]
| Deep Deterministic Policy Gradient || Model-Free || Off-policy || Continuous || Continuous || Q-value
|-
! scope="row" | [[Asynchronous Advantage Actor-Critic Algorithm|A3C]]
| Asynchronous Advantage Actor-Critic Algorithm || Model-Free || On-policy || Continuous || Continuous || Advantage
|-
! scope="row" | [[Trust Region Policy Optimization|TRPO]]
| Trust Region Policy Optimization || Model-Free || On-policy || Continuous or Discrete || Continuous || Advantage
|-
! scope="row" | [[Proximal Policy Optimization|PPO]]
| Proximal Policy Optimization || Model-Free || On-policy || Continuous or Discrete || Continuous || Advantage
|-
! scope="row" | [[Twin Delayed Deep Deterministic Policy Gradient|TD3]]
| Twin Delayed Deep Deterministic Policy Gradient || Model-Free || Off-policy || Continuous || Continuous || Q-value
|-
! scope="row" | [[Soft Actor-Critic|SAC]]
| Soft Actor-Critic || Model-Free || Off-policy || Continuous || Continuous || Advantage
|-
!scope="row" |[[Distributional Soft Actor-Critic|DSAC]]
|Distributional Soft Actor-Critic ||Model-free ||Off-policy ||Continuous ||Continuous ||Value distribution
|}
 
Various techniques exist to train policies to solve tasks with deep reinforcement learning algorithms, each having their own benefits. At the highest level, there is a distinction between model-based and model-free reinforcement learning, which refers to whether the algorithm attempts to learn a forward model of the environment dynamics.