Revision as of 04:01, 17 November 2023 edit Materialscientist (talk \| contribs) Edit filter managers, Autopatrolled, Checkusers, Rollbackers, Administrators 2,037,328 edits m Reverted edits by Tsesea (talk): unexplained content removal (HG) (3.4.12) Tags: Huggle Rollback ← Previous edit		Revision as of 10:10, 18 November 2023 edit undo Tsesea (talk \| contribs) 104 edits →Algorithms Tag: Reverted Next edit →
Line 39: In '''model-free''' deep reinforcement learning algorithms, a policy <math>\pi(a\|s)</math> is learned without explicitly modeling the forward dynamics. A policy can be optimized to maximize returns by directly estimating the policy gradient<ref name="williams1992"/> but suffers from high variance, making it impractical for use with function approximation in deep RL. Subsequent algorithms have been developed for more stable learning and widely applied.<ref name="schulman2015trpo"/><ref name="schulman2017ppo"/> Another class of model-free deep reinforcement learning algorithms rely on [[dynamic programming]], inspired by [[temporal difference learning]] and [[Q-learning]]. In discrete action spaces, these algorithms usually learn a neural network Q-function <math>Q(s, a)</math> that estimates the future returns taking action <math>a</math> from state <math>s</math>.<ref name="DQN1"/> In continuous spaces, these algorithms often learn both a value estimate and a policy.<ref name="lillicrap2015ddpg"/><ref name="mnih2016a3c"/><ref name="haarnoja2018sac"/> {\| class="wikitable sortable" style="font-size: 96%;" !Algorithm \|\| class=unsortable\|Description \|\| class=unsortable\|Model \|\| Policy \|\| class=unsortable \|Action Space \|\| class=unsortable \|State Space \|\|Operator \|- ! scope="row" \| [[Q-learning#Deep Q-learning\|DQN]] \| Deep Q Network \|\| Model-Free \|\| Off-policy \|\| Discrete \|\| Continuous \|\| Q-value \|- ! scope="row" \| [[Deep Deterministic Policy Gradient\|DDPG]] \| Deep Deterministic Policy Gradient \|\| Model-Free \|\| Off-policy \|\| Continuous \|\| Continuous \|\| Q-value \|- ! scope="row" \| [[Asynchronous Advantage Actor-Critic Algorithm\|A3C]] \| Asynchronous Advantage Actor-Critic Algorithm \|\| Model-Free \|\| On-policy \|\| Continuous \|\| Continuous \|\| Advantage \|- ! scope="row" \| [[Trust Region Policy Optimization\|TRPO]] \| Trust Region Policy Optimization \|\| Model-Free \|\| On-policy \|\| Continuous or Discrete \|\| Continuous \|\| Advantage \|- ! scope="row" \| [[Proximal Policy Optimization\|PPO]] \| Proximal Policy Optimization \|\| Model-Free \|\| On-policy \|\| Continuous or Discrete \|\| Continuous \|\| Advantage \|- ! scope="row" \| [[Twin Delayed Deep Deterministic Policy Gradient\|TD3]] \| Twin Delayed Deep Deterministic Policy Gradient \|\| Model-Free \|\| Off-policy \|\| Continuous \|\| Continuous \|\| Q-value \|- ! scope="row" \| [[Soft Actor-Critic\|SAC]] \| Soft Actor-Critic \|\| Model-Free \|\| Off-policy \|\| Continuous \|\| Continuous \|\| Advantage \|- !scope="row" \|[[Distributional Soft Actor-Critic\|DSAC]] \|Distributional Soft Actor-Critic \|\|Model-free \|\|Off-policy \|\|Continuous \|\|Continuous \|\|Value distribution \|} == Research ==

Deep reinforcement learning: Difference between revisions