Revision as of 05:30, 21 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits mNo edit summary Tag: 2017 wikitext editor ← Previous edit		Revision as of 05:39, 21 January 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →Critic Tag: Visual edit Next edit →
Line 1: {{Short description\|Reinforcement learning algorithm that combines policy and value estimation}} The '''actor-critic algorithm''' (AC) is a family of [[reinforcement learning]] (RL) algorithms that combine policy-based RL algorithms such as [[Policy gradient method\|policy gradient methods]], and value-based RL algorithms such as value iteration, [[Q-learning]], [[State–action–reward–state–action\|SARSA]], and [[Temporal difference learning\|TD learning]].<ref>{{Cite journal \|last=Arulkumaran \|first=Kai \|last2=Deisenroth \|first2=Marc Peter \|last3=Brundage \|first3=Miles \|last4=Bharath \|first4=Anil Anthony \|date=November 2017 \|title=Deep Reinforcement Learning: A Brief Survey \|url=http://ieeexplore.ieee.org/document/8103164/ \|journal=IEEE Signal Processing Magazine \|volume=34 \|issue=6 \|pages=26–38 \|doi=10.1109/MSP.2017.2743240 \|issn=1053-5888}}</ref> An AC algorithm consists of two main components: an "'''actor'''" that determines which actions to take according to a policy function, and a "'''critic'''" that evaluates those actions according to a value function.<ref>{{Cite journal \|last=Konda \|first=Vijay \|last2=Tsitsiklis \|first2=John \|date=1999 \|title=Actor-Critic Algorithms \|url=https://proceedings.neurips.cc/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=MIT Press \|volume=12}}</ref> Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases. Line 46: === Critic === In the unbiased estimators given above, certain functions such as <math>V^{\pi_\theta}, Q^{\pi_\theta}, A^{\pi_\theta}</math> appear. These are ~~estimated~~approximated by the '''critic'''. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms. For example, if the critic is estimating the state-value function <math>V^{\pi_\theta}(s)</math>, then it can be learned by any value function approximation method. Let the critic be a function approximator <math>V_\phi(s)</math> with parameters <math>\phi</math>. The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:<math display="block">\delta_t = R_t + \gamma V_\phi(S_{t+1}) - V_\phi(S_t)</math>The critic parameters are updated by gradient descent on the squared TD error:<math display="block">\phi \leftarrow \phi + \alpha \nabla_\phi (\delta_t)^2 = \phi + \alpha \delta_t \nabla_\phi V_\phi(S_t)</math>where <math>\alpha</math> is the learning rate. Similarly, if the critic is estimating the action-value function <math>Q^{\pi_\theta}(s,a)</math>, then it can be learned by [[Q-learning]] or [[State–action–reward–state–action\|SARSA]]. == Variants ==

Actor-critic algorithm: Difference between revisions