Revision as of 03:52, 21 January 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits init Tag: Visual edit		Revision as of 04:33, 21 January 2025 edit undo BattyBot (talk \| contribs) Bots 1,957,331 edits Fixed reference date error(s) (see CS1 errors: dates for details) and AWB general fixes, added orphan tag Tag: AWB Next edit →
Line 1: {{Short description\|Reinforcement learning algorithm that combines policy and value estimation}} {{Orphan\|date=January 2025}} The '''actor-critic algorithm''' (AC) is a family of [[reinforcement learning]] (RL) algorithms that combine policy-based and value-based RL algorithms. It consists of two main components: an "'''actor'''" that determines which actions to take according to a policy function, and a "'''critic'''" that evaluates those actions according to a value function.<ref>{{Cite journal \|last=Konda \|first=Vijay \|last2=Tsitsiklis \|first2=John \|date=1999 \|title=Actor-Critic Algorithms \|url=https://proceedings.neurips.cc/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=MIT Press \|volume=12}}</ref> Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases. The AC algorithms are one of the main algorithm families used in modern RL.<ref>{{Cite journal \|last=Arulkumaran \|first=Kai \|last2=Deisenroth \|first2=Marc Peter \|last3=Brundage \|first3=Miles \|last4=Bharath \|first4=Anil Anthony \|date=November 2017~~-11~~ \|title=Deep Reinforcement Learning: A Brief Survey \|url=http://ieeexplore.ieee.org/document/8103164/ \|journal=IEEE Signal Processing Magazine \|volume=34 \|issue=6 \|pages=26–38 \|doi=10.1109/MSP.2017.2743240 \|issn=1053-5888}}</ref> == Overview == The actor-critic method belongs to the family of [[~~Policy~~policy gradient method~~\|policy gradient methods~~]]s but addresses their high variance issue by incorporating a value function approximator (the critic). The actor uses a policy function <math>\pi(a\|s)</math>, while the critic estimates either the [[value function]] <math>V(s)</math>, the action-value Q-function <math>Q(s,a) </math>, the advantage function <math>A(s,a)</math>, or any combination thereof. The actor is a parameterized function <math>\pi_\theta</math>, where <math>\theta</math> are the parameters of the actor. The actor takes as argument the state of the environment <math>s</math> and produces a [[probability distribution]] <math>\pi_\theta(\cdot \| s)</math>. If the action space is discrete, then <math>\sum_{a} \pi_\theta(a \| s) = 1</math>. If the action space is continuous, then <math>\int_{a} \pi_\theta(a \| s) da = 1</math>. Line 51 ⟶ 52: == References == {{Reflist\|30em}} * {{Cite journal \|last=Konda \|first=Vijay R. \|last2=Tsitsiklis \|first2=John N. \|date=January 2003~~-01~~ \|title=On Actor-Critic Algorithms \|url=http://epubs.siam.org/doi/10.1137/S0363012901385691 \|journal=SIAM Journal on Control and Optimization \|language=en \|volume=42 \|issue=4 \|pages=1143–1166 \|doi=10.1137/S0363012901385691 \|issn=0363-0129}} * {{Cite book \|last=Sutton \|first=Richard S. \|title=Reinforcement learning: an introduction \|last2=Barto \|first2=Andrew G. \|date=2018 \|publisher=The MIT Press \|isbn=978-0-262-03924-6 \|edition=2 \|series=Adaptive computation and machine learning series \|___location=Cambridge, Massachusetts}} * {{Cite book \|last=Bertsekas \|first=Dimitri P. \|title=Reinforcement learning and optimal control \|date=2019 \|publisher=Athena Scientific \|isbn=978-1-886529-39-7 \|edition=2 \|___location=Belmont, Massachusetts}}

Actor-critic algorithm: Difference between revisions