Actor-critic algorithm: Difference between revisions

Content deleted Content added
init
 
BattyBot (talk | contribs)
Fixed reference date error(s) (see CS1 errors: dates for details) and AWB general fixes, added orphan tag
Line 1:
{{Short description|Reinforcement learning algorithm that combines policy and value estimation}}
{{Orphan|date=January 2025}}
 
The '''actor-critic algorithm''' (AC) is a family of [[reinforcement learning]] (RL) algorithms that combine policy-based and value-based RL algorithms. It consists of two main components: an "'''actor'''" that determines which actions to take according to a policy function, and a "'''critic'''" that evaluates those actions according to a value function.<ref>{{Cite journal |last=Konda |first=Vijay |last2=Tsitsiklis |first2=John |date=1999 |title=Actor-Critic Algorithms |url=https://proceedings.neurips.cc/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=12}}</ref> Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.
 
The AC algorithms are one of the main algorithm families used in modern RL.<ref>{{Cite journal |last=Arulkumaran |first=Kai |last2=Deisenroth |first2=Marc Peter |last3=Brundage |first3=Miles |last4=Bharath |first4=Anil Anthony |date=November 2017-11 |title=Deep Reinforcement Learning: A Brief Survey |url=http://ieeexplore.ieee.org/document/8103164/ |journal=IEEE Signal Processing Magazine |volume=34 |issue=6 |pages=26–38 |doi=10.1109/MSP.2017.2743240 |issn=1053-5888}}</ref>
 
== Overview ==
 
The actor-critic method belongs to the family of [[Policypolicy gradient method|policy gradient methods]]s but addresses their high variance issue by incorporating a value function approximator (the critic). The actor uses a policy function <math>\pi(a|s)</math>, while the critic estimates either the [[value function]] <math>V(s)</math>, the action-value Q-function <math>Q(s,a)
</math>, the advantage function <math>A(s,a)</math>, or any combination thereof.
 
The actor is a parameterized function <math>\pi_\theta</math>, where <math>\theta</math> are the parameters of the actor. The actor takes as argument the state of the environment <math>s</math> and produces a [[probability distribution]] <math>\pi_\theta(\cdot | s)</math>.
 
If the action space is discrete, then <math>\sum_{a} \pi_\theta(a | s) = 1</math>. If the action space is continuous, then <math>\int_{a} \pi_\theta(a | s) da = 1</math>.
Line 51 ⟶ 52:
== References ==
{{Reflist|30em}}
* {{Cite journal |last=Konda |first=Vijay R. |last2=Tsitsiklis |first2=John N. |date=January 2003-01 |title=On Actor-Critic Algorithms |url=http://epubs.siam.org/doi/10.1137/S0363012901385691 |journal=SIAM Journal on Control and Optimization |language=en |volume=42 |issue=4 |pages=1143–1166 |doi=10.1137/S0363012901385691 |issn=0363-0129}}
* {{Cite book |last=Sutton |first=Richard S. |title=Reinforcement learning: an introduction |last2=Barto |first2=Andrew G. |date=2018 |publisher=The MIT Press |isbn=978-0-262-03924-6 |edition=2 |series=Adaptive computation and machine learning series |___location=Cambridge, Massachusetts}}
* {{Cite book |last=Bertsekas |first=Dimitri P. |title=Reinforcement learning and optimal control |date=2019 |publisher=Athena Scientific |isbn=978-1-886529-39-7 |edition=2 |___location=Belmont, Massachusetts}}