Markov decision process: Difference between revisions

Content deleted Content added
mNo edit summary
Line 175:
{{main|Reinforcement learning}}
 
[[Reinforcement learning]] uses MDPs where the probabilities orand rewards are unknown.<ref>{{cite journal|author1=Shoham, Y.|author2= Powers, R.|author3= Grenager, T. |year=2003|title= Multi-agent reinforcement learning: a critical survey |pages= 1–13|journal= Technical Report, Stanford University|url=http://jmvidal.cse.sc.edu/library/shoham03a.pdf|access-date=2018-12-12}}</ref>
 
===Reinforcement Learning for discrete MDPs===
For thisthe purpose of this section, it is useful to define a further function, which corresponds to taking the action <math>a</math> and then continuing optimally (or according to whatever policy one currently has):
:<math>\ Q(s,a) = \sum_{s'} P_a(s,s') (R_a(s,s') + \gamma V(s')).\ </math>