Markov decision process: Difference between revisions

Content deleted Content added
OAbot (talk | contribs)
m Open access bot: doi updated in citation with #oabot.
URI was not to the PDF of the book but some other PDF
Line 1:
{{Short description|Mathematical model}}
In mathematics, a '''Markov decision process''' ('''MDP''') is a [[discrete-time]] [[stochastic]] [[Optimal control theory|control]] process. It provides a mathematical framework for modeling [[decision making]] in situations where outcomes are partly [[Randomness#In mathematics|random]] and partly under the control of a decision maker. MDPs are useful for studying [[optimization problem]]s solved via [[dynamic programming]]. MDPs were known at least as early as the 1950s;<ref>{{cite journal|first=R.|last=Bellman|author-link=Richard E. Bellman|url=http://www.iumj.indiana.edu/IUMJ/FULLTEXT/1957/6/56038|title=A Markovian Decision Process|journal=Journal of Mathematics and Mechanics|volume=6|year=1957|issue=5|pages=679–684|jstor=24900506}}</ref> a core body of research on Markov decision processes resulted from [[Ronald A. Howard|Ronald Howard]]'s 1960 book, ''Dynamic Programming and Markov Processes''.<ref>{{cite book|first=Ronald A.|last=Howard|title=Dynamic Programming and Markov Processes|publisher=The M.I.T. Press|year=1960|url=http://web.mit.edu/dimitrib/www/dpchapter.pdf}}</ref> They are used in many disciplines, including [[robotics]], [[automatic control]], [[economics]] and [[manufacturing]]. The name of MDPs comes from the Russian mathematician [[Andrey Markov]] as they are an extension of [[Markov chain]]s.
 
At each time step, the process is in some state <math>s</math>, and the decision maker may choose any action <math>a</math> that is available in state <math>s</math>. The process responds at the next time step by randomly moving into a new state <math>s'</math>, and giving the decision maker a corresponding reward <math>R_a(s,s')</math>.