Markov decision process: Difference between revisions

Content deleted Content added
See also: added link to Markov chain
Arachnidly (talk | contribs)
Technical corrections and better refs. will be back soon to fix more stuff and make it less scientific journal-y.
Line 1:
{{Short description|Mathematical model}}
'''Markov decision process''' ('''MDP'''), also called a [[Stochastic dynamic programming|stochastic dynamic program]] or stochastic control problem, is a model for [[sequential decision making]] when [[Outcome (probability)|outcomes]] are uncertain.<ref>{{Cite book |last=Puterman |first=Martin L. |title=Markov decision processes: discrete stochastic dynamic programming |date=1994 |publisher=Wiley |isbn=978-0-471-61977-2 |series=Wiley series in probability and mathematical statistics. Applied probability and statistics section |___location=New York}}</ref>
In mathematics, a '''Markov decision process''' ('''MDP''') is a [[discrete-time]] [[stochastic]] [[Optimal control theory|control]] process. It provides a mathematical framework for modeling [[decision making]] in situations where outcomes are partly [[Randomness#In mathematics|random]] and partly under the control of a decision maker. MDPs are useful for studying [[optimization problem]]s solved via [[dynamic programming]]. MDPs were known at least as early as the 1950s;<ref>{{cite journal|first=R.|last=Bellman|author-link=Richard E. Bellman|url=http://www.iumj.indiana.edu/IUMJ/FULLTEXT/1957/6/56038|title=A Markovian Decision Process|journal=Journal of Mathematics and Mechanics|volume=6|year=1957|issue=5|pages=679–684|jstor=24900506}}</ref> a core body of research on Markov decision processes resulted from [[Ronald A. Howard|Ronald Howard]]'s 1960 book, ''Dynamic Programming and Markov Processes''.<ref>{{cite book|first=Ronald A.|last=Howard|title=Dynamic Programming and Markov Processes|publisher=The M.I.T. Press|year=1960}}</ref> They are used in many disciplines, including [[robotics]], [[automatic control]], [[economics]] and [[manufacturing]]. The name of MDPs comes from the Russian mathematician [[Andrey Markov]] as they are an extension of [[Markov chain]]s.
 
Originating from [[operations research]] in the 1950s,<ref>{{Cite journal |last=Schneider |first=S. |last2=Wagner |first2=D. H. |date=1957-02-26 |title=Error detection in redundant systems |url=https://dl.acm.org/doi/10.1145/1455567.1455587 |journal=Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability |series=IRE-AIEE-ACM '57 (Western) |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=115–121 |doi=10.1145/1455567.1455587 |isbn=978-1-4503-7861-1}}</ref><ref>{{Cite journal |last=Bellman |first=Richard |date=1958-09-01 |title=Dynamic programming and stochastic control processes |url=https://linkinghub.elsevier.com/retrieve/pii/S0019995858800030 |journal=Information and Control |volume=1 |issue=3 |pages=228–239 |doi=10.1016/S0019-9958(58)80003-0 |issn=0019-9958}}</ref> MDPs have since gained recognition in a variety of fields, including [[robotics]], [[ecology]], [[economics]], [[Health care|healthcare]], and [[telecommunications]].
At each time step, the process is in some state <math>s</math>, and the decision maker may choose any action <math>a</math> that is available in state <math>s</math>. The process responds at the next time step by randomly moving into a new state <math>s'</math>, and giving the decision maker a corresponding reward <math>R_a(s,s')</math>.
 
==Background==
The [[probability]] that the process moves into its new state <math>s'</math> is influenced by the chosen action. Specifically, it is given by the [[state transition function]] <math>P_a(s,s')</math>. Thus, the next state <math>s'</math> depends on the current state <math>s</math> and the decision maker's action <math>a</math>. But given <math>s</math> and <math>a</math>, it is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP satisfy the [[Markov property]].
 
The name "Markov decision process" comes from its connection to '''[[Markov chain|Markov chains]]''', a mathematical concept developed by the Russian mathematician [[Andrey Markov]]. A Markov chain is a sequence of states where the probability of moving to the next state depends only on the current state and not on the sequence of events that preceded it. This property is known as the '''[[Markov property]]''' or memorylessness.
Markov decision processes are an extension of [[Markov chain]]s; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state (e.g. "wait") and all rewards are the same (e.g. "zero"), a Markov decision process reduces to a Markov chain.
 
An MDP builds on the idea of a Markov chain but adds the element of decision-making. In an MDP, an agent makes decisions that influence the transitions between states. Each decision (or action) taken in a particular state leads to a probability distribution over the next possible states, similar to a Markov chain. However, unlike a simple Markov chain, in an MDP, the agent can actively choose actions to optimize a certain objective (usually maximizing some cumulative reward).
 
The "Markov" in "Markov decision process" refers to the underlying structure of [[Transition system|state transitions]] that still follow the Markov property. The process is called a "decision process" because it involves making decisions that influence these state transitions, extending the concept of a Markov chain into the realm of decision-making under uncertainty.
 
==Definition==