Content deleted Content added
→See also: added link to Markov chain |
Arachnidly (talk | contribs) Technical corrections and better refs. will be back soon to fix more stuff and make it less scientific journal-y. |
||
Line 1:
{{Short description|Mathematical model}}
'''Markov decision process''' ('''MDP'''), also called a [[Stochastic dynamic programming|stochastic dynamic program]] or stochastic control problem, is a model for [[sequential decision making]] when [[Outcome (probability)|outcomes]] are uncertain.<ref>{{Cite book |last=Puterman |first=Martin L. |title=Markov decision processes: discrete stochastic dynamic programming |date=1994 |publisher=Wiley |isbn=978-0-471-61977-2 |series=Wiley series in probability and mathematical statistics. Applied probability and statistics section |___location=New York}}</ref>
Originating from [[operations research]] in the 1950s,<ref>{{Cite journal |last=Schneider |first=S. |last2=Wagner |first2=D. H. |date=1957-02-26 |title=Error detection in redundant systems |url=https://dl.acm.org/doi/10.1145/1455567.1455587 |journal=Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability |series=IRE-AIEE-ACM '57 (Western) |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=115–121 |doi=10.1145/1455567.1455587 |isbn=978-1-4503-7861-1}}</ref><ref>{{Cite journal |last=Bellman |first=Richard |date=1958-09-01 |title=Dynamic programming and stochastic control processes |url=https://linkinghub.elsevier.com/retrieve/pii/S0019995858800030 |journal=Information and Control |volume=1 |issue=3 |pages=228–239 |doi=10.1016/S0019-9958(58)80003-0 |issn=0019-9958}}</ref> MDPs have since gained recognition in a variety of fields, including [[robotics]], [[ecology]], [[economics]], [[Health care|healthcare]], and [[telecommunications]].
==Background==
The name "Markov decision process" comes from its connection to '''[[Markov chain|Markov chains]]''', a mathematical concept developed by the Russian mathematician [[Andrey Markov]]. A Markov chain is a sequence of states where the probability of moving to the next state depends only on the current state and not on the sequence of events that preceded it. This property is known as the '''[[Markov property]]''' or memorylessness.
An MDP builds on the idea of a Markov chain but adds the element of decision-making. In an MDP, an agent makes decisions that influence the transitions between states. Each decision (or action) taken in a particular state leads to a probability distribution over the next possible states, similar to a Markov chain. However, unlike a simple Markov chain, in an MDP, the agent can actively choose actions to optimize a certain objective (usually maximizing some cumulative reward).
The "Markov" in "Markov decision process" refers to the underlying structure of [[Transition system|state transitions]] that still follow the Markov property. The process is called a "decision process" because it involves making decisions that influence these state transitions, extending the concept of a Markov chain into the realm of decision-making under uncertainty.
==Definition==
|