Content deleted Content added
Magicheader (talk | contribs) No edit summary |
m Bot: link syntax and minor changes |
||
Line 1:
{{technical|date=June 2012}}
'''Stochastic approximation''' methods are a family of [[
In a nutshell, stochastic approximation algorithms deal with a function of the form <math display="inline"> f(\theta) = \operatorname E_{\xi} [F(\theta,\xi)] </math>
Line 9:
These applications range from [[stochastic optimization]] methods and algorithms,
to online forms of the [[Expectation–maximization algorithm| EM algorithm]], reinforcement learning
via [[Temporal difference learning|temporal differences
Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory.<ref>{{cite web|last1=Le Ny|first1=Jerome|title=Introduction to Stochastic Approximation Algorithms|url=http://www.professeurs.polymtl.ca/jerome.le-ny/teaching/DP_fall09/notes/lec11_SA.pdf|website=Polytechnique Montreal|publisher=Teaching Notes|accessdate=16 November 2016}}</ref>
Line 19:
::<math>\theta_{n+1}=\theta_n - a_n(N(\theta_n) - \alpha)</math>
Here, <math>a_1, a_2, \dots</math> is a sequence of positive step sizes. [[Herbert Robbins|Robbins]] and Monro proved
* <math display="inline">N(\theta)</math> is uniformly bounded,
* <math display="inline">M(\theta)</math> is nondecreasing,
Line 30:
===Complexity results===
#If <math display="inline">f(\theta)</math> is twice continuously differentiable, and strongly convex, and the minimizer of <math display="inline">f(\theta)</math> belongs to the interior of <math display="inline">\Theta</math>, then the Robbins–Monro algorithm will achieve the asymptotically optimal convergence rate, with respect to the objective function, being <math display="inline">\operatorname E[f(\theta_n) - f^*] = O(1/n)</math>, where <math display="inline">f^*</math> is the minimal value of <math display="inline">f(\theta)</math> over <math display="inline">\theta \in \Theta</math>.<ref name="jsacks">{{Cite journal | last1 = Sacks | first1 = J. | title = Asymptotic Distribution of Stochastic Approximation Procedures | doi = 10.1214/aoms/1177706619 | journal = The Annals of Mathematical Statistics | volume = 29 | issue = 2 | pages = 373–405 | year = 1958 | jstor = 2237335| pmid = | pmc = }}</ref><ref name="NJLS">{{Cite journal | last1 = Nemirovski | first1 = A. | authorlink1 = Arkadi Nemirovski| last2 = Juditsky | first2 = A. | last3 = Lan | first3 = G. | last4 = Shapiro | first4 = A. | title = Robust Stochastic Approximation Approach to Stochastic Programming | doi = 10.1137/070704277 | journal = SIAM Journal on Optimization | volume = 19 | issue = 4 | pages = 1574 | year = 2009 | pmid = | pmc = }}</ref>
# Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin
===Subsequent developments and Polyak–Ruppert Averaging===
Line 39:
'''''A1)''''' ''<math display="block"> a_n \rightarrow 0, \qquad \frac{a_n - a_{n+1}}{a_n} = o(a_n)</math>
Therefore, the sequence <math display="inline">a_n = n^{-\alpha}</math> with <math display="inline">0 < \alpha < 1</math> satisfies this restriction, but <math display="inline">\alpha = 1</math> does not, hence the longer steps. Under the assumptions outlined in the Robbins–Monro algorithm, the resulting modification will result in the same asymptotically optimal convergence rate <math display="inline">O(1/\sqrt{n})</math> yet with a more robust step size policy.<ref name="pj" /> Prior to this, the idea of using longer steps and averaging the iterates had already been proposed by Nemirovski and Yudin
A more general result is given in Chapter 11 of Kushner and Yin<ref>{{Cite book|url=https://www.springer.com/us/book/9780387008943|title=Stochastic Approximation and Recursive Algorithms and {{!}} Harold Kushner {{!}} Springer|website=www.springer.com|access-date=2016-05-16|isbn=9780387008943|last1=Kushner|first1=Harold|last2=George Yin|first2=G.|date=2003-07-17}}</ref> by defining interpolated time <math display="inline">t_n=\sum_{i=0}^{n-1}a_i</math>, interpolated process <math display="inline">\theta^n(\cdot)</math> and interpolated normalized process <math display="inline">U^n(\cdot)</math> as
Line 102:
</math> almost surely.
Here are some intuitive explanations about these conditions. Suppose <math>H(\theta_n, X_{n+1})</math> is a uniformly bounded random variables. If
</math> , then<math display="block">\theta_n - \theta_0 = -\sum_{i=0}^{n-1}\epsilon_i H(\theta_i, X_{i+1})
</math>is a bounded sequence, so the iteration cannot converge to <math>\theta^*
|