Stochastic approximation: Difference between revisions

Content deleted Content added
No edit summary
FrescoBot (talk | contribs)
m Bot: link syntax and minor changes
Line 1:
{{technical|date=June 2012}}
'''Stochastic approximation''' methods are a family of [[Iterative methods|iterative methods]] typically used for [[Root-finding|root-finding]] problems or for [[optimization problem|optimization]] problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating [[Extremum|extreme values]] of functions which cannot be computed directly, but only estimated via noisy observations.
 
In a nutshell, stochastic approximation algorithms deal with a function of the form <math display="inline"> f(\theta) = \operatorname E_{\xi} [F(\theta,\xi)] </math>
Line 9:
These applications range from [[stochastic optimization]] methods and algorithms,
to online forms of the [[Expectation–maximization algorithm| EM algorithm]], reinforcement learning
via [[Temporal difference learning|temporal differences ]], and [[Deep learning|deep learning]], and others.<ref>{{cite journal |last1=Toulis |first1=Panos |first2=Edoardo |last2=Airoldi|title=Scalable estimation strategies based on stochastic approximations: classical results and new insights |journal=Statistics and Computing |volume=25 |issue=4 |year=2015 |pages=781-795|url=https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4484776/|doi=10.1007/s11222-015-9560-y}}</ref>
Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory.<ref>{{cite web|last1=Le Ny|first1=Jerome|title=Introduction to Stochastic Approximation Algorithms|url=http://www.professeurs.polymtl.ca/jerome.le-ny/teaching/DP_fall09/notes/lec11_SA.pdf|website=Polytechnique Montreal|publisher=Teaching Notes|accessdate=16 November 2016}}</ref>
 
Line 19:
::<math>\theta_{n+1}=\theta_n - a_n(N(\theta_n) - \alpha)</math>
 
Here, <math>a_1, a_2, \dots</math> is a sequence of positive step sizes. [[Herbert Robbins|Robbins]] and Monro proved <ref name="rm" /><sup>, Theorem 2</sup> that <math>\theta_n</math> [[convergence of random variables|converges]] in <math>L^2</math> (and hence also in probability) to <math>\theta</math>, and Blum<ref name=":0">{{Cite journal|last=Blum|first=Julius R.|date=1954-06-01|title=Approximation Methods which Converge with Probability one|journal=The Annals of Mathematical Statistics|language=EN|volume=25|issue=2|pages=382–386|doi=10.1214/aoms/1177728794|issn=0003-4851}}</ref> later proved the convergence is actually with probability one, provided that:
* <math display="inline">N(\theta)</math> is uniformly bounded,
* <math display="inline">M(\theta)</math> is nondecreasing,
Line 30:
===Complexity results===
#If <math display="inline">f(\theta)</math> is twice continuously differentiable, and strongly convex, and the minimizer of <math display="inline">f(\theta)</math> belongs to the interior of <math display="inline">\Theta</math>, then the Robbins–Monro algorithm will achieve the asymptotically optimal convergence rate, with respect to the objective function, being <math display="inline">\operatorname E[f(\theta_n) - f^*] = O(1/n)</math>, where <math display="inline">f^*</math> is the minimal value of <math display="inline">f(\theta)</math> over <math display="inline">\theta \in \Theta</math>.<ref name="jsacks">{{Cite journal | last1 = Sacks | first1 = J. | title = Asymptotic Distribution of Stochastic Approximation Procedures | doi = 10.1214/aoms/1177706619 | journal = The Annals of Mathematical Statistics | volume = 29 | issue = 2 | pages = 373–405 | year = 1958 | jstor = 2237335| pmid = | pmc = }}</ref><ref name="NJLS">{{Cite journal | last1 = Nemirovski | first1 = A. | authorlink1 = Arkadi Nemirovski| last2 = Juditsky | first2 = A. | last3 = Lan | first3 = G. | last4 = Shapiro | first4 = A. | title = Robust Stochastic Approximation Approach to Stochastic Programming | doi = 10.1137/070704277 | journal = SIAM Journal on Optimization | volume = 19 | issue = 4 | pages = 1574 | year = 2009 | pmid = | pmc = }}</ref>
# Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin <ref name="NYcomp">Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin, ''Wiley -Intersci. Ser. Discrete Math'' '''15''' ''John Wiley'' ''New York'' (1983) .</ref> have shown that the asymptotically optimal convergence rate, with respect to the objective function values, is <math display="inline">O(1/\sqrt{n})</math>. They have also proven that this rate cannot be improved.
 
===Subsequent developments and Polyak–Ruppert Averaging===
Line 39:
'''''A1)''''' ''<math display="block"> a_n \rightarrow 0, \qquad \frac{a_n - a_{n+1}}{a_n} = o(a_n)</math>
 
Therefore, the sequence <math display="inline">a_n = n^{-\alpha}</math> with <math display="inline">0 < \alpha < 1</math> satisfies this restriction, but <math display="inline">\alpha = 1</math> does not, hence the longer steps. Under the assumptions outlined in the Robbins–Monro algorithm, the resulting modification will result in the same asymptotically optimal convergence rate <math display="inline">O(1/\sqrt{n})</math> yet with a more robust step size policy.<ref name="pj" /> Prior to this, the idea of using longer steps and averaging the iterates had already been proposed by Nemirovski and Yudin <ref name="NY">On Cezari's convergence of the steepest descent method for approximating saddle points of convex-concave functions, A. Nemirovski and D. Yudin, ''Dokl. Akad. Nauk SSR'' '''2939''', (1978 (Russian)), Soviet Math. Dokl. '''19''' (1978 (English)).</ref> for the cases of solving the stochastic optimization problem with continuous convex objectives and for convex-concave saddle point problems. These algorithms were observed to attain the nonasymptotic rate <math display="inline">O(1/\sqrt{n})</math>.
 
A more general result is given in Chapter 11 of Kushner and Yin<ref>{{Cite book|url=https://www.springer.com/us/book/9780387008943|title=Stochastic Approximation and Recursive Algorithms and {{!}} Harold Kushner {{!}} Springer|website=www.springer.com|access-date=2016-05-16|isbn=9780387008943|last1=Kushner|first1=Harold|last2=George Yin|first2=G.|date=2003-07-17}}</ref> by defining interpolated time <math display="inline">t_n=\sum_{i=0}^{n-1}a_i</math>, interpolated process <math display="inline">\theta^n(\cdot)</math> and interpolated normalized process <math display="inline">U^n(\cdot)</math> as
Line 102:
</math> almost surely.
 
Here are some intuitive explanations about these conditions. Suppose <math>H(\theta_n, X_{n+1})</math> is a uniformly bounded random variables. If C2) is not satisfied, i.e. <math>\sum_{n=0}^{\infty}\epsilon_n < \infty
</math> , then<math display="block">\theta_n - \theta_0 = -\sum_{i=0}^{n-1}\epsilon_i H(\theta_i, X_{i+1})
</math>is a bounded sequence, so the iteration cannot converge to <math>\theta^*