Revision as of 05:15, 4 November 2019 edit Magicheader (talk \| contribs) 139 edits No edit summary ← Previous edit		Revision as of 14:09, 29 November 2019 edit undo FrescoBot (talk \| contribs) Bots 1,152,845 edits m Bot: link syntax and minor changes Next edit →
Line 1: {{technical\|date=June 2012}} '''Stochastic approximation''' methods are a family of [[~~Iterative methods\|~~iterative methods]] typically used for [[~~Root-finding\|~~root-finding]] problems or for [[optimization problem\|optimization]] problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating [[Extremum\|extreme values]] of functions which cannot be computed directly, but only estimated via noisy observations. In a nutshell, stochastic approximation algorithms deal with a function of the form <math display="inline"> f(\theta) = \operatorname E_{\xi} [F(\theta,\xi)] </math> Line 9: These applications range from [[stochastic optimization]] methods and algorithms, to online forms of the [[Expectation–maximization algorithm\| EM algorithm]], reinforcement learning via [[Temporal difference learning\|temporal differences ]], and [[~~Deep learning\|~~deep learning]], and others.<ref>{{cite journal \|last1=Toulis \|first1=Panos \|first2=Edoardo \|last2=Airoldi\|title=Scalable estimation strategies based on stochastic approximations: classical results and new insights \|journal=Statistics and Computing \|volume=25 \|issue=4 \|year=2015 \|pages=781-795\|url=https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4484776/\|doi=10.1007/s11222-015-9560-y}}</ref> Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory.<ref>{{cite web\|last1=Le Ny\|first1=Jerome\|title=Introduction to Stochastic Approximation Algorithms\|url=http://www.professeurs.polymtl.ca/jerome.le-ny/teaching/DP_fall09/notes/lec11_SA.pdf\|website=Polytechnique Montreal\|publisher=Teaching Notes\|accessdate=16 November 2016}}</ref> Line 19: ::<math>\theta_{n+1}=\theta_n - a_n(N(\theta_n) - \alpha)</math> Here, <math>a_1, a_2, \dots</math> is a sequence of positive step sizes. [[Herbert Robbins\|Robbins]] and Monro proved <ref name="rm" /><sup>, Theorem 2</sup> that <math>\theta_n</math> [[convergence of random variables\|converges]] in <math>L^2</math> (and hence also in probability) to <math>\theta</math>, and Blum<ref name=":0">{{Cite journal\|last=Blum\|first=Julius R.\|date=1954-06-01\|title=Approximation Methods which Converge with Probability one\|journal=The Annals of Mathematical Statistics\|language=EN\|volume=25\|issue=2\|pages=382–386\|doi=10.1214/aoms/1177728794\|issn=0003-4851}}</ref> later proved the convergence is actually with probability one, provided that: * <math display="inline">N(\theta)</math> is uniformly bounded, * <math display="inline">M(\theta)</math> is nondecreasing, Line 30: ===Complexity results=== #If <math display="inline">f(\theta)</math> is twice continuously differentiable, and strongly convex, and the minimizer of <math display="inline">f(\theta)</math> belongs to the interior of <math display="inline">\Theta</math>, then the Robbins–Monro algorithm will achieve the asymptotically optimal convergence rate, with respect to the objective function, being <math display="inline">\operatorname E[f(\theta_n) - f^] = O(1/n)</math>, where <math display="inline">f^</math> is the minimal value of <math display="inline">f(\theta)</math> over <math display="inline">\theta \in \Theta</math>.<ref name="jsacks">{{Cite journal \| last1 = Sacks \| first1 = J. \| title = Asymptotic Distribution of Stochastic Approximation Procedures \| doi = 10.1214/aoms/1177706619 \| journal = The Annals of Mathematical Statistics \| volume = 29 \| issue = 2 \| pages = 373–405 \| year = 1958 \| jstor = 2237335\| pmid = \| pmc = }}</ref><ref name="NJLS">{{Cite journal \| last1 = Nemirovski \| first1 = A. \| authorlink1 = Arkadi Nemirovski\| last2 = Juditsky \| first2 = A. \| last3 = Lan \| first3 = G. \| last4 = Shapiro \| first4 = A. \| title = Robust Stochastic Approximation Approach to Stochastic Programming \| doi = 10.1137/070704277 \| journal = SIAM Journal on Optimization \| volume = 19 \| issue = 4 \| pages = 1574 \| year = 2009 \| pmid = \| pmc = }}</ref> # Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin <ref name="NYcomp">Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin, ''Wiley -Intersci. Ser. Discrete Math'' '''15''' ''John Wiley'' ''New York'' (1983) .</ref> have shown that the asymptotically optimal convergence rate, with respect to the objective function values, is <math display="inline">O(1/\sqrt{n})</math>. They have also proven that this rate cannot be improved. ===Subsequent developments and Polyak–Ruppert Averaging=== Line 39: '''''A1)''''' ''<math display="block"> a_n \rightarrow 0, \qquad \frac{a_n - a_{n+1}}{a_n} = o(a_n)</math> Therefore, the sequence <math display="inline">a_n = n^{-\alpha}</math> with <math display="inline">0 < \alpha < 1</math> satisfies this restriction, but <math display="inline">\alpha = 1</math> does not, hence the longer steps. Under the assumptions outlined in the Robbins–Monro algorithm, the resulting modification will result in the same asymptotically optimal convergence rate <math display="inline">O(1/\sqrt{n})</math> yet with a more robust step size policy.<ref name="pj" /> Prior to this, the idea of using longer steps and averaging the iterates had already been proposed by Nemirovski and Yudin <ref name="NY">On Cezari's convergence of the steepest descent method for approximating saddle points of convex-concave functions, A. Nemirovski and D. Yudin, ''Dokl. Akad. Nauk SSR'' '''2939''', (1978 (Russian)), Soviet Math. Dokl. '''19''' (1978 (English)).</ref> for the cases of solving the stochastic optimization problem with continuous convex objectives and for convex-concave saddle point problems. These algorithms were observed to attain the nonasymptotic rate <math display="inline">O(1/\sqrt{n})</math>. A more general result is given in Chapter 11 of Kushner and Yin<ref>{{Cite book\|url=https://www.springer.com/us/book/9780387008943\|title=Stochastic Approximation and Recursive Algorithms and {{!}} Harold Kushner {{!}} Springer\|website=www.springer.com\|access-date=2016-05-16\|isbn=9780387008943\|last1=Kushner\|first1=Harold\|last2=George Yin\|first2=G.\|date=2003-07-17}}</ref> by defining interpolated time <math display="inline">t_n=\sum_{i=0}^{n-1}a_i</math>, interpolated process <math display="inline">\theta^n(\cdot)</math> and interpolated normalized process <math display="inline">U^n(\cdot)</math> as Line 102: </math> almost surely. Here are some intuitive explanations about these conditions. Suppose <math>H(\theta_n, X_{n+1})</math> is a uniformly bounded random variables. If C2) is not satisfied, i.e. <math>\sum_{n=0}^{\infty}\epsilon_n < \infty </math> , then<math display="block">\theta_n - \theta_0 = -\sum_{i=0}^{n-1}\epsilon_i H(\theta_i, X_{i+1}) </math>is a bounded sequence, so the iteration cannot converge to <math>\theta^*

Stochastic approximation: Difference between revisions