Stochastic approximation: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 14:55, 25 November 2023 edit 212.186.132.66 (talk) →Application in stochastic optimization: Link ← Previous edit		Latest revision as of 08:32, 27 January 2025 edit undo 134.10.78.35 (talk) →Robbins–Monro algorithm Tag: Visual edit
(14 intermediate revisions by 5 users not shown)
Line 11: ==Robbins–Monro algorithm== The Robbins–Monro algorithm, introduced in 1951 by [[Herbert Robbins]] and [[John U. Monro#Personal life\|Sutton Monro]],<ref name="rm">{{Cite journal \| last1 = Robbins \| first1 = H. \| author-link = Herbert Robbins\| last2 = Monro \| first2 = S. \| doi = 10.1214/aoms/1177729586 \| title = A Stochastic Approximation Method \| journal = The Annals of Mathematical Statistics \| volume = 22 \| issue = 3 \| pages = 400 \| year = 1951 \| doi-access = free }}</ref> presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function <math display="inline">M(\theta)</math>, and a constant <math display="inline">\alpha</math>, such that the equation <math display="inline">M(\theta) = \alpha</math> has a unique root at <math display="inline">\theta^.</math>. It is assumed that while we cannot directly observe the function <math display="inline">M(\theta),</math>, we can instead obtain measurements of the random variable <math display="inline">N(\theta)</math> where <math display="inline">\operatorname E[N(\theta)] = M(\theta)</math>. The structure of the algorithm is to then generate iterates of the form: ::<math display="block">\theta_{n+1}=\theta_n - a_n(N(\theta_n) - \alpha)</math> Here, <math>a_1, a_2, \dots</math> is a sequence of positive step sizes. [[Herbert Robbins\|Robbins]] and Monro proved<ref name="rm" /><sup>, Theorem 2</sup> that <math>\theta_n</math> [[convergence of random variables\|converges]] in <math>L^2</math> (and hence also in probability) to <math>\theta^</math>, and Blum<ref name=":0">{{Cite journal\|last=Blum\|first=Julius R.\|date=1954-06-01\|title=Approximation Methods which Converge with Probability one\|journal=The Annals of Mathematical Statistics\|language=EN\|volume=25\|issue=2\|pages=382–386\|doi=10.1214/aoms/1177728794\|issn=0003-4851\|doi-access=free}}</ref> later proved the convergence is actually with probability one, provided that: Line 20: * <math display="inline">M'(\theta^)</math> exists and is positive, and The sequence <math display="inline">a_n</math> satisfies the following requirements: ~~:: <math>\qquad \sum^{\infty}_{n=0}a_n = \infty \quad \mbox{ and } \quad \sum^{\infty}_{n=0}a^2_n < \infty \quad </math>~~ <math display="block">\qquad \sum^{\infty}_{n=0}a_n = \infty \quad \mbox{ and } \quad \sum^{\infty}_{n=0}a^2_n < \infty \quad </math> A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: <math display="inline">a_n=a/n</math>, for <math display="inline"> a > 0 </math>. Other series, such as <math>a_n = \frac{1}{n \ln n}, \frac{1}{n \ln n \ln\ln n}, \dots</math> are possible but in order to average out the noise in <math display="inline">N(\theta)</math>, the above condition must be met. === Example === Consider the problem of estimating the mean <math>\theta^</math> of a probability distribution from a stream of independent samples <math>X_1, X_2, \dots</math>. Let <math>N(\theta) := \theta - X</math>, then the unique solution to <math display="inline">\operatorname E[N(\theta)] = 0</math> is the desired mean <math>\theta^</math>. The RM algorithm gives us<math display="block">\theta_{n+1}=\theta_n - a_n(\theta_n - X_n) </math>This is equivalent to [[stochastic gradient descent]] with loss function <math>L(\theta) = \frac 12 \\|X - \theta\\|^2 </math>. It is also equivalent to a weighted average:<math display="block">\theta_{n+1}=(1-a_n)\theta_n + a_n X_n </math>In general, if there exists some function <math>L</math> such that <math>\nabla L(\theta) = N(\theta) - \alpha </math>, then the Robbins–Monro algorithm is equivalent to stochastic gradient descent with loss function <math>L(\theta)</math>. However, the RM algorithm does not require <math>L</math> to exist in order to converge. ===Complexity results=== Line 31 ⟶ 36: While the Robbins–Monro algorithm is theoretically able to achieve <math display="inline"> O(1/n)</math> under the assumption of twice continuous differentiability and strong convexity, it can perform quite poorly upon implementation. This is primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the supposed asymptotically optimal step size policy can be quite harmful in the beginning.<ref name="NJLS" /><ref name="jcsbook">[https://books.google.com/books?id=f66OIvvkKnAC&q=%22Robbins-Monro%22 Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control], J.C. Spall, ''John Wiley'' ''Hoboken, NJ'', (2003).</ref> Chung (1954)<ref>{{Cite journal\|last=Chung\|first=K. L.\|date=1954-09-01\|title=On a Stochastic Approximation Method\|journal=The Annals of Mathematical Statistics\|language=EN\|volume=25\|issue=3\|pages=463–483\|doi=10.1214/aoms/1177728716\|issn=0003-4851\|doi-access=free}}</ref> and Fabian (1968)<ref>{{Cite journal\|last=Fabian\|first=Vaclav\|date=1968-08-01\|title=On Asymptotic Normality in Stochastic Approximation\|journal=The Annals of Mathematical Statistics\|language=EN\|volume=39\|issue=4\|pages=1327–1332\|doi=10.1214/aoms/1177698258\|issn=0003-4851\|doi-access=free}}</ref> showed that we would achieve optimal convergence rate <math display="inline">O(1/\sqrt{n})</math> with <math display="inline">a_n=\bigtriangledown^2f(\theta^)^{-1}/n</math> (or <math display="inline">a_n=\frac{1}{(nM'(\theta^))}</math>). Lai and Robbins<ref>{{Cite journal\|last1=Lai\|first1=T. L.\|last2=Robbins\|first2=Herbert\|date=1979-11-01\|title=Adaptive Design and Stochastic Approximation\|journal=The Annals of Statistics\|language=EN\|volume=7\|issue=6\|pages=1196–1221\|doi=10.1214/aos/1176344840\|issn=0090-5364\|doi-access=free}}</ref><ref>{{Cite journal\|last1=Lai\|first1=Tze Leung\|last2=Robbins\|first2=Herbert\|date=1981-09-01\|title=Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes\|journal=Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete\|language=en\|volume=56\|issue=3\|pages=329–360\|doi=10.1007/BF00536178\|s2cid=122109044\|issn=0044-3719\|doi-access=free}}</ref> designed adaptive procedures to estimate <math display="inline">M'(\theta^)</math> such that <math display="inline">\theta_n</math> has minimal asymptotic variance. However the application of such optimal methods requires much a priori information which is hard to obtain in most situations. To overcome this shortfall, Polyak (1991)<ref>{{Cite journal\|last=Polyak\|first=B T\|date=~~1990-01-01~~1991\|title=New stochastic approximation type procedures. (In Russian.)\|url=https://www.researchgate.net/publication/236736759\|journal=Automation and Remote Control\|volume=7\|issue=7}}</ref> and Ruppert (1988)<ref>{{Cite ~~journal~~report\|last=Ruppert\|first=D.David\|title=Efficient estimators from a slowly converging robbins-monro process\|url=https://www.researchgate.net/publication/242608650\|type=Technical Report 781\|publisher=Cornell University School of Operations Research and Industrial Engineering\|year=1988}}</ref> independently developed a new optimal algorithm based on the idea of averaging the trajectories. Polyak and Juditsky<ref name="pj">{{Cite journal \| last1 = Polyak \| first1 = B. T. \| last2 = Juditsky \| first2 = A. B. \| doi = 10.1137/0330046 \| title = Acceleration of Stochastic Approximation by Averaging \| journal = SIAM Journal on Control and Optimization \| volume = 30 \| issue = 4 \| pages = 838 \| year = 1992 }}</ref> also presented a method of accelerating Robbins–Monro for linear and non-linear root-searching problems through the use of longer steps, and averaging of the iterates. The algorithm would have the following structure:<math display="block"> \theta_{n+1} - \theta_n = a_n(\alpha - N(\theta_n)), \qquad \bar{\theta}_n = \frac{1}{n} \sum^{n-1}_{i=0} \theta_i </math>The convergence of <math> \bar{\theta}_n </math> to the unique root <math>\theta^</math> relies on the condition that the step sequence <math>\{a_n\}</math> decreases sufficiently slowly. That is '''''A1)''''' ''<math display="block"> a_n \rightarrow 0, \qquad \frac{a_n - a_{n+1}}{a_n} = o(a_n)</math> Therefore, the sequence <math display="inline">a_n = n^{-\alpha}</math> with <math display="inline">0 < \alpha < 1</math> satisfies this restriction, but <math display="inline">\alpha = 1</math> does not, hence the longer steps. Under the assumptions outlined in the Robbins–Monro algorithm, the resulting modification will result in the same asymptotically optimal convergence rate <math display="inline">O(1/\sqrt{n})</math> yet with a more robust step size policy.<ref name="pj" /> Prior to this, the idea of using longer steps and averaging the iterates had already been proposed by Nemirovski and Yudin<ref name="NY">On Cezari's convergence of the steepest descent method for approximating saddle points of convex-concave functions, A. Nemirovski and D. Yudin, ''Dokl. Akad. Nauk SSR'' '''2939''', (1978 (Russian)), Soviet Math. Dokl. '''19''' (1978 (English)).</ref> for the cases of solving the stochastic optimization problem with continuous convex objectives and for convex-concave saddle point problems. These algorithms were observed to attain the nonasymptotic rate <math display="inline">O(1/\sqrt{n})</math>. Line 43 ⟶ 48: With assumption '''A1)''' and the following '''A2)''' '''''A2)''''' ''There is a Hurwitz matrix <math display="inline">A</math> and a symmetric and positive-definite matrix <math display="inline">\Sigma</math> such that <math display="inline">\{U^n(\cdot)\}</math> converges weakly to <math display="inline">U(\cdot)</math>, where <math display="inline">U(\cdot)</math> is the statisolution to'' <math display="block">dU = AU \, dt +\Sigma^{1/2} \, dw</math>where <math display="inline">w(\cdot)</math> is a standard Wiener process.'' satisfied, and define ''<math display="inline">\bar{V}=(A^{-1})'\Sigma(A')^{-1}</math>''. Then for each ''<math display="inline">t</math>'', Line 52 ⟶ 57: === Application in stochastic optimization === Suppose we want to solve the following stochastic optimization problem<math display="block">g(\theta^) = \min_{\theta\in\Theta}\operatorname{E}[Q(\theta,X)],</math>where <math display="inline">g(\theta) = \operatorname{E}[Q(\theta,X)]</math> is differentiable and convex, then this problem is equivalent to find the root <math>\theta^</math> of <math>\nabla g(\theta) = 0</math>. Here <math>Q(\theta,X)</math> can be interpreted as some "observed" cost as a function of the chosen <math>\theta</math> and random effects <math>X</math>. In practice, it might be hard to get an analytical form of <math>\nabla g(\theta)</math>, Robbins–Monro method manages to generate a sequence <math>(\theta_n)_{n\geq 0}</math> to approximate <math>\theta^</math> if one can generate <math>(X_n)_{n\geq 0}▼ ~~Suppose we want to solve the following stochastic optimization problem~~ ▲<math display="block">g(\theta^) = \min_{\theta\in\Theta}\operatorname{E}[Q(\theta,X)],</math>where <math display="inline">g(\theta) = \operatorname{E}[Q(\theta,X)]</math> is differentiable and convex, then this problem is equivalent to find the root <math>\theta^</math> of <math>\nabla g(\theta) = 0</math>. Here <math>Q(\theta,X)</math> can be interpreted as some "observed" cost as a function of the chosen <math>\theta</math> and random effects <math>X</math>. In practice, it might be hard to get an analytical form of <math>\nabla g(\theta)</math>, Robbins–Monro method manages to generate a sequence <math>(\theta_n)_{n\geq 0}</math> to approximate <math>\theta^</math> if one can generate <math>(X_n)_{n\geq 0} </math> , in which the conditional expectation of <math>X_n Line 71 ⟶ 74: The following result gives sufficient conditions on <math>\theta_n </math> for the algorithm to converge:<ref>{{Cite book\|title=Numerical Methods for stochastic Processes\|last1=Bouleau\|first1=N.\|author-link=Nicolas Bouleau\|last2=Lepingle\|first2=D.\|publisher=John Wiley\|year=1994\|isbn=9780471546412\|___location=New York\|url=https://books.google.com/books?id=9MLL2RN40asC}}</ref> C1) <math>\varepsilon_n \geq 0, \forall\; n\geq 0. </math> Line 99 ⟶ 102: ==Kiefer–Wolfowitz algorithm== The Kiefer–Wolfowitz algorithm was introduced in 1952 by [[Jacob Wolfowitz]] and [[Jack_Kiefer_(statistician)\|Jack Kiefer]],<ref name = "KW">{{Cite journal \| last1 = Kiefer \| first1 = J. \| last2 = Wolfowitz \| first2 = J. \| doi = 10.1214/aoms/1177729392 \| title = Stochastic Estimation of the Maximum of a Regression Function \| journal = The Annals of Mathematical Statistics \| volume = 23 \| issue = 3 \| pages = 462 \| year = 1952 \| doi-access = free }}</ref> and was motivated by the publication of the Robbins–Monro algorithm. However, the algorithm was presented as a method which would stochastically estimate the maximum of a function. Let <math>M(x) </math> be a function which has a maximum at the point <math>\theta </math>. It is assumed that <math>M(x)</math> is unknown; however, certain observations <math>N(x)</math>, where <math>\operatorname E[N(x)] = M(x)</math>, can be made at any point <math>x</math>. The structure of the algorithm follows a gradient-like method, with the iterates being generated as ~~follows:~~ ::<math> x_{n+1} = x_n + a_n \~~bigg~~cdot\left(\frac{N(x_n + c_n) - N(x_n -c_n)}{2 c_n} \~~bigg~~right) </math> where <math>N(x_n+c_n)</math> and <math>N(x_n-c_n)</math> are independent,. ~~and~~At every step, the gradient of <math>M(x)</math> is approximated ~~using~~akin ~~finite~~to ~~differences~~a [[Finite difference#Basic types\|central difference method]] with <math>h=2c_n</math>. ~~The~~So the sequence <math>\{c_n\}</math> specifies the sequence of finite difference widths used for the gradient approximation, while the sequence <math>\{a_n\}</math> specifies a sequence of positive step sizes taken along that direction. Kiefer and Wolfowitz proved that, if <math>M(x)</math> satisfied certain regularity conditions, then <math>x_n</math> will converge to <math>\theta</math> in probability as <math>n\to\infty </math>, and later Blum<ref name=":0" /> in 1954 showed <math>x_n</math> converges to <math>\theta</math> almost surely, provided that: * <math>\operatorname{Var}(N(x))\le S<\infty</math> for all <math>x</math>. Line 129 ⟶ 136: stochastic approximation to [[Robust statistics\|robust]] [[estimation]].<ref>{{Cite journal \| last1 = Martin \| first1 = R. \| last2 = Masreliez \| first2 = C. \| doi = 10.1109/TIT.1975.1055386 \| title = Robust estimation via stochastic approximation \| journal = IEEE Transactions on Information Theory \| volume = 21 \| issue = 3 \| pages = 263 \| year = 1975 }}</ref> The main tool for analyzing stochastic approximations algorithms (including the Robbins–Monro and the Kiefer–Wolfowitz algorithms) is a theorem by [[Aryeh Dvoretzky]] published in ~~the proceedings of the third Berkeley symposium on mathematical statistics and probability,~~ 1956.<ref>{{~~Cite~~cite journal\|last=Dvoretzky\|first=Aryeh\|date=1956-01-01\|title=On Stochastic Approximation\|url=http://projecteuclid.org/euclid.bsmsp/1200501645\|language=EN\|publisher=The Regents of the University of California}}</ref>conference \| last = Dvoretzky \| first = Aryeh \| author-link = Aryeh Dvoretzky \| editor-last = Neyman \| editor-first = Jerzy \| editor-link = Jerzy Neyman \| contribution = On stochastic approximation \| contribution-url = https://projecteuclid.org/euclid.bsmsp/1200501645 \| mr = 84911 \| pages = 39–55 \| publisher = University of California Press \| title = Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955, vol. I \| year = 1956}}</ref> ==See also==