Content deleted Content added
No edit summary |
|||
Line 28:
# Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin<ref name="NYcomp">Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin, ''Wiley -Intersci. Ser. Discrete Math'' '''15''' ''John Wiley'' ''New York'' (1983) .</ref> have shown that the asymptotically optimal convergence rate, with respect to the objective function values, is <math display="inline">O(1/\sqrt{n})</math>. They have also proven that this rate cannot be improved.
===Subsequent developments and Polyak–Ruppert
While the Robbins–Monro algorithm is theoretically able to achieve <math display="inline"> O(1/n)</math> under the assumption of twice continuous differentiability and strong convexity, it can perform quite poorly upon implementation. This is primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the supposed asymptotically optimal step size policy can be quite harmful in the beginning.<ref name="NJLS" /><ref name="jcsbook">[https://books.google.com/books?id=f66OIvvkKnAC&printsec=frontcover#v=onepage&q=%22Robbins-Monro%22&f=false Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control], J.C. Spall, ''John Wiley'' ''Hoboken, NJ'', (2003).</ref>
Line 43:
With assumption '''A1)''' and the following '''A2)'''
'''''A2)''''' ''There is a Hurwitz matrix <math display="inline">A</math> and a symmetric and positive-definite matrix <math display="inline">\Sigma</math> such that <math display="inline">\{U^n(\cdot)\}</math> converges weakly to <math display="inline">U(\cdot)</math>, where <math display="inline">U(\cdot)</math> is the statisolution to''<math display="block">dU =
satisfied, and define ''<math display="inline">\bar{V}=(A^{-1})'\Sigma(A')^{-1}</math>''. Then for each ''<math display="inline">t</math>'',
''<math display="block">\hat{U}^n(t)\stackrel{\mathcal{D}}{\longrightarrow}\mathcal{N}(0,V_t),\quad \
The success of the averaging idea is because of the time scale separation of the original sequence ''<math display="inline">\{\theta_n\}</math>'' and the averaged sequence ''<math display="inline">\{\Theta_n\}</math>'', with the time scale of the former one being faster.
|