Revision as of 06:46, 8 July 2019 edit Citation bot (talk \| contribs) Bots 5,870,534 edits m Removed parameters. \| You can use this bot yourself. Report bugs here.\| Activated by User:Marianne Zimmerman ← Previous edit		Revision as of 04:32, 13 July 2019 edit undo Magicheader (talk \| contribs) 139 edits →Extensions and variants Next edit →
Line 107: ===Implicit updates (ISGD)=== As mentioned earlier, classical stochastic gradient descent is generally sensitive to [[learning rate]] {{mvar\|η}}. Fast convergence requires large learning rates but this may induce numerical instability. The problem can be largely solved<ref>{{~~citation~~cite journal ~~needed~~\|~~date~~last1=~~May~~Toulis ~~2019~~\|first1=Panos \|first2=Edoardo \|last2=Airoldi\|title=Asymptotic and finite-sample properties of estimators based on stochastic gradients \|journal=Annals of Statistics \|volume=45 \|issue=4 \|year=2017 \|pages=1694-1727\|url=https://projecteuclid.org/euclid.aos/1498636871\|doi=10.1214/16-AOS1506}}</ref> by considering ''implicit updates'' whereby the stochastic gradient is evaluated at the next iterate rather than the current one: :<math>w^{new} := w^{old} - \eta \nabla Q_i(w^{new}).</math> Line 225: ===Second-Order Methods=== It is known that a stochastic analogue of the standard (deterministic) Newton-Raphson algorithm (a “second-order” method) provides an asymptotically optimal or near-optimal form of iterative optimization in the setting of stochastic approximation. A method that uses direct measurements of the Hessian matrices of the summands in the empirical risk function is given in <ref>R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer, “A Stochastic Quasi-Newton method for Large-Scale Optimization,” SIAM Journal on Optimization, vol. 26, no. 2, pp. 1008–1031, 2016</ref>. However, directly determining the required Hessian matrices for optimization may not be possible in practice. Practical and theoretically sound methods for second-order versions of SGD that do not require direct Hessian information are given in <ref>J. C. Spall (2000), “Adaptive Stochastic Approximation by the Simultaneous Perturbation Method,” IEEE Transactions on Automatic Control, vol. 45, pp. 1839−1853. http://dx.doi.org/10.1109/TAC.2000.880982 </ref> <ref>J. C. Spall (2009), “Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm,” IEEE Transactions on Automatic Control, vol. 54(6), pp. 1216–1229. http://dx.doi.org/10.1109/TAC.2009.2019793 </ref> <ref>S. Bhatnagar, H. L. Prasad, and L. A. Prashanth (2013), Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods. Springer.</ref>. (A less efficient method based on finite differences, instead of simultaneous perturbations, is given in <ref>D. Ruppert, “A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure,” The Annals of Statistics, vol. 13, no. 1, pp. 236–245, 1985</ref>.) These methods not requiring direct Hessian information are based on either values of the summands in the above empirical risk function or values of the gradients of the summands (i.e., the SGD inputs). In particular, second-order optimality is asymptotically achievable without direct calculation of the Hessian matrices of the summands in the empirical risk function. == Notes ==

Stochastic gradient descent: Difference between revisions