Content deleted Content added
Mathpriest (talk | contribs) →History: lots of missing references on the history of backpropagation |
|||
Line 161:
{{See also|Perceptron#History|label 1=History of Perceptron}}
According to various sources,<ref name="dreyfus1990">[[Stuart Dreyfus]] (1990). Artificial Neural Networks, Back Propagation and the Kelley-Bryson Gradient Procedure. J. Guidance, Control and Dynamics, 1990. </ref><ref name="mizutani2000">Eiji Mizutani, [[Stuart Dreyfus]], Kenichi Nishio (2000). On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application. Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2000), Como Italy, July 2000. [http://queue.ieor.berkeley.edu/People/Faculty/dreyfus-pubs/ijcnn2k.pdf Online] </ref><ref name="schmidhuber2015">[[Jürgen Schmidhuber]] (2015). Deep learning in neural networks: An overview. Neural Networks 61 (2015): 85-117. [http://arxiv.org/abs/1404.7828 ArXiv] </ref><ref name="scholarpedia2015">[[Jürgen Schmidhuber]] (2015). Deep Learning. Scholarpedia, 10(11):32832. [http://www.scholarpedia.org/article/Deep_Learning#Backpropagation Section on Backpropagation]</ref>
basics of continuous backpropagation were derived in the context of [[control theory]] by [[Henry J. Kelley]]<ref name="kelley1960">[[Henry J. Kelley]] (1960). Gradient theory of optimal flight paths. Ars Journal, 30(10), 947-954. [http://arc.aiaa.org/doi/abs/10.2514/8.5282?journalCode=arsj Online]</ref> in 1960 and by [[Arthur E. Bryson]] in 1961,<ref name="bryson1961">[[Arthur E. Bryson]] (1961, April). A gradient method for optimizing multi-stage allocation processes. In Proceedings of the Harvard Univ. Symposium on digital computers and their applications.</ref> using principles of [[dynamic programming]]. In 1962, [[Stuart Dreyfus]] published a simpler derivation based only on the [[chain rule]].<ref name="dreyfus1962">[[Stuart Dreyfus]] (1962). The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1), 30-45. [https://www.researchgate.net/publication/256244271_The_numerical_solution_of_variational_problems Online]</ref> [[Vapnik]] cites reference<ref>Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550</ref> in his book on [[Support Vector Machines]]. [[Arthur E. Bryson]] and [[Yu-Chi Ho]] described it as a multi-stage dynamic system optimization method in 1969.<ref>{{cite book|title=Artificial Intelligence A Modern Approach|author1=[[Stuart J. Russell|Stuart Russell]]|author2=[[Peter Norvig]]|quote=The most popular method for learning in multilayer networks is called Back-propagation. |page=578}}</ref><ref>{{cite book|title=Applied optimal control: optimization, estimation, and control|authors=Arthur Earl Bryson, Yu-Chi Ho|year=1969|pages=481|publisher=Blaisdell Publishing Company or Xerox College Publishing}}</ref>
In 1970, [[Seppo Linnainmaa]] finally published the general method for [[automatic differentiation]] (AD) of discrete connected networks of nested [[differentiable]] functions.<ref name="lin1970">[[Seppo Linnainmaa]] (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 6-7.</ref><ref name="lin1976">[[Seppo Linnainmaa]] (1976). Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2), 146-160.</ref> This corresponds to the modern version of backpropagation which is efficient even when the networks are sparse.<ref name="grie2012">Griewank, Andreas (2012). Who Invented the Reverse Mode of Differentiation?. Optimization Stories, Documenta Matematica, Extra Volume ISMP (2012), 389-400.</ref><ref name="grie2008">Griewank, Andreas and Walther, A.. Principles and Techniques of Algorithmic Differentiation, Second Edition. SIAM, 2008.</ref><ref name="schmidhuber2015"/><ref name="scholarpedia2015"/>
In 1973, [[Stuart Dreyfus]] used backpropagation to adapt [[parameter]]s of controllers in proportion to error gradients.<ref name="dreyfus1973">[[Stuart Dreyfus]] (1973). The computational solution of optimal control problems with time lag. IEEE Transactions on Automatic Control, 18(4):383–385.</ref> In 1974, [[Paul Werbos]] mentioned the possibility of applying this principle to [[artificial neural networks]],<ref name="werbos1974">[[Paul Werbos]] (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.</ref> and in 1982, he applied Linnainmaa's AD method to neural networks in the way that is widely used today.<ref name="werbos1982">[[Paul Werbos]] (1982). Applications of advances in nonlinear sensitivity analysis. In System modeling and optimization (pp. 762-770). Springer Berlin Heidelberg. [http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf Online]</ref><ref name="scholarpedia2015"/>
In 1986, [[David E. Rumelhart]], [[Geoffrey E. Hinton]] and [[Ronald J. Williams]] showed through computer experiments that this method can generate useful internal representations of incoming data in hidden layers of neural networks.<ref name=Rumelhart1986>{{cite journal|last=Rumelhart|first=David E.|author2=Hinton, Geoffrey E.|author3=Williams, Ronald J.|title=Learning representations by back-propagating errors|journal=Nature|date=8 October 1986|volume=323|issue=6088|pages=533–536|doi=10.1038/323533a0}}</ref>
<ref name=Alpaydin2010>{{cite book|last=Alpaydın|first=Ethem|title=Introduction to machine learning|year=2010|publisher=MIT Press|___location=Cambridge, Mass.|isbn=978-0-262-01243-0|edition=2nd|page=250}}</ref> In 1993, Eric A. Wan was the first<ref name="schmidhuber2015"/> to win an international pattern recognition contest through backpropagation.<ref name="wan1993">Eric A. Wan (1993). Time series prediction by using a connectionist network with internal delay lines. In SANTA FE INSTITUTE STUDIES IN THE SCIENCES OF COMPLEXITY-PROCEEDINGS (Vol. 15, pp. 195-195). Addison-Wesley Publishing Co.</ref>
During the 2000s it fell out of favour but has returned again in the 2010s, now able to train much larger networks using huge modern computing power such as [[GPU]]s. For example, in 2013 top speech recognisers now use backpropagation-trained neural networks.{{citation needed|date=September 2015}}
==Notes==
|