Backpropagation: Difference between revisions

Content deleted Content added
History: lots of missing references on the history of backpropagation
Line 161:
{{See also|Perceptron#History|label 1=History of Perceptron}}
 
According to various sources,<ref name="dreyfus1990">[[Stuart Dreyfus]] (1990). Artificial Neural Networks, Back Propagation and the Kelley-Bryson Gradient Procedure. J. Guidance, Control and Dynamics, 1990. </ref><ref name="mizutani2000">Eiji Mizutani, [[Stuart Dreyfus]], Kenichi Nishio (2000). On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application. Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2000), Como Italy, July 2000. [http://queue.ieor.berkeley.edu/People/Faculty/dreyfus-pubs/ijcnn2k.pdf Online] </ref><ref name="schmidhuber2015">[[Jürgen Schmidhuber]] (2015). Deep learning in neural networks: An overview. Neural Networks 61 (2015): 85-117. [http://arxiv.org/abs/1404.7828 ArXiv] </ref><ref name="scholarpedia2015">[[Jürgen Schmidhuber]] (2015). Deep Learning. Scholarpedia, 10(11):32832. [http://www.scholarpedia.org/article/Deep_Learning#Backpropagation Section on Backpropagation]</ref>
[[Vapnik]] cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the first publication of the backpropagation algorithm in his book "Support Vector Machines.". [[Arthur E. Bryson]] and [[Yu-Chi Ho]] described it as a multi-stage dynamic system optimization method in 1969.<ref>{{cite book|title=Artificial Intelligence A Modern Approach|author1=[[Stuart J. Russell|Stuart Russell]]|author2=[[Peter Norvig]]|quote=The most popular method for learning in multilayer networks is called Back-propagation. It was first invented in 1969 by Bryson and Ho, but was largely ignored until the mid-1980s.|page=578}}</ref><ref>{{cite book|title=Applied optimal control: optimization, estimation, and control|authors=Arthur Earl Bryson, Yu-Chi Ho|year=1969|pages=481|publisher=Blaisdell Publishing Company or Xerox College Publishing}}</ref> It wasn't until 1974 and later, when applied in the context of neural networks and through the work of [[Paul Werbos]],<ref>Paul J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974</ref> [[David E. Rumelhart]], [[Geoffrey E. Hinton]] and [[Ronald J. Williams]],<ref name=Rumelhart1986>{{cite journal|last=Rumelhart|first=David E.|author2=Hinton, Geoffrey E.|author3=Williams, Ronald J.|title=Learning representations by back-propagating errors|journal=Nature|date=8 October 1986|volume=323|issue=6088|pages=533–536|doi=10.1038/323533a0}}</ref><ref name=Alpaydin2010>{{cite book|last=Alpaydın|first=Ethem|title=Introduction to machine learning|year=2010|publisher=MIT Press|___location=Cambridge, Mass.|isbn=978-0-262-01243-0|edition=2nd|quote=...and hence the name ''backpropagation'' was coined (Rumelhart, Hinton, and Williams 1986a).|page=250}}</ref> that it gained recognition, and it led to a “renaissance” in the field of artificial neural network research. During the 2000s it fell out of favour but has returned again in the 2010s, now able to train much larger networks using huge modern computing power such as [[GPU]]s. For example, in 2013 top speech recognisers now use backpropagation-trained neural networks.{{citation needed|date=September 2015}}
basics of continuous backpropagation were derived in the context of [[control theory]] by [[Henry J. Kelley]]<ref name="kelley1960">[[Henry J. Kelley]] (1960). Gradient theory of optimal flight paths. Ars Journal, 30(10), 947-954. [http://arc.aiaa.org/doi/abs/10.2514/8.5282?journalCode=arsj Online]</ref> in 1960 and by [[Arthur E. Bryson]] in 1961,<ref name="bryson1961">[[Arthur E. Bryson]] (1961, April). A gradient method for optimizing multi-stage allocation processes. In Proceedings of the Harvard Univ. Symposium on digital computers and their applications.</ref> using principles of [[dynamic programming]]. In 1962, [[Stuart Dreyfus]] published a simpler derivation based only on the [[chain rule]].<ref name="dreyfus1962">[[Stuart Dreyfus]] (1962). The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1), 30-45. [https://www.researchgate.net/publication/256244271_The_numerical_solution_of_variational_problems Online]</ref> [[Vapnik]] cites reference<ref>Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550</ref> in his book on [[Support Vector Machines]]. [[Arthur E. Bryson]] and [[Yu-Chi Ho]] described it as a multi-stage dynamic system optimization method in 1969.<ref>{{cite book|title=Artificial Intelligence A Modern Approach|author1=[[Stuart J. Russell|Stuart Russell]]|author2=[[Peter Norvig]]|quote=The most popular method for learning in multilayer networks is called Back-propagation. |page=578}}</ref><ref>{{cite book|title=Applied optimal control: optimization, estimation, and control|authors=Arthur Earl Bryson, Yu-Chi Ho|year=1969|pages=481|publisher=Blaisdell Publishing Company or Xerox College Publishing}}</ref>
 
In 1970, [[Seppo Linnainmaa]] finally published the general method for [[automatic differentiation]] (AD) of discrete connected networks of nested [[differentiable]] functions.<ref name="lin1970">[[Seppo Linnainmaa]] (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 6-7.</ref><ref name="lin1976">[[Seppo Linnainmaa]] (1976). Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2), 146-160.</ref> This corresponds to the modern version of backpropagation which is efficient even when the networks are sparse.<ref name="grie2012">Griewank, Andreas (2012). Who Invented the Reverse Mode of Differentiation?. Optimization Stories, Documenta Matematica, Extra Volume ISMP (2012), 389-400.</ref><ref name="grie2008">Griewank, Andreas and Walther, A.. Principles and Techniques of Algorithmic Differentiation, Second Edition. SIAM, 2008.</ref><ref name="schmidhuber2015"/><ref name="scholarpedia2015"/>
 
In 1973, [[Stuart Dreyfus]] used backpropagation to adapt [[parameter]]s of controllers in proportion to error gradients.<ref name="dreyfus1973">[[Stuart Dreyfus]] (1973). The computational solution of optimal control problems with time lag. IEEE Transactions on Automatic Control, 18(4):383–385.</ref> In 1974, [[Paul Werbos]] mentioned the possibility of applying this principle to [[artificial neural networks]],<ref name="werbos1974">[[Paul Werbos]] (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.</ref> and in 1982, he applied Linnainmaa's AD method to neural networks in the way that is widely used today.<ref name="werbos1982">[[Paul Werbos]] (1982). Applications of advances in nonlinear sensitivity analysis. In System modeling and optimization (pp. 762-770). Springer Berlin Heidelberg. [http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf Online]</ref><ref name="scholarpedia2015"/>
 
In 1986, [[David E. Rumelhart]], [[Geoffrey E. Hinton]] and [[Ronald J. Williams]] showed through computer experiments that this method can generate useful internal representations of incoming data in hidden layers of neural networks.<ref name=Rumelhart1986>{{cite journal|last=Rumelhart|first=David E.|author2=Hinton, Geoffrey E.|author3=Williams, Ronald J.|title=Learning representations by back-propagating errors|journal=Nature|date=8 October 1986|volume=323|issue=6088|pages=533–536|doi=10.1038/323533a0}}</ref>
<ref name=Alpaydin2010>{{cite book|last=Alpaydın|first=Ethem|title=Introduction to machine learning|year=2010|publisher=MIT Press|___location=Cambridge, Mass.|isbn=978-0-262-01243-0|edition=2nd|page=250}}</ref> In 1993, Eric A. Wan was the first<ref name="schmidhuber2015"/> to win an international pattern recognition contest through backpropagation.<ref name="wan1993">Eric A. Wan (1993). Time series prediction by using a connectionist network with internal delay lines. In SANTA FE INSTITUTE STUDIES IN THE SCIENCES OF COMPLEXITY-PROCEEDINGS (Vol. 15, pp. 195-195). Addison-Wesley Publishing Co.</ref>
During the 2000s it fell out of favour but has returned again in the 2010s, now able to train much larger networks using huge modern computing power such as [[GPU]]s. For example, in 2013 top speech recognisers now use backpropagation-trained neural networks.{{citation needed|date=September 2015}}
 
==Notes==