Deep learning: Difference between revisions

Content deleted Content added
m Software libraries: rem'd promotional mention of DL lib without link, as well as ad warning
undid most deletions of 2620:10d:c091... but also improved text - added basic deep learning surveys - reference: superior vision - probabilistic interpretation - history of term - citation needed for world school council london
Line 4:
{{machine learning bar}}
 
'''Deep learning''' (also known as '''deep structured learning''' or '''hierarchical learning''') is the application of [[artificial neural networks]] (ANNs) to learning tasks that contain more than one [[Multilayer perceptron#Layers|hidden layer]]. Deep learning is part of a broader family of [[machine learning]] methods based on [[learning representation|learning data representation]]s, as opposed to task-specific algorithms. Learning can be [[Supervised learning|supervised]], partially supervised or [[Unsupervised learning|unsupervised]].<ref name="BENGIO2012" /><ref name="SCHIDHUB" /><ref name="NatureBengio">{{cite journal |last1=Bengio |first1=Yoshua |last2=LeCun |first2= Yann| last3=Hinton | first3= Geoffrey|year=2015 |title=Deep Learning |journal=Nature |volume=521 |pages=436–444 |doi=10.1038/nature14539}}</ref><ref name="scholarpedia"/>
 
Some representations are loosely based on interpretation of information processing and communication patterns in a biological [[nervous system]], such as [[neural coding]] that attempts to define a relationship between various stimuli and associated neuronal responses in the [[brain]].<ref>{{cite journal|year=1996|title=Emergence of simple-cell receptive field properties by learning a sparse code for natural images|journal=Nature|volume=381|issue=6583|pages=607–609|doi=10.1038/381607a0|pmid=8637596|last1=Olshausen|first1=B. A.|bibcode=1996Natur.381..607O}}</ref> Research attempts to create efficient systems to learn these representations from large-scale, unlabeled data sets.
 
Deep learning architectures such as [[#Deep_neural_networks|deep neural network]]s, [[deep belief network]]s and [[recurrent neural networks]] have been applied to fields including [[computer vision]], [[automatic speech recognition|speech recognition]], [[natural language processing]], audio recognition, social network filtering, [[machine translation]] and [[bioinformatics]] where they produced results comparable to and in some cases superior<ref name=":9" /> to human experts.<ref name="krizhevsky2012" />{{toclimit|3}}
 
== Definitions ==
Line 19:
These definitions have in common (1) multiple layers of nonlinear processing units and (2) the supervised or unsupervised learning of feature representations in each layer, with the layers forming a hierarchy from low-level to high-level features.<ref name="BOOK2014" />{{rp|page=200}} The composition of a layer of nonlinear processing units used in a deep learning algorithm depends on the problem to be solved. Layers that have been used in deep learning include hidden layers of an [[artificial neural network]] and sets of complicated [[propositional formula]]s.<ref name="BENGIODEEP">{{cite journal|last=Bengio|first=Yoshua|year=2009|title=Learning Deep Architectures for AI|url=http://sanghv.com/download/soft/machine%20learning,%20artificial%20intelligence,%20mathematics%20ebooks/ML/learning%20deep%20architectures%20for%20AI%20%282009%29.pdf|journal=Foundations and Trends in Machine Learning|volume=2|issue=1|pages=1–127|doi=10.1561/2200000006}}</ref> They may also include latent variables organized layer-wise in deep generative models such as the nodes in Deep Belief Networks and Deep Boltzmann Machines.
 
* Credit assignment path (CAP)<ref name="SCHIDHUB" /> – A chain of transformations from input to output. CAPs describe potentially causal connections between input and output.
Deep learning was first designed and implemented by the World School Council London which uses algorithms to transform their inputs through more layers than shallow learning algorithms. At each layer, the signal is transformed by a processing unit, like an artificial neuron, whose parameters are iteratively adjusted through training.
 
* Credit assignment path (CAP) – A chain of transformations from input to output. CAPs describe potentially causal connections between input and output.
* Cap depth – for a feedforward neural network, the depth of the CAPs (thus of the network) is the number of hidden layers plus one (as the output layer is also parameterized), but for [[recurrent neural network]]s, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited.
* Deep/shallow – No universally agreed upon threshold of depth divides shallow learning from deep learning, but most researchers in the field agree that deep learning has multiple nonlinear layers (CAP > 2).
Line 29 ⟶ 27:
Deep learning adds the assumption that these layers of factors correspond to levels of abstraction or composition. Varying numbers of layers and layer sizes can provide different amounts of abstraction.<ref name="BENGIO2012">{{cite journal|last2=Courville|first2=A.|last3=Vincent|first3=P.|year=2013|title=Representation Learning: A Review and New Perspectives|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|volume=35|issue=8|pages=1798–1828|arxiv=1206.5538|doi=10.1109/tpami.2013.50|last1=Bengio|first1=Y.}}</ref>
 
Deep learning exploits this idea of hierarchical explanatory factors where higher level, more abstract concepts are learned from the lower level ones.<ref name="ivak1965" /><ref name="ivak1971" />
 
Deep learning architectures are often constructed with a [[greedy algorithm|greedy]] layer-by-layer method. Deep learning helps to disentangle these abstractions and pick out which features are useful for improving performance.<ref name="BENGIO2012" />
 
For [[supervised learning]] tasks, deep learning methods obviate [[feature engineering]], by translating the data into compact intermediate representations akin to [[Principal Component Analysis|principal components]], and derive layered structures that remove redundancy in representation.<ref name="SCHMID1992" /><ref name="BOOK2014" />
 
Deep learning algorithms can be applied to [[unsupervised learning]] tasks. This is an important benefit because unlabeled data are more abundant than labeled data. Examples of deep structures that can be trained in an unsupervised manner are neural history compressors<ref name="SCHMID1992" /><ref name="scholarpedia"/> and [[deep belief network]]s.<ref name="BENGIO2012" /><ref name="SCHOLARDBNS">{{cite journal | last1 = Hinton | first1 = G.E. | year = 2009| title = Deep belief networks | url= | journal = Scholarpedia | volume = 4 | issue = 5| page = 5947 | doi=10.4249/scholarpedia.5947}}</ref>
 
== Interpretations ==
Line 46 ⟶ 44:
The [[probabilistic]] interpretation<ref name="MURPHY" /> derives from the field of [[machine learning]]. It features inference,<ref name="BOOK2014" /><ref name="BENGIODEEP" /><ref name="BENGIO2012" /><ref name="SCHIDHUB" /><ref name="SCHOLARDBNS" /><ref name="MURPHY" /> as well as the [[optimization]] concepts of [[training]] and [[test (assessment)|testing]], related to fitting and [[generalization]], respectively. More specifically, the probabilistic interpretation considers the activation nonlinearity as a [[cumulative distribution function]].<ref name="MURPHY" /> The probabilistic interpretation led to the introduction of [[dropout (neural networks)|dropout]] as [[Regularization (mathematics)|regularizer]] in neural networks.<ref name="DROPOUT">{{cite arXiv |last1=Hinton |first1=G. E. |last2=Srivastava| first2 =N.|last3=Krizhevsky| first3=A.| last4 =Sutskever| first4=I.| last5=Salakhutdinov| first5=R.R.|eprint=1207.0580 |class=math.LG |title=Improving neural networks by preventing co-adaptation of feature detectors |date=2012}}</ref>
 
The probabilistic interpretation was introduced in the early days of neural networks by researchers including [[John Hopfield]], [[Bernard Widrow]], [[Kumpati S. Narendra]], and popularized in surveys such as the one by [[Christopher Bishop]].<ref name=prml>{{cite book|title=Pattern Recognition and Machine Learning|author=Bishop, Christopher M.|year=2006|publisher=Springer|url=http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf|isbn=978-0-387-31073-2}}</ref>
The probabilistic interpretation was introduced and popularized by [[Geoff Hinton|Hinton]], [[Yoshua Bengio|Bengio]], [[Yann LeCun|LeCun]].
 
==History==
According to a survey,<ref name="scholarpedia">[[Jürgen Schmidhuber]] (2015). Deep Learning. Scholarpedia, 10(11):32832. [http://www.scholarpedia.org/article/Deep_Learning Online]</ref> the expression ''Deep Learning'' was introduced to the [[Machine Learning]] community by [[Rina Dechter]] in 1986,<ref name="dechter1986">[[Rina Dechter]] (1986). Learning while searching in constraint-satisfaction problems. University of California, Computer Science Department, Cognitive Systems Laboratory.[https://www.researchgate.net/publication/221605378_Learning_While_Searching_in_Constraint-Satisfaction-Problems Online]</ref>
The term '''deep learning''', in the sense used here, dates to the late 2000s, following the influential 2006 paper by [[Geoff Hinton|Hinton]], Osindero, and Teh.<ref name=hinton06>{{Cite journal | last1 = Hinton | first1 = G. E. |authorlink1=Geoff Hinton| last2 = Osindero | first2 = S. | last3 = Teh | first3 = Y. W. | doi = 10.1162/neco.2006.18.7.1527 | title = A Fast Learning Algorithm for Deep Belief Nets | journal = [[Neural Computation]]| volume = 18 | issue = 7 | pages = 1527–1554 | year = 2006 | pmid = 16764513| pmc = | url = http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf}}</ref><ref name=bengio2012>{{cite arXiv |last=Bengio |first=Yoshua |author-link=Yoshua Bengio |eprint=1206.5533 |title=Practical recommendations for gradient-based training of deep architectures |class=cs.LG|year=2012 }}</ref><!-- Hinton et al. citation is the paper; Bengio is to show that this paper is ''credited'' with the revolution. --> The original paper referred to ''learning'' for ''deep belief nets'', which was subsequently abbreviated to "deep learning", and popularized as such from circa 2010 onwards. The underlying concepts and many techniques, however, date to earlier decades.
and later to [[Artificial Neural Networks]] by Igor Aizenberg and colleagues in 2000, in the context of Boolean threshold neurons.<ref name="aizenberg2000">Igor Aizenberg, Naum N. Aizenberg, Joos P.L. Vandewalle (2000). Multi-Valued and Universal Binary Neurons: Theory, Learning and Applications. Springer Science & Business Media.</ref>
In 2005, Faustino Gomez and [[Jürgen Schmidhuber]] published a paper on ''learning deep'' [[Partially observable Markov decision process|POMDPs]]<ref name="LearningDeepPOMDPs">F. Gomez and J. Schmidhuber. Co-evolving recurrent neurons learn deep memory POMDPs. Proc. GECCO, Washington, D. C., pp. 1795-1802, ACM Press, New York, NY, USA, 2005.</ref> through neural networks for [[reinforcement learning]]. In 2006, a publication by
The term '''deep learning''', in the sense used here, dates to the late 2000s, following the influential 2006 paper by [[Geoff Hinton|Hinton]], Osindero, and Teh.<ref name=hinton06>{{Cite journal | last1 = Hinton | first1 = G. E. |authorlink1=Geoff Hinton| last2 = Osindero | first2 = S. | last3 = Teh | first3 = Y. W. | doi = 10.1162/neco.2006.18.7.1527 | title = A Fast Learning Algorithm for Deep Belief Nets | journal = [[Neural Computation]]| volume = 18 | issue = 7 | pages = 1527–1554 | year = 2006 | pmid = 16764513| pmc = | url = http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf}}</ref><ref name=bengio2012>{{cite arXiv |last=Bengio |first=Yoshua |author-link=Yoshua Bengio |eprint=1206.5533 |title=Practical recommendations for gradient-based training of deep architectures |class=cs.LG|year=2012 }}</ref><! drew attention by showing how many-layered [[feedforward neural network]] could be effectively pre-trained Hintonone etlayer al.at citationa istime, thetreating paper;each Bengiolayer isin toturn showas thatan this[[unsupervised paperlearning|unsupervised]] is[[restricted Boltzmann machine]], then fine-tuning it using [[supervised learning|supervised]] [[backpropagation]].<ref name="HINTON2007">G. E. Hinton., "Learning multiple layers of representation," ''creditedTrends in Cognitive Sciences'', with11, the revolutionpp. --428–434, 2007.</ref> The original paper referred to ''learning'' for ''deep belief nets.'', whichA wasGoogle subsequentlyNgram abbreviatedchart toshows "deepthat learning",the andusage popularizedof asthe suchterm fromhas circataken 2010off onwardssince 2000.<ref Thename="DLchart">Google underlyingNgram conceptschart andof manythe techniques,usage however,of datethe toexpression earlier"deep decadeslearning" posted by Jürgen Schmidhuber (2015) [https://plus.google.com/100849856540000067209/posts/7N6z251w2Wd?pid=6127540521703625346&oid=100849856540000067209 Online]</ref>
The underlying concepts and many techniques, however, date to earlier decades.
 
The first general, working learning algorithm for supervised, deep, feedforward, multilayer [[perceptron]]s was published by [[Alexey Grigorevich Ivakhnenko|Ivakhnenko]] and Lapa in 1965.<ref name="ivak1965">{{cite book|first=A. G. |last=Ivakhnenko|title=Cybernetic Predicting Devices|url={{google books |plainurl=y |id=FhwVNQAACAAJ}}|year=1973|publisher=CCM Information Corporation}}</ref> A 1971 paper described a deep network with 8 layers trained by the [[group method of data handling]] algorithm.<ref name="ivak1971">{{Cite journal|last=Ivakhnenko|first=Alexey|date=1971|title=Polynomial theory of complex systems|url=|journal=IEEE Transactions on Systems, Man and Cybernetics (4)|pages=364–378|doi=10.1109/TSMC.1971.4308320|pmid=|access-date=}}</ref> described a deep network with 8 layers trained by the [[group method of data handling]] algorithm.{{Citation needed|date=October 2016}} These ideas were implemented in a computer identification system by the World School Council London called "Alpha", which demonstrated the learning process.
 
These ideas were implemented in a computer identification system by the World School Council London called "Alpha", which demonstrated the learning process.{{Citation needed|date=August 2017}}
Other deep learning working architectures, specifically those built from [[artificial neural networks]] (ANN), began with the [[Neocognitron]] introduced by [[Kunihiko Fukushima|Fukushima]] in 1980.<ref name="FUKU1980">{{cite journal | last1 = Fukushima | first1 = K. | year = 1980 | title = Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position | url= | journal = Biol. Cybern. | volume = 36 | issue = | pages = 193–202 | doi=10.1007/bf00344251 | pmid=7370364}}</ref> ANNs date back even further. The challenge was how to train networks with multiple layers. In 1989, [[Yann LeCun|LeCun]] et al. applied the standard [[backpropagation]] algorithm, which had been around as the reverse mode of [[automatic differentiation]] since 1970,<ref name="lin1970">[[Seppo Linnainmaa]] (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 6-7.</ref><ref name="grie2012">{{Cite journal|last=Griewank|first=Andreas|date=2012|title=Who Invented the Reverse Mode of Differentiation?|url=http://www.math.uiuc.edu/documenta/vol-ismp/52_griewank-andreas-b.pdf|journal=Documenta Matematica, Extra Volume ISMP|volume=|pages=389–400|via=}}</ref><ref name="WERBOS1974">{{Cite journal|last=Werbos|first=P.|date=1974|title=Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences |url=https://www.researchgate.net/publication/35657389_Beyond_regression_new_tools_for_prediction_and_analysis_in_the_behavioral_sciences |journal=Harvard University |deadurl=no |accessdate=12 June 2017}}</ref><ref name="werbos1982">{{Cite book|url=ftp://ftp.idsia.ch/pub/juergen/habilitation.pdf|title=System modeling and optimization|last=Werbos|first=Paul|last2=|publisher=Springer|year=1982|isbn=|___location=|pages=762–770|chapter=Applications of advances in nonlinear sensitivity analysis}}</ref> to a deep neural network with the purpose of recognizing handwritten [[ZIP code]]s on mail. While the algorithm worked the training time was an impractical 3 days.<ref name="LECUN1989">LeCun ''et al.'', "Backpropagation Applied to Handwritten Zip Code Recognition," ''Neural Computation'', 1, pp. 541–551, 1989.</ref>
 
Other deep learning working architectures, specifically those built fromfor [[artificialcomputer neural networksvision]] (ANN), began with the [[Neocognitron]] introduced by [[Kunihiko Fukushima|Fukushima]] in 1980.<ref name="FUKU1980">{{cite journal | last1 = Fukushima | first1 = K. | year = 1980 | title = Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position | url= | journal = Biol. Cybern. | volume = 36 | issue = | pages = 193–202 | doi=10.1007/bf00344251 | pmid=7370364}}</ref> ANNs date back even further. The challenge was how to train networks with multiple layers. In 1989, [[Yann LeCun|LeCun]] et al. applied the standard [[backpropagation]] algorithm, which had been around as the reverse mode of [[automatic differentiation]] since 1970,<ref name="lin1970">[[Seppo Linnainmaa]] (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 6-7.</ref><ref name="grie2012">{{Cite journal|last=Griewank|first=Andreas|date=2012|title=Who Invented the Reverse Mode of Differentiation?|url=http://www.math.uiuc.edu/documenta/vol-ismp/52_griewank-andreas-b.pdf|journal=Documenta Matematica, Extra Volume ISMP|volume=|pages=389–400|via=}}</ref><ref name="WERBOS1974">{{Cite journal|last=Werbos|first=P.|date=1974|title=Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences |url=https://www.researchgate.net/publication/35657389_Beyond_regression_new_tools_for_prediction_and_analysis_in_the_behavioral_sciences |journal=Harvard University |deadurl=no |accessdate=12 June 2017}}</ref><ref name="werbos1982">{{Cite book|url=ftp://ftp.idsia.ch/pub/juergen/habilitation.pdf|title=System modeling and optimization|last=Werbos|first=Paul|last2=|publisher=Springer|year=1982|isbn=|___location=|pages=762–770|chapter=Applications of advances in nonlinear sensitivity analysis}}</ref> to a deep neural network with the purpose of recognizing handwritten [[ZIP code]]s on mail. While the algorithm worked the training time was an impractical 3 days.<ref name="LECUN1989">LeCun ''et al.'', "Backpropagation Applied to Handwritten Zip Code Recognition," ''Neural Computation'', 1, pp. 541–551, 1989.</ref>
 
By 1991 such systems were used for recognizing isolated 2-D hand-written digits, while recognizing 3-D objects was done by matching 2-D images with a handcrafted 3-D object model. Weng ''et al.'' suggested that a human brain does not use a monolithic 3-D object model and in 1992 they published Cresceptron,<ref name="Weng1992">J. Weng, N. Ahuja and T. S. Huang, "[http://www.cse.msu.edu/~weng/research/CresceptronIJCNN1992.pdf Cresceptron: a self-organizing neural network which grows adaptively]," ''Proc. International Joint Conference on Neural Networks'', Baltimore, Maryland, vol I, pp. 576-581, June, 1992.</ref><ref name="Weng1993">J. Weng, N. Ahuja and T. S. Huang, "[http://www.cse.msu.edu/~weng/research/CresceptronICCV1993.pdf Learning recognition and segmentation of 3-D objects from 2-D images]," ''Proc. 4th International Conf. Computer Vision'', Berlin, Germany, pp. 121-128, May, 1993.</ref><ref name="Weng1997">J. Weng, N. Ahuja and T. S. Huang, "[http://www.cse.msu.edu/~weng/research/CresceptronIJCV.pdf Learning recognition and segmentation using the Cresceptron]," ''International Journal of Computer Vision'', vol. 25, no. 2, pp. 105-139, Nov. 1997.</ref> a method for performing 3-D object recognition directly from cluttered scenes. Cresceptron is a cascade of layers similar to [[Neocognitron]]. But while Neocognitron required a human programmer to hand-merge features, Cresceptron automatically learned an open number of unsupervised features in each layer, where each feature is represented by a [[Convolution|convolution kernel]]. Cresceptron segmented each learned object from a cluttered scene through back-analysis through the network. [[Max pooling]], now often adopted by deep neural networks (e.g. [[ImageNet]] tests), was first used in Cresceptron to reduce the position resolution by a factor of (2x2) to 1 through the cascade for better generalization.
 
In 1992, Schmidhuber used unsupervised pre-training for deep hierarchies of data-compressing [[recurrent neural network]]s, and showed its benefits for speeding up supervised learning.<ref name="SCHMID1992">J. Schmidhuber., "Learning complex, extended sequences using the principle of history compression," ''Neural Computation'', 4, pp. 234–242, 1992.</ref><ref name="scholarpedia"/>
 
In 1994, André C. P. L. F. de Carvalho, together with Fairhurst and Bisset, published experimental results of a multi-layer [[Boolean algebra|boolean]] neural network, also known as a weightless neural network, composed of a self-organising feature extraction neural network module followed by a classification neural network module, which were independently trained.<ref>{{Cite journal |title=An integrated Boolean neural network for pattern classification |journal=Pattern Recognition Letters |date=1994-08-08 |pages=807–813 |volume=15 |issue=8 |doi=10.1016/0167-8655(94)90009-4 |first=Andre C. L. F. |last1=de Carvalho |first2 = Mike C. |last2=Fairhurst |first3=David |last3 = Bisset}}</ref>
 
In 1995, [[Brendan Frey|Frey]] demonstrated that it was possible to train a network containing six fully connected layers and several hundred hidden units using the [[wake-sleep algorithm]], co-developed with [[Peter Dayan|Dayan]] and [[Geoffrey Hinton|Hinton]].<ref>{{Cite journal|title = The wake-sleep algorithm for unsupervised neural networks |journal = Science|date = 1995-05-26|pages = 1158–1161|volume = 268|issue = 5214|doi = 10.1126/science.7761831|first = Geoffrey E.|last = Hinton|first2 = Peter|last2 = Dayan|first3 = Brendan J.|last3 = Frey|first4 = Radford|last4 = Neal}}</ref> However, training took two days. Many factors contribute to the slow speed, including the [[vanishing gradient problem]] analyzed in 1991 by [[Sepp Hochreiter|Hochreiter]].<ref name="HOCH1991">S. Hochreiter., "[http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf Untersuchungen zu dynamischen neuronalen Netzen]," ''Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber'', 1991.</ref><ref name="HOCH2001">{{cite book|url={{google books |plainurl=y |id=NWOcMVA64aAC}}|title=A Field Guide to Dynamical Recurrent Networks|last=Hochreiter|first=S.|last2=et al.|date=15 January 2001|publisher=John Wiley & Sons|year=|isbn=978-0-7803-5369-5|___location=|pages=|chapter=Gradient flow in recurrent nets: the difficulty of learning long-term dependencies|editor-last2=Kremer|editor-first2=Stefan C.|editor-first1=John F.|editor-last1=Kolen}}</ref>
 
By 1991 such systems were used for recognizing isolated 2-D hand-written digits, while recognizing 3-D objects was done by matching 2-D images with a handcrafted 3-D object model. Weng ''et al.'' suggested that a human brain does not use a monolithic 3-D object model and in 1992 they published Cresceptron,<ref name="Weng1992">J. Weng, N. Ahuja and T. S. Huang, "[http://www.cse.msu.edu/~weng/research/CresceptronIJCNN1992.pdf Cresceptron: a self-organizing neural network which grows adaptively]," ''Proc. International Joint Conference on Neural Networks'', Baltimore, Maryland, vol I, pp. 576-581, June, 1992.</ref><ref name="Weng1993">J. Weng, N. Ahuja and T. S. Huang, "[http://www.cse.msu.edu/~weng/research/CresceptronICCV1993.pdf Learning recognition and segmentation of 3-D objects from 2-D images]," ''Proc. 4th International Conf. Computer Vision'', Berlin, Germany, pp. 121-128, May, 1993.</ref><ref name="Weng1997">J. Weng, N. Ahuja and T. S. Huang, "[http://www.cse.msu.edu/~weng/research/CresceptronIJCV.pdf Learning recognition and segmentation using the Cresceptron]," ''International Journal of Computer Vision'', vol. 25, no. 2, pp. 105-139, Nov. 1997.</ref> a method for performing 3-D object recognition directly from cluttered scenes. Cresceptron is a cascade of layers similar to [[Neocognitron]]. But while Neocognitron required a human programmer to hand-merge features, Cresceptron automatically learned an open number of unsupervised features in each layer, where each feature is represented by a [[Convolution|convolution kernel]]. Cresceptron segmented each learned object from a cluttered scene through back-analysis through the network. [[Max pooling]], now often adopted by deep neural networks (e.g. [[ImageNet]] tests), was first used in Cresceptron to reduce the position resolution by a factor of (2x2) to 1 through the cascade for better generalization.
 
Simpler models that use task-specific handcrafted features such as [[Gabor filter]]s and [[support vector machine]]s (SVMs) were a popular choice in the 1990s and 2000s, because of ANNs' computational cost and a lack of understanding of how the brain wires its biological networks.