Recurrent neural network: Difference between revisions

Content deleted Content added
No edit summary
many additional sources on recursive neural networks, automatic differentiation, history compression, LSTM
Line 12:
 
In [[reinforcement learning]] settings, there is no teacher providing target signals for the RNN, instead a [[fitness function]] or [[reward function]] is occasionally used to evaluate the RNN's performance, which is influencing its input stream through output units connected to actuators affecting the environment. Again, compare the section on training algorithms below.
 
===Recursive neural networks===
 
A [[recursive neural network]]<ref>{{cite journal|doi=10.1109/ICNN.1996.548916|title=Learning task-dependent distributed representations by backpropagation through structure|last1=Goller|first1=C.|last2=Küchler|first2=A.|journal=Neural Networks, 1996., IEEE}}</ref> is created by applying the same set of weights [[recursion|recursively]] over a differentiable graph-like structure, by traversing the structure in [[topological sort|topological order]]. Such networks are typically also trained by the reverse mode of [[automatic differentiation]].<ref name="lin1970">[[Seppo Linnainmaa]] (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 6-7.</ref><ref name="grie2008">Griewank, Andreas and Walther, A.. Principles and Techniques of Algorithmic Differentiation, Second Edition. SIAM, 2008.</ref>
They were introduced to learn [[distributed representation|distributed representations]] of structure, such as [[mathematical logic|logical terms]].
A special case of recursive neural networks is the RNN itself whose structure corresponds to a linear chain. Recursive neural networks have been applied to [[natural language processing]].<ref>{{cite journal|last1=Socher|first1=Richard|last2=Lin|first2=Cliff|last3=Ng|first3=Andrew Y.|last4=Manning|first4=Christopher D.|title=Parsing Natural Scenes and Natural Language with Recursive Neural Networks|journal=The 28th International Conference on Machine Learning (ICML 2011)}}</ref> The Recursive Neural Tensor Network uses a tensor-based composition function for all nodes in the tree.<ref>{{cite journal|last1=Socher|first1=Richard|last2=Perelygin|first2=Alex|last3=Y. Wu|first3=Jean|last4=Chuang|first4=Jason|last5=D. Manning|first5=Christopher|last6=Y. Ng|first6=Andrew|last7=Potts|first7=Christopher|title=Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank|journal=EMNLP 2013|url=http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf}}</ref>
 
===Hopfield network===
Line 40 ⟶ 46:
The [[echo state network]] (ESN) is a recurrent neural network with a sparsely connected random hidden layer. The weights of output neurons are the only part of the network that can change and be trained. ESN are good at reproducing certain time series.<ref>H. Jaeger. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304:78–80, 2004.</ref> A variant for [[Action potential|spiking neuron]]s is known as [[Liquid state machines]].<ref>W. Maass, T. Natschläger, and H. Markram. A fresh look at real-time computation in generic recurrent neural circuits. Technical report, Institute for Theoretical Computer Science, TU Graz, 2002.</ref>
 
===Neural history compressor===
===Long short term memory network===
 
The [[vanishing gradient problem]]<ref name="hochreiter1991">[[Sepp Hochreiter]] (1991), [http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf Untersuchungen zu dynamischen neuronalen Netzen], Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber.</ref>
of [[automatic differentiation]] or [[backpropagation]] in neural networks
was partially overcome in 1992 by an early generative model called the neural history compressor, implemented as an unsupervised stack of recurrent neural networks (RNNs).<ref name="schmidhuber1992">[[Jürgen Schmidhuber]]. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242. [ftp://ftp.idsia.ch/pub/juergen/chunker.pdf Online]</ref> The RNN at the input level learns to predict its next input from the previous input history. Only unpredictable inputs of some RNN in the hierarchy become inputs to the next higher level RNN which therefore recomputes its internal state only rarely. Each higher level RNN thus learns a compressed representation of the information in the RNN below. This is done such that the input sequence can be precisely reconstructed from the sequence representation at the highest level. The system effectively minimises the description length or the negative [[logarithm]] of the probability of the data.<ref name="scholarpedia2015pre">[[Jürgen Schmidhuber]] (2015). Deep Learning. Scholarpedia, 10(11):32832. [http://www.scholarpedia.org/article/Deep_Learning#Fundamental_Deep_Learning_Problem_and_Unsupervised_Pre-Training_of_RNNs_and_FNNs Section on Unsupervised Pre-Training of RNNs and FNNs]</ref>
If there is a lot of learnable predictability in the incoming data sequence, then the highest level RNN can use supervised learning to easily classify even deep sequences with very long time intervals between important events. In 1993, such a system already solved a "Very Deep Learning" task that requires more than 1000 subsequent layers in an RNN unfolded in time.<ref name="schmidhuber1993">[[Jürgen Schmidhuber]] (1993). Habilitation thesis, TUM, 1993. Page 150 ff demonstrates credit assignment across the equivalent of 1,200 layers in an unfolded RNN. [ftp://ftp.idsia.ch/pub/juergen/habilitation.pdf Online]</ref>
 
It is also possible to distill the entire RNN hierarchy into only two RNNs called the "conscious" chunker (higher level) and the "subconscious" automatizer (lower level).<ref name="schmidhuber1992"/> Once the chunker has learned to predict and compress inputs that are still unpredictable by the automatizer, then the automatizer can be forced in the next learning phase to predict or imitate through special additional units the hidden units of the more slowly changing chunker. This makes it easy for the automatizer to learn appropriate, rarely changing memories across very long time intervals. This in turn helps the automatizer to make many of its once unpredictable inputs predictable, such that the chunker can focus on the remaining still unpredictable events, to compress the data even further.<ref name="schmidhuber1992"/>
 
===[[Long short term memory network]]===
 
Numerous researchers now use a [[deep learning]] RNN called
the [[Long short term memory]] (LSTM) network, published by [[Sepp Hochreiter|Hochreiter]] & [[Jürgen Schmidhuber|Schmidhuber]] in 1997.<ref name=lstm>[[Sepp Hochreiter|Hochreiter, Sepp]]; and [[Jürgen Schmidhuber|Schmidhuber, Jürgen]]; ''Long Short-Term Memory'', Neural Computation, 9(8):1735–1780, 1997</ref> It is a [[deep learning]] system that unlike traditional RNNs doesn't have the [[vanishing gradient problem]] (compare the section on training algorithms below).
LSTM is normally augmented by recurrent gates called forget gates<ref name="gers2002">Felix Gers, Nicholas Schraudolph, and [[Jürgen Schmidhuber]] (2002). Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research 3:115–143.</ref>. LSTM RNNs prevent backpropagated errors from vanishing or exploding<ref name="hochreiter1991"/>. Instead errors can flow backwards through unlimited numbers of virtual layers in LSTM RNNs unfolded in space. That is, LSTM can learn "Very Deep Learning" tasks<ref name="schmidhuber2015">[[Jürgen Schmidhuber]] (2015). Deep learning in neural networks: An overview. Neural Networks 61 (2015): 85-117. [http://arxiv.org/abs/1404.7828 ArXiv]</ref> that require memories of events that happened thousands or even millions of discrete time steps ago. Problem-specific LSTM-like topologies can be evolved.<ref name="bayer2009">Justin Bayer, Daan Wierstra, Julian Togelius, and Jürgen Schmidhuber (2009). Evolving memory cell structures for sequence learning. Proceedings of ICANN (2), pp. 755–764.</ref>
LSTM works even when there are long delays, and it can handle signals that have a mix of low and high frequency components.
 
Today, many applications use stacks of LSTM RNNs<ref name="fernandez2007">Santiago Fernandez, Alex Graves, and [[Jürgen Schmidhuber]] (2007). Sequence labelling in structured domains with hierarchical recurrent neural networks. Proceedings of IJCAI.</ref> and train them by Connectionist Temporal Classification (CTC)<ref name="graves2006">Alex Graves, Santiago Fernandez, Faustino Gomez, and [[Jürgen Schmidhuber]] (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets. Proceedings of ICML’06, pp. 369–376.</ref> to find an RNN weight matrix that maximizes the probability of the label sequences in a training set, given the corresponding input sequences. CTC achieves both alignment and recognition. Around 2007, LSTM started to revolutionise [[speech recognition]], outperforming traditional models in certain speech applications<ref name="fernandez2007keyword">Santiago Fernandez, Alex Graves, and Jürgen Schmidhuber (2007). An application of recurrent neural networks to discriminative keyword spotting. Proceedings of ICANN (2), pp. 220–229.</ref>. In 2009, CTC-trained LSTM was the first RNN to win pattern recognition contests, when it won several competitions in connected [[handwriting recognition]].<ref>Graves, Alex; and Schmidhuber, Jürgen; ''Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks'', in Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris K. I.; and Culotta, Aron (eds.), ''Advances in Neural Information Processing Systems 22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC'', Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552</ref><ref name="schmidhuber2015"/> In 2014, the Chinese search giant Baidu used CTC-trained RNNs to break the Switchboard Hub5'00 speech recognition benchmark, without using any traditional speech processing methods.<ref name="hannun2014">Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, [[Andrew Ng]] (2014). Deep Speech: Scaling up end-to-end speech recognition. [http://arxiv.org/abs/1412.5567 arXiv:1412.5567] </ref>
LSTM also improved large-vocabulary speech recognition,<ref name="sak2014">Hasim Sak and Andrew Senior and Francoise Beaufays (2014). Long Short-Term Memory recurrent neural network architectures for large scale acoustic modeling. Proceedings of Interspeech 2014.</ref><ref name="liwu2015">Xiangang Li, Xihong Wu (2015). Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition [http://arxiv.org/abs/1410.4281 arXiv:1410.4281]</ref> text-to-speech synthesis,<ref name="fan2015">Bo Fan, Lijuan Wang, Frank K. Soong, and Lei Xie (2015). Photo-Real Talking Head with Deep Bidirectional LSTM. In Proceedings of ICASSP 2015.</ref> also for Google Android,<ref name="zen2015">Heiga Zen and Hasim Sak (2015). Unidirectional Long Short-Term Memory Recurrent Neural Network with Recurrent Output Layer for Low-Latency Speech Synthesis. In Proceedings of ICASSP, pp. 4470-4474.</ref><ref name="schmidhuber2015"/>, and photo-real talking heads.<ref name="fan2015">Bo Fan, Lijuan Wang, Frank K. Soong, and Lei Xie (2015). Photo-Real Talking Head with Deep Bidirectional LSTM. In Proceedings of ICASSP 2015.</ref> In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which is now available through [[Google Voice]] to all smartphone users.<ref name="sak2015">Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays and Johan Schalkwyk (September 2015): [http://googleresearch.blogspot.ch/2015/09/google-voice-search-faster-and-more.html Google voice search: faster and more accurate.]</ref>
 
LSTM has also become very popular in the field of [[Natural Language Processing]].
The [[Long short term memory]] (LSTM) network, developed by [[Sepp Hochreiter|Hochreiter]] & [[Jürgen Schmidhuber|Schmidhuber]] in 1997,<ref name=lstm>[[Sepp Hochreiter|Hochreiter, Sepp]]; and [[Jürgen Schmidhuber|Schmidhuber, Jürgen]]; ''Long Short-Term Memory'', Neural Computation, 9(8):1735–1780, 1997</ref> is an artificial neural network structure that unlike traditional RNNs doesn't have the [[vanishing gradient problem]] (compare the section on training algorithms below). It works even when there are long delays, and it can handle signals that have a mix of low and high frequency components. LSTM RNN outperformed other methods in numerous applications such as language learning<ref>Gers, Felix A.; and Schmidhuber, Jürgen; ''LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages'', IEEE Transactions on Neural Networks, 12(6):1333–1340, 2001</ref> and connected handwriting recognition.<ref>Graves, Alex; and Schmidhuber, Jürgen; ''Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks'', in Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris K. I.; and Culotta, Aron (eds.), ''Advances in Neural Information Processing Systems 22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC'', Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552</ref>
Unlike previous models based on HMMs and similar concepts, LSTM can learn to recognise [[context-sensitive languages]].<ref name="gers2001">Felix A. Gers and [[Jürgen Schmidhuber]]. LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages. IEEE TNN 12(6):1333–1340, 2001.</ref> LSTM improved machine translation<ref name="sutskever2014">Ilya Sutskever, Oriol Vinyals, and Quoc V. Le (2014). Sequence to Sequence Learning with Neural Networks. [http://arxiv.org/abs/1409.3215 arXiv]</ref>, Language Modeling<ref name="vinyals2016">Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu (2016). Exploring the Limits of Language Modeling. [http://arxiv.org/abs/1602.02410 arXiv]</ref> and Multilingual Language Processing<ref name="gillick2015">Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya (2015). Multilingual Language Processing From Bytes. [http://arxiv.org/abs/1512.00103 arXiv]</ref>. LSTM combined with [[Convolutional Neural Network]]s (CNNs) also improved automatic image captioning<ref name="vinyals2015">Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan (2015). Show and Tell: A Neural Image Caption Generator. [http://arxiv.org/abs/1411.4555 arXiv]</ref> and a plethora of other applications.
 
===Bi-directional RNN===
Line 71 ⟶ 95:
===Hierarchical RNN===
 
There are many instances of hierarchical RNN whose elements are connected in various ways to decompose hierarchical behavior into useful subprograms.<ref>[[Jürgen Schmidhuber|J. Schmidhuber]]. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992<name="schmidhuber1992"/ref><ref>R.W. Paine, J. Tani, "How hierarchical control self-organizes in artificial adaptive systems," Adaptive Behavior, 13(3), 211-225, 2005.</ref>
 
===Recurrent multilayer perceptron===
Line 108 ⟶ 132:
 
There also is an online hybrid between BPTT and RTRL with intermediate complexity,<ref>[[Jürgen Schmidhuber|J. Schmidhuber]]. A fixed size storage O(n3) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2):243–248, 1992.</ref><ref>R. J. Williams. Complexity of exact gradient computation algorithms for recurrent neural networks. Technical Report Technical Report NU-CCS-89-27, Boston: Northeastern University, College of Computer Science, 1989.</ref> and there are variants for continuous time.<ref>B. A. Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural Computation, 1(2):263–269, 1989.</ref>
A major problem with gradient descent for standard RNN architectures is that error gradients vanish exponentially quickly with the size of the time lag between important events.<ref>[[Sepp Hochreiter|S. Hochreiter]]. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f. Informatik, Technische Univ. Munich, 1991.<name="hochreiter1991"/ref><ref>[[Sepp Hochreiter|S. Hochreiter]], Y. Bengio, P. Frasconi, and [[Jürgen Schmidhuber|J. Schmidhuber]]. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.</ref> The [[Long short term memory]] architecture together with a BPTT/RTRL hybrid learning method was introduced in an attempt to overcome these problems.<ref name="lstm" />
 
===Global optimization methods===