Time delay neural network: Difference between revisions

Content deleted Content added
adding links to references using Google Scholar
mNo edit summary
Line 1:
[[File:TDNN Diagram.png|thumb|right|TDNN Diagramdiagram]]
 
'''Time delay neural network''' ('''TDNN''') <ref name="phoneme detection">[[Alex Waibel|Alexander Waibel]], Tashiyuki Hanazawa, [[Geoffrey Hinton]], Kiyohito Shikano, Kevin J. Lang, ''[http://www.inf.ufrgs.br/~engel/data/media/file/cmp121/waibel89_TDNN.pdf Phoneme Recognition Using Time-Delay Neural Networks]'', IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. - 339 March 1989.</ref> is a multilayer [[artificial neural network]] architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.
 
Shift-invariant classification means that the classifier does not require explicit segmentation prior to classification. For the classification of a temporal pattern (such as speech), the TDNN thus avoids having to determine the beginning and end points of sounds before classifying them.
Line 9:
== History ==
The TDNN was first proposed to classify [[phonemes]] in speech signals for automatic [[speech recognition]], where the automatic determination of precise segments or feature boundaries is difficult or impossible. Because the TDNN recognizes phonemes and their underlying acoustic/phonetic features, independent of position in time, it improved performance over static classification.<ref name="phoneme detection" /><ref name=":0">Alexander Waibel, ''[http://www.inf.ufrgs.br/~engel/data/media/file/cmp121/waibel89_TDNN.pdf Phoneme Recognition Using Time-Delay Neural Networks]'', SP87-100, Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE), December, 1987,Tokyo, Japan.</ref> It was also applied to two-dimensional signals (time-frequency patterns in speech,<ref name=":1">John B. Hampshire and Alexander Waibel, ''[http://papers.nips.cc/paper/213-connectionist-architectures-for-multi-speaker-phoneme-recognition.pdf Connectionist Architectures for Multi-Speaker Phoneme Recognition]'', Advances in Neural Information Processing Systems, 1990, Morgan Kaufmann.</ref> and coordinate space pattern in OCR<ref name=":2">Stefan Jaeger, Stefan Manke, Juergen Reichert, Alexander Waibel, ''[https://www.researchgate.net/profile/Stefan_Jaeger/publication/220163530_Online_handwriting_recognition_the_NPen_recognizer_Int_J_Doc_Anal_Recognit_3169-180/links/0c96051af3e6133ed0000000.pdf Online handwriting recognition: the NPen++recognizer]'', International Journal on Document Analysis and Recognition Vol. 3, Issue 3, March 2001</ref>).
 
==== Max pooling ====
In 1990, Yamaguchi et al. introduced the concept of max pooling. They did so by combining TDNNs with max pooling in order to realize a speaker independent isolated word recognition system.<ref name=Yamaguchi111990>{{cite conference |title=A Neural Network for Speaker-Independent Isolated Word Recognition |last1=Yamaguchi |first1=Kouichi |last2=Sakamoto |first2=Kenji |last3=Akabane |first3=Toshio |last4=Fujimoto |first4=Yoshiji |date=November 1990 |___location=Kobe, Japan |conference=First International Conference on Spoken Language Processing (ICSLP 90)|url=https://www.isca-speech.org/archive/icslp_1990/i90_1077.html}}</ref>
 
== Overview ==
The Time Delay Neural Network, like other neural networks, operates with multiple interconnected layers of [[perceptron]]s, and is implemented as a [[feedforward neural network]]. All neurons (at each layer) of a TDNN receive inputs from the outputs of neurons at the layer below but with two differences:
 
Line 19 ⟶ 20:
 
=== Example ===
 
In the case of a speech signal, inputs are spectral coefficients over time.
 
Line 25:
 
=== Implementation ===
The precise architecture of TDNNs (time-delays, number of layers) is mostly determined by the designer depending on the classification problem and the most useful context sizes. The delays or context windows are chosen specific to each application. Work has also been done to create adaptable time-delay TDNNs <ref>Christian Koehler and Joachim K. Anlauf, ''[https://pdfs.semanticscholar.org/9a0a/08e4d9a4cea6fa035555f2ee54bdae673614.pdf An adaptable time-delay neural-network algorithm for image sequence analysis]'', IEEE Transactions on Neural Networks 10.6 (1999): 1531-1536</ref> where this manual tuning is eliminated.
 
The precise architecture of TDNNs (time-delays, number of layers) is mostly determined by the designer depending on the classification problem and the most useful context sizes. The delays or context windows are chosen specific to each application. Work has also been done to create adaptable time-delay TDNNs <ref>Christian Koehler and Joachim K. Anlauf, ''[https://pdfs.semanticscholar.org/9a0a/08e4d9a4cea6fa035555f2ee54bdae673614.pdf An adaptable time-delay neural-network algorithm for image sequence analysis]'', IEEE Transactions on Neural Networks 10.6 (1999): 1531-1536</ref> where this manual tuning is eliminated.
 
=== State of the art ===
TDNN-based phoneme recognizers compared favourably in early comparisons with HMM-based phone models.<ref name="phoneme detection" /><ref name=":3" /> Modern deep TDNN architectures include many more hidden layers and sub-sample or pool connections over broader contexts at higher layers. They achieve up to 50% word error reduction over [[Mixture model|GMM]]-based acoustic models.<ref name=":4">Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur, ''[https://pdfs.semanticscholar.org/ced2/11de5412580885279090f44968a428f1710b.pdf A time delay neural network architecture for efficient modeling of long temporal contexts]'', Proceedings of Interspeech 2015</ref><ref name=":5">David Snyder, Daniel Garcia-Romero, Daniel Povey, ''[http://danielpovey.com/files/2015_asru_tdnn_ubm.pdf A Time-Delay Deep Neural Network-Based Universal Background Models for Speaker Recognition]'', Proceedings of ASRU 2015.</ref> While the different layers of TDNNs are intended to learn features of increasing context width, they do model local contexts. When longer-distance relationships and pattern sequences have to be processed, learning states and state-sequences is important and TDNNs can be combined with other modelling techniques .<ref name=":6">Patrick Haffner, Alexander Waibel, ''[http://papers.nips.cc/paper/580-multi-state-time-delay-networks-for-continuous-speech-recognition.pdf Multi-State Time Delay Neural Networks for Continuous Speech Recognition]'', Advances in Neural Information Processing Systems, 1992, Morgan Kaufmann.</ref><ref name=":1" /><ref name=":2" />
 
== Applications ==
TDNN-based phoneme recognizers compared favourably in early comparisons with HMM-based phone models.<ref name="phoneme detection" /><ref name=":3" /> Modern deep TDNN architectures include many more hidden layers and sub-sample or pool connections over broader contexts at higher layers. They achieve up to 50% word error reduction over [[Mixture model|GMM]]-based acoustic models.<ref name=":4">Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur, ''[https://pdfs.semanticscholar.org/ced2/11de5412580885279090f44968a428f1710b.pdf A time delay neural network architecture for efficient modeling of long temporal contexts]'', Proceedings of Interspeech 2015</ref><ref name=":5">David Snyder, Daniel Garcia-Romero, Daniel Povey, ''[http://danielpovey.com/files/2015_asru_tdnn_ubm.pdf A Time-Delay Deep Neural Network-Based Universal Background Models for Speaker Recognition]'', Proceedings of ASRU 2015.</ref> While the different layers of TDNNs are intended to learn features of increasing context width, they do model local contexts. When longer-distance relationships and pattern sequences have to be processed, learning states and state-sequences is important and TDNNs can be combined with other modelling techniques <ref name=":6">Patrick Haffner, Alexander Waibel, ''[http://papers.nips.cc/paper/580-multi-state-time-delay-networks-for-continuous-speech-recognition.pdf Multi-State Time Delay Neural Networks for Continuous Speech Recognition]'', Advances in Neural Information Processing Systems, 1992, Morgan Kaufmann.</ref><ref name=":1" /><ref name=":2" />
 
==Applications==
 
=== Speech recognition ===
TDNNs used to solve problems in speech recognition that were introduced in 1987 <ref name=":0" /> and initially focused on shift-invariant phoneme recognition. Speech lends itself nicely to TDNNs as spoken sounds are rarely of uniform length and precise segmentation is difficult or impossible. By scanning a sound over past and future, the TDNN is able to construct a model for the key elements of that sound in a time-shift invariant manner. This is particularly useful as sounds are smeared out through reverberation.<ref name=":4" /><ref name=":5" /> Large phonetic TDNNs can be constructed modularly through pre-training and combining smaller networks.<ref name=":3" />
 
TDNNs used to solve problems in speech recognition that were introduced in 1987 <ref name=":0" /> and initially focused on shift-invariant phoneme recognition. Speech lends itself nicely to TDNNs as spoken sounds are rarely of uniform length and precise segmentation is difficult or impossible. By scanning a sound over past and future, the TDNN is able to construct a model for the key elements of that sound in a time-shift invariant manner. This is particularly useful as sounds are smeared out through reverberation.<ref name=":4" /><ref name=":5" /> Large phonetic TDNNs can be constructed modularly through pre-training and combining smaller networks.<ref name=":3" />
 
=== Large vocabulary speech recognition ===
 
Large vocabulary speech recognition requires recognizing sequences of phonemes that make up words subject to the constraints of a large pronunciation vocabulary. Integration of TDNNs into large vocabulary speech recognizers is possible by introducing state transitions and search between phonemes that make up a word. The resulting Multi-State Time-Delay Neural Network (MS-TDNN) can be trained discriminative from the word level, thereby optimizing the entire arrangement toward word recognition instead of phoneme classification.<ref name=":6" /><ref name=":7">Christoph Bregler, Hermann Hild, Stefan Manke, Alexander Waibel, ''[http://isl.anthropomatik.kit.edu/cmu-kit/downloads/Improving_Connected_Letter_Recognition_by_Lipreading.pdf Improving Connected Letter Recognition by Lipreading]'', IEEE Proceedings International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, 1993.</ref><ref name=":2" />
 
=== Speaker independence ===
 
Two-dimensional variants of the TDNNs were proposed for speaker independence.<ref name=":1" /> Here, shift-invariance is applied to the time ''as well as'' to the frequency axis in order to learn hidden features that are independent of precise ___location in time and in frequency (the latter being due to speaker variability).
 
=== Reverberation ===
 
One of the persistent problems in speech recognition is recognizing speech when it is corrupted by echo and reverberation (as is the case in large rooms and distant microphones). Reverberation can be viewed as corrupting speech with delayed versions of itself. In general, it is difficult, however, to de-reverberate a signal as the impulse response function (and thus the convolutional noise experienced by the signal) is not known for any arbitrary space. The TDNN was shown to be effective to recognize speech robustly despite different levels of reverberation.<ref name=":4" /><ref name=":5" />
 
=== Lip-reading – audio-visual speech ===
 
TDNNs were also successfully used in early demonstrations of audio-visual speech, where the sounds of speech are complemented by visually reading lip movement.<ref name=":7" /> Here, TDNN-based recognizers used visual and acoustic features jointly to achieve improved recognition accuracy, particularly in the presence of noise, where complementary information from an alternate modality could be fused nicely in a neural net.
 
=== Handwriting recognition ===
 
TDNNs have been used effectively in compact and high-performance [[handwriting recognition]] systems. Shift-invariance was also adapted to spatial patterns (x/y-axes) in image offline handwriting recognition.<ref name=":2" />
 
=== Video analysis ===
 
Video has a temporal dimension that makes a TDNN an ideal solution to analysing motion patterns. An example of this analysis is a combination of vehicle detection and recognizing pedestrians.<ref>Christian Woehler and Joachim K. Anlauf, [https://www.sciencedirect.com/science/article/pii/S0262885601000403 Real-time object recognition on image sequences with the adaptable time delay neural network algorithm—applications for autonomous vehicles]." Image and Vision Computing 19.9 (2001): 593-618.</ref> When examining videos, subsequent images are fed into the TDNN as input where each image is the next frame in the video. The strength of the TDNN comes from its ability to examine objects shifted in time forward and backward to define an object detectable as the time is altered. If an object can be recognized in this manner, an application can plan on that object to be found in the future and perform an optimal action.
 
=== Image recognition ===
Two-dimensional TDNNs were later applied to other image-recognition tasks under the name of "[[Convolutional neural network|Convolutional Neural Networks]]", where shift-invariant training is applied to the x/y axes of an image.
 
Two-dimensional TDNNs were later applied to other image-recognition tasks under the name of “[[Convolutional neural network|Convolutional Neural Networks]]”, where shift-invariant training is applied to the x/y axes of an image.
 
=== Common libraries ===
* TDNNs can be implemented in virtually all machine-learning frameworks using one-dimensional [[convolutional neural network]]s, due to the equivalence of the methods.
* [[Matlab]]: The neural network toolbox has explicit functionality designed to produce a time delay neural network give the step size of time delays and an optional training function. The default training algorithm is a Supervised Learning back-propagation algorithm that updates filter weights based on the Levenberg-Marquardt optimizations. The function is timedelaynet(delays, hidden_layers, train_fnc) and returns a time-delay neural network architecture that a user can train and provide inputs to.<ref>''"[https://www.mathworks.com/help/deeplearning/time-series-and-dynamic-systems.html Time Series and Dynamic Systems - MATLAB & Simulink]".'' mathworks.com. Retrieved 21 June 2016.</ref>
* The [[Kaldi (software)|Kaldi ASR Toolkit]] has an implementation of TDNNs with several optimizations for speech recognition .<ref>Vijayaditya Peddinti, Guoguo Chen, Vimal Manohar, Tom Ko, Daniel Povey, Sanjeev Khudanpur, ''[http://danielpovey.com/files/2015_asru_aspire.pdf JHU ASpIRE system: Robust LVCSR with TDNNs i-vector Adaptation and RNN-LMs]'', Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2015.</ref>
 
== See also ==
*TDNNs can be implemented in virtually all machine-learning frameworks using one-dimensional [[convolutional neural network]]s, due to the equivalence of the methods.
* [[Convolutional neural network]] -{{snd}} a convolutional neural net where the convolution is performed along the time axis of the data is very similar to a TDNN.
*[[Matlab]]: The neural network toolbox has explicit functionality designed to produce a time delay neural network give the step size of time delays and an optional training function. The default training algorithm is a Supervised Learning back-propagation algorithm that updates filter weights based on the Levenberg-Marquardt optimizations. The function is timedelaynet(delays, hidden_layers, train_fnc) and returns a time-delay neural network architecture that a user can train and provide inputs to.<ref>''"[https://www.mathworks.com/help/deeplearning/time-series-and-dynamic-systems.html Time Series and Dynamic Systems - MATLAB & Simulink]".'' mathworks.com. Retrieved 21 June 2016.</ref>
* [[Recurrent neural networks]] -{{snd}} a recurrent neural network also handles temporal data, albeit in a different manner. Instead of a time-varied input, RNNs maintain internal hidden layers to keep track of past (and in the case of Bi-directional RNNs, future) inputs.
*The [[Kaldi (software)|Kaldi ASR Toolkit]] has an implementation of TDNNs with several optimizations for speech recognition <ref>Vijayaditya Peddinti, Guoguo Chen, Vimal Manohar, Tom Ko, Daniel Povey, Sanjeev Khudanpur, ''[http://danielpovey.com/files/2015_asru_aspire.pdf JHU ASpIRE system: Robust LVCSR with TDNNs i-vector Adaptation and RNN-LMs]'', Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2015.</ref>
 
==See also==
 
* [[Convolutional neural network]] - a convolutional neural net where the convolution is performed along the time axis of the data is very similar to a TDNN.
* [[Recurrent neural networks]] - a recurrent neural network also handles temporal data, albeit in a different manner. Instead of a time-varied input, RNNs maintain internal hidden layers to keep track of past (and in the case of Bi-directional RNNs, future) inputs.
 
== References ==
{{reflist}}