Time delay neural network: Difference between revisions

Content deleted Content added
Link completed
m v2.05 - Fix errors for CW project (Reference before punctuation - Reference list duplication)
 
(26 intermediate revisions by 8 users not shown)
Line 1:
{{Short description|Neural network architecture}}
[[File:TDNN Diagram.png|thumb|right|TDNN diagram]]
 
'''Time delay neural network''' ('''TDNN''')<ref name="phoneme detection">[[Alex{{cite Waibel|Alexanderjournal Waibel]], Tashiyuki Hanazawa, [[Geoffrey Hinton]], Kiyohito Shikano, Kevin J|doi=10.1109/29.21701 Lang, ''[|url=http://www.inf.ufrgs.br/~engel/data/media/file/cmp121/waibel89_TDNN.pdf |title=Phoneme Recognitionrecognition Usingusing Timetime-Delaydelay Neuralneural Networks]'',networks |date=1989 |last1=Waibel |first1=A. |last2=Hanazawa |first2=T. |last3=Hinton |first3=G. |last4=Shikano |first4=K. |last5=Lang |first5=K.J. |journal=IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume |volume=37, No. |issue=3, pp.|pages=328–339 328. - 339 March 1989.}}</ref> is a multilayer [[artificial neural network]] architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network. It is essentially a 1-d [[convolutional neural network]] (CNN).
 
Shift-invariant classification means that the classifier does not require explicit segmentation prior to classification. For the classification of a temporal pattern (such as speech), the TDNN thus avoids having to determine the beginning and end points of sounds before classifying them.
Line 8 ⟶ 9:
 
== History ==
The TDNN was introduced in the late 1980s and applied to a task of [[phoneme]] classification for automatic [[speech recognition]] in speech signals where the automatic determination of precise segments or feature boundaries was difficult or impossible. Because the TDNN recognizes phonemes and their underlying acoustic/phonetic features, independent of position in time, it improved performance over static classification.<ref name="phoneme detection" /><ref name=":0">Alexander Waibel, [https://isl.iar.kit.edu/downloads/Pheome_Recognition_Using_Time-Delay_Neural_Networks_SP87-100_6.pdf Phoneme Recognition Using Time-Delay Neural Networks], Procedures of the Institute of Electrical, Information and Communication Engineers (IEICE), December, 1987, Tokyo, Japan.</ref> It was also applied to two-dimensional signals (time-frequency patterns in speech,<ref name=":1">{{cite journal |author=John B. Hampshire and Alexander|author2=Alex Waibel, ''[http|url=https://paperswww.nipsresearchgate.ccnet/paperpublication/213-connectionist-architectures-for-multi-speaker-phoneme-recognition.pdf391319411 |title=Connectionist Architectures for Multi-Speaker Phoneme Recognition] {{Webarchive|urljournal=https://web.archive.org/web/20160411092444/http://papers.nips.cc/paper/213-connectionist-architectures-for-multi-speaker-phoneme-recognition.pdf |date=2016-04-11 }}'', Advances in Neural Information Processing Systems, 1990,|volume=2 Morgan Kaufmann.|pages=203–210}}</ref> and coordinate space pattern in OCR<ref name=":2">Stefan{{cite Jaeger,journal Stefan Manke, Juergen Reichert, Alexander Waibel, ''[|url=https://www.researchgate.net/profile/Stefan_Jaeger/publication/220163530_Online_handwriting_recognition_the_NPen_recognizer_Int_J_Doc_Anal_Recognit_3169-180/links/0c96051af3e6133ed0000000220163530 |doi=10.pdf1007/PL00013559 |title=Online handwriting recognition: theThe NPen++ recognizer]'', |date=2001 |last1=Jaeger |first1=S. |last2=Manke |first2=S. |last3=Reichert |first3=J. |last4=Waibel |first4=A. |journal=International Journal on Document Analysis and Recognition Vol. |volume=3, Issue |issue=3, March|pages=169–180 2001}}</ref>).
 
[[Kunihiko Fukushima]] published the [[neocognitron]] in 1980.<ref name="intro">{{cite journal |last=Fukushima |first=Kunihiko |year=1980 |title=Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position |url=https://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf |url-status=live |journal=Biological Cybernetics |volume=36 |issue=4 |pages=193–202 |doi=10.1007/BF00344251 |pmid=7370364 |s2cid=206775608 |archive-url=https://web.archive.org/web/20140603013137/http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf |archive-date=3 June 2014 |access-date=16 November 2013}}</ref> [[Max pooling]] appears in a 1982 publication on the neocognitron<ref>{{Cite journal |last1=Fukushima |first1=Kunihiko |last2=Miyake |first2=Sei |date=1982-01-01 |title=Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position |url=https://www.sciencedirect.com/science/article/abs/pii/0031320382900243 |journal=Pattern Recognition |volume=15 |issue=6 |pages=455–469 |doi=10.1016/0031-3203(82)90024-3 |bibcode=1982PatRe..15..455F |issn=0031-3203|url-access=subscription }}</ref> and was in the 1989 publication in [[LeNet|LeNet-5]].<ref>{{Cite journal |last1=LeCun |first1=Yann |last2=Boser |first2=Bernhard |last3=Denker |first3=John |last4=Henderson |first4=Donnie |last5=Howard |first5=R. |last6=Hubbard |first6=Wayne |last7=Jackel |first7=Lawrence |date=1989 |title=Handwritten Digit Recognition with a Back-Propagation Network |url=https://proceedings.neurips.cc/paper/1989/hash/53c3bce66e43be4f209556518c2fcb54-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=2}}</ref>
 
In 1990, Yamaguchi et al. used max pooling in TDNNs in order to realize a speaker independent isolated word recognition system.<ref name="Yamaguchi111990">{{cite conference |title=A Neural Network for Speaker-Independent Isolated Word Recognition |last1=Yamaguchi |first1=Kouichi |last2=Sakamoto |first2=Kenji |last3=Akabane |first3=Toshio |last4=Fujimoto |first4=Yoshiji |date=November 1990 |___location=Kobe, Japan |conference=First International Conference on Spoken Language Processing (ICSLP 90) |url=https://www.isca-speech.org/archive/icslp_1990/i90_1077.html |access-date=2019-09-04 |archive-date=2021-03-07 |archive-url=https://web.archive.org/web/20210307233750/https://www.isca-speech.org/archive/icslp_1990/i90_1077.html |url-status=dead }}</ref>
 
== Overview ==
The Time Delay Neural Network, like other neural networks, operates with multiple interconnected layers of [[perceptron]]s, and is implemented as a [[feedforward neural network]]. All neurons (at each layer) of a TDNN receive inputs from the outputs of neurons at the layer below but with two differences:
 
=== Architecture ===
# Unlike regular [[Multilayer perceptron|Multi-Layer perceptrons]], all units in a TDNN, at each layer, obtain inputs from a contextual ''window'' of outputs from the layer below. For time varying signals (e.g. speech), each unit has connections to the output from units below but also to the time-delayed (past) outputs from these same units. This models the units' temporal pattern/trajectory. For two-dimensional signals (e.g. time-frequency patterns or images), a 2-D context window is observed at each layer. Higher layers have inputs from wider context windows than lower layers and thus generally model coarser levels of abstraction.
In modern language, the design of TDNN is a 1D [[convolutional neural network]], where the direction of convolution is across the dimension of time. In the original design, there are exactly 3 layers.
# Shift-invariance is achieved by explicitly removing position dependence during [[backpropagation]] training. This is done by making time-shifted copies of a network across the dimension of invariance (here: time). The error gradient is then computed by backpropagation through all these networks from an overall target vector, but before performing the weight update, the error gradients associated with shifted copies are averaged and thus shared and constrained to be equal. Thus, all position dependence from backpropagation training through the shifted copies is removed and the copied networks learn the most salient hidden features shift-invariantly, i.e. independent of their precise position in the input data. Shift-invariance is also readily extended to multiple dimensions by imposing similar weight-sharing across copies that are shifted along multiple dimensions.<ref name=":1" /><ref name=":2" />
 
The input to the network is a continuous speech signal, preprocessed into a 2D array (a [[mel scale]] [[spectrogram]]). One dimension is time at 10 ms per frame, and the other dimension is frequency. The time dimension can be arbitrarily long, but the frequency dimension was only 16-long. In the original experiment, they only considered very short speech signals pronouncing single words like "baa", "daa", "gaa". Because of this, the speech signals could be very short, indeed, only 15 frames long (150 ms in time).
 
In detail, they processed a voice signal as follows:
 
* Input speech is sampled at 12 kHz, [[Window function#Hann and Hamming windows|Hamming-windowed]].
* Its [[Fast Fourier transform|FFT]] is computed every 5 ms.
* The mel scale coefficients are computed from the power spectrum by taking log energies in each mel scale energy band.
* Adjacent coefficients in time are soothed over, resulting in one frame every 10 ms.
* For each signal, a human manually detect the onset of the vowel, and the entire speech signal is cut off except 7 frames before and 7 frames after, leaving just 15 frames in total, centered at the onset of the vowel.
* The coefficients are normalized by subtracting the mean, then scaling, so that the signals fall between -1 and +1.
 
The first layer of the TDNN is a 1D convolutional layer. The layer contains 8 kernels of shape <math>3 \times 16 </math>. It outputs a tensor of shape <math>8 \times 13</math>.
 
The second layer of the TDNN is a 1D convolutional layer. The layer contains 3 kernels of shape <math>5 \times 8</math>. It outputs a tensor of shape <math>3 \times 9</math>.
 
The third layer of the TDNN is not a convolutional layer. Instead, it is simply a fixed layer with 3 neurons. Let the output from the second layer be <math>x_{i,j}</math> where <math>i \in 1:3</math> and <math>j \in 1:9</math>. The <math>i</math>-th neuron in the third layer computes <math>\sigma(\sum_{j\in 1:9} x_{i,j})</math>, where <math>\sigma</math> is the [[sigmoid function]]. Essentially, it can be thought of as a convolution layer with 3 kernels of shape <math>1 \times 9</math>.
 
It was trained on ~800 samples for 20000--50000 [[backpropagation]] steps. Each steps was computed in a [[Batch processing|batch]] over the entire training dataset, i.e. not [[Stochastic gradient descent|stochastic]]. It required the use of an [[Alliant Computer Systems|Alliant supercomputer]] with 4 processors.
 
=== Example ===
Line 26 ⟶ 45:
 
=== Implementation ===
The precise architecture of TDNNs (time-delays, number of layers) is mostly determined by the designer depending on the classification problem and the most useful context sizes. The delays or context windows are chosen specific to each application. Work has also been done to create adaptable time-delay TDNNs<ref>Christian{{cite Koehlerjournal and Joachim K|doi=10. Anlauf, ''[https://web.archive.org/web/20190904162647/https://pdfs.semanticscholar.org/9a0a1109/08e4d9a4cea6fa035555f2ee54bdae67361472.pdf809100|s2cid=16813677 |title=An adaptable time-delay neural-network algorithm for image sequence analysis]'', |date=1999 |last1=Wöhler |first1=C. |last2=Anlauf |first2=J.K. |journal=IEEE Transactions on Neural Networks |volume=10. |issue=6 (1999):|pages=1531–1536 1531-1536|pmid=18252656 }}</ref> where this manual tuning is eliminated.
 
=== State of the art ===
TDNN-based phoneme recognizers compared favourably in early comparisons with HMM-based phone models.<ref name="phoneme detection" /><ref name=":3" /> Modern deep TDNN architectures include many more hidden layers and sub-sample or pool connections over broader contexts at higher layers. They achieve up to 50% word error reduction over [[Mixture model|GMM]]-based acoustic models.<ref name=":4">Vijayaditya{{cite Peddinti,book Daniel Povey, Sanjeev Khudanpur, ''[https://web|doi=10.archive.org21437/web/20180306041537/https://pdfsInterspeech.semanticscholar.org/ced2/11de5412580885279090f44968a428f1710b.pdf2015-647 |doi-access=free |s2cid=8536162 |chapter=A time delay neural network architecture for efficient modeling of long temporal contexts]'', Proceedings of |title=Interspeech 2015 |date=2015 |last1=Peddinti |first1=Vijayaditya |last2=Povey |first2=Daniel |last3=Khudanpur |first3=Sanjeev |pages=3214–3218 }}</ref><ref name=":5">David Snyder, Daniel Garcia-Romero, Daniel Povey, ''[http://danielpovey.com/files/2015_asru_tdnn_ubm.pdf A Time-Delay Deep Neural Network-Based Universal Background Models for Speaker Recognition]'', Proceedings of ASRU 2015.</ref> While the different layers of TDNNs are intended to learn features of increasing context width, they do model local contexts. When longer-distance relationships and pattern sequences have to be processed, learning states and state-sequences is important and TDNNs can be combined with other modelling techniques.<ref name=":6">Patrick{{Cite journal |last1=Haffner, Alexander|first1=Patrick |last2=Waibel, ''[http://papers.nips.cc/paper/580-multi-state-time-delay-networks-for-continuous-speech-recognition.pdf|first2=Alex |date=1991 |title=Multi-State Time Delay Neural Networks for Continuous Speech Recognition] {{Webarchive|url=https://webproceedings.archiveneurips.orgcc/webpaper_files/20160411090850paper/http:1991/hash/papers069d3bb002acd8d7dd095917f9efe4cb-Abstract.nipshtml |website=proceedings.neurips.cc |volume=4 |publisher=NIPS |pages=135–142}}</paperref><ref name=":1" /580-multi-><ref name=":2" /> TDNN architectures have also been adapted to [[Spiking neural network|Spiking Neural Networks]], leading to state-timeof-delaythe-networksart results while lending themselves to energy-for-continuous-speech-recognitionefficient [[Neuromorphic chip|hardware implementations]].pdf<ref>{{Cite journal |last=D’Agostino |first=Simone |last2=Moro |first2=Filippo |last3=Torchet |first3=Tristan |last4=Demirağ |first4=Yiğit |last5=Grenouillet |first5=Laurent |last6=Castellani |first6=Niccolò |last7=Indiveri |first7=Giacomo |last8=Vianello |first8=Elisa |last9=Payvand |first9=Melika |date=20162024-04-1124 }}'',|title=DenRAM: Advancesneuromorphic indendritic Neuralarchitecture Informationwith ProcessingRRAM Systems,for 1992,efficient Morgantemporal Kaufmannprocessing with delays |url=https://www.<nature.com/ref><refarticles/s41467-024-47764-w name|journal=Nature Communications |language=en |volume=15 |issue=":1" |pages=3446 |doi=10.1038/><refs41467-024-47764-w name|issn=":2"2041-1723|pmc=11043378 }}</ref>
 
== Applications ==
Line 37 ⟶ 56:
 
=== Large vocabulary speech recognition ===
Large vocabulary speech recognition requires recognizing sequences of phonemes that make up words subject to the constraints of a large pronunciation vocabulary. Integration of TDNNs into large vocabulary speech recognizers is possible by introducing state transitions and search between phonemes that make up a word. The resulting Multi-State Time-Delay Neural Network (MS-TDNN) can be trained discriminative from the word level, thereby optimizing the entire arrangement toward word recognition instead of phoneme classification.<ref name=":6" /><ref name=":7">Christoph{{cite Bregler,book Hermann| Hild, Stefan Manke, Alexander Waibel, ''[http://isldoi=10.anthropomatik.kit.edu/cmu-kit/downloads1109/Improving_Connected_Letter_Recognition_by_LipreadingICASSP.pdf1993.319179 | chapter=Improving Connectedconnected Letterletter Recognitionrecognition by Lipreading]'',lipreading IEEE| Proceedingstitle=IEEE International Conference on Acoustics, Speech, and Signal Processing, Minneapolis,| date=1993 | last1=Bregler | first1=C. | last2=Hild | first2=H. | last3=Manke | first3=S. | last4=Waibel | first4=A. | pages=557–560 |volume=1 | isbn=0-7803-0946-4 }}</ref><ref name=":2" />
 
=== Speaker independence ===
Line 49 ⟶ 68:
 
=== Handwriting recognition ===
TDNNs have been used effectively in compact and high-performance [[handwriting recognition]] systems.<ref>{{Cite journal |last=Guyon |first=I. |last2=Albrecht |first2=P. |last3=Le Cun |first3=Y. |last4=Denker |first4=J. |last5=Hubbard |first5=W. |date=1991-01-01 |title=Design of a neural network character recognizer for a touch terminal |url=https://www.sciencedirect.com/science/article/pii/003132039190081F |journal=Pattern Recognition |volume=24 |issue=2 |pages=105–119 |doi=10.1016/0031-3203(91)90081-F |issn=0031-3203|url-access=subscription }}</ref> Shift-invariance was also adapted to spatial patterns (x/y-axes) in image offline handwriting recognition.<ref name=":2" />
 
=== Video analysis ===
Video has a temporal dimension that makes a TDNN an ideal solution to analysing motion patterns. An example of this analysis is a combination of vehicle detection and recognizing pedestrians.<ref>Christian{{cite Woehlerjournal and| Joachim Kdoi=10. Anlauf, [https:1016//www.sciencedirect.com/science/article/pii/S0262885601000403S0262-8856(01)00040-3 | title=Real-time object recognition on image sequences with the adaptable time delay neural network algorithm—applicationsalgorithm — applications for autonomous vehicles] | date=2001 | last1=Wöhler | first1=C." | last2=Anlauf | first2=J. K. | journal=Image and Vision Computing | volume=19.9 (2001):| 593-618.issue=9–10 | pages=593–618 }}</ref> When examining videos, subsequent images are fed into the TDNN as input where each image is the next frame in the video. The strength of the TDNN comes from its ability to examine objects shifted in time forward and backward to define an object detectable as the time is altered. If an object can be recognized in this manner, an application can plan on that object to be found in the future and perform an optimal action.
 
=== Image recognition ===
Line 60 ⟶ 79:
* TDNNs can be implemented in virtually all machine-learning frameworks using one-dimensional [[convolutional neural network]]s, due to the equivalence of the methods.
* [[Matlab]]: The neural network toolbox has explicit functionality designed to produce a time delay neural network give the step size of time delays and an optional training function. The default training algorithm is a Supervised Learning back-propagation algorithm that updates filter weights based on the Levenberg-Marquardt optimizations. The function is timedelaynet(delays, hidden_layers, train_fnc) and returns a time-delay neural network architecture that a user can train and provide inputs to.<ref>''"[https://www.mathworks.com/help/deeplearning/time-series-and-dynamic-systems.html Time Series and Dynamic Systems - MATLAB & Simulink]".'' mathworks.com. Retrieved 21 June 2016.</ref>
* The [[Kaldi (software)|Kaldi ASR Toolkit]] has an implementation of TDNNs with several optimizations for speech recognition.<ref>Vijayaditya{{cite Peddinti,book Guoguo|doi=10.1109/ASRU.2015.7404842 Chen, Vimal Manohar, Tom Ko, Daniel Povey, Sanjeev Khudanpur, ''[|chapter-url=http://danielpovey.com/files/2015_asru_aspire.pdf |chapter=JHU ASpIRE system: Robust LVCSR with TDNNsTDNNS, i-vectoriVector Adaptationadaptation and RNN-LMs]'',LMS Proceedings|title=2015 ofIEEE theWorkshop IEEEon Automatic Speech Recognition and Understanding Workshop,(ASRU) |date=2015. |last1=Peddinti |first1=Vijayaditya |last2=Chen |first2=Guoguo |last3=Manohar |first3=Vimal |last4=Ko |first4=Tom |last5=Povey |first5=Daniel |last6=Khudanpur |first6=Sanjeev |pages=539–546 |isbn=978-1-4799-7291-3 }}</ref>
 
== See also ==
Line 67 ⟶ 86:
 
== References ==
{{reflist}}
{{reflist}}<ref>{{Cite journal |last=Waibel |first=Alex |last2=Hanazawa |first2=Toshiyuki |last3=Hinton |first3=Geoffrey |last4=Shikano |first4=Kiyohiro |last5=Lang |first5=Kevin |date=April 1989 |title=Phoneme recognition using time-delay neural networks |url=https://www.researchgate.net/publication/391037926_Phoneme_Recognition_Using_Time-Delay_Neural_Networks#fullTextFileContent |journal=Acoustics, Speech and Signal Processing, IEEE Transactions on |volume=37 |pages=328 - 339 |doi=10.1109/29.21701}}</ref>
<ref>* {{Cite journal |lastlast1=WaibelHaffner |firstfirst1=AlexPatrick |last2=Waibel |date=19871991 |orig-date=DecemberJanuary 1991 |titleeditor-last=PhonemeLippman Recognition|editor-first=Richard Using|editor2-last=Moody Time|editor2-Delayfirst=John Neural|title=Multi-State Time Delay Networks for Continuous Speech Recognition |url=https://www.researchgate.net/publication/391037926_Phoneme_Recognition_Using_Time-Delay_Neural_Networks#fullTextFileContent221618146 |journal=Conference:Advances Meetingin of the Institute of Electrical,Neural Information andProcessing CommunicationSystems Engineers|publisher=Morgan (IEICE)Kaufman |___locationvolume=Japan4 |pages=135–142}}</ref>
* {{Cite journal |last1=Hampshire |first1=John |last2=Waibel |first2=Alex |orig-date=November 30, 1989 |editor-last=Touretzky |editor-first=David |title=Connectionist Architectures for Multi-Speaker Phoneme Recognition |url=http://papers.nips.cc/paper/213-connectionist-architectures-for-multi-speaker-phoneme-recognition |journal=Advances in Neural Information Processing Systems 2 |date=1990 |page=203-210}}
{{reflist}}<ref>* {{Cite journal |lastlast1=Waibel |firstfirst1=Alex |last2=Hanazawa |first2=Toshiyuki |last3=Hinton |first3=Geoffrey |last4=Shikano |first4=Kiyohiro |last5=Lang |first5=Kevin |date=April 1989 |title=Phoneme recognition using time-delay neural networks |url=https://www.researchgate.net/publication/391037926_Phoneme_Recognition_Using_Time-Delay_Neural_Networks#fullTextFileContent391037926 |journal= IEEE Transactions on Acoustics, Speech, and Signal Processing, IEEE Transactions on |volume=37 |issue=3 |pages=328 - 339328–339 |doi=10.1109/29.21701}}</ref>
* {{Cite journal |last=Waibel |first=Alex |date=1987 |orig-date=December |title=Phoneme Recognition Using Time-Delay Neural Networks |url=https://www.researchgate.net/publication/391037926 |journal=Conference: Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE) |___location=Japan}}
 
[[Category:Neural network architectures]]
[[Category:1987 in artificial intelligence]]