Time delay neural network: Difference between revisions

Content deleted Content added
No edit summary
Line 15:
 
== Overview ==
The Time Delay Neural Network, like other neural networks, operates with multiple interconnected layers of [[perceptron]]s, and is implemented as a [[feedforward neural network]]. All neurons (at each layer) of a TDNN receive inputs from the outputs of neurons at the layer below but with two differences:
 
=== Architecture ===
# Unlike regular [[Multilayer perceptron|Multi-Layer perceptrons]], all units in a TDNN, at each layer, obtain inputs from a contextual ''window'' of outputs from the layer below. For time varying signals (e.g. speech), each unit has connections to the output from units below but also to the time-delayed (past) outputs from these same units. This models the units' temporal pattern/trajectory. For two-dimensional signals (e.g. time-frequency patterns or images), a 2-D context window is observed at each layer. Higher layers have inputs from wider context windows than lower layers and thus generally model coarser levels of abstraction.
In modern language, the design of TDNN is a 1D [[convolutional neural network]], where the direction of convolution is across the dimension of time. In the original design, there are exactly 3 layers.
# Shift-invariance is achieved by explicitly removing position dependence during [[backpropagation]] training. This is done by making time-shifted copies of a network across the dimension of invariance (here: time). The error gradient is then computed by backpropagation through all these networks from an overall target vector, but before performing the weight update, the error gradients associated with shifted copies are averaged and thus shared and constrained to be equal. Thus, all position dependence from backpropagation training through the shifted copies is removed and the copied networks learn the most salient hidden features shift-invariantly, i.e. independent of their precise position in the input data. Shift-invariance is also readily extended to multiple dimensions by imposing similar weight-sharing across copies that are shifted along multiple dimensions.<ref name=":1" /><ref name=":2" />
 
The input to the network is a continuous speech signal, preprocessed into a 2D array (a [[mel scale]] [[spectrogram]]). One dimension is time at 10 ms per frame, and the other dimension is frequency. The time dimension can be arbitrarily long, but the frequency dimension was only 16-long. In the original experiment, they only considered very short speech signals pronouncing single words like "baa", "daa", "gaa". Because of this, the speech signals could be very short, indeed, only 15 frames long (150 ms in time).
 
In detail, they processed a voice signal as follows:
 
* Input speech is sampled at 12 kHz, [[Window function#Hann and Hamming windows|Hamming-windowed]].
* Its [[Fast Fourier transform|FFT]] is computed every 5 ms.
* The mel scale coefficients are computed from the power spectrum by taking log energies in each mel scale energy band.
* Adjacent coefficients in time are soothed over, resulting in one frame every 10 ms.
* For each signal, a human manually detect the onset of the vowel, and the entire speech signal is cut off except 7 frames before and 7 frames after, leaving just 15 frames in total, centered at the onset of the vowel.
* The coefficients are normalized by subtracting the mean, then scaling, so that the signals fall between -1 and +1.
 
The first layer of the TDNN is a 1D convolutional layer. The layer contains 8 kernels of shape <math>3 \times 16 </math>. It outputs a tensor of shape <math>8 \times 13</math>.
 
The second layer of the TDNN is a 1D convolutional layer. The layer contains 3 kernels of shape <math>5 \times 8</math>. It outputs a tensor of shape <math>3 \times 9</math>.
 
The third layer of the TDNN is not a convolutional layer. Instead, it is simply a fixed layer with 3 neurons. Let the output from the second layer be <math>x_{i,j}</math> where <math>i \in 1:3</math> and <math>j \in 1:9</math>. The <math>i</math>-th neuron in the third layer computes <math>\sigma(\sum_{j\in 1:9} x_{i,j})</math>, where <math>\sigma</math> is the [[sigmoid function]]. Essentially, it can be thought of as a convolution layer with 3 kernels of shape <math>1 \times 9</math>.
 
It was trained on ~800 samples for 20000--50000 [[backpropagation]] steps. Each steps was computed in a [[Batch processing|batch]] over the entire training dataset, i.e. not [[Stochastic gradient descent|stochastic]]. It required the use of an [[Alliant Computer Systems|Alliant supercomputer]] with 4 processors.
 
=== Example ===