Content deleted Content added
No edit summary |
→Overview: arch |
||
Line 15:
== Overview ==
=== Architecture ===
In modern language, the design of TDNN is a 1D [[convolutional neural network]], where the direction of convolution is across the dimension of time. In the original design, there are exactly 3 layers.
The input to the network is a continuous speech signal, preprocessed into a 2D array (a [[mel scale]] [[spectrogram]]). One dimension is time at 10 ms per frame, and the other dimension is frequency. The time dimension can be arbitrarily long, but the frequency dimension was only 16-long. In the original experiment, they only considered very short speech signals pronouncing single words like "baa", "daa", "gaa". Because of this, the speech signals could be very short, indeed, only 15 frames long (150 ms in time).
In detail, they processed a voice signal as follows:
* Input speech is sampled at 12 kHz, [[Window function#Hann and Hamming windows|Hamming-windowed]].
* Its [[Fast Fourier transform|FFT]] is computed every 5 ms.
* The mel scale coefficients are computed from the power spectrum by taking log energies in each mel scale energy band.
* Adjacent coefficients in time are soothed over, resulting in one frame every 10 ms.
* For each signal, a human manually detect the onset of the vowel, and the entire speech signal is cut off except 7 frames before and 7 frames after, leaving just 15 frames in total, centered at the onset of the vowel.
* The coefficients are normalized by subtracting the mean, then scaling, so that the signals fall between -1 and +1.
The first layer of the TDNN is a 1D convolutional layer. The layer contains 8 kernels of shape <math>3 \times 16 </math>. It outputs a tensor of shape <math>8 \times 13</math>.
The second layer of the TDNN is a 1D convolutional layer. The layer contains 3 kernels of shape <math>5 \times 8</math>. It outputs a tensor of shape <math>3 \times 9</math>.
The third layer of the TDNN is not a convolutional layer. Instead, it is simply a fixed layer with 3 neurons. Let the output from the second layer be <math>x_{i,j}</math> where <math>i \in 1:3</math> and <math>j \in 1:9</math>. The <math>i</math>-th neuron in the third layer computes <math>\sigma(\sum_{j\in 1:9} x_{i,j})</math>, where <math>\sigma</math> is the [[sigmoid function]]. Essentially, it can be thought of as a convolution layer with 3 kernels of shape <math>1 \times 9</math>.
It was trained on ~800 samples for 20000--50000 [[backpropagation]] steps. Each steps was computed in a [[Batch processing|batch]] over the entire training dataset, i.e. not [[Stochastic gradient descent|stochastic]]. It required the use of an [[Alliant Computer Systems|Alliant supercomputer]] with 4 processors.
=== Example ===
|