Revision as of 05:04, 17 May 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits m →Special cases Tag: 2017 wikitext editor ← Previous edit		Revision as of 05:08, 17 May 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Special cases: notes on when to use which Tag: Visual edit Next edit →
Line 136: return y </syntaxhighlight>For multilayered [[Recurrent neural network\|recurrent neural networks]] (RNN), BatchNorm is usually applied only to half of the inputs.<ref name=":4">{{Cite journal \|last=Laurent \|first=Cesar \|last2=Pereyra \|first2=Gabriel \|last3=Brakel \|first3=Philemon \|last4=Zhang \|first4=Ying \|last5=Bengio \|first5=Yoshua \|date=2016-03 \|title=Batch normalized recurrent neural networks \|url=http://ieeexplore.ieee.org/document/7472159/ \|publisher=IEEE \|pages=2657–2661 \|doi=10.1109/ICASSP.2016.7472159 \|isbn=978-1-4799-9988-0}}</ref> Let the hidden state of the <math>l</math>-th layer at time <math>t</math> be <math>h_t^{(l)}</math>. The standard RNN, without normalization, satisfies<math display="block">h^{(l)}_t = \phi(W^{(l)} h_t^{l-1} + U^{(l)} h_{t-1}^{l} + b^{(l)}) </math>where <math>W^{(l)}, U^{(l)}, b^{(l)}</math> are weights and biases, and <math>\phi</math> is the activation function. Applying BatchNorm, this becomes<math display="block">h^{(l)}_t = \phi(\mathrm{BN}(W^{(l)} h_t^{l-1}) + U^{(l)} h_{t-1}^{l}) </math>There are two possible ways to define what a "batch" is in BatchNorm for RNNs: ''frame-wise'' and ''sequence-wise''. Concretely, consider applying an RNN to process a batch of sentences. Let <math>h_{b, t}^{(l)}</math> be the hidden state of the <math>l</math>-th layer for the <math>t</math>-th token of the <math>b</math>-th input sentence. Then frame-wise BatchNorm means normalizing over <math>b</math>:<math display="block"> \begin{aligned} \mu_t^{(l)} &= \frac{1}{B} \sum_{b=1}^B h_{i,t}^{(l)} \\ Line 146: (\sigma^{(l)})^2 &= \frac{1}{BT} \sum_{b=1}^B\sum_{t=1}^T (h_t^{(l)} - \mu^{(l)})^2 \end{aligned} </math>Frame-wise BatchNorm is suited for causal tasks such as next-character prediction, where future frames are unavailable, forcing normalization per frame. Sequence-wise BatchNorm is suited for tasks such as speech recognition, where the entire sequences are available, but with variable lengths. In a batch, the smaller sequences are padded with zeroes to match the size of the longest sequence of the batch. In such setups, frame-wise is not recommended, because the number of unpadded frames decreases along the time axis, leading to increasingly poorer statistics estimates.<ref name=":4" /> </math>It is also possible to apply BatchNorm to [[Long short-term memory\|LSTMs]].<ref>{{Citation \|last=Cooijmans \|first=Tim \|title=Recurrent Batch Normalization \|date=2017-02-28 \|url=https://arxiv.org/abs/1603.09025 \|publisher=arXiv \|doi=10.48550/arXiv.1603.09025 \|id=arXiv:1603.09025 \|last2=Ballas \|first2=Nicolas \|last3=Laurent \|first3=César \|last4=Gülçehre \|first4=Çağlar \|last5=Courville \|first5=Aaron}}</ref>▼ ▲~~</math>~~It is also possible to apply BatchNorm to [[Long short-term memory\|LSTMs]].<ref>{{Citation \|last=Cooijmans \|first=Tim \|title=Recurrent Batch Normalization \|date=2017-02-28 \|url=https://arxiv.org/abs/1603.09025 \|publisher=arXiv \|doi=10.48550/arXiv.1603.09025 \|id=arXiv:1603.09025 \|last2=Ballas \|first2=Nicolas \|last3=Laurent \|first3=César \|last4=Gülçehre \|first4=Çağlar \|last5=Courville \|first5=Aaron}}</ref> === Improvements ===

Normalization (machine learning): Difference between revisions