Normalization (machine learning): Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 09:33, 11 November 2024 edit OAbot (talk \| contribs) Bots 643,717 edits m Open access bot: arxiv updated in citation with #oabot. ← Previous edit		Latest revision as of 00:53, 27 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,868,224 edits Added bibcode. Removed URL that duplicated identifier. Removed parameters. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 923/990
(16 intermediate revisions by 7 users not shown)
Line 17: {{Main\|Batch normalization}}'''Batch normalization''' ('''BatchNorm''')<ref name=":0">{{Cite journal \|last1=Ioffe \|first1=Sergey \|last2=Szegedy \|first2=Christian \|date=2015-06-01 \|title=Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift \|url=https://proceedings.mlr.press/v37/ioffe15.html \|journal=Proceedings of the 32nd International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=448–456\|arxiv=1502.03167 }}</ref> operates on the activations of a layer for each mini-batch. Consider a simple feedforward network, defined by chaining together modules:<math display="block">x^{(0)} \mapsto x^{(1)} \mapsto x^{(2)} \mapsto \cdots</math>where each network module can be a linear transform, a nonlinear activation function, a convolution, etc. <math>x^{(0)}</math> is the input vector, <math>x^{(1)}</math> is the output vector from the first module, etc. <math display="block">x^{(0)} \mapsto x^{(1)} \mapsto x^{(2)} \mapsto \cdots</math> BatchNorm is a module that can be inserted at any point in the feedforward network. For example, suppose it is inserted just after <math>x^{(l)}</math>, then the network would operate accordingly:<math display="block">\cdots \mapsto x^{(l)} \mapsto \mathrm{BN}(x^{(l)}) \mapsto x^{(l+1)} \mapsto \cdots </math>The BatchNorm module does not operate over individual inputs. Instead, it must operate over one batch of inputs at a time. where each network module can be a linear transform, a nonlinear activation function, a convolution, etc. <math>x^{(0)}</math> is the input vector, <math>x^{(1)}</math> is the output vector from the first module, etc. Concretely, suppose we have a batch of inputs <math>x^{(0)}_{(1)}, x^{(0)}_{(2)}, \dots, x^{(0)}_{(B)} </math>, fed all at once into the network. We would obtain in the middle of the network some vectors<math display="block">x^{(l)}_{(1)}, x^{(l)}_{(2)}, \dots, x^{(l)}_{(B)} </math>The BatchNorm module computes the coordinate-wise mean and variance of these vectors:<math display="block"> BatchNorm is a module that can be inserted at any point in the feedforward network. For example, suppose it is inserted just after <math>x^{(l)}</math>, then the network would operate accordingly: <math display="block">\cdots \mapsto x^{(l)} \mapsto \mathrm{BN}(x^{(l)}) \mapsto x^{(l+1)} \mapsto \cdots</math> The BatchNorm module does not operate over individual inputs. Instead, it must operate over one batch of inputs at a time. Concretely, suppose we have a batch of inputs <math>x^{(0)}_{(1)}, x^{(0)}_{(2)}, \dots, x^{(0)}_{(B)}</math>, fed all at once into the network. We would obtain in the middle of the network some vectors: <math display="block">x^{(l)}_{(1)}, x^{(l)}_{(2)}, \dots, x^{(l)}_{(B)}</math> The BatchNorm module computes the coordinate-wise mean and variance of these vectors: <math display="block"> \begin{aligned} \mu^{(l)}_i &= \frac 1B \sum_{b=1}^B x^{(l)}_{(b), i} \\ (\sigma^{(l)}_i)^2 &= \frac{1}{B} \sum_{b=1}^B (x_{(b),i}^{(l)} - \mu_i^{(l)})^2 \end{aligned} </math> </math>where <math>i</math> indexes the coordinates of the vectors, and <math>b</math> indexes the elements of the batch. In other words, we are considering the <math>i</math>-th coordinate of each vector in the batch, and computing the mean and variance of this collection of numbers. where <math>i</math> indexes the coordinates of the vectors, and <math>b</math> indexes the elements of the batch. In other words, we are considering the <math>i</math>-th coordinate of each vector in the batch, and computing the mean and variance of these numbers. It then normalizes each coordinate to have zero mean and unit variance: <math display="block">\hat{x}^{(l)}_{(b), i} = \frac{x^{(l)}_{(b), i} - \mu^{(l)}_i}{\sqrt{(\sigma^{(l)}_i)^2 + \epsilon}}</math>The <math>\epsilon</math> is a small positive constant such as <math>10^{-8}</math> added to the variance for numerical stability, to avoid division by zero. ~~Finally, it applies a linear transform:~~<math display="block">y\hat{x}^{(l)}_{(b), i} = \~~gamma_i \hat~~frac{x}^{(l)}_{(b), i} +- \~~beta_i</math>Here, <math>~~mu^{(l)}_i}{\~~gamma</math>~~sqrt{(\sigma^{(l)}_i)^2 ~~and~~+ ~~<math>~~\~~beta~~epsilon}}</math> ~~are parameters inside the BatchNorm module. They are learnable parameters, typically trained by gradient descent.~~ The <math>\epsilon</math> is a small positive constant such as <math>10^{-9}</math> added to the variance for numerical stability, to avoid [[division by zero]]. Finally, it applies a linear transformation: <math display="block">y^{(l)}_{(b), i} = \gamma_i \hat{x}^{(l)}_{(b), i} + \beta_i</math> Here, <math>\gamma</math> and <math>\beta</math> are parameters inside the BatchNorm module. They are learnable parameters, typically trained by [[gradient descent]]. The following is a [[Python (programming language)\|Python]] implementation of BatchNorm: ~~The following code illustrates BatchNorm.~~ <syntaxhighlight lang="python3"> import numpy as np def batchnorm(x, gamma, beta, epsilon=1e-89): # Mean and variance of each feature mu = np.mean(x, axis=0) # shape (N,) Line 52 ⟶ 76: === Interpretation === <math>\gamma</math> and <math>\beta</math> allow the network to learn to undo the normalization, if ~~that~~this is beneficial.<ref name=":1">{{Cite book \|last1=Goodfellow \|first1=Ian \|title=Deep learning \|last2=Bengio \|first2=Yoshua \|last3=Courville \|first3=Aaron \|date=2016 \|publisher=The MIT Press \|isbn=978-0-262-03561-3 \|series=Adaptive computation and machine learning \|___location=Cambridge, Massachusetts \|chapter=8.7.1. Batch Normalization}}</ref> BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus solely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be augmented with a linear transformation layer on top.<ref>{{Cite journal \|last1=Desjardins \|first1=Guillaume \|last2=Simonyan \|first2=Karen \|last3=Pascanu \|first3=Razvan \|last4=kavukcuoglu \|first4=koray \|date=2015 \|title=Natural Neural Networks \|url=https://proceedings.neurips.cc/paper_files/paper/2015/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=28}}</ref><ref name=":1" /> BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus purely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be topped with a linear transform layer on top.<ref>{{Cite journal \|last1=Desjardins \|first1=Guillaume \|last2=Simonyan \|first2=Karen \|last3=Pascanu \|first3=Razvan \|last4=kavukcuoglu \|first4=koray \|date=2015 \|title=Natural Neural Networks \|url=https://proceedings.neurips.cc/paper_files/paper/2015/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=28}}</ref><ref name=":1" /> It is claimed in the original publication that BatchNorm works by reducing "internal covariance shift", though the claim has both supporters<ref>{{Cite journal \|last1=Xu \|first1=Jingjing \|last2=Sun \|first2=Xu \|last3=Zhang \|first3=Zhiyuan \|last4=Zhao \|first4=Guangxiang \|last5=Lin \|first5=Junyang \|date=2019 \|title=Understanding and Improving Layer Normalization \|url=https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=32 \|arxiv=1911.07013}}</ref><ref>{{Cite journal \|last1=Awais \|first1=Muhammad \|last2=Bin Iqbal \|first2=Md. Tauhid \|last3=Bae \|first3=Sung-Ho \|date=November 2021 \|title=Revisiting Internal Covariate Shift for Batch Normalization ~~\|url=https://ieeexplore.ieee.org/document/9238401~~ \|journal=IEEE Transactions on Neural Networks and Learning Systems \|volume=32 \|issue=11 \|pages=5082–5092 \|doi=10.1109/TNNLS.2020.3026784 \|issn=2162-237X \|pmid=33095717\|bibcode=2021ITNNL..32.5082A }}</ref> and detractors.<ref>{{Cite journal \|last1=Bjorck \|first1=Nils \|last2=Gomes \|first2=Carla P \|last3=Selman \|first3=Bart \|last4=Weinberger \|first4=Kilian Q \|date=2018 \|title=Understanding Batch Normalization \|url=https://proceedings.neurips.cc/paper/2018/hash/36072923bfc3cf47745d704feb489480-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=31 \|arxiv=1806.02375}}</ref><ref>{{Cite journal \|last1=Santurkar \|first1=Shibani \|last2=Tsipras \|first2=Dimitris \|last3=Ilyas \|first3=Andrew \|last4=Madry \|first4=Aleksander \|date=2018 \|title=How Does Batch Normalization Help Optimization? \|url=https://proceedings.neurips.cc/paper/2018/hash/905056c1ac1dad141560467e0a99e1cf-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=31}}</ref> === Special cases === The original paper<ref name=":0" /> recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is, <math>\phi(\mathrm{BN}(Wx + b))</math>, not <math>\mathrm{BN}(\phi(Wx + b))</math>. Also, the bias <math>b </math> does not matter, since ~~will~~it would be canceled by the subsequent mean subtraction, so it is of the form <math>\mathrm{BN}(Wx)</math>. That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to ~~constant~~ zero.<ref name=":0" /> For [[convolutional neural network]]s (~~CNN~~CNNs), BatchNorm must preserve the translation-invariance of ~~CNN~~these models, ~~which means~~meaning that it must treat all outputs of the same [[Kernel (image processing)\|kernel]] as if they are different data points within a batch.<ref name=":0" /> This is sometimes called Spatial BatchNorm, or BatchNorm2D, or per-channel BatchNorm.<ref>{{Cite web \|title=BatchNorm2d — PyTorch 2.4 documentation \|url=https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html \|access-date=2024-09-26 \|website=pytorch.org}}</ref><ref>{{Cite book \|last1=Zhang \|first1=Aston \|title=Dive into deep learning \|last2=Lipton \|first2=Zachary \|last3=Li \|first3=Mu \|last4=Smola \|first4=Alexander J. \|date=2024 \|publisher=Cambridge University Press \|isbn=978-1-009-38943-3 \|___location=Cambridge New York Port Melbourne New Delhi Singapore \|chapter=8.5. Batch Normalization \|chapter-url=https://d2l.ai/chapter_convolutional-modern/batch-norm.html}}</ref> Concretely, suppose we have a 2-dimensional convolutional layer defined by: <math display="block">x^{(l)}_{h, w, c} = \sum_{h', w', c'} K^{(l)}_{h'-h, w'-w, c, c'} x_{h', w', c'}^{(l-1)} + b^{(l)}_c</math> where: Concretely, suppose we have a 2-dimensional convolutional layer defined by<math display="block">x^{(l)}_{h, w, c} = \sum_{h', w', c'} K^{(l)}_{h'-h, w'-w, c, c'} x_{h', w', c'}^{(l-1)} + b^{(l)}_c</math>where * <math>x^{(l)}_{h, w, c}</math> is the activation of the neuron at position <math>(h, w)</math> in the <math>c</math>-th channel of the <math>l</math>-th layer. * <math>K^{(l)}_{\Delta h, \Delta w, c, c'}</math> is a kernel tensor. Each channel <math>c</math> corresponds to a kernel <math>K^{(l)}_{h'-h, w'-w, c, c'}</math>, with indices <math>\Delta h, \Delta w, c'</math>. * <math>b^{(l)}_c</math> is the bias term for the <math>c</math>-th channel of the <math>l</math>-th layer. In order to preserve the translational invariance, BatchNorm treats all outputs from the same kernel in the same batch as more data in a batch. That is, it is applied once per ''kernel'' <math>c</math> (equivalently, once per channel <math>c</math>), not per ''activation'' <math>x^{(l+1)}_{h, w, c}</math>: <math display="block"> \begin{aligned} \mu^{(l)}_c &= \frac{1}{BHW} \sum_{b=1}^B \sum_{h=1}^H \sum_{w=1}^W x^{(l)}_{(b), h, w, c} \\ (\sigma^{(l)}_c)^2 &= \frac{1}{BHW} \sum_{b=1}^B \sum_{h=1}^H \sum_{w=1}^W (x_{(b), h, w, c}^{(l)} - \mu_c^{(l)})^2 \end{aligned} </math> ~~</math>where <math>B</math> is the batch size, <math>H</math> is the height of the feature map, and <math>W</math> is the width of the feature map.~~ where <math>B</math> is the batch size, <math>H</math> is the height of the feature map, and <math>W</math> is the width of the feature map. ~~That is, even though there are only <math>~~ B ~~</math> data points in a batch, all <math>~~ ~~BHW~~ ~~</math> outputs from the kernel in this batch are treated equally.<ref name=":0" />~~ That is, even though there are only <math>B</math> data points in a batch, all <math>BHW</math> outputs from the kernel in this batch are treated equally.<ref name=":0" /> ~~Subsequently, normalization and the linear transform is also done per kernel:<math display="block">~~ Subsequently, normalization and the linear transform is also done per kernel: <math display="block"> \begin{aligned} \hat{x}^{(l)}_{(b), h, w, c} &= \frac{x^{(l)}_{(b), h, w, c} - \mu^{(l)}_c}{\sqrt{(\sigma^{(l)}_c)^2 + \epsilon}} \\ y^{(l)}_{(b), h, w, c} &= \gamma_c \hat{x}^{(l)}_{(b), h, w, c} + \beta_c \end{aligned} </math> ~~</math>Similar considerations apply for BatchNorm for ''n''-dimensional convolutions.~~ Similar considerations apply for BatchNorm for ''n''-dimensional convolutions. The following is a Python implementation of BatchNorm for 2D convolutions: ~~The following code illustrates BatchNorm for 2D convolutions:~~ <syntaxhighlight lang="python3"> import numpy as np def batchnorm_cnn(x, gamma, beta, epsilon=1e-89): # Calculate the mean and variance for each channel. mean = np.mean(x, axis=(0, 1, 2), keepdims=True) Line 103 ⟶ 136: return y </syntaxhighlight>For multilayered [[Recurrent neural network\|recurrent neural networks]] (RNN), BatchNorm is usually applied only for the ''input-to-hidden'' part, not the ''hidden-to-hidden'' part.<ref name=":4">{{Cite book \|last1=Laurent \|first1=Cesar \|last2=Pereyra \|first2=Gabriel \|last3=Brakel \|first3=Philemon \|last4=Zhang \|first4=Ying \|last5=Bengio \|first5=Yoshua \|chapter=Batch normalized recurrent neural networks \|date=March 2016 \|title=2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) \|publisher=IEEE \|pages=2657–2661 \|doi=10.1109/ICASSP.2016.7472159 \|arxiv=1510.01378 \|isbn=978-1-4799-9988-0}}</ref> Let the hidden state of the <math>l</math>-th layer at time <math>t</math> be <math>h_t^{(l)}</math>. The standard RNN, without normalization, satisfies<math display="block">h^{(l)}_t = \phi(W^{(l)} h_t^{l-1} + U^{(l)} h_{t-1}^{l} + b^{(l)}) </math>where <math>W^{(l)}, U^{(l)}, b^{(l)}</math> are weights and biases, and <math>\phi</math> is the activation function. Applying BatchNorm, this becomes<math display="block">h^{(l)}_t = \phi(\mathrm{BN}(W^{(l)} h_t^{l-1}) + U^{(l)} h_{t-1}^{l}) </math>There are two possible ways to define what a "batch" is in BatchNorm for RNNs: ''frame-wise'' and ''sequence-wise''. Concretely, consider applying an RNN to process a batch of sentences. Let <math>h_{b, t}^{(l)}</math> be the hidden state of the <math>l</math>-th layer for the <math>t</math>-th token of the <math>b</math>-th input sentence. Then frame-wise BatchNorm means normalizing over <math>b</math>:<math display="block"> ~~</syntaxhighlight>~~ \begin{aligned} \mu_t^{(l)} &= \frac{1}{B} \sum_{b=1}^B h_{i,t}^{(l)} \\ (\sigma_t^{(l)})^2 &= \frac{1}{B} \sum_{b=1}^B (h_t^{(l)} - \mu_t^{(l)})^2 \end{aligned} </math>and sequence-wise means normalizing over <math>(b, t)</math>:<math display="block"> \begin{aligned} \mu^{(l)} &= \frac{1}{BT} \sum_{b=1}^B\sum_{t=1}^T h_{i,t}^{(l)} \\ (\sigma^{(l)})^2 &= \frac{1}{BT} \sum_{b=1}^B\sum_{t=1}^T (h_t^{(l)} - \mu^{(l)})^2 \end{aligned} </math>Frame-wise BatchNorm is suited for causal tasks such as next-character prediction, where future frames are unavailable, forcing normalization per frame. Sequence-wise BatchNorm is suited for tasks such as speech recognition, where the entire sequences are available, but with variable lengths. In a batch, the smaller sequences are padded with zeroes to match the size of the longest sequence of the batch. In such setups, frame-wise is not recommended, because the number of unpadded frames decreases along the time axis, leading to increasingly poorer statistics estimates.<ref name=":4" /> It is also possible to apply BatchNorm to [[Long short-term memory\|LSTMs]].<ref>{{cite arXiv \| eprint=1603.09025 \| last1=Cooijmans \| first1=Tim \| last2=Ballas \| first2=Nicolas \| last3=Laurent \| first3=César \| last4=Gülçehre \| first4=Çağlar \| last5=Courville \| first5=Aaron \| title=Recurrent Batch Normalization \| date=2016 \| class=cs.LG }}</ref> === Improvements === BatchNorm has been very popular and there were many attempted improvements. Some examples include:<ref name=":3">{{cite arXiv \| eprint=1906.03548 \| last1=Summers \| first1=Cecilia \| last2=Dinneen \| first2=Michael J. \| title=Four Things Everyone Should Know to Improve Batch Normalization \| date=2019 \| class=cs.LG }}</ref> * ~~Ghost~~ghost ~~batch~~batching: ~~Randomly~~randomly partition a batch into sub-batches and perform BatchNorm separately on each.; * ~~Weight~~weight decay on <math>\gamma</math> and <math>\beta</math>.; * ~~Combine~~and combining BatchNorm with GroupNorm. A particular problem with BatchNorm is that during training, the mean and variance ~~were~~are calculated on the fly for each batch (usually as an [[exponential moving average]]), but during inference, the mean and variance were frozen from those calculated during training. This train-test disparity degrades performance. The disparity can be decreased by simulating the moving average during inference:<ref name=":3" />{{Pg\|___location=Eq. 3}} <math display="block"> \begin{aligned} \mu &= \alpha E[x] + (1 - \alpha) \mu_{x, \text{ train}} \\ \sigma^2 &= (\alpha E[x]^2 + (1 - \alpha) \mu_{x^2, \text{ train}}) - \mu^2 \end{aligned} </math> ~~</math>where <math>\alpha</math> is a hyperparameter to be optimized on a validation set.~~ where <math>\alpha</math> is a hyperparameter to be optimized on a validation set. Other works attempt to eliminate BatchNorm, such as the Normalizer-Free ResNet.<ref>{{cite arXiv \| eprint=2102.06171 \| last1=Brock \| first1=Andrew \| last2=De \| first2=Soham \| last3=Smith \| first3=Samuel L. \| last4=Simonyan \| first4=Karen \| title=High-Performance Large-Scale Image Recognition Without Normalization \| date=2021 \| class=cs.CV }}</ref> == Layer normalization == '''Layer normalization''' ('''LayerNorm''')<ref name=":2">{{Cite arXiv \|last1=Ba \|first1=Jimmy Lei \|last2=Kiros \|first2=Jamie Ryan \|last3=Hinton \|first3=Geoffrey E. \|date=2016 \|title=Layer Normalization \|class=stat.ML \|eprint=1607.06450}}</ref> is a ~~common~~popular ~~competitor~~alternative to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size. It is a key component of [[Transformer (deep learning architecture)\|~~Transformers~~transformer]] models. For a given data input and layer, LayerNorm computes the mean (<math>\mu</math>) and variance (<math>\sigma^2</math>) over all the neurons in the layer. Similar to BatchNorm, learnable parameters <math>\gamma</math> (scale) and <math>\beta</math> (shift) are applied. It is defined by: <math display="block">\hat{x_i} = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y_i = \gamma_i \hat{x_i} + \beta_i</math> where ~~<math>~~: ~~\mu = \frac 1D \sum_{i=1}^D x_i~~ <math display="block">\mu = \frac 1D \sum_{i=1}^D x_i, \quad \sigma^2 = \frac 1D \sum_{i=1}^D (x_i - \mu)^2</math> ~~</math> and <math>~~ ~~\sigma^2 = \frac 1D \sum_{i=1}^D (x_i - \mu)^2~~ and the index <math>i</math> ranges over the neurons in that layer. ~~</math>, and <math>~~ i ~~</math> ranges over the neurons in that layer.~~ === Examples === For example, in CNN, a LayerNorm applies to all activations in a layer. In the previous notation, we have: <math display="block"> \begin{aligned} \mu^{(l)} &= \frac{1}{HWC} \sum_{h=1}^H \sum_{w=1}^W\sum_{c=1}^C x^{(l)}_{h, w, c} \\ (\sigma^{(l)})^2 &= \frac{1}{HWC} \sum_{h=1}^H \sum_{w=1}^W\sum_{c=1}^C (x_{h, w, c}^{(l)} - \mu^{(l)})^2 \\ \hat{x}^{(l)}_{h,w,c} &= \frac{\hat{x}^{(l)}_{h,w,c} - \mu^{(l)}}{\sqrt{(\sigma^{(l)})^2 + \epsilon}} \\ y^{(l)}_{h,w,c} &= \gamma^{(l)} \hat{x}^{(l)}_{h,w,c} + \beta^{(l)} \end{aligned}~~</math>notice that the batch index <math>~~ </math> b ~~</math> is removed, while the channel index <math>~~ c ~~</math> is added.~~ Notice that the batch index <math>b</math> is removed, while the channel index <math>c</math> is added. In [[recurrent neural network]]s<ref name=":2" /> and [[Transformer (deep learning architecture)\|Transformers]],<ref>{{cite arXiv \|last1=Phuong \|first1=Mary \|title=Formal Algorithms for Transformers \|date=2022-07-19 \|eprint=2207.09238 \|last2=Hutter \|first2=Marcus\|class=cs.LG }}</ref> LayerNorm is applied individually to each timestep. In [[recurrent neural network]]s<ref name=":2" /> and [[Transformer (deep learning architecture)\|transformers]],<ref>{{cite arXiv \|last1=Phuong \|first1=Mary \|title=Formal Algorithms for Transformers \|date=2022-07-19 \|eprint=2207.09238 \|last2=Hutter \|first2=Marcus\|class=cs.LG }}</ref> LayerNorm is applied individually to each timestep. For example, if the hidden vector in an RNN at timestep <math>t</math> is <math>x^{(t)} \in \mathbb{R}^{D} ~~For example, if the hidden vector in an RNN at timestep <math>~~ </math>, where <math>D</math> is the dimension of the hidden vector, then LayerNorm will be applied with: t ~~</math> is <math>~~ <math display="block">\hat{x_{i}}^{(t)} = \frac{x_i^{(t)} - \mu^{(t)}}{\sqrt{(\sigma^{(t)})^2 + \epsilon}}, \quad y_i^{(t)} = \gamma_i \hat{x_i}^{(t)} + \beta_i</math> ~~x^{(t)} \in \mathbb{R}^{D}~~ ~~</math> where <math>~~ where: D </math> is the dimension of the hidden vector, then LayerNorm will be applied with<math display="block">\hat{x_{i}}^{(t)} = \frac{x_i^{(t)} - \mu^{(t)}}{\sqrt{(\sigma^{(t)})^2 + \epsilon}}, \quad y_i^{(t)} = \gamma_i \hat{x_i}^{(t)} + \beta_i</math>where <math> <math display="block">\mu^{(t)} = \frac 1D \sum_{i=1}^D x_i^{(t)}, \quad (\sigma^{(t)})^2 = \frac 1D \sum_{i=1}^D (x_i^{(t)} - \mu^{(t)})^2</math> ~~</math> and <math>~~ ~~(\sigma^{(t)})^2 = \frac 1D \sum_{i=1}^D (x_i^{(t)} - \mu^{(t)})^2~~ ~~</math>.~~ === Root mean square layer normalization === '''Root mean square layer normalization''' ('''RMSNorm'''):<ref>{{cite arXiv \|last1=Zhang \|first1=Biao \|title=Root Mean Square Layer Normalization \|date=2019-10-16 \|eprint=1910.07467 \|last2=Sennrich \|first2=Rico\|class=cs.LG }}</ref> ~~changes LayerNorm by~~ <math display="block"> \hat{x_i} = \frac{x_i}{\sqrt{\frac 1D \sum_{i=1}^D x_i^2}}, \quad y_i = \gamma \hat{x_i} + \beta </math> ~~</math>Essentially it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>.~~ Essentially, it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>. It is also called '''L2 normalization'''. It is a special case of '''Lp normalization''', or '''power normalization''':<math display="block"> \hat{x_i} = \frac{x_i}{\left(\frac 1D \sum_{i=1}^D \|x_i\|^p \right)^{1/p}}, \quad y_i = \gamma \hat{x_i} + \beta </math>where <math>p > 0</math> is a constant. === Adaptive === '''Adaptive layer norm''' ('''adaLN''') computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data. It was first proposed for ~~CNN~~CNNs,<ref>{{Cite journal \|last1=Perez \|first1=Ethan \|last2=Strub \|first2=Florian \|last3=De Vries \|first3=Harm \|last4=Dumoulin \|first4=Vincent \|last5=Courville \|first5=Aaron \|date=2018-04-29 \|title=FiLM: Visual Reasoning with a General Conditioning Layer \|url=https://ojs.aaai.org/index.php/AAAI/article/view/11671 \|journal=Proceedings of the AAAI Conference on Artificial Intelligence \|volume=32 \|issue=1 \|doi=10.1609/aaai.v32i1.11671 \|issn=2374-3468\|arxiv=1709.07871 }}</ref> and has been used effectively in [[~~diffusion~~Diffusion ~~Transformer~~model\|diffusion]] transformers (~~DiT~~DiTs).<ref>{{Cite journal \|last1=Peebles \|first1=William \|last2=Xie \|first2=Saining \|date=2023 \|title=Scalable Diffusion Models with Transformers \|url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html \|language=en \|pages=4195–4205 \|arxiv=2212.09748}}</ref> For example, in a DiT, the conditioning information (such as a text encoding vector) is processed by ana ~~MLP~~[[multilayer perceptron]] into <math>\gamma, \beta</math>, which is then applied in the LayerNorm module inof a ~~Transformer~~transformer. == Weight normalization == '''Weight normalization''' ('''WeightNorm''')<ref>{{cite arXiv \|last1=Salimans \|first1=Tim \|title=Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks \|date=2016-06-03 \|eprint=1602.07868 \|last2=Kingma \|first2=Diederik P.\|class=cs.LG }}</ref> is a technique inspired by BatchNorm. Itthat normalizes weight matrices in a neural network, rather than its ~~neural~~ activations. One example is '''spectral normalization''', which divides weight matrices by their [[spectral norm]]. The spectral normalization is used in [[generative adversarial network]]s (GANs) such as the [[Wasserstein GAN]].<ref>{{cite arXiv \|eprint=1802.05957 \|class=cs.LG \|first1=Takeru \|last1=Miyato \|first2=Toshiki \|last2=Kataoka \|title=Spectral Normalization for Generative Adversarial Networks \|date=2018-02-16 \|last3=Koyama \|first3=Masanori \|last4=Yoshida \|first4=Yuichi}}</ref> The spectral radius can be efficiently computed by the following algorithm: {{blockquote\|'''INPUT''' matrix <math>W</math> and initial guess <math>x</math> One example is '''spectral normalization''', which divides weight matrices by their [[spectral norm]]. The spectral normalization is used in [[Generative adversarial network\|generative adversarial networks]] (GANs) such as the [[Wasserstein GAN]].<ref>{{cite arXiv \|eprint=1802.05957 \|class=cs.LG \|first1=Takeru \|last1=Miyato \|first2=Toshiki \|last2=Kataoka \|title=Spectral Normalization for Generative Adversarial Networks \|date=2018-02-16 \|last3=Koyama \|first3=Masanori \|last4=Yoshida \|first4=Yuichi}}</ref> The spectral radius can be efficiently computed by the following algorithm:{{blockquote\|'''INPUT''' matrix <math>W</math> and initial guess <math>x</math> Iterate <math>x \mapsto \frac{1}{\\|Wx\\|_2}Wx</math> to convergence <math>x^</math>. This is the eigenvector of <math>W</math> with eigenvalue <math>\\|W\\|_s</math>. '''RETURN''' <math>x^, \\|Wx^\\|_2</math>}} '''RETURN''' <math>x^, \\|Wx^\\|_2</math>}}By reassigning <math>W_i \leftarrow \frac{W_i}{\\|W_i\\|_s}</math> after each update of the discriminator, we can upper bound <math>\\|W_i\\|_s \leq 1</math>, and thus upper bound <math>\\|D \\|_L</math>. By reassigning <math>W_i \leftarrow \frac{W_i}{\\|W_i\\|_s}</math> after each update of the discriminator, we can upper-bound <math>\\|W_i\\|_s \leq 1</math>, and thus upper-bound <math>\\|D \\|_L</math>. The algorithm can be further accelerated by [[memoization]]: Atat step <math>t</math>, store <math>x^_i(t)</math>. Then, at step <math>t+1</math>, use <math>x^_i(t)</math> as the initial guess for the algorithm. Since <math>W_i(t+1)</math> is very close to <math>W_i(t)</math>, so is <math>x^_i(t)</math> ~~close~~ to <math>x^_i(t+1)</math>, ~~so this~~thus ~~allows~~allowing rapid convergence. == CNN-specific normalization == Line 181 ⟶ 239: === Response normalization === {{Anchor\|Local response normalization}}'''Local response normalization'''<ref>{{Cite journal \|last1=Krizhevsky \|first1=Alex \|last2=Sutskever \|first2=Ilya \|last3=Hinton \|first3=Geoffrey E \|date=2012 \|title=ImageNet Classification with Deep Convolutional Neural Networks \|url=https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=25}}</ref> was used in [[AlexNet]]. It was applied in a convolutional layer, just after a nonlinear activation function. It was defined by: <math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j\right)^2\right)^\beta}</math> where <math>a_{x,y}^i</math> is the activation of the neuron at ___location <math>(x,y)</math> and channel <math>i</math>. ~~In words~~I.e., each pixel in a channel is suppressed by the activations of the same pixel in its adjacent channels. <math>k, n, \alpha, \beta</math> are hyperparameters picked by using a validation set. It was a variant of the earlier '''local contrast normalization'''.<ref>{{Cite book \|last1=Jarrett \|first1=Kevin \|last2=Kavukcuoglu \|first2=Koray \|last3=Ranzato \|first3=Marc' Aurelio \|last4=LeCun \|first4=Yann \|chapter=What is the best multi-stage architecture for object recognition? \|date=September 2009 \|pages=2146–2153 \|title=2009 IEEE 12th International Conference on Computer Vision \|chapter-url=http://dx.doi.org/10.1109/iccv.2009.5459469 \|publisher=IEEE \|doi=10.1109/iccv.2009.5459469\|isbn=978-1-4244-4420-5 }}</ref> <math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j - \bar a_{x, y}^j\right)^2\right)^\beta}</math> ~~The numbers <math>k, n, \alpha, \beta</math> are hyperparameters picked by using a validation set.~~ It was a variant of the earlier '''local contrast normalization'''.<ref>{{Cite book \|last1=Jarrett \|first1=Kevin \|last2=Kavukcuoglu \|first2=Koray \|last3=Ranzato \|first3=Marc' Aurelio \|last4=LeCun \|first4=Yann \|chapter=What is the best multi-stage architecture for object recognition? \|date=September 2009 \|pages=2146–2153 \|title=2009 IEEE 12th International Conference on Computer Vision \|chapter-url=http://dx.doi.org/10.1109/iccv.2009.5459469 \|publisher=IEEE \|doi=10.1109/iccv.2009.5459469\|isbn=978-1-4244-4420-5 }}</ref><math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j - \bar a_{x, y}^j\right)^2\right)^\beta}</math>where <math>\bar a_{x, y}^j</math> is the average activation in a small window centered on ___location <math>(x,y)</math> and channel <math>i</math>. The ~~numbers~~hyperparameters <math>k, n, \alpha, \beta</math>, and the size of the small window, are ~~hyperparameters~~ picked by using a validation set. Similar methods were called '''divisive normalization''', as they divide activations by a number depending on the activations. They were originally inspired by biology, where it was used to explain nonlinear responses of cortical neurons and nonlinear masking in visual perception.<ref>{{Cite book \|last1=Lyu \|first1=Siwei \|last2=Simoncelli \|first2=Eero P. \|chapter=Nonlinear image representation using divisive normalization \|date=2008 \|title=2008 IEEE Conference on Computer Vision and Pattern Recognition \|volume=2008 \|pages=1–8 \|doi=10.1109/CVPR.2008.4587821 \|issn=1063-6919 \|pmc=4207373 \|pmid=25346590\|isbn=978-1-4244-2242-5 }}</ref> Both kinds of local normalization were ~~obsoleted~~obviated by batch normalization, which is a more global form of normalization.<ref>{{Cite journal \|last1=Ortiz \|first1=Anthony \|last2=Robinson \|first2=Caleb \|last3=Morris \|first3=Dan \|last4=Fuentes \|first4=Olac \|last5=Kiekintveld \|first5=Christopher \|last6=Hassan \|first6=Md Mahmudulla \|last7=Jojic \|first7=Nebojsa \|date=2020 \|title=Local Context Normalization: Revisiting Local Normalization \|url=https://openaccess.thecvf.com/content_CVPR_2020/html/Ortiz_Local_Context_Normalization_Revisiting_Local_Normalization_CVPR_2020_paper.html \|pages=11276–11285\|arxiv=1912.05845 }}</ref> Response normalization reappeared in ConvNeXT-2 as '''global response normalization'''.<ref>{{Cite journal \|last1=Woo \|first1=Sanghyun \|last2=Debnath \|first2=Shoubhik \|last3=Hu \|first3=Ronghang \|last4=Chen \|first4=Xinlei \|last5=Liu \|first5=Zhuang \|last6=Kweon \|first6=In So \|last7=Xie \|first7=Saining \|date=2023 \|title=ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders \|url=https://openaccess.thecvf.com/content/CVPR2023/html/Woo_ConvNeXt_V2_Co-Designing_and_Scaling_ConvNets_With_Masked_Autoencoders_CVPR_2023_paper.html \|language=en \|pages=16133–16142\|arxiv=2301.00808 }}</ref> === Group normalization === '''Group normalization''' ('''GroupNorm''')<ref>{{Cite journal \|last1=Wu \|first1=Yuxin \|last2=He \|first2=Kaiming \|date=2018 \|title=Group Normalization \|url=https://openaccess.thecvf.com/content_ECCV_2018/html/Yuxin_Wu_Group_Normalization_ECCV_2018_paper.html \|pages=3–19}}</ref> is a technique ~~only~~also solely used for CNNs. It can be understood as the LayerNorm for CNN applied once per channel- group. Suppose at a layer <math>l</math>, there are channels <math>1, 2, \dots, C</math>, then weit ~~partition~~is itpartitioned into groups <math>g_1, g_2, \dots, g_G</math>. Then, weLayerNorm ~~apply~~is ~~LayerNorm~~applied to each group. === Instance normalization === '''Instance normalization''' ('''InstanceNorm'''), or '''contrast normalization''', is a technique first developed for [[neural style transfer]], and is also only used for CNNs.<ref>{{cite arXiv \|last1=Ulyanov \|first1=Dmitry \|title=Instance Normalization: The Missing Ingredient for Fast Stylization \|date=2017-11-06 \|eprint=1607.08022 \|last2=Vedaldi \|first2=Andrea \|last3=Lempitsky \|first3=Victor\|class=cs.CV }}</ref> It can be understood as the LayerNorm for CNN applied once per channel, or equivalently, as group normalization where each group consists of a single channel: <math display="block"> \begin{aligned} \mu^{(l)}_c &= \frac{1}{HW} \sum_{h=1}^H \sum_{w=1}^Wx^{(l)}_{h, w, c} \\ Line 209 ⟶ 277: === Adaptive instance normalization === '''Adaptive instance normalization''' ('''AdaIN''') is a variant of instance normalization, designed specifically for neural style transfer with ~~CNN~~CNNs, ~~not~~rather ~~for~~than ~~CNN~~just CNNs in general.<ref>{{Cite journal \|last1=Huang \|first1=Xun \|last2=Belongie \|first2=Serge \|date=2017 \|title=Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization \|url=https://openaccess.thecvf.com/content_iccv_2017/html/Huang_Arbitrary_Style_Transfer_ICCV_2017_paper.html \|pages=1501–1510\|arxiv=1703.06868 }}</ref> In the AdaIN method of style transfer, we take a CNN and two input images, one for '''content''' and one for '''style'''. Each image is processed through the same CNN, and at a certain layer <math>l</math>, AdaIn is applied. Let <math>x^{(l), \text{ content}}</math> be the activation in the content image, and <math>x^{(l), \text{ style}}</math> be the activation in the style image. Then, AdaIn first computes the mean and variance of the activations of the content image <math>x'^{(l)}</math>, then uses those as the <math>\gamma, \beta</math> for InstanceNorm on <math>x^{(l), \text{ content}}</math>. Note that <math>x^{(l), \text{ style}}</math> itself remains unchanged. Explicitly, we have: In the AdaIN method of style transfer, we take a CNN, and two input images, one '''content''' and one '''style'''. Each image is processed through the same CNN, and at a certain layer <math>l</math>, the AdaIn is applied. <math display="block"> Let <math>x^{(l), \text{ content}}</math> be the activation in the content image, and <math>x^{(l), \text{ style}}</math> be the activation in the style image. Then, AdaIn first computes the mean and variance of the activations of the content image <math>x'^{(l)}</math>, then use those as the <math>\gamma, \beta</math> for InstanceNorm on <math>x^{(l), \text{ content}}</math>. Note that <math>x^{(l), \text{ style}}</math> itself remains unchanged. Explicitly, we have<math display="block"> \begin{aligned} y^{(l), \text{ content}}_{h,w,c} &= \sigma^{(l), \text{ style}}_c \left( \frac{x^{(l), \text{ content}}_{h,w,c} - \mu^{(l), \text{ content}}_c}{\sqrt{(\sigma^{(l), \text{ content}}_c)^2 + \epsilon}} \right) + \mu^{(l), \text{ style}}_c. \end{aligned} </math> == Transformers == Some normalization methods were designed for use in [[Transformer (deep learning architecture)\|~~Transformers~~transformers]]. The original 2017 ~~Transformer~~transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train, and required careful [[Hyperparameter optimization\|hyperparameter tuning]] and a "warm-up" in [[learning rate]], where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018,<ref>{{~~Citation~~cite ~~\|last1=Wang~~arXiv \|~~first1=Qiang~~ ~~\|title~~eprint=~~Learning Deep Transformer Models for Machine Translation \|date=2019-06-04 \|url=https://arxiv.org/abs/~~1906.01787 \|~~access-date~~ last1=~~2024-10-18~~Wang \|~~arxiv~~ first1=~~1906.01787~~Qiang \| last2=Li \| first2=Bei \| last3=Xiao \| first3=Tong \| last4=Zhu \| first4=Jingbo \| last5=Li \| first5=Changliang \| last6=Wong \| first6=Derek F. \| last7=Chao \| first7=Lidia S. \| title=Learning Deep Transformer Models for Machine Translation \| date=2019 \| class=cs.CL }}</ref> was found to be easier to train, requiring no warm-up, leading to faster convergence.<ref name="auto1">{{cite arXiv \|eprint=2002.04745 \|class=cs.LG \|first1=Ruibin \|last1=Xiong \|first2=Yunchang \|last2=Yang \|title=On Layer Normalization in the Transformer Architecture \|date=2020-06-29 \|last3=He \|first3=Di \|last4=Zheng \|first4=Kai \|last5=Zheng \|first5=Shuxin \|last6=Xing \|first6=Chen \|last7=Zhang \|first7=Huishuai \|last8=Lan \|first8=Yanyan \|last9=Wang \|first9=Liwei \|last10=Liu \|first10=Tie-Yan}}</ref> '''FixNorm'''<ref>{{~~Citation~~cite arXiv \| eprint=1710.01329 \| last1=Nguyen \| first1=Toan Q. \| last2=Chiang \| first2=David \| title=Improving Lexical Choice in Neural Machine Translation \| date=~~2018-04-17~~2017 \|~~url=https://arxiv.org/abs/1710.01329~~ ~~\|access-date~~class=~~2024-10-18 \|arxiv=1710~~cs.~~01329~~CL ~~\|last2=Chiang \|first2=David~~}}</ref> and '''ScaleNorm<ref>{{Cite journal \|last1=Nguyen \|first1=Toan Q. \|last2=Salazar \|first2=Julian \|date=2019-11-02 \|title=Transformers without Tears: Improving the Normalization of Self-Attention \|doi=10.5281/zenodo.3525484\|arxiv=1910.05895 }}</ref>''' both normalize activation vectors in a ~~Transformer~~transformer. The FixNorm method divides the ''output'' vectors from a ~~Transformer~~transformer by their L2 norms, then ~~multiply~~multiplies by a learned parameter <math>g</math>. The ScaleNorm replaces all LayerNorms inside a ~~Transformer~~transformer by division with L2 norm, then multiplying by a learned parameter <math>g'</math> (shared by all ScaleNorm modules of a ~~Transformer~~transformer). '''Query-Key normalization''' ('''QKNorm''')<ref>{{Cite journal \|last1=Henry \|first1=Alex \|last2=Dachapally \|first2=Prudhvi Raj \|last3=Pawar \|first3=Shubham Shantaram \|last4=Chen \|first4=Yuxuan \|date=November 2020 \|editor-last=Cohn \|editor-first=Trevor \|editor2-last=He \|editor2-first=Yulan \|editor3-last=Liu \|editor3-first=Yang \|title=Query-Key Normalization for Transformers \|url=https://aclanthology.org/2020.findings-emnlp.379/ \|journal=Findings of the Association for Computational Linguistics: EMNLP 2020 \|___location=Online \|publisher=Association for Computational Linguistics \|pages=4246–4253 \|doi=10.18653/v1/2020.findings-emnlp.379\|arxiv=2010.04245 }}</ref> normalizes query and key vectors to have unit L2 norm. In '''nGPT''', many vectors are normalized to have unit L2 norm:<ref>{{~~Citation~~cite ~~\|last1=Loshchilov~~arXiv \|~~first1=Ilya~~ ~~\|title~~eprint=~~nGPT: Normalized Transformer with Representation Learning on the Hypersphere \|date=2024-10-01 \|url=https://arxiv.org/abs/~~2410.01131 \|~~access-date~~ last1=~~2024-10-18~~Loshchilov \|~~arxiv~~ first1=~~2410.01131~~Ilya \| last2=Hsieh \| first2=Cheng-Ping \| last3=Sun \| first3=Simeng \| last4=Ginsburg \| first4=Boris \| title=NGPT: Normalized Transformer with Representation Learning on the Hypersphere \| date=2024 \| class=cs.LG }}</ref> hidden state vectors, input and output embedding vectors, weight matrix columns, and query and key vectors. == Miscellaneous == Line 241 ⟶ 311: == Further reading == {{Cite web \|title=Normalization Layers \|url=https://nn.labml.ai/normalization/index.html \|access-date=2024-08-07 \|website=labml.ai Deep Learning Paper Implementations \|language=en}} ~~{{Differentiable computing}}~~ {{Artificial intelligence navbox}} [[Category:Articles with example Python (programming language) code]] [[Category:Deep learning]]