Normalization (machine learning): Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 21:30, 26 September 2024 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits m →Special cases Tag: Visual edit ← Previous edit		Latest revision as of 00:53, 27 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,868,224 edits Added bibcode. Removed URL that duplicated identifier. Removed parameters. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 923/990
(40 intermediate revisions by 8 users not shown)
Line 1: {{Short description\|~~Rescaling~~Machine ~~inputs~~learning ~~to improve model training~~technique}} {{Machine learning bar}} In [[machine learning]], '''normalization''' is a statistical technique with various applications. There are ~~mainly~~ two main forms of normalization, namely ''data normalization'' and ''activation normalization''. Data normalization, (or [[feature scaling]]~~, is a general technique in statistics, and it~~) includes methods that rescale input data so that ~~they~~the [[Feature (machine learning)\|features]] have ~~well-behaved~~the same range, mean, variance, ~~and~~or other statistical properties. ~~Activation~~For instance, a popular choice of feature scaling method is [[Feature scaling#Rescaling (min-max normalization)\|min-max normalization]], where each feature is ~~specific~~transformed to ~~deep~~have ~~learning,~~the ~~and~~same itrange ~~includes~~(typically ~~methods~~<math>[0,1]</math> ~~that~~or ~~rescale~~<math>[-1,1]</math>). This solves the ~~activation~~problem of ~~hidden~~different ~~neurons~~features ~~inside~~having avastly ~~neural~~different scales, for example if one feature is measured in kilometers and another in ~~network~~nanometers. Activation normalization, on the other hand, is specific to [[deep learning]], and includes methods that rescale the activation of [[Hidden layer\|hidden neurons]] inside [[Neural network (machine learning)\|neural networks]]. Normalization is often used for faster training convergence, less sensitivity to variations in input data, less overfitting, and better generalization to unseen data. They are often theoretically justified as reducing covariance shift, smoother optimization landscapes, increasing [[Regularization (mathematics)\|regularization]], though they are mainly justified by empirical success.<ref>{{Cite book \|last=Huang \|first=Lei \|url=https://link.springer.com/10.1007/978-3-031-14595-7 \|title=Normalization Techniques in Deep Learning \|date=2022 \|publisher=Springer International Publishing \|isbn=978-3-031-14594-0 \|series=Synthesis Lectures on Computer Vision \|___location=Cham \|language=en \|doi=10.1007/978-3-031-14595-7}}</ref> Normalization is often used to: * increase the speed of training convergence, * reduce sensitivity to variations and feature scales in input data, * reduce [[overfitting]], * and produce better model generalization to unseen data. Normalization techniques are often theoretically justified as reducing covariance shift, smoothing optimization landscapes, and increasing [[Regularization (mathematics)\|regularization]], though they are mainly justified by empirical success.<ref>{{Cite book \|last=Huang \|first=Lei \|url=https://link.springer.com/10.1007/978-3-031-14595-7 \|title=Normalization Techniques in Deep Learning \|date=2022 \|publisher=Springer International Publishing \|isbn=978-3-031-14594-0 \|series=Synthesis Lectures on Computer Vision \|___location=Cham \|language=en \|doi=10.1007/978-3-031-14595-7}}</ref> == Batch normalization == {{Main\|Batch normalization}}'''Batch normalization''' ('''BatchNorm''')<ref name=":0">{{Cite journal \|last1=Ioffe \|first1=Sergey \|last2=Szegedy \|first2=Christian \|date=2015-06-01 \|title=Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift \|url=https://proceedings.mlr.press/v37/ioffe15.html \|journal=Proceedings of the 32nd International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=448–456\|arxiv=1502.03167 }}</ref> operates on the activations of a layer for each mini-batch. Consider a simple feedforward network, defined by chaining together modules:<math display="block">x^{(0)} \mapsto x^{(1)} \mapsto x^{(2)} \mapsto \cdots</math>where each network module can be a linear transform, a nonlinear activation function, a convolution, etc. <math>x^{(0)}</math> is the input vector, <math>x^{(1)}</math> is the output vector from the first module, etc. <math display="block">x^{(0)} \mapsto x^{(1)} \mapsto x^{(2)} \mapsto \cdots</math> BatchNorm is a module that can be inserted at any point in the feedforward network. For example, suppose it is inserted just after <math>x^{(l)}</math>, then the network would operate accordingly:<math display="block">\cdots \mapsto x^{(l)} \mapsto \mathrm{BN}(x^{(l)}) \mapsto x^{(l+1)} \mapsto \cdots </math>The BatchNorm module does not operate over individual inputs. Instead, it must operate over one batch of inputs at a time. where each network module can be a linear transform, a nonlinear activation function, a convolution, etc. <math>x^{(0)}</math> is the input vector, <math>x^{(1)}</math> is the output vector from the first module, etc. Concretely, suppose we have a batch of inputs <math>x^{(0)}_{(1)}, x^{(0)}_{(2)}, \dots, x^{(0)}_{(B)} </math>, fed all at once into the network. We would obtain in the middle of the network some vectors<math display="block">x^{(l)}_{(1)}, x^{(l)}_{(2)}, \dots, x^{(l)}_{(B)} </math>The BatchNorm module computes the coordinate-wise mean and variance of these vectors:<math display="block"> BatchNorm is a module that can be inserted at any point in the feedforward network. For example, suppose it is inserted just after <math>x^{(l)}</math>, then the network would operate accordingly: <math display="block">\cdots \mapsto x^{(l)} \mapsto \mathrm{BN}(x^{(l)}) \mapsto x^{(l+1)} \mapsto \cdots</math> The BatchNorm module does not operate over individual inputs. Instead, it must operate over one batch of inputs at a time. Concretely, suppose we have a batch of inputs <math>x^{(0)}_{(1)}, x^{(0)}_{(2)}, \dots, x^{(0)}_{(B)}</math>, fed all at once into the network. We would obtain in the middle of the network some vectors: <math display="block">x^{(l)}_{(1)}, x^{(l)}_{(2)}, \dots, x^{(l)}_{(B)}</math> The BatchNorm module computes the coordinate-wise mean and variance of these vectors: <math display="block"> \begin{aligned} \mu^{(l)}_i &= \frac 1B \sum_{b=1}^B x^{(l)}_{(b), i} \\ (\sigma^{(l)}_i)^2 &= \frac{1}{B} \sum_{b=1}^B (x_{(b),i}^{(l)} - \mu_i^{(l)})^2 \end{aligned} </math> </math>where <math>i</math> indexes the coordinates of the vectors, and <math>b</math> indexes the elements of the batch. In other words, we are considering the <math>i</math>-th coordinate of each vector in the batch, and computing the mean and variance of this collection of numbers. where <math>i</math> indexes the coordinates of the vectors, and <math>b</math> indexes the elements of the batch. In other words, we are considering the <math>i</math>-th coordinate of each vector in the batch, and computing the mean and variance of these numbers. It then normalizes each coordinate to have zero mean and unit variance: <math display="block">\hat{x}^{(l)}_{(b), i} = \frac{x^{(l)}_{(b), i} - \mu^{(l)}_i}{\sqrt{(\sigma^{(l)}_i)^2 + \epsilon}}</math>The <math>\epsilon</math> is a small positive constant such as <math>10^{-8}</math> added to the variance for numerical stability, to avoid division by zero. ~~Finally, it applies a linear transform:~~<math display="block">y\hat{x}^{(l)}_{(b), i} = \~~gamma_i \hat~~frac{x}^{(l)}_{(b), i} +- \~~beta_i</math>Here, <math>~~mu^{(l)}_i}{\~~gamma</math>~~sqrt{(\sigma^{(l)}_i)^2 ~~and~~+ ~~<math>~~\~~beta~~epsilon}}</math> ~~are parameters inside the BatchNorm module. They are learnable parameters, typically trained by gradient descent.~~ The <math>\epsilon</math> is a small positive constant such as <math>10^{-9}</math> added to the variance for numerical stability, to avoid [[division by zero]]. Finally, it applies a linear transformation: <math display="block">y^{(l)}_{(b), i} = \gamma_i \hat{x}^{(l)}_{(b), i} + \beta_i</math> Here, <math>\gamma</math> and <math>\beta</math> are parameters inside the BatchNorm module. They are learnable parameters, typically trained by [[gradient descent]]. The following is a [[Python (programming language)\|Python]] implementation of BatchNorm: ~~The following code illustrates BatchNorm.~~ <syntaxhighlight lang="python3"> import numpy as np def batchnorm(x, gamma, beta, epsilon=1e-89): # Mean and variance of each feature mu = np.mean(x, axis=0) # shape (N,) ~~sigma2~~var = np.var(x, axis=0) # shape (N,) # Normalize the activations x_hat = (x - mu) / np.sqrt(~~sigma2~~var + epsilon) # shape (B, N) # Apply the linear transform Line 43 ⟶ 76: === Interpretation === <math>\gamma</math> and <math>\beta</math> allow the network to learn to undo the normalization, if ~~that~~this is beneficial.<ref name=":1">{{Cite book \|~~last~~last1=Goodfellow \|~~first~~first1=Ian \|title=Deep learning \|last2=Bengio \|first2=Yoshua \|last3=Courville \|first3=Aaron \|date=2016 \|publisher=The MIT Press \|isbn=978-0-262-03561-3 \|series=Adaptive computation and machine learning \|___location=Cambridge, Massachusetts \|chapter=8.7.1. Batch Normalization}}</ref> BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus solely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be augmented with a linear transformation layer on top.<ref>{{Cite journal \|last1=Desjardins \|first1=Guillaume \|last2=Simonyan \|first2=Karen \|last3=Pascanu \|first3=Razvan \|last4=kavukcuoglu \|first4=koray \|date=2015 \|title=Natural Neural Networks \|url=https://proceedings.neurips.cc/paper_files/paper/2015/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=28}}</ref><ref name=":1" /> Because a neural network can always be topped with a linear transform layer on top, BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus purely on modelling the nonlinear aspects of data.<ref>{{Cite journal \|last=Desjardins \|first=Guillaume \|last2=Simonyan \|first2=Karen \|last3=Pascanu \|first3=Razvan \|last4=kavukcuoglu \|first4=koray \|date=2015 \|title=Natural Neural Networks \|url=https://proceedings.neurips.cc/paper_files/paper/2015/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=28}}</ref><ref name=":1" /> It is claimed in the original publication that BatchNorm works by reducing "internal covariance shift", though the claim has both supporters<ref>{{Cite journal \|last1=Xu \|first1=Jingjing \|last2=Sun \|first2=Xu \|last3=Zhang \|first3=Zhiyuan \|last4=Zhao \|first4=Guangxiang \|last5=Lin \|first5=Junyang \|date=2019 \|title=Understanding and Improving Layer Normalization \|url=https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=32 \|arxiv=1911.07013}}</ref><ref>{{Cite journal \|last1=Awais \|first1=Muhammad \|last2=Bin Iqbal \|first2=Md. Tauhid \|last3=Bae \|first3=Sung-Ho \|date=November 2021 \|title=Revisiting Internal Covariate Shift for Batch Normalization ~~\|url=https://ieeexplore.ieee.org/document/9238401~~ \|journal=IEEE Transactions on Neural Networks and Learning Systems \|volume=32 \|issue=11 \|pages=5082–5092 \|doi=10.1109/TNNLS.2020.3026784 \|issn=2162-237X \|pmid=33095717\|bibcode=2021ITNNL..32.5082A }}</ref> and detractors.<ref>{{Cite journal \|last1=Bjorck \|first1=Nils \|last2=Gomes \|first2=Carla P \|last3=Selman \|first3=Bart \|last4=Weinberger \|first4=Kilian Q \|date=2018 \|title=Understanding Batch Normalization \|url=https://proceedings.neurips.cc/paper/2018/hash/36072923bfc3cf47745d704feb489480-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=31 \|arxiv=1806.02375}}</ref><ref>{{Cite journal \|last1=Santurkar \|first1=Shibani \|last2=Tsipras \|first2=Dimitris \|last3=Ilyas \|first3=Andrew \|last4=Madry \|first4=Aleksander \|date=2018 \|title=How Does Batch Normalization Help Optimization? \|url=https://proceedings.neurips.cc/paper/2018/hash/905056c1ac1dad141560467e0a99e1cf-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=31}}</ref> === Special cases === The original paper<ref name=":0" /> recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is, <math>\phi(\mathrm{BN}(Wx + b))</math>, not <math>\mathrm{BN}(\phi(Wx + b))</math>. Also, the bias <math>b </math> does not matter, since ~~will~~it would be canceled by the subsequent mean subtraction, so it is of the form <math>\mathrm{BN}(Wx)</math>. That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to ~~constant~~ zero.<ref name=":0" /> For [[~~Convolutional~~convolutional neural network~~\|convolutional neural networks~~]]s (~~CNN~~CNNs), BatchNorm must preserve the translation-invariance of ~~CNN,~~these ~~which~~models, ~~means~~meaning that it must treat all outputs of the same [[Kernel (image processing)\|kernel]] as if they are different data points within a batch.<ref name=":0" /> This is sometimes called Spatial BatchNorm, or BatchNorm2D, or per-channel BatchNorm.<ref>{{Cite web \|title=BatchNorm2d — PyTorch 2.4 documentation \|url=https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html \|access-date=2024-09-26 \|website=pytorch.org}}</ref><ref>{{Cite book \|last1=Zhang \|first1=Aston \|title=Dive into deep learning \|last2=Lipton \|first2=Zachary \|last3=Li \|first3=Mu \|last4=Smola \|first4=Alexander J. \|date=2024 \|publisher=Cambridge University Press \|isbn=978-1-009-38943-3 \|___location=Cambridge New York Port Melbourne New Delhi Singapore \|chapter=8.5. Batch Normalization \|chapter-url=https://d2l.ai/chapter_convolutional-modern/batch-norm.html}}</ref> Concretely, suppose we have a 2-dimensional convolutional layer defined by: <math display="block">x^{(l)}_{h, w, c} = \sum_{h', w', c'} K^{(l)}_{h'-h, w'-w, c, c'} x_{h', w', c'}^{(l-1)} + b^{(l)}_c</math> where: Concretely, suppose we have a 2-dimensional convolutional layer defined by<math display="block">x^{(l)}_{h, w, c} = \sum_{h', w', c'} K^{(l)}_{h'-h, w'-w, c, c'} x_{h', w', c'}^{(l-1)} + b^{(l)}_c</math>where * <math>x^{(l)}_{h, w, c}</math> is the activation of the neuron at position <math>(h, w)</math> in the <math>c</math>-th channel of the <math>l</math>-th layer. * <math>K^{(l)}_{\Delta h, \Delta w, c, c'}</math> is a kernel tensor. Each channel <math>c</math> corresponds to a kernel <math>K^{(l)}_{h'-h, w'-w, c, c'}</math>, with indices <math>\Delta h, \Delta w, c'</math>. * <math>b^{(l)}_c</math> is the bias term for the <math>c</math>-th channel of the <math>l</math>-th layer. In order to preserve the translational invariance, BatchNorm treats all outputs from the same kernel in the same batch as more data in a batch. That is, it is applied once per ''kernel'' <math>c</math> (equivalently, once per channel <math>c</math>), not per ''activation'' <math>x^{(l+1)}_{h, w, c}</math>: <math display="block"> \begin{aligned} \mu^{(l)}_c &= \frac{1}{BHW} \sum_{b=1}^B \sum_{h=1}^H \sum_{w=1}^W x^{(l)}_{(b), h, w, c} \\ (\sigma^{(l)}_c)^2 &= \frac{1}{BHW} \sum_{b=1}^B \sum_{h=1}^H \sum_{w=1}^W (x_{(b), h, w, c}^{(l)} - \mu_c^{(l)})^2 \end{aligned} </math> ~~</math>where <math>B</math> is the batch size, <math>H</math> is the height of the feature map, and <math>W</math> is the width of the feature map.~~ where <math>B</math> is the batch size, <math>H</math> is the height of the feature map, and <math>W</math> is the width of the feature map. That is, even though there are only <math>B</math> data points in a batch, all <math>BHW</math> outputs from the kernel in this batch are treated equally.<ref name=":0" /> Subsequently, normalization and the linear transform is also done per kernel: ~~That is, even though there are only <math>~~ B ~~</math> data points in a batch, all <math>~~ ~~BHW~~ ~~</math> outputs from the kernel in this batch are treated equally.<ref name=":0" />~~ ~~Subsequently, normalization and the linear transform is also done per kernel:~~<math display="block"> \begin{aligned} \hat{x}^{(l)}_{(b), h, w, c} &= \frac{x^{(l)}_{(b), h, w, c} - \mu^{(l)}_c}{\sqrt{(\sigma^{(l)}_c)^2 + \epsilon}} \\ y^{(l)}_{(b), h, w, c} &= \gamma_c \hat{x}^{(l)}_{(b), h, w, c} + \beta_c \end{aligned} </math> ~~</math>Similar considerations apply for BatchNorm for ''n''-dimensional convolutions.~~ Similar considerations apply for BatchNorm for ''n''-dimensional convolutions. The following is a Python implementation of BatchNorm for 2D convolutions: ~~The following code illustrates BatchNorm for 2D convolutions:~~ <syntaxhighlight lang="python3"> import numpy as np def batchnorm_cnn(x, gamma, beta, epsilon=1e-89): # Calculate the mean and variance for each channel. mean = np.mean(x, axis=(0, 1, 2), keepdims=True) Line 94 ⟶ 136: return y </syntaxhighlight>For multilayered [[Recurrent neural network\|recurrent neural networks]] (RNN), BatchNorm is usually applied only for the ''input-to-hidden'' part, not the ''hidden-to-hidden'' part.<ref name=":4">{{Cite book \|last1=Laurent \|first1=Cesar \|last2=Pereyra \|first2=Gabriel \|last3=Brakel \|first3=Philemon \|last4=Zhang \|first4=Ying \|last5=Bengio \|first5=Yoshua \|chapter=Batch normalized recurrent neural networks \|date=March 2016 \|title=2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) \|publisher=IEEE \|pages=2657–2661 \|doi=10.1109/ICASSP.2016.7472159 \|arxiv=1510.01378 \|isbn=978-1-4799-9988-0}}</ref> Let the hidden state of the <math>l</math>-th layer at time <math>t</math> be <math>h_t^{(l)}</math>. The standard RNN, without normalization, satisfies<math display="block">h^{(l)}_t = \phi(W^{(l)} h_t^{l-1} + U^{(l)} h_{t-1}^{l} + b^{(l)}) </math>where <math>W^{(l)}, U^{(l)}, b^{(l)}</math> are weights and biases, and <math>\phi</math> is the activation function. Applying BatchNorm, this becomes<math display="block">h^{(l)}_t = \phi(\mathrm{BN}(W^{(l)} h_t^{l-1}) + U^{(l)} h_{t-1}^{l}) </math>There are two possible ways to define what a "batch" is in BatchNorm for RNNs: ''frame-wise'' and ''sequence-wise''. Concretely, consider applying an RNN to process a batch of sentences. Let <math>h_{b, t}^{(l)}</math> be the hidden state of the <math>l</math>-th layer for the <math>t</math>-th token of the <math>b</math>-th input sentence. Then frame-wise BatchNorm means normalizing over <math>b</math>:<math display="block"> ~~</syntaxhighlight>~~ \begin{aligned} \mu_t^{(l)} &= \frac{1}{B} \sum_{b=1}^B h_{i,t}^{(l)} \\ (\sigma_t^{(l)})^2 &= \frac{1}{B} \sum_{b=1}^B (h_t^{(l)} - \mu_t^{(l)})^2 \end{aligned} </math>and sequence-wise means normalizing over <math>(b, t)</math>:<math display="block"> \begin{aligned} \mu^{(l)} &= \frac{1}{BT} \sum_{b=1}^B\sum_{t=1}^T h_{i,t}^{(l)} \\ (\sigma^{(l)})^2 &= \frac{1}{BT} \sum_{b=1}^B\sum_{t=1}^T (h_t^{(l)} - \mu^{(l)})^2 \end{aligned} </math>Frame-wise BatchNorm is suited for causal tasks such as next-character prediction, where future frames are unavailable, forcing normalization per frame. Sequence-wise BatchNorm is suited for tasks such as speech recognition, where the entire sequences are available, but with variable lengths. In a batch, the smaller sequences are padded with zeroes to match the size of the longest sequence of the batch. In such setups, frame-wise is not recommended, because the number of unpadded frames decreases along the time axis, leading to increasingly poorer statistics estimates.<ref name=":4" /> It is also possible to apply BatchNorm to [[Long short-term memory\|LSTMs]].<ref>{{cite arXiv \| eprint=1603.09025 \| last1=Cooijmans \| first1=Tim \| last2=Ballas \| first2=Nicolas \| last3=Laurent \| first3=César \| last4=Gülçehre \| first4=Çağlar \| last5=Courville \| first5=Aaron \| title=Recurrent Batch Normalization \| date=2016 \| class=cs.LG }}</ref> === Improvements === BatchNorm has been very popular and there were many attempted improvements. Some examples include:<ref name=":3">{{cite arXiv \| eprint=1906.03548 \| last1=Summers \| first1=Cecilia \| last2=Dinneen \| first2=Michael J. \| title=Four Things Everyone Should Know to Improve Batch Normalization \| date=2019 \| class=cs.LG }}</ref> * ghost batching: randomly partition a batch into sub-batches and perform BatchNorm separately on each; * weight decay on <math>\gamma</math> and <math>\beta</math>; * and combining BatchNorm with GroupNorm. A particular problem with BatchNorm is that during training, the mean and variance are calculated on the fly for each batch (usually as an [[exponential moving average]]), but during inference, the mean and variance were frozen from those calculated during training. This train-test disparity degrades performance. The disparity can be decreased by simulating the moving average during inference:<ref name=":3" />{{Pg\|___location=Eq. 3}} <math display="block"> \begin{aligned} \mu &= \alpha E[x] + (1 - \alpha) \mu_{x, \text{ train}} \\ \sigma^2 &= (\alpha E[x]^2 + (1 - \alpha) \mu_{x^2, \text{ train}}) - \mu^2 \end{aligned} </math> where <math>\alpha</math> is a hyperparameter to be optimized on a validation set. Other works attempt to eliminate BatchNorm, such as the Normalizer-Free ResNet.<ref>{{cite arXiv \| eprint=2102.06171 \| last1=Brock \| first1=Andrew \| last2=De \| first2=Soham \| last3=Smith \| first3=Samuel L. \| last4=Simonyan \| first4=Karen \| title=High-Performance Large-Scale Image Recognition Without Normalization \| date=2021 \| class=cs.CV }}</ref> == Layer normalization == '''Layer normalization''' ('''LayerNorm''')<ref name=":2">{{Cite ~~journal~~arXiv \|last1=Ba \|first1=Jimmy Lei \|last2=Kiros \|first2=Jamie Ryan \|last3=Hinton \|first3=Geoffrey E. \|date=2016 \|title=Layer Normalization \|~~url~~class=~~https://arxiv~~stat.~~org/abs/1607.06450~~ML \|~~arxiv~~eprint=1607.06450}}</ref> is a ~~common~~popular ~~competitor~~alternative to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size. It is a key component of [[Transformer (deep learning architecture)\|~~Transformers~~transformer]] models. For a given data input and layer, LayerNorm computes the mean (<math>\mu</math>) and variance (<math>\sigma^2</math>) over all the neurons in the layer. Similar to BatchNorm, learnable parameters <math>\gamma</math> (scale) and <math>\beta</math> (shift) are applied. It is defined by: <math display="block">\hat{x_i} = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y_i = \gamma_i \hat{x_i} + \beta_i</math> where ~~<math>~~: ~~\mu = \frac 1D \sum_{i=1}^D x_i~~ <math display="block">\mu = \frac 1D \sum_{i=1}^D x_i, \quad \sigma^2 = \frac 1D \sum_{i=1}^D (x_i - \mu)^2</math> ~~</math> and <math>~~ ~~\sigma^2 = \frac 1D \sum_{i=1}^D (x_i - \mu)^2~~ and the index <math>i</math> ranges over the neurons in that layer. ~~</math>, and <math>~~ i ~~</math> ranges over the neurons in that layer.~~ === Examples === For example, in CNN, a LayerNorm applies to all activations in a layer. In the previous notation, we have: <math display="block"> \begin{aligned} \mu^{(l)} &= \frac{1}{HWC} \sum_{h=1}^H \sum_{w=1}^W\sum_{c=1}^C x^{(l)}_{h, w, c} \\ (\sigma^{(l)})^2 &= \frac{1}{HWC} \sum_{h=1}^H \sum_{w=1}^W\sum_{c=1}^C (x_{h, w, c}^{(l)} - \mu^{(l)})^2 \\ \hat{x}^{(l)}_{h,w,c} &= \frac{\hat{x}^{(l)}_{h,w,c} - \mu^{(l)}}{\sqrt{(\sigma^{(l)})^2 + \epsilon}} \\ y^{(l)}_{h,w,c} &= \gamma^{(l)} \hat{x}^{(l)}_{h,w,c} + \beta^{(l)} \end{aligned}~~</math>notice that the batch index <math>~~ </math> b ~~</math> is removed, while the channel index <math>~~ c ~~</math> is added.~~ Notice that the batch index <math>b</math> is removed, while the channel index <math>c</math> is added. In [[Recurrent neural network\|recurrent neural networks]]<ref name=":2" /> and [[Transformer (deep learning architecture)\|Transformers]],<ref>{{Citation \|last=Phuong \|first=Mary \|title=Formal Algorithms for Transformers \|date=2022-07-19 \|url=http://arxiv.org/abs/2207.09238 \|access-date=2024-08-08 \|doi=10.48550/arXiv.2207.09238 \|last2=Hutter \|first2=Marcus}}</ref> LayerNorm is applied individually to each timestep. In [[recurrent neural network]]s<ref name=":2" /> and [[Transformer (deep learning architecture)\|transformers]],<ref>{{cite arXiv \|last1=Phuong \|first1=Mary \|title=Formal Algorithms for Transformers \|date=2022-07-19 \|eprint=2207.09238 \|last2=Hutter \|first2=Marcus\|class=cs.LG }}</ref> LayerNorm is applied individually to each timestep. For example, if the hidden vector in an RNN at timestep <math>t</math> is <math>x^{(t)} \in \mathbb{R}^{D} ~~For example, if the hidden vector in an RNN at timestep <math>~~ </math>, where <math>D</math> is the dimension of the hidden vector, then LayerNorm will be applied with: t ~~</math> is <math>~~ <math display="block">\hat{x_{i}}^{(t)} = \frac{x_i^{(t)} - \mu^{(t)}}{\sqrt{(\sigma^{(t)})^2 + \epsilon}}, \quad y_i^{(t)} = \gamma_i \hat{x_i}^{(t)} + \beta_i</math> ~~x^{(t)} \in \mathbb{R}^{D}~~ ~~</math> where <math>~~ where: D </math> is the dimension of the hidden vector, then LayerNorm will be applied with<math display="block">\hat{x_{i}}^{(t)} = \frac{x_i^{(t)} - \mu^{(t)}}{\sqrt{(\sigma^{(t)})^2 + \epsilon}}, \quad y_i^{(t)} = \gamma_i \hat{x_i}^{(t)} + \beta_i</math>where <math> <math display="block">\mu^{(t)} = \frac 1D \sum_{i=1}^D x_i^{(t)}, \quad (\sigma^{(t)})^2 = \frac 1D \sum_{i=1}^D (x_i^{(t)} - \mu^{(t)})^2</math> ~~</math> and <math>~~ ~~(\sigma^{(t)})^2 = \frac 1D \sum_{i=1}^D (x_i^{(t)} - \mu^{(t)})^2~~ ~~</math>.~~ === Root mean square layer normalization === '''Root mean square layer normalization''' ('''RMSNorm'''):<ref>{{~~Citation~~cite arXiv \|~~last~~last1=Zhang \|~~first~~first1=Biao \|title=Root Mean Square Layer Normalization \|date=2019-10-16 \|~~url~~eprint=~~http://arxiv.org/abs/1910.07467 \|access-date=2024-08-07 \|doi=10.48550/arXiv.~~1910.07467 \|last2=Sennrich \|first2=Rico\|class=cs.LG }}</ref> ~~changes LayerNorm by~~ <math display="block"> \hat{x_i} = \frac{x_i}{\sqrt{\frac 1D \sum_{i=1}^D x_i^2}}, \quad y_i = \gamma \hat{x_i} + \beta </math> ~~</math>Essentially it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>.~~ Essentially, it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>. It is also called '''L2 normalization'''. It is a special case of '''Lp normalization''', or '''power normalization''':<math display="block"> ~~== Other normalizations ==~~ \hat{x_i} = \frac{x_i}{\left(\frac 1D \sum_{i=1}^D \|x_i\|^p \right)^{1/p}}, \quad y_i = \gamma \hat{x_i} + \beta '''Weight normalization''' ('''WeightNorm''')<ref>{{Citation \|last=Salimans \|first=Tim \|title=Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks \|date=2016-06-03 \|url=http://arxiv.org/abs/1602.07868 \|access-date=2024-08-08 \|doi=10.48550/arXiv.1602.07868 \|last2=Kingma \|first2=Diederik P.}}</ref> is a technique inspired by BatchNorm. It normalizes weight matrices in a neural network, rather than its neural activations. </math>where <math>p > 0</math> is a constant. === Adaptive === '''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal \|last=Chen \|first=Zhao \|last2=Badrinarayanan \|first2=Vijay \|last3=Lee \|first3=Chen-Yu \|last4=Rabinovich \|first4=Andrew \|date=2018-07-03 \|title=GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks \|url=https://proceedings.mlr.press/v80/chen18a.html \|journal=Proceedings of the 35th International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=794–803}}</ref> normalizes gradient vectors during backpropagation. '''Adaptive layer norm''' ('''adaLN''') computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data. It was first proposed for CNNs,<ref>{{Cite journal \|last1=Perez \|first1=Ethan \|last2=Strub \|first2=Florian \|last3=De Vries \|first3=Harm \|last4=Dumoulin \|first4=Vincent \|last5=Courville \|first5=Aaron \|date=2018-04-29 \|title=FiLM: Visual Reasoning with a General Conditioning Layer \|url=https://ojs.aaai.org/index.php/AAAI/article/view/11671 \|journal=Proceedings of the AAAI Conference on Artificial Intelligence \|volume=32 \|issue=1 \|doi=10.1609/aaai.v32i1.11671 \|issn=2374-3468\|arxiv=1709.07871 }}</ref> and has been used effectively in [[Diffusion model\|diffusion]] transformers (DiTs).<ref>{{Cite journal \|last1=Peebles \|first1=William \|last2=Xie \|first2=Saining \|date=2023 \|title=Scalable Diffusion Models with Transformers \|url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html \|language=en \|pages=4195–4205 \|arxiv=2212.09748}}</ref> For example, in a DiT, the conditioning information (such as a text encoding vector) is processed by a [[multilayer perceptron]] into <math>\gamma, \beta</math>, which is then applied in the LayerNorm module of a transformer. == Weight normalization == '''Adaptive layer norm''' ('''adaLN''')<ref>{{Cite journal \|last=Peebles \|first=William \|last2=Xie \|first2=Saining \|date=2023 \|title=Scalable Diffusion Models with Transformers \|url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html \|language=en \|pages=4195–4205}}</ref> computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data. '''Weight normalization''' ('''WeightNorm''')<ref>{{cite arXiv \|last1=Salimans \|first1=Tim \|title=Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks \|date=2016-06-03 \|eprint=1602.07868 \|last2=Kingma \|first2=Diederik P.\|class=cs.LG }}</ref> is a technique inspired by BatchNorm that normalizes weight matrices in a neural network, rather than its activations. One example is '''spectral normalization''', which divides weight matrices by their [[spectral norm]]. The spectral normalization is used in [[generative adversarial network]]s (GANs) such as the [[Wasserstein GAN]].<ref>{{cite arXiv \|eprint=1802.05957 \|class=cs.LG \|first1=Takeru \|last1=Miyato \|first2=Toshiki \|last2=Kataoka \|title=Spectral Normalization for Generative Adversarial Networks \|date=2018-02-16 \|last3=Koyama \|first3=Masanori \|last4=Yoshida \|first4=Yuichi}}</ref> The spectral radius can be efficiently computed by the following algorithm: {{blockquote\|'''INPUT''' matrix <math>W</math> and initial guess <math>x</math> Iterate <math>x \mapsto \frac{1}{\\|Wx\\|_2}Wx</math> to convergence <math>x^</math>. This is the eigenvector of <math>W</math> with eigenvalue <math>\\|W\\|_s</math>. '''RETURN''' <math>x^, \\|Wx^\\|_2</math>}} By reassigning <math>W_i \leftarrow \frac{W_i}{\\|W_i\\|_s}</math> after each update of the discriminator, we can upper-bound <math>\\|W_i\\|_s \leq 1</math>, and thus upper-bound <math>\\|D \\|_L</math>. The algorithm can be further accelerated by [[memoization]]: at step <math>t</math>, store <math>x^_i(t)</math>. Then, at step <math>t+1</math>, use <math>x^_i(t)</math> as the initial guess for the algorithm. Since <math>W_i(t+1)</math> is very close to <math>W_i(t)</math>, so is <math>x^_i(t)</math> to <math>x^_i(t+1)</math>, thus allowing rapid convergence. == CNN-specific normalization == There are some activation normalization techniques that are only used for CNNs. === ~~Local response~~Response normalization === {{Anchor\|Local response normalization}}'''Local response normalization'''<ref>{{Cite journal \|~~last~~last1=Krizhevsky \|~~first~~first1=Alex \|last2=Sutskever \|first2=Ilya \|last3=Hinton \|first3=Geoffrey E \|date=2012 \|title=ImageNet Classification with Deep Convolutional Neural Networks \|url=https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=25}}</ref> was used in [[AlexNet]]. It was applied in a convolutional layer, just after a nonlinear activation function. It was defined by: <math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j\right)^2\right)^\beta}</math> where <math>a_{x,y}^i</math> is the activation of the neuron at ___location <math>(x,y)</math> and channel <math>i</math>. ~~In words~~I.e., each pixel in a channel is suppressed by the activations of the same pixel in its adjacent channels. <math>k, n, \alpha, \beta</math> are hyperparameters picked by using a validation set. It was a variant of the earlier '''local contrast normalization'''.<ref>{{Cite book \|last1=Jarrett \|first1=Kevin \|last2=Kavukcuoglu \|first2=Koray \|last3=Ranzato \|first3=Marc' Aurelio \|last4=LeCun \|first4=Yann \|chapter=What is the best multi-stage architecture for object recognition? \|date=September 2009 \|pages=2146–2153 \|title=2009 IEEE 12th International Conference on Computer Vision \|chapter-url=http://dx.doi.org/10.1109/iccv.2009.5459469 \|publisher=IEEE \|doi=10.1109/iccv.2009.5459469\|isbn=978-1-4244-4420-5 }}</ref> <math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j - \bar a_{x, y}^j\right)^2\right)^\beta}</math> where <math>\bar a_{x, y}^j</math> is the average activation in a small window centered on ___location <math>(x,y)</math> and channel <math>i</math>. The ~~numbers~~hyperparameters <math>k, n, \alpha, \beta</math>, ~~are~~and ~~hyperparameters~~the size of the small window, are picked by using a validation set. Similar methods were called '''divisive normalization''', as they divide activations by a number depending on the activations. They were originally inspired by biology, where it was used to explain nonlinear responses of cortical neurons and nonlinear masking in visual perception.<ref>{{Cite book \|last1=Lyu \|first1=Siwei \|last2=Simoncelli \|first2=Eero P. \|chapter=Nonlinear image representation using divisive normalization \|date=2008 \|title=2008 IEEE Conference on Computer Vision and Pattern Recognition \|volume=2008 \|pages=1–8 \|doi=10.1109/CVPR.2008.4587821 \|issn=1063-6919 \|pmc=4207373 \|pmid=25346590\|isbn=978-1-4244-2242-5 }}</ref> It was a variant of the earlier '''local contrast normalization'''.<ref>{{Cite journal \|last=Jarrett \|first=Kevin \|last2=Kavukcuoglu \|first2=Koray \|last3=Ranzato \|first3=Marc' Aurelio \|last4=LeCun \|first4=Yann \|date=September 2009 \|title=What is the best multi-stage architecture for object recognition? \|url=http://dx.doi.org/10.1109/iccv.2009.5459469 \|journal=2009 IEEE 12th International Conference on Computer Vision \|publisher=IEEE \|doi=10.1109/iccv.2009.5459469}}</ref><math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j - \bar a_{x, y}^j\right)^2\right)^\beta}</math>where <math>\bar a_{x, y}^j</math> is the average activation in a small window centered on ___location <math>(x,y)</math> and channel <math>i</math>. The numbers <math>k, n, \alpha, \beta</math>, and the size of the small window, are hyperparameters picked by using a validation set. Both kinds of local normalization were obviated by batch normalization, which is a more global form of normalization.<ref>{{Cite journal \|last1=Ortiz \|first1=Anthony \|last2=Robinson \|first2=Caleb \|last3=Morris \|first3=Dan \|last4=Fuentes \|first4=Olac \|last5=Kiekintveld \|first5=Christopher \|last6=Hassan \|first6=Md Mahmudulla \|last7=Jojic \|first7=Nebojsa \|date=2020 \|title=Local Context Normalization: Revisiting Local Normalization \|url=https://openaccess.thecvf.com/content_CVPR_2020/html/Ortiz_Local_Context_Normalization_Revisiting_Local_Normalization_CVPR_2020_paper.html \|pages=11276–11285\|arxiv=1912.05845 }}</ref> Similar methods were called '''divisive normalization''', as they divide activations by a number depending on the activations. They were originally inspired by biology, where it was used to explain nonlinear responses of cortical neurons and nonlinear masking in visual perception.<ref>{{Cite journal \|last=Lyu \|first=Siwei \|last2=Simoncelli \|first2=Eero P. \|date=2008 \|title=Nonlinear Image Representation Using Divisive Normalization \|url=https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4207373/ \|journal=Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition \|volume=2008 \|pages=1–8 \|doi=10.1109/CVPR.2008.4587821 \|issn=1063-6919 \|pmc=4207373 \|pmid=25346590}}</ref> ~~Both kinds of local normalization were obsoleted by batch~~Response normalization, ~~which~~reappeared isin aConvNeXT-2 ~~more~~as '''global ~~form of~~response normalization'''.<ref>{{Cite journal \|~~last~~last1=~~Ortiz~~Woo \|~~first~~first1=~~Anthony~~Sanghyun \|last2=~~Robinson~~Debnath \|first2=~~Caleb~~Shoubhik \|last3=~~Morris~~Hu \|first3=~~Dan~~Ronghang \|last4=~~Fuentes~~Chen \|first4=~~Olac~~Xinlei \|last5=~~Kiekintveld~~Liu \|first5=~~Christopher~~Zhuang \|last6=~~Hassan~~Kweon \|first6=MdIn ~~Mahmudulla~~So \|last7=~~Jojic~~Xie \|first7=~~Nebojsa~~Saining \|date=~~2020~~2023 \|title=~~Local~~ConvNeXt ~~Context Normalization~~V2: ~~Revisiting~~Co-Designing and Scaling ConvNets With ~~Local~~Masked ~~Normalization~~Autoencoders \|url=https://openaccess.thecvf.com/~~content_CVPR_2020~~content/CVPR2023/html/~~Ortiz_Local_Context_Normalization_Revisiting_Local_Normalization_CVPR_2020_paper~~Woo_ConvNeXt_V2_Co-Designing_and_Scaling_ConvNets_With_Masked_Autoencoders_CVPR_2023_paper.html \|language=en \|pages=~~11276–11285~~16133–16142\|arxiv=2301.00808 }}</ref> === Group normalization === '''Group normalization''' ('''GroupNorm''')<ref>{{Cite journal \|~~last~~last1=Wu \|~~first~~first1=Yuxin \|last2=He \|first2=Kaiming \|date=2018 \|title=Group Normalization \|url=https://openaccess.thecvf.com/content_ECCV_2018/html/Yuxin_Wu_Group_Normalization_ECCV_2018_paper.html \|pages=3–19}}</ref> is a technique ~~only~~also solely used for CNNs. It can be understood as the LayerNorm for CNN applied once per channel- group. Suppose at a layer <math>l</math>, there are channels <math>1, 2, \dots, C</math>, then weit ~~partition~~is itpartitioned into groups <math>g_1, g_2, \dots, g_G</math>. Then, weLayerNorm ~~apply~~is ~~LayerNorm~~applied to each group. === Instance normalization === '''Instance normalization''' ('''InstanceNorm'''), or '''contrast normalization''', is a technique first developed for [[neural style transfer]], and is also only used for CNNs.<ref>{{~~Citation~~cite arXiv \|~~last~~last1=Ulyanov \|~~first~~first1=Dmitry \|title=Instance Normalization: The Missing Ingredient for Fast Stylization \|date=2017-11-06 \|~~url~~eprint=~~http://arxiv.org/abs/1607.08022 \|access-date=2024-08-08 \|doi=10.48550/arXiv.~~1607.08022 \|last2=Vedaldi \|first2=Andrea \|last3=Lempitsky \|first3=Victor\|class=cs.CV }}</ref> It can be understood as the LayerNorm for CNN applied once per channel, or equivalently, as group normalization where each group consists of a single channel: <math display="block"> \begin{aligned} \mu^{(l)}_c &= \frac{1}{HW} \sum_{h=1}^H \sum_{w=1}^Wx^{(l)}_{h, w, c} \\ Line 175 ⟶ 277: === Adaptive instance normalization === '''Adaptive instance normalization''' ('''AdaIN''') is a variant of instance normalization, designed specifically for neural style transfer with ~~CNN~~CNNs, ~~not~~rather ~~for~~than ~~CNN~~just CNNs in general.<ref>{{Cite journal \|~~last~~last1=Huang \|~~first~~first1=Xun \|last2=Belongie \|first2=Serge \|date=2017 \|title=Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization \|url=https://openaccess.thecvf.com/content_iccv_2017/html/Huang_Arbitrary_Style_Transfer_ICCV_2017_paper.html \|pages=1501–1510\|arxiv=1703.06868 }}</ref> In the AdaIN method of style transfer, we take a CNN and two input images, one for '''content''' and one for '''style'''. Each image is processed through the same CNN, and at a certain layer <math>l</math>, AdaIn is applied. Let <math>x^{(l), \text{ content}}</math> be the activation in the content image, and <math>x^{(l), \text{ style}}</math> be the activation in the style image. Then, AdaIn first computes the mean and variance of the activations of the content image <math>x'^{(l)}</math>, then uses those as the <math>\gamma, \beta</math> for InstanceNorm on <math>x^{(l), \text{ content}}</math>. Note that <math>x^{(l), \text{ style}}</math> itself remains unchanged. Explicitly, we have: In the AdaIN method of style transfer, we take a CNN, and two input images, one '''content''' and one '''style'''. Each image is processed through the same CNN, and at a certain layer <math>l</math>, the AdaIn is applied. <math display="block"> Let <math>x^{(l), \text{ content}}</math> be the activation in the content image, and <math>x^{(l), \text{ style}}</math> be the activation in the style image. Then, AdaIn first computes the mean and variance of the activations of the content image <math>x'^{(l)}</math>, then use those as the <math>\gamma, \beta</math> for InstanceNorm on <math>x^{(l), \text{ content}}</math>. Note that <math>x^{(l), \text{ style}}</math> itself remains unchanged. Explicitly, we have<math display="block"> \begin{aligned} y^{(l), \text{ content}}_{h,w,c} &= \sigma^{(l), \text{ style}}_c \left( \frac{x^{(l), \text{ content}}_{h,w,c} - \mu^{(l), \text{ content}}_c}{\sqrt{(\sigma^{(l), \text{ content}}_c)^2 + \epsilon}} \right) + \mu^{(l), \text{ style}}_c. \end{aligned} </math> == Transformers == Some normalization methods were designed for use in [[Transformer (deep learning architecture)\|transformers]]. The original 2017 transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train, and required careful [[Hyperparameter optimization\|hyperparameter tuning]] and a "warm-up" in [[learning rate]], where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018,<ref>{{cite arXiv \| eprint=1906.01787 \| last1=Wang \| first1=Qiang \| last2=Li \| first2=Bei \| last3=Xiao \| first3=Tong \| last4=Zhu \| first4=Jingbo \| last5=Li \| first5=Changliang \| last6=Wong \| first6=Derek F. \| last7=Chao \| first7=Lidia S. \| title=Learning Deep Transformer Models for Machine Translation \| date=2019 \| class=cs.CL }}</ref> was found to be easier to train, requiring no warm-up, leading to faster convergence.<ref name="auto1">{{cite arXiv \|eprint=2002.04745 \|class=cs.LG \|first1=Ruibin \|last1=Xiong \|first2=Yunchang \|last2=Yang \|title=On Layer Normalization in the Transformer Architecture \|date=2020-06-29 \|last3=He \|first3=Di \|last4=Zheng \|first4=Kai \|last5=Zheng \|first5=Shuxin \|last6=Xing \|first6=Chen \|last7=Zhang \|first7=Huishuai \|last8=Lan \|first8=Yanyan \|last9=Wang \|first9=Liwei \|last10=Liu \|first10=Tie-Yan}}</ref> '''FixNorm'''<ref>{{cite arXiv \| eprint=1710.01329 \| last1=Nguyen \| first1=Toan Q. \| last2=Chiang \| first2=David \| title=Improving Lexical Choice in Neural Machine Translation \| date=2017 \| class=cs.CL }}</ref> and '''ScaleNorm<ref>{{Cite journal \|last1=Nguyen \|first1=Toan Q. \|last2=Salazar \|first2=Julian \|date=2019-11-02 \|title=Transformers without Tears: Improving the Normalization of Self-Attention \|doi=10.5281/zenodo.3525484\|arxiv=1910.05895 }}</ref>''' both normalize activation vectors in a transformer. The FixNorm method divides the ''output'' vectors from a transformer by their L2 norms, then multiplies by a learned parameter <math>g</math>. The ScaleNorm replaces all LayerNorms inside a transformer by division with L2 norm, then multiplying by a learned parameter <math>g'</math> (shared by all ScaleNorm modules of a transformer). '''Query-Key normalization''' ('''QKNorm''')<ref>{{Cite journal \|last1=Henry \|first1=Alex \|last2=Dachapally \|first2=Prudhvi Raj \|last3=Pawar \|first3=Shubham Shantaram \|last4=Chen \|first4=Yuxuan \|date=November 2020 \|editor-last=Cohn \|editor-first=Trevor \|editor2-last=He \|editor2-first=Yulan \|editor3-last=Liu \|editor3-first=Yang \|title=Query-Key Normalization for Transformers \|url=https://aclanthology.org/2020.findings-emnlp.379/ \|journal=Findings of the Association for Computational Linguistics: EMNLP 2020 \|___location=Online \|publisher=Association for Computational Linguistics \|pages=4246–4253 \|doi=10.18653/v1/2020.findings-emnlp.379\|arxiv=2010.04245 }}</ref> normalizes query and key vectors to have unit L2 norm. In '''nGPT''', many vectors are normalized to have unit L2 norm:<ref>{{cite arXiv \| eprint=2410.01131 \| last1=Loshchilov \| first1=Ilya \| last2=Hsieh \| first2=Cheng-Ping \| last3=Sun \| first3=Simeng \| last4=Ginsburg \| first4=Boris \| title=NGPT: Normalized Transformer with Representation Learning on the Hypersphere \| date=2024 \| class=cs.LG }}</ref> hidden state vectors, input and output embedding vectors, weight matrix columns, and query and key vectors. == Miscellaneous == '''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal \|last1=Chen \|first1=Zhao \|last2=Badrinarayanan \|first2=Vijay \|last3=Lee \|first3=Chen-Yu \|last4=Rabinovich \|first4=Andrew \|date=2018-07-03 \|title=GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks \|url=https://proceedings.mlr.press/v80/chen18a.html \|journal=Proceedings of the 35th International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=794–803 \|arxiv=1711.02257}}</ref> normalizes gradient vectors during backpropagation. == See also == Line 190 ⟶ 306: [[Data preprocessing]] * [[Feature scaling]] ~~== Further reading ==~~ * {{Cite web \|title=Normalization Layers \|url=https://nn.labml.ai/normalization/index.html \|access-date=2024-08-07 \|website=labml.ai Deep Learning Paper Implementations \|language=en}} == References == <references /> == Further reading == * {{Cite web \|title=Normalization Layers \|url=https://nn.labml.ai/normalization/index.html \|access-date=2024-08-07 \|website=labml.ai Deep Learning Paper Implementations \|language=en}} {{Artificial intelligence navbox}} [[Category:Articles with example Python (programming language) code]]