Content deleted Content added
Citation bot (talk | contribs) Altered template type. Added eprint. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar |
Citation bot (talk | contribs) Added bibcode. Removed URL that duplicated identifier. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 923/990 |
||
(33 intermediate revisions by 8 users not shown) | |||
Line 1:
{{Short description|
{{Machine learning bar}}
In [[machine learning]], '''normalization''' is a statistical technique with various applications. There are
Activation normalization, on the other hand, is specific to [[deep learning]], and includes methods that rescale the activation of [[Hidden layer|hidden neurons]] inside [[Neural network (machine learning)|neural networks]].
Normalization is often used to:
* increase the speed of training convergence,
* reduce sensitivity to variations and feature scales in input data,
* reduce [[overfitting]],
* and produce better model generalization to unseen data.
Normalization techniques are often theoretically justified as reducing covariance shift, smoothing optimization landscapes, and increasing [[Regularization (mathematics)|regularization]], though they are mainly justified by empirical success.<ref>{{Cite book |last=Huang |first=Lei |url=https://link.springer.com/10.1007/978-3-031-14595-7 |title=Normalization Techniques in Deep Learning |date=2022 |publisher=Springer International Publishing |isbn=978-3-031-14594-0 |series=Synthesis Lectures on Computer Vision |___location=Cham |language=en |doi=10.1007/978-3-031-14595-7}}</ref>
== Batch normalization ==
{{Main|Batch normalization}}'''Batch normalization''' ('''BatchNorm''')<ref name=":0">{{Cite journal |last1=Ioffe |first1=Sergey |last2=Szegedy |first2=Christian |date=2015-06-01 |title=Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift |url=https://proceedings.mlr.press/v37/ioffe15.html |journal=Proceedings of the 32nd International Conference on Machine Learning |language=en |publisher=PMLR |pages=448–456|arxiv=1502.03167 }}</ref> operates on the activations of a layer for each mini-batch.
Consider a simple feedforward network, defined by chaining together modules:
<math display="block">x^{(0)} \mapsto x^{(1)} \mapsto x^{(2)} \mapsto \cdots</math>
where each network module can be a linear transform, a nonlinear activation function, a convolution, etc. <math>x^{(0)}</math> is the input vector, <math>x^{(1)}</math> is the output vector from the first module, etc.
BatchNorm is a module that can be inserted at any point in the feedforward network. For example, suppose it is inserted just after <math>x^{(l)}</math>, then the network would operate accordingly:
<math display="block">\cdots \mapsto x^{(l)} \mapsto \mathrm{BN}(x^{(l)}) \mapsto x^{(l+1)} \mapsto \cdots</math>
The BatchNorm module does not operate over individual inputs. Instead, it must operate over one batch of inputs at a time.
Concretely, suppose we have a batch of inputs <math>x^{(0)}_{(1)}, x^{(0)}_{(2)}, \dots, x^{(0)}_{(B)}</math>, fed all at once into the network. We would obtain in the middle of the network some vectors:
<math display="block">x^{(l)}_{(1)}, x^{(l)}_{(2)}, \dots, x^{(l)}_{(B)}</math>
The BatchNorm module computes the coordinate-wise mean and variance of these vectors:
<math display="block">
\begin{aligned}
\mu^{(l)}_i &= \frac 1B \sum_{b=1}^B x^{(l)}_{(b), i} \\
(\sigma^{(l)}_i)^2 &= \frac{1}{B} \sum_{b=1}^B (x_{(b),i}^{(l)} - \mu_i^{(l)})^2
\end{aligned}
</math>
where <math>i</math> indexes the coordinates of the vectors, and <math>b</math> indexes the elements of the batch. In other words, we are considering the <math>i</math>-th coordinate of each vector in the batch, and computing the mean and variance of these numbers.
It then normalizes each coordinate to have zero mean and unit variance:
The <math>\epsilon</math> is a small positive constant such as <math>10^{-9}</math> added to the variance for numerical stability, to avoid [[division by zero]].
Finally, it applies a linear transformation:
<math display="block">y^{(l)}_{(b), i} = \gamma_i \hat{x}^{(l)}_{(b), i} + \beta_i</math>
Here, <math>\gamma</math> and <math>\beta</math> are parameters inside the BatchNorm module. They are learnable parameters, typically trained by [[gradient descent]].
The following is a [[Python (programming language)|Python]] implementation of BatchNorm:
<syntaxhighlight lang="python3">
import numpy as np
def batchnorm(x, gamma, beta, epsilon=1e-
# Mean and variance of each feature
mu = np.mean(x, axis=0) # shape (N,)
Line 43 ⟶ 76:
=== Interpretation ===
<math>\gamma</math> and <math>\beta</math>
It is claimed in the original publication that BatchNorm works by reducing
=== Special cases ===
The original paper<ref name=":0" /> recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is, <math>\phi(\mathrm{BN}(Wx + b))</math>, not <math>\mathrm{BN}(\phi(Wx + b))</math>. Also, the bias <math>b
For [[convolutional neural network]]s (
Concretely, suppose we have a 2-dimensional convolutional layer defined by:
<math display="block">x^{(l)}_{h, w, c} = \sum_{h', w', c'} K^{(l)}_{h'-h, w'-w, c, c'} x_{h', w', c'}^{(l-1)} + b^{(l)}_c</math>
where:
* <math>x^{(l)}_{h, w, c}</math> is the activation of the neuron at position <math>(h, w)</math> in the <math>c</math>-th channel of the <math>l</math>-th layer.
* <math>K^{(l)}_{\Delta h, \Delta w, c, c'}</math> is a kernel tensor. Each channel <math>c</math> corresponds to a kernel <math>K^{(l)}_{h'-h, w'-w, c, c'}</math>, with indices <math>\Delta h, \Delta w, c'</math>.
* <math>b^{(l)}_c</math> is the bias term for the <math>c</math>-th channel of the <math>l</math>-th layer.
In order to preserve the translational invariance, BatchNorm treats all outputs from the same kernel in the same batch as more data in a batch. That is, it is applied once per ''kernel'' <math>c</math> (equivalently, once per channel <math>c</math>), not per ''activation'' <math>x^{(l+1)}_{h, w, c}</math>:
<math display="block"> \begin{aligned}
\mu^{(l)}_c &= \frac{1}{BHW} \sum_{b=1}^B \sum_{h=1}^H \sum_{w=1}^W x^{(l)}_{(b), h, w, c} \\
(\sigma^{(l)}_c)^2 &= \frac{1}{BHW} \sum_{b=1}^B \sum_{h=1}^H \sum_{w=1}^W (x_{(b), h, w, c}^{(l)} - \mu_c^{(l)})^2
\end{aligned}
</math>
where <math>B</math> is the batch size, <math>H</math> is the height of the feature map, and <math>W</math> is the width of the feature map.
That is, even though there are only <math>B</math> data points in a batch, all <math>BHW</math> outputs from the kernel in this batch are treated equally.<ref name=":0" />
Subsequently, normalization and the linear transform is also done per kernel:
\begin{aligned}
\hat{x}^{(l)}_{(b), h, w, c} &= \frac{x^{(l)}_{(b), h, w, c} - \mu^{(l)}_c}{\sqrt{(\sigma^{(l)}_c)^2 + \epsilon}} \\
y^{(l)}_{(b), h, w, c} &= \gamma_c \hat{x}^{(l)}_{(b), h, w, c} + \beta_c
\end{aligned}
</math>
Similar considerations apply for BatchNorm for ''n''-dimensional convolutions.
The following is a Python implementation of BatchNorm for 2D convolutions:
<syntaxhighlight lang="python3">
import numpy as np
def batchnorm_cnn(x, gamma, beta, epsilon=1e-
# Calculate the mean and variance for each channel.
mean = np.mean(x, axis=(0, 1, 2), keepdims=True)
Line 94 ⟶ 136:
return y
</syntaxhighlight>For multilayered [[Recurrent neural network|recurrent neural networks]] (RNN), BatchNorm is usually applied only for the ''input-to-hidden'' part, not the ''hidden-to-hidden'' part.<ref name=":4">{{Cite book |last1=Laurent |first1=Cesar |last2=Pereyra |first2=Gabriel |last3=Brakel |first3=Philemon |last4=Zhang |first4=Ying |last5=Bengio |first5=Yoshua |chapter=Batch normalized recurrent neural networks |date=March 2016 |title=2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |publisher=IEEE |pages=2657–2661 |doi=10.1109/ICASSP.2016.7472159 |arxiv=1510.01378 |isbn=978-1-4799-9988-0}}</ref> Let the hidden state of the <math>l</math>-th layer at time <math>t</math> be <math>h_t^{(l)}</math>. The standard RNN, without normalization, satisfies<math display="block">h^{(l)}_t = \phi(W^{(l)} h_t^{l-1} + U^{(l)} h_{t-1}^{l} + b^{(l)}) </math>where <math>W^{(l)}, U^{(l)}, b^{(l)}</math> are weights and biases, and <math>\phi</math> is the activation function. Applying BatchNorm, this becomes<math display="block">h^{(l)}_t = \phi(\mathrm{BN}(W^{(l)} h_t^{l-1}) + U^{(l)} h_{t-1}^{l}) </math>There are two possible ways to define what a "batch" is in BatchNorm for RNNs: ''frame-wise'' and ''sequence-wise''. Concretely, consider applying an RNN to process a batch of sentences. Let <math>h_{b, t}^{(l)}</math> be the hidden state of the <math>l</math>-th layer for the <math>t</math>-th token of the <math>b</math>-th input sentence. Then frame-wise BatchNorm means normalizing over <math>b</math>:<math display="block">
\begin{aligned}
\mu_t^{(l)} &= \frac{1}{B} \sum_{b=1}^B h_{i,t}^{(l)} \\
(\sigma_t^{(l)})^2 &= \frac{1}{B} \sum_{b=1}^B (h_t^{(l)} - \mu_t^{(l)})^2
\end{aligned}
</math>and sequence-wise means normalizing over <math>(b, t)</math>:<math display="block">
\begin{aligned}
\mu^{(l)} &= \frac{1}{BT} \sum_{b=1}^B\sum_{t=1}^T h_{i,t}^{(l)} \\
(\sigma^{(l)})^2 &= \frac{1}{BT} \sum_{b=1}^B\sum_{t=1}^T (h_t^{(l)} - \mu^{(l)})^2
\end{aligned}
</math>Frame-wise BatchNorm is suited for causal tasks such as next-character prediction, where future frames are unavailable, forcing normalization per frame. Sequence-wise BatchNorm is suited for tasks such as speech recognition, where the entire sequences are available, but with variable lengths. In a batch, the smaller sequences are padded with zeroes to match the size of the longest sequence of the batch. In such setups, frame-wise is not recommended, because the number of unpadded frames decreases along the time axis, leading to increasingly poorer statistics estimates.<ref name=":4" />
It is also possible to apply BatchNorm to [[Long short-term memory|LSTMs]].<ref>{{cite arXiv | eprint=1603.09025 | last1=Cooijmans | first1=Tim | last2=Ballas | first2=Nicolas | last3=Laurent | first3=César | last4=Gülçehre | first4=Çağlar | last5=Courville | first5=Aaron | title=Recurrent Batch Normalization | date=2016 | class=cs.LG }}</ref>
=== Improvements ===
BatchNorm has been very popular and there were many attempted improvements. Some examples include:<ref name=":3">{{cite arXiv | eprint=1906.03548 | last1=Summers | first1=Cecilia | last2=Dinneen | first2=Michael J. | title=Four Things Everyone Should Know to Improve Batch Normalization | date=2019 | class=cs.LG }}</ref>
* ghost batching: randomly partition a batch into sub-batches and perform BatchNorm separately on each;
* weight decay on <math>\gamma</math> and <math>\beta</math>;
* and combining BatchNorm with GroupNorm.
A particular problem with BatchNorm is that during training, the mean and variance are calculated on the fly for each batch (usually as an [[exponential moving average]]), but during inference, the mean and variance were frozen from those calculated during training. This train-test disparity degrades performance. The disparity can be decreased by simulating the moving average during inference:<ref name=":3" />{{Pg|___location=Eq. 3}}
<math display="block">
\begin{aligned}
\mu &= \alpha E[x] + (1 - \alpha) \mu_{x, \text{ train}} \\
\sigma^2 &= (\alpha E[x]^2 + (1 - \alpha) \mu_{x^2, \text{ train}}) - \mu^2
\end{aligned}
</math>
where <math>\alpha</math> is a hyperparameter to be optimized on a validation set.
Other works attempt to eliminate BatchNorm, such as the Normalizer-Free ResNet.<ref>{{cite arXiv | eprint=2102.06171 | last1=Brock | first1=Andrew | last2=De | first2=Soham | last3=Smith | first3=Samuel L. | last4=Simonyan | first4=Karen | title=High-Performance Large-Scale Image Recognition Without Normalization | date=2021 | class=cs.CV }}</ref>
== Layer normalization ==
'''Layer normalization''' ('''LayerNorm''')<ref name=":2">{{Cite arXiv |last1=Ba |first1=Jimmy Lei |last2=Kiros |first2=Jamie Ryan |last3=Hinton |first3=Geoffrey E. |date=2016 |title=Layer Normalization |class=stat.ML |eprint=1607.06450}}</ref> is a
For a given data input and layer, LayerNorm computes the mean
<math display="block">\hat{x_i} = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y_i = \gamma_i \hat{x_i} where <math display="block">\mu = \frac 1D \sum_{i=1}^D x_i, \quad \sigma^2 = \frac 1D \sum_{i=1}^D (x_i - \mu)^2</math>
and the index <math>i</math> ranges over the neurons in that layer.
=== Examples ===
For example, in CNN, a LayerNorm applies to all activations in a layer. In the previous notation, we have:
<math display="block"> \begin{aligned} \mu^{(l)} &= \frac{1}{HWC} \sum_{h=1}^H \sum_{w=1}^W\sum_{c=1}^C x^{(l)}_{h, w, c} \\
(\sigma^{(l)})^2 &= \frac{1}{HWC} \sum_{h=1}^H \sum_{w=1}^W\sum_{c=1}^C (x_{h, w, c}^{(l)} - \mu^{(l)})^2 \\
\hat{x}^{(l)}_{h,w,c} &= \frac{\hat{x}^{(l)}_{h,w,c} - \mu^{(l)}}{\sqrt{(\sigma^{(l)})^2 + \epsilon}} \\
y^{(l)}_{h,w,c} &= \gamma^{(l)} \hat{x}^{(l)}_{h,w,c} + \beta^{(l)}
\end{aligned}
</math>
Notice that the batch index <math>b</math> is removed, while the channel index <math>c</math> is added.
In [[recurrent neural network]]s<ref name=":2" /> and [[Transformer (deep learning architecture)|transformers]],<ref>{{cite arXiv |last1=Phuong |first1=Mary |title=Formal Algorithms for Transformers |date=2022-07-19 |eprint=2207.09238 |last2=Hutter |first2=Marcus|class=cs.LG }}</ref> LayerNorm is applied individually to each timestep. For example, if the hidden vector in an RNN at timestep <math>t</math> is <math>x^{(t)} \in \mathbb{R}^{D}
</math>, where <math>D</math> is the dimension of the hidden vector, then LayerNorm will be applied with:
<math display="block">\hat{x_{i}}^{(t)} = \frac{x_i^{(t)} - \mu^{(t)}}{\sqrt{(\sigma^{(t)})^2 + \epsilon}}, \quad y_i^{(t)} = \gamma_i \hat{x_i}^{(t)} + \beta_i</math>
where:
<math display="block">\mu^{(t)} = \frac 1D \sum_{i=1}^D x_i^{(t)}, \quad (\sigma^{(t)})^2 = \frac 1D \sum_{i=1}^D (x_i^{(t)} - \mu^{(t)})^2</math>
=== Root mean square layer normalization ===
'''Root mean square layer normalization''' ('''RMSNorm'''):<ref>{{cite arXiv |last1=Zhang |first1=Biao |title=Root Mean Square Layer Normalization |date=2019-10-16 |eprint=1910.07467 |last2=Sennrich |first2=Rico|class=cs.LG }}</ref>
<math display="block"> \hat{x_i} = \frac{x_i}{\sqrt{\frac 1D \sum_{i=1}^D x_i^2}}, \quad y_i = \gamma \hat{x_i} + \beta
</math>
Essentially, it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>. It is also called '''L2 normalization'''. It is a special case of '''Lp normalization''', or '''power normalization''':<math display="block">
\hat{x_i} = \frac{x_i}{\left(\frac 1D \sum_{i=1}^D |x_i|^p \right)^{1/p}}, \quad y_i = \gamma \hat{x_i} + \beta
</math>where <math>p > 0</math> is a constant.
=== Adaptive ===
'''Adaptive layer norm''' ('''adaLN''') computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data. It was first proposed for CNNs,<ref>{{Cite journal |last1=Perez |first1=Ethan |last2=Strub |first2=Florian |last3=De Vries |first3=Harm |last4=Dumoulin |first4=Vincent |last5=Courville |first5=Aaron |date=2018-04-29 |title=FiLM: Visual Reasoning with a General Conditioning Layer |url=https://ojs.aaai.org/index.php/AAAI/article/view/11671 |journal=Proceedings of the AAAI Conference on Artificial Intelligence |volume=32 |issue=1 |doi=10.1609/aaai.v32i1.11671 |issn=2374-3468|arxiv=1709.07871 }}</ref> and has been used effectively in [[Diffusion model|diffusion]] transformers (DiTs).<ref>{{Cite journal |last1=Peebles |first1=William |last2=Xie |first2=Saining |date=2023 |title=Scalable Diffusion Models with Transformers |url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html |language=en |pages=4195–4205 |arxiv=2212.09748}}</ref> For example, in a DiT, the conditioning information (such as a text encoding vector) is processed by a [[multilayer perceptron]] into <math>\gamma, \beta</math>, which is then applied in the LayerNorm module of a transformer.
== Weight normalization ==
'''Weight normalization''' ('''WeightNorm''')<ref>{{cite arXiv |last1=Salimans |first1=Tim |title=Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks |date=2016-06-03 |eprint=1602.07868 |last2=Kingma |first2=Diederik P.|class=cs.LG }}</ref> is a technique inspired by BatchNorm that normalizes weight matrices in a neural network, rather than its activations.
One example is '''spectral normalization''', which divides weight matrices by their [[spectral norm]]. The spectral normalization is used in [[generative adversarial network]]s (GANs) such as the [[Wasserstein GAN]].<ref>{{cite arXiv |eprint=1802.05957 |class=cs.LG |first1=Takeru |last1=Miyato |first2=Toshiki |last2=Kataoka |title=Spectral Normalization for Generative Adversarial Networks |date=2018-02-16 |last3=Koyama |first3=Masanori |last4=Yoshida |first4=Yuichi}}</ref> The spectral radius can be efficiently computed by the following algorithm:
{{blockquote|'''INPUT''' matrix <math>W</math> and initial guess <math>x</math>
Iterate <math>x \mapsto \frac{1}{\|Wx\|_2}Wx</math> to convergence <math>x^*</math>. This is the eigenvector of <math>W</math> with eigenvalue <math>\|W\|_s</math>.
'''RETURN''' <math>x^*, \|Wx^*\|_2</math>}}
By reassigning <math>W_i \leftarrow \frac{W_i}{\|W_i\|_s}</math> after each update of the discriminator, we can upper-bound <math>\|W_i\|_s \leq 1</math>, and thus upper-bound <math>\|D \|_L</math>.
The algorithm can be further accelerated by [[memoization]]: at step <math>t</math>, store <math>x^*_i(t)</math>. Then, at step <math>t+1</math>, use <math>x^*_i(t)</math> as the initial guess for the algorithm. Since <math>W_i(t+1)</math> is very close to <math>W_i(t)</math>, so is <math>x^*_i(t)</math> to <math>x^*_i(t+1)</math>, thus allowing rapid convergence.
== CNN-specific normalization ==
There are some activation normalization techniques that are only used for CNNs.
===
{{Anchor|Local response normalization}}'''Local response normalization'''<ref>{{Cite journal |last1=Krizhevsky |first1=Alex |last2=Sutskever |first2=Ilya |last3=Hinton |first3=Geoffrey E |date=2012 |title=ImageNet Classification with Deep Convolutional Neural Networks |url=https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=25}}</ref> was used in [[AlexNet]]. It was applied in a convolutional layer, just after a nonlinear activation function. It was defined by
<math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j\right)^2\right)^\beta}</math>
where <math>a_{x,y}^i</math> is the activation of the neuron at ___location <math>(x,y)</math> and channel <math>i</math>. I.e., each pixel in a channel is suppressed by the activations of the same pixel in its adjacent channels.
<math>k, n, \alpha, \beta</math> are hyperparameters picked by using a validation set. It was a variant of the earlier '''local contrast normalization'''.<ref>{{Cite book |last1=Jarrett |first1=Kevin |last2=Kavukcuoglu |first2=Koray |last3=Ranzato |first3=Marc' Aurelio |last4=LeCun |first4=Yann |chapter=What is the best multi-stage architecture for object recognition? |date=September 2009 |pages=2146–2153 |title=2009 IEEE 12th International Conference on Computer Vision |chapter-url=http://dx.doi.org/10.1109/iccv.2009.5459469 |publisher=IEEE |doi=10.1109/iccv.2009.5459469|isbn=978-1-4244-4420-5 }}</ref> <math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j - \bar a_{x, y}^j\right)^2\right)^\beta}</math> where <math>\bar a_{x, y}^j</math> is the average activation in a small window centered on ___location <math>(x,y)</math> and channel <math>i</math>. The Similar methods were called '''divisive normalization''', as they divide activations by a number depending on the activations. They were originally inspired by biology, where it was used to explain nonlinear responses of cortical neurons and nonlinear masking in visual perception.<ref>{{Cite book |last1=Lyu |first1=Siwei |last2=Simoncelli |first2=Eero P. |chapter=Nonlinear image representation using divisive normalization |date=2008 |title=2008 IEEE Conference on Computer Vision and Pattern Recognition |volume=2008 |pages=1–8 |doi=10.1109/CVPR.2008.4587821 |issn=1063-6919 |pmc=4207373 |pmid=25346590|isbn=978-1-4244-2242-5 }}</ref>
Both kinds of local normalization were
Response normalization reappeared in ConvNeXT-2 as '''global response normalization'''.<ref>{{Cite journal |last1=Woo |first1=Sanghyun |last2=Debnath |first2=Shoubhik |last3=Hu |first3=Ronghang |last4=Chen |first4=Xinlei |last5=Liu |first5=Zhuang |last6=Kweon |first6=In So |last7=Xie |first7=Saining |date=2023 |title=ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders |url=https://openaccess.thecvf.com/content/CVPR2023/html/Woo_ConvNeXt_V2_Co-Designing_and_Scaling_ConvNets_With_Masked_Autoencoders_CVPR_2023_paper.html |language=en |pages=16133–16142|arxiv=2301.00808 }}</ref>
=== Group normalization ===
'''Group normalization''' ('''GroupNorm''')<ref>{{Cite journal |last1=Wu |first1=Yuxin |last2=He |first2=Kaiming |date=2018 |title=Group Normalization |url=https://openaccess.thecvf.com/content_ECCV_2018/html/Yuxin_Wu_Group_Normalization_ECCV_2018_paper.html |pages=3–19}}</ref> is a technique
Suppose at a layer <math>l</math>, there are channels <math>1, 2, \dots, C</math>, then
=== Instance normalization ===
'''Instance normalization''' ('''InstanceNorm'''), or '''contrast normalization''', is a technique first developed for [[neural style transfer]], and is also only used for CNNs.<ref>{{cite arXiv |last1=Ulyanov |first1=Dmitry |title=Instance Normalization: The Missing Ingredient for Fast Stylization |date=2017-11-06 |eprint=1607.08022 |last2=Vedaldi |first2=Andrea |last3=Lempitsky |first3=Victor|class=cs.CV }}</ref> It can be understood as the LayerNorm for CNN applied once per channel, or equivalently, as group normalization where each group consists of a single channel:
<math display="block"> \begin{aligned}
\mu^{(l)}_c &= \frac{1}{HW} \sum_{h=1}^H \sum_{w=1}^Wx^{(l)}_{h, w, c} \\
Line 175 ⟶ 277:
=== Adaptive instance normalization ===
'''Adaptive instance normalization''' ('''AdaIN''') is a variant of instance normalization, designed specifically for neural style transfer with
In the AdaIN method of style transfer, we take a CNN and two input images, one for '''content''' and one for '''style'''. Each image is processed through the same CNN, and at a certain layer <math>l</math>, AdaIn is applied.
Let <math>x^{(l), \text{ content}}</math> be the activation in the content image, and <math>x^{(l), \text{ style}}</math> be the activation in the style image. Then, AdaIn first computes the mean and variance of the activations of the content image <math>x'^{(l)}</math>, then uses those as the <math>\gamma, \beta</math> for InstanceNorm on <math>x^{(l), \text{ content}}</math>. Note that <math>x^{(l), \text{ style}}</math> itself remains unchanged. Explicitly, we have:
<math display="block">
\begin{aligned}
y^{(l), \text{ content}}_{h,w,c} &= \sigma^{(l),
\text{ style}}_c \left( \frac{x^{(l), \text{ content}}_{h,w,c} - \mu^{(l), \text{ content}}_c}{\sqrt{(\sigma^{(l), \text{ content}}_c)^2 + \epsilon}} \right) + \mu^{(l), \text{ style}}_c
\end{aligned}
</math>
== Transformers ==
Some normalization methods were designed for use in [[Transformer (deep learning architecture)|transformers]].
The original 2017 transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train, and required careful [[Hyperparameter optimization|hyperparameter tuning]] and a "warm-up" in [[learning rate]], where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018,<ref>{{cite arXiv | eprint=1906.01787 | last1=Wang | first1=Qiang | last2=Li | first2=Bei | last3=Xiao | first3=Tong | last4=Zhu | first4=Jingbo | last5=Li | first5=Changliang | last6=Wong | first6=Derek F. | last7=Chao | first7=Lidia S. | title=Learning Deep Transformer Models for Machine Translation | date=2019 | class=cs.CL }}</ref> was found to be easier to train, requiring no warm-up, leading to faster convergence.<ref name="auto1">{{cite arXiv |eprint=2002.04745 |class=cs.LG |first1=Ruibin |last1=Xiong |first2=Yunchang |last2=Yang |title=On Layer Normalization in the Transformer Architecture |date=2020-06-29 |last3=He |first3=Di |last4=Zheng |first4=Kai |last5=Zheng |first5=Shuxin |last6=Xing |first6=Chen |last7=Zhang |first7=Huishuai |last8=Lan |first8=Yanyan |last9=Wang |first9=Liwei |last10=Liu |first10=Tie-Yan}}</ref>
'''FixNorm'''<ref>{{cite arXiv | eprint=1710.01329 | last1=Nguyen | first1=Toan Q. | last2=Chiang | first2=David | title=Improving Lexical Choice in Neural Machine Translation | date=2017 | class=cs.CL }}</ref> and '''ScaleNorm<ref>{{Cite journal |last1=Nguyen |first1=Toan Q. |last2=Salazar |first2=Julian |date=2019-11-02 |title=Transformers without Tears: Improving the Normalization of Self-Attention |doi=10.5281/zenodo.3525484|arxiv=1910.05895 }}</ref>''' both normalize activation vectors in a transformer. The FixNorm method divides the ''output'' vectors from a transformer by their L2 norms, then multiplies by a learned parameter <math>g</math>. The ScaleNorm replaces all LayerNorms inside a transformer by division with L2 norm, then multiplying by a learned parameter <math>g'</math> (shared by all ScaleNorm modules of a transformer). '''Query-Key normalization''' ('''QKNorm''')<ref>{{Cite journal |last1=Henry |first1=Alex |last2=Dachapally |first2=Prudhvi Raj |last3=Pawar |first3=Shubham Shantaram |last4=Chen |first4=Yuxuan |date=November 2020 |editor-last=Cohn |editor-first=Trevor |editor2-last=He |editor2-first=Yulan |editor3-last=Liu |editor3-first=Yang |title=Query-Key Normalization for Transformers |url=https://aclanthology.org/2020.findings-emnlp.379/ |journal=Findings of the Association for Computational Linguistics: EMNLP 2020 |___location=Online |publisher=Association for Computational Linguistics |pages=4246–4253 |doi=10.18653/v1/2020.findings-emnlp.379|arxiv=2010.04245 }}</ref> normalizes query and key vectors to have unit L2 norm.
In '''nGPT''', many vectors are normalized to have unit L2 norm:<ref>{{cite arXiv | eprint=2410.01131 | last1=Loshchilov | first1=Ilya | last2=Hsieh | first2=Cheng-Ping | last3=Sun | first3=Simeng | last4=Ginsburg | first4=Boris | title=NGPT: Normalized Transformer with Representation Learning on the Hypersphere | date=2024 | class=cs.LG }}</ref> hidden state vectors, input and output embedding vectors, weight matrix columns, and query and key vectors.
== Miscellaneous ==
'''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal |last1=Chen |first1=Zhao |last2=Badrinarayanan |first2=Vijay |last3=Lee |first3=Chen-Yu |last4=Rabinovich |first4=Andrew |date=2018-07-03 |title=GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks |url=https://proceedings.mlr.press/v80/chen18a.html |journal=Proceedings of the 35th International Conference on Machine Learning |language=en |publisher=PMLR |pages=794–803 |arxiv=1711.02257}}</ref> normalizes gradient vectors during backpropagation.
== See also ==
Line 191 ⟶ 307:
* [[Feature scaling]]
==
<references />
== Further reading ==
* {{Cite web |title=Normalization Layers |url=https://nn.labml.ai/normalization/index.html |access-date=2024-08-07 |website=labml.ai Deep Learning Paper Implementations |language=en}}
{{Artificial intelligence navbox}}
[[Category:Articles with example Python (programming language) code]]
|