Normalization (machine learning): Difference between revisions

Content deleted Content added
m clean up
Citation bot (talk | contribs)
Alter: template type, title. Add: isbn, chapter-url, pages, chapter, arxiv, eprint, authors 1-1. Removed or converted URL. Removed access-date with no URL. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar
Line 43:
 
=== Interpretation ===
<math>\gamma</math> and <math>\beta</math> allow the network to learn to undo the normalization if that is beneficial.<ref name=":1">{{Cite book |lastlast1=Goodfellow |firstfirst1=Ian |title=Deep learning |last2=Bengio |first2=Yoshua |last3=Courville |first3=Aaron |date=2016 |publisher=The MIT Press |isbn=978-0-262-03561-3 |series=Adaptive computation and machine learning |___location=Cambridge, Massachusetts |chapter=8.7.1. Batch Normalization}}</ref>
Because a neural network can always be topped with a linear transform layer on top, BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus purely on modelling the nonlinear aspects of data.<ref>{{Cite journal |lastlast1=Desjardins |firstfirst1=Guillaume |last2=Simonyan |first2=Karen |last3=Pascanu |first3=Razvan |last4=kavukcuoglu |first4=koray |date=2015 |title=Natural Neural Networks |url=https://proceedings.neurips.cc/paper_files/paper/2015/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=28}}</ref><ref name=":1" />
 
It is claimed in the original publication that BatchNorm works by reducing "internal covariance shift", though the claim has both supporters<ref>{{Cite journal |last1=Xu |first1=Jingjing |last2=Sun |first2=Xu |last3=Zhang |first3=Zhiyuan |last4=Zhao |first4=Guangxiang |last5=Lin |first5=Junyang |date=2019 |title=Understanding and Improving Layer Normalization |url=https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=32 |arxiv=1911.07013}}</ref><ref>{{Cite journal |last1=Awais |first1=Muhammad |last2=Bin Iqbal |first2=Md. Tauhid |last3=Bae |first3=Sung-Ho |date=November 2021 |title=Revisiting Internal Covariate Shift for Batch Normalization |url=https://ieeexplore.ieee.org/document/9238401 |journal=IEEE Transactions on Neural Networks and Learning Systems |volume=32 |issue=11 |pages=5082–5092 |doi=10.1109/TNNLS.2020.3026784 |issn=2162-237X |pmid=33095717}}</ref> and detractors.<ref>{{Cite journal |last1=Bjorck |first1=Nils |last2=Gomes |first2=Carla P |last3=Selman |first3=Bart |last4=Weinberger |first4=Kilian Q |date=2018 |title=Understanding Batch Normalization |url=https://proceedings.neurips.cc/paper/2018/hash/36072923bfc3cf47745d704feb489480-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=31 |arxiv=1806.02375}}</ref><ref>{{Cite journal |last1=Santurkar |first1=Shibani |last2=Tsipras |first2=Dimitris |last3=Ilyas |first3=Andrew |last4=Madry |first4=Aleksander |date=2018 |title=How Does Batch Normalization Help Optimization? |url=https://proceedings.neurips.cc/paper/2018/hash/905056c1ac1dad141560467e0a99e1cf-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=31}}</ref>
Line 51:
The original paper<ref name=":0" /> recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is, <math>\phi(\mathrm{BN}(Wx + b))</math>, not <math>\mathrm{BN}(\phi(Wx + b))</math>. Also, the bias <math>b </math> does not matter, since will be canceled by the subsequent mean subtraction, so it is of form <math>\mathrm{BN}(Wx)</math>. That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to constant zero.<ref name=":0" />
 
For [[convolutional neural network]]s (CNN), BatchNorm must preserve the translation-invariance of CNN, which means that it must treat all outputs of the same kernel as if they are different data points within a batch.<ref name=":0" /> This is sometimes called Spatial BatchNorm, or BatchNorm2D, or per-channel BatchNorm.<ref>{{Cite web |title=BatchNorm2d — PyTorch 2.4 documentation |url=https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html |access-date=2024-09-26 |website=pytorch.org}}</ref><ref>{{Cite book |lastlast1=Zhang |firstfirst1=Aston |title=Dive into deep learning |last2=Lipton |first2=Zachary |last3=Li |first3=Mu |last4=Smola |first4=Alexander J. |date=2024 |publisher=Cambridge University Press |isbn=978-1-009-38943-3 |___location=Cambridge New York Port Melbourne New Delhi Singapore |chapter=8.5. Batch Normalization |chapter-url=https://d2l.ai/chapter_convolutional-modern/batch-norm.html}}</ref>
 
Concretely, suppose we have a 2-dimensional convolutional layer defined by<math display="block">x^{(l)}_{h, w, c} = \sum_{h', w', c'} K^{(l)}_{h'-h, w'-w, c, c'} x_{h', w', c'}^{(l-1)} + b^{(l)}_c</math>where
Line 97:
 
== Layer normalization ==
'''Layer normalization''' ('''LayerNorm''')<ref name=":2">{{Cite journal |last1=Ba |first1=Jimmy Lei |last2=Kiros |first2=Jamie Ryan |last3=Hinton |first3=Geoffrey E. |date=2016 |title=Layer Normalization |url=https://arxiv.org/abs/1607.06450 |arxiv=1607.06450}}</ref> is a common competitor to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size. It is a key component of [[Transformer (deep learning architecture)|Transformers]].
 
For a given data input and layer, LayerNorm computes the mean (<math>\mu</math>) and variance (<math>\sigma^2</math>) over all the neurons in the layer. Similar to BatchNorm, learnable parameters <math>\gamma</math> (scale) and <math>\beta</math> (shift) are applied. It is defined by:<math display="block">\hat{x_i} = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y_i = \gamma_i \hat{x_i} + \beta_i</math>where <math>
Line 119:
</math> is added.
 
In [[recurrent neural network]]s<ref name=":2" /> and [[Transformer (deep learning architecture)|Transformers]],<ref>{{cite arxivarXiv |lastlast1=Phuong |firstfirst1=Mary |title=Formal Algorithms for Transformers |date=2022-07-19 |url=http://arxiv.org/abs/2207.09238 |access-date=2024-08-08 |arxiveprint=2207.09238 |last2=Hutter |first2=Marcus}}</ref> LayerNorm is applied individually to each timestep.
 
For example, if the hidden vector in an RNN at timestep <math>
Line 134:
 
=== Root mean square layer normalization ===
'''Root mean square layer normalization''' ('''RMSNorm''')<ref>{{cite arxivarXiv |lastlast1=Zhang |firstfirst1=Biao |title=Root Mean Square Layer Normalization |date=2019-10-16 |url=http://arxiv.org/abs/1910.07467 |access-date=2024-08-07 |arxiveprint=1910.07467 |last2=Sennrich |first2=Rico}}</ref> changes LayerNorm by<math display="block">
\hat{x_i} = \frac{x_i}{\sqrt{\frac 1D \sum_{i=1}^D x_i^2}}, \quad y_i = \gamma \hat{x_i} + \beta
</math>Essentially it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>.
 
== Other normalizations ==
'''Weight normalization''' ('''WeightNorm''')<ref>{{cite arxivarXiv |lastlast1=Salimans |firstfirst1=Tim |title=Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks |date=2016-06-03 |url=http://arxiv.org/abs/1602.07868 |access-date=2024-08-08 |arxiveprint=1602.07868 |last2=Kingma |first2=Diederik P.}}</ref> is a technique inspired by BatchNorm. It normalizes weight matrices in a neural network, rather than its neural activations.
 
'''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal |lastlast1=Chen |firstfirst1=Zhao |last2=Badrinarayanan |first2=Vijay |last3=Lee |first3=Chen-Yu |last4=Rabinovich |first4=Andrew |date=2018-07-03 |title=GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks |url=https://proceedings.mlr.press/v80/chen18a.html |journal=Proceedings of the 35th International Conference on Machine Learning |language=en |publisher=PMLR |pages=794–803|arxiv=1711.02257 }}</ref> normalizes gradient vectors during backpropagation.
 
'''Adaptive layer norm''' ('''adaLN''')<ref>{{Cite journal |lastlast1=Peebles |firstfirst1=William |last2=Xie |first2=Saining |date=2023 |title=Scalable Diffusion Models with Transformers |url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html |language=en |pages=4195–4205|arxiv=2212.09748 }}</ref> computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data.
 
== CNN-specific normalization ==
Line 149:
 
=== Local response normalization ===
'''Local response normalization'''<ref>{{Cite journal |lastlast1=Krizhevsky |firstfirst1=Alex |last2=Sutskever |first2=Ilya |last3=Hinton |first3=Geoffrey E |date=2012 |title=ImageNet Classification with Deep Convolutional Neural Networks |url=https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=25}}</ref> was used in [[AlexNet]]. It was applied in a convolutional layer, just after a nonlinear activation function. It was defined by<math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j\right)^2\right)^\beta}</math>where <math>a_{x,y}^i</math> is the activation of the neuron at ___location <math>(x,y)</math> and channel <math>i</math>. In words, each pixel in a channel is suppressed by the activations of the same pixel in its adjacent channels.
 
The numbers <math>k, n, \alpha, \beta</math> are hyperparameters picked by using a validation set.
 
It was a variant of the earlier '''local contrast normalization'''.<ref>{{Cite journalbook |lastlast1=Jarrett |firstfirst1=Kevin |last2=Kavukcuoglu |first2=Koray |last3=Ranzato |first3=Marc' Aurelio |last4=LeCun |first4=Yann |date=September 2009 |titlechapter=What is the best multi-stage architecture for object recognition? |date=September 2009 |pages=2146–2153 |title=2009 IEEE 12th International Conference on Computer Vision |chapter-url=http://dx.doi.org/10.1109/iccv.2009.5459469 |journal=2009 IEEE 12th International Conference on Computer Vision |publisher=IEEE |doi=10.1109/iccv.2009.5459469|isbn=978-1-4244-4420-5 }}</ref><math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j - \bar a_{x, y}^j\right)^2\right)^\beta}</math>where <math>\bar a_{x, y}^j</math> is the average activation in a small window centered on ___location <math>(x,y)</math> and channel <math>i</math>. The numbers <math>k, n, \alpha, \beta</math>, and the size of the small window, are hyperparameters picked by using a validation set.
 
Similar methods were called '''divisive normalization''', as they divide activations by a number depending on the activations. They were originally inspired by biology, where it was used to explain nonlinear responses of cortical neurons and nonlinear masking in visual perception.<ref>{{Cite journal |lastlast1=Lyu |firstfirst1=Siwei |last2=Simoncelli |first2=Eero P. |date=2008 |title=Nonlinear Image Representation Using Divisive Normalization |url=https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4207373/ |journal=Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition |volume=2008 |pages=1–8 |doi=10.1109/CVPR.2008.4587821 |issn=1063-6919 |pmc=4207373 |pmid=25346590|isbn=978-1-4244-2242-5 }}</ref>
 
Both kinds of local normalization were obsoleted by batch normalization, which is a more global form of normalization.<ref>{{Cite journal |lastlast1=Ortiz |firstfirst1=Anthony |last2=Robinson |first2=Caleb |last3=Morris |first3=Dan |last4=Fuentes |first4=Olac |last5=Kiekintveld |first5=Christopher |last6=Hassan |first6=Md Mahmudulla |last7=Jojic |first7=Nebojsa |date=2020 |title=Local Context Normalization: Revisiting Local Normalization |url=https://openaccess.thecvf.com/content_CVPR_2020/html/Ortiz_Local_Context_Normalization_Revisiting_Local_Normalization_CVPR_2020_paper.html |pages=11276–11285|arxiv=1912.05845 }}</ref>
 
=== Group normalization ===
'''Group normalization''' ('''GroupNorm''')<ref>{{Cite journal |lastlast1=Wu |firstfirst1=Yuxin |last2=He |first2=Kaiming |date=2018 |title=Group Normalization |url=https://openaccess.thecvf.com/content_ECCV_2018/html/Yuxin_Wu_Group_Normalization_ECCV_2018_paper.html |pages=3–19}}</ref> is a technique only used for CNNs. It can be understood as the LayerNorm for CNN applied once per channel-group.
 
Suppose at a layer <math>l</math>, there are channels <math>1, 2, \dots, C</math>, then we partition it into groups <math>g_1, \dots, g_G</math>. Then, we apply LayerNorm to each group.
 
=== Instance normalization ===
'''Instance normalization''' ('''InstanceNorm'''), or '''contrast normalization''', is a technique first developed for [[neural style transfer]], and is only used for CNNs.<ref>{{cite arxivarXiv |lastlast1=Ulyanov |firstfirst1=Dmitry |title=Instance Normalization: The Missing Ingredient for Fast Stylization |date=2017-11-06 |url=http://arxiv.org/abs/1607.08022 |access-date=2024-08-08 |arxiveprint=1607.08022 |last2=Vedaldi |first2=Andrea |last3=Lempitsky |first3=Victor}}</ref> It can be understood as the LayerNorm for CNN applied once per channel, or equivalently, as group normalization where each group consists of a single channel:<math display="block">
\begin{aligned}
\mu^{(l)}_c &= \frac{1}{HW} \sum_{h=1}^H \sum_{w=1}^Wx^{(l)}_{h, w, c} \\
Line 175:
 
=== Adaptive instance normalization ===
'''Adaptive instance normalization''' ('''AdaIN''') is a variant of instance normalization, designed specifically for neural style transfer with CNN, not for CNN in general.<ref>{{Cite journal |lastlast1=Huang |firstfirst1=Xun |last2=Belongie |first2=Serge |date=2017 |title=Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization |url=https://openaccess.thecvf.com/content_iccv_2017/html/Huang_Arbitrary_Style_Transfer_ICCV_2017_paper.html |pages=1501–1510|arxiv=1703.06868 }}</ref>
 
In the AdaIN method of style transfer, we take a CNN, and two input images, one '''content''' and one '''style'''. Each image is processed through the same CNN, and at a certain layer <math>l</math>, the AdaIn is applied.