Revision as of 21:35, 27 September 2024 edit Headbomb (talk \| contribs) Edit filter managers, Autopatrolled, Extended confirmed users, Page movers, File movers, New page reviewers, Pending changes reviewers, Rollbackers, Template editors 473,387 edits m clean up Tag: AWB ← Previous edit		Revision as of 21:36, 27 September 2024 edit undo Citation bot (talk \| contribs) Bots 5,868,072 edits Alter: template type, title. Add: isbn, chapter-url, pages, chapter, arxiv, eprint, authors 1-1. Removed or converted URL. Removed access-date with no URL. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Headbomb \| #UCB_toolbar Next edit →
Line 43: === Interpretation === <math>\gamma</math> and <math>\beta</math> allow the network to learn to undo the normalization if that is beneficial.<ref name=":1">{{Cite book \|~~last~~last1=Goodfellow \|~~first~~first1=Ian \|title=Deep learning \|last2=Bengio \|first2=Yoshua \|last3=Courville \|first3=Aaron \|date=2016 \|publisher=The MIT Press \|isbn=978-0-262-03561-3 \|series=Adaptive computation and machine learning \|___location=Cambridge, Massachusetts \|chapter=8.7.1. Batch Normalization}}</ref> Because a neural network can always be topped with a linear transform layer on top, BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus purely on modelling the nonlinear aspects of data.<ref>{{Cite journal \|~~last~~last1=Desjardins \|~~first~~first1=Guillaume \|last2=Simonyan \|first2=Karen \|last3=Pascanu \|first3=Razvan \|last4=kavukcuoglu \|first4=koray \|date=2015 \|title=Natural Neural Networks \|url=https://proceedings.neurips.cc/paper_files/paper/2015/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=28}}</ref><ref name=":1" /> It is claimed in the original publication that BatchNorm works by reducing "internal covariance shift", though the claim has both supporters<ref>{{Cite journal \|last1=Xu \|first1=Jingjing \|last2=Sun \|first2=Xu \|last3=Zhang \|first3=Zhiyuan \|last4=Zhao \|first4=Guangxiang \|last5=Lin \|first5=Junyang \|date=2019 \|title=Understanding and Improving Layer Normalization \|url=https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=32 \|arxiv=1911.07013}}</ref><ref>{{Cite journal \|last1=Awais \|first1=Muhammad \|last2=Bin Iqbal \|first2=Md. Tauhid \|last3=Bae \|first3=Sung-Ho \|date=November 2021 \|title=Revisiting Internal Covariate Shift for Batch Normalization \|url=https://ieeexplore.ieee.org/document/9238401 \|journal=IEEE Transactions on Neural Networks and Learning Systems \|volume=32 \|issue=11 \|pages=5082–5092 \|doi=10.1109/TNNLS.2020.3026784 \|issn=2162-237X \|pmid=33095717}}</ref> and detractors.<ref>{{Cite journal \|last1=Bjorck \|first1=Nils \|last2=Gomes \|first2=Carla P \|last3=Selman \|first3=Bart \|last4=Weinberger \|first4=Kilian Q \|date=2018 \|title=Understanding Batch Normalization \|url=https://proceedings.neurips.cc/paper/2018/hash/36072923bfc3cf47745d704feb489480-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=31 \|arxiv=1806.02375}}</ref><ref>{{Cite journal \|last1=Santurkar \|first1=Shibani \|last2=Tsipras \|first2=Dimitris \|last3=Ilyas \|first3=Andrew \|last4=Madry \|first4=Aleksander \|date=2018 \|title=How Does Batch Normalization Help Optimization? \|url=https://proceedings.neurips.cc/paper/2018/hash/905056c1ac1dad141560467e0a99e1cf-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=31}}</ref> Line 51: The original paper<ref name=":0" /> recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is, <math>\phi(\mathrm{BN}(Wx + b))</math>, not <math>\mathrm{BN}(\phi(Wx + b))</math>. Also, the bias <math>b </math> does not matter, since will be canceled by the subsequent mean subtraction, so it is of form <math>\mathrm{BN}(Wx)</math>. That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to constant zero.<ref name=":0" /> For [[convolutional neural network]]s (CNN), BatchNorm must preserve the translation-invariance of CNN, which means that it must treat all outputs of the same kernel as if they are different data points within a batch.<ref name=":0" /> This is sometimes called Spatial BatchNorm, or BatchNorm2D, or per-channel BatchNorm.<ref>{{Cite web \|title=BatchNorm2d — PyTorch 2.4 documentation \|url=https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html \|access-date=2024-09-26 \|website=pytorch.org}}</ref><ref>{{Cite book \|~~last~~last1=Zhang \|~~first~~first1=Aston \|title=Dive into deep learning \|last2=Lipton \|first2=Zachary \|last3=Li \|first3=Mu \|last4=Smola \|first4=Alexander J. \|date=2024 \|publisher=Cambridge University Press \|isbn=978-1-009-38943-3 \|___location=Cambridge New York Port Melbourne New Delhi Singapore \|chapter=8.5. Batch Normalization \|chapter-url=https://d2l.ai/chapter_convolutional-modern/batch-norm.html}}</ref> Concretely, suppose we have a 2-dimensional convolutional layer defined by<math display="block">x^{(l)}_{h, w, c} = \sum_{h', w', c'} K^{(l)}_{h'-h, w'-w, c, c'} x_{h', w', c'}^{(l-1)} + b^{(l)}_c</math>where Line 97: == Layer normalization == '''Layer normalization''' ('''LayerNorm''')<ref name=":2">{{Cite journal \|last1=Ba \|first1=Jimmy Lei \|last2=Kiros \|first2=Jamie Ryan \|last3=Hinton \|first3=Geoffrey E. \|date=2016 \|title=Layer Normalization ~~\|url=https://arxiv.org/abs/1607.06450~~ \|arxiv=1607.06450}}</ref> is a common competitor to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size. It is a key component of [[Transformer (deep learning architecture)\|Transformers]]. For a given data input and layer, LayerNorm computes the mean (<math>\mu</math>) and variance (<math>\sigma^2</math>) over all the neurons in the layer. Similar to BatchNorm, learnable parameters <math>\gamma</math> (scale) and <math>\beta</math> (shift) are applied. It is defined by:<math display="block">\hat{x_i} = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y_i = \gamma_i \hat{x_i} + \beta_i</math>where <math> Line 119: </math> is added. In [[recurrent neural network]]s<ref name=":2" /> and [[Transformer (deep learning architecture)\|Transformers]],<ref>{{cite ~~arxiv~~arXiv \|~~last~~last1=Phuong \|~~first~~first1=Mary \|title=Formal Algorithms for Transformers \|date=2022-07-19 \|~~url=http://arxiv.org/abs/2207.09238 \|access-date=2024-08-08 \|arxiv~~eprint=2207.09238 \|last2=Hutter \|first2=Marcus}}</ref> LayerNorm is applied individually to each timestep. For example, if the hidden vector in an RNN at timestep <math> Line 134: === Root mean square layer normalization === '''Root mean square layer normalization''' ('''RMSNorm''')<ref>{{cite ~~arxiv~~arXiv \|~~last~~last1=Zhang \|~~first~~first1=Biao \|title=Root Mean Square Layer Normalization \|date=2019-10-16 \|~~url=http://arxiv.org/abs/1910.07467 \|access-date=2024-08-07 \|arxiv~~eprint=1910.07467 \|last2=Sennrich \|first2=Rico}}</ref> changes LayerNorm by<math display="block"> \hat{x_i} = \frac{x_i}{\sqrt{\frac 1D \sum_{i=1}^D x_i^2}}, \quad y_i = \gamma \hat{x_i} + \beta </math>Essentially it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>. == Other normalizations == '''Weight normalization''' ('''WeightNorm''')<ref>{{cite ~~arxiv~~arXiv \|~~last~~last1=Salimans \|~~first~~first1=Tim \|title=Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks \|date=2016-06-03 \|~~url=http://arxiv.org/abs/1602.07868 \|access-date=2024-08-08 \|arxiv~~eprint=1602.07868 \|last2=Kingma \|first2=Diederik P.}}</ref> is a technique inspired by BatchNorm. It normalizes weight matrices in a neural network, rather than its neural activations. '''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal \|~~last~~last1=Chen \|~~first~~first1=Zhao \|last2=Badrinarayanan \|first2=Vijay \|last3=Lee \|first3=Chen-Yu \|last4=Rabinovich \|first4=Andrew \|date=2018-07-03 \|title=GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks \|url=https://proceedings.mlr.press/v80/chen18a.html \|journal=Proceedings of the 35th International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=794–803\|arxiv=1711.02257 }}</ref> normalizes gradient vectors during backpropagation. '''Adaptive layer norm''' ('''adaLN''')<ref>{{Cite journal \|~~last~~last1=Peebles \|~~first~~first1=William \|last2=Xie \|first2=Saining \|date=2023 \|title=Scalable Diffusion Models with Transformers \|url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html \|language=en \|pages=4195–4205\|arxiv=2212.09748 }}</ref> computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data. == CNN-specific normalization == Line 149: === Local response normalization === '''Local response normalization'''<ref>{{Cite journal \|~~last~~last1=Krizhevsky \|~~first~~first1=Alex \|last2=Sutskever \|first2=Ilya \|last3=Hinton \|first3=Geoffrey E \|date=2012 \|title=ImageNet Classification with Deep Convolutional Neural Networks \|url=https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=25}}</ref> was used in [[AlexNet]]. It was applied in a convolutional layer, just after a nonlinear activation function. It was defined by<math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j\right)^2\right)^\beta}</math>where <math>a_{x,y}^i</math> is the activation of the neuron at ___location <math>(x,y)</math> and channel <math>i</math>. In words, each pixel in a channel is suppressed by the activations of the same pixel in its adjacent channels. The numbers <math>k, n, \alpha, \beta</math> are hyperparameters picked by using a validation set. It was a variant of the earlier '''local contrast normalization'''.<ref>{{Cite ~~journal~~book \|~~last~~last1=Jarrett \|~~first~~first1=Kevin \|last2=Kavukcuoglu \|first2=Koray \|last3=Ranzato \|first3=Marc' Aurelio \|last4=LeCun \|first4=Yann \|~~date=September 2009 \|title~~chapter=What is the best multi-stage architecture for object recognition? \|date=September 2009 \|pages=2146–2153 \|title=2009 IEEE 12th International Conference on Computer Vision \|chapter-url=http://dx.doi.org/10.1109/iccv.2009.5459469 ~~\|journal=2009 IEEE 12th International Conference on Computer Vision~~ \|publisher=IEEE \|doi=10.1109/iccv.2009.5459469\|isbn=978-1-4244-4420-5 }}</ref><math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j - \bar a_{x, y}^j\right)^2\right)^\beta}</math>where <math>\bar a_{x, y}^j</math> is the average activation in a small window centered on ___location <math>(x,y)</math> and channel <math>i</math>. The numbers <math>k, n, \alpha, \beta</math>, and the size of the small window, are hyperparameters picked by using a validation set. Similar methods were called '''divisive normalization''', as they divide activations by a number depending on the activations. They were originally inspired by biology, where it was used to explain nonlinear responses of cortical neurons and nonlinear masking in visual perception.<ref>{{Cite journal \|~~last~~last1=Lyu \|~~first~~first1=Siwei \|last2=Simoncelli \|first2=Eero P. \|date=2008 \|title=Nonlinear Image Representation Using Divisive Normalization ~~\|url=https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4207373/~~ \|journal=Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition \|volume=2008 \|pages=1–8 \|doi=10.1109/CVPR.2008.4587821 \|issn=1063-6919 \|pmc=4207373 \|pmid=25346590\|isbn=978-1-4244-2242-5 }}</ref> Both kinds of local normalization were obsoleted by batch normalization, which is a more global form of normalization.<ref>{{Cite journal \|~~last~~last1=Ortiz \|~~first~~first1=Anthony \|last2=Robinson \|first2=Caleb \|last3=Morris \|first3=Dan \|last4=Fuentes \|first4=Olac \|last5=Kiekintveld \|first5=Christopher \|last6=Hassan \|first6=Md Mahmudulla \|last7=Jojic \|first7=Nebojsa \|date=2020 \|title=Local Context Normalization: Revisiting Local Normalization \|url=https://openaccess.thecvf.com/content_CVPR_2020/html/Ortiz_Local_Context_Normalization_Revisiting_Local_Normalization_CVPR_2020_paper.html \|pages=11276–11285\|arxiv=1912.05845 }}</ref> === Group normalization === '''Group normalization''' ('''GroupNorm''')<ref>{{Cite journal \|~~last~~last1=Wu \|~~first~~first1=Yuxin \|last2=He \|first2=Kaiming \|date=2018 \|title=Group Normalization \|url=https://openaccess.thecvf.com/content_ECCV_2018/html/Yuxin_Wu_Group_Normalization_ECCV_2018_paper.html \|pages=3–19}}</ref> is a technique only used for CNNs. It can be understood as the LayerNorm for CNN applied once per channel-group. Suppose at a layer <math>l</math>, there are channels <math>1, 2, \dots, C</math>, then we partition it into groups <math>g_1, \dots, g_G</math>. Then, we apply LayerNorm to each group. === Instance normalization === '''Instance normalization''' ('''InstanceNorm'''), or '''contrast normalization''', is a technique first developed for [[neural style transfer]], and is only used for CNNs.<ref>{{cite ~~arxiv~~arXiv \|~~last~~last1=Ulyanov \|~~first~~first1=Dmitry \|title=Instance Normalization: The Missing Ingredient for Fast Stylization \|date=2017-11-06 \|~~url=http://arxiv.org/abs/1607.08022 \|access-date=2024-08-08 \|arxiv~~eprint=1607.08022 \|last2=Vedaldi \|first2=Andrea \|last3=Lempitsky \|first3=Victor}}</ref> It can be understood as the LayerNorm for CNN applied once per channel, or equivalently, as group normalization where each group consists of a single channel:<math display="block"> \begin{aligned} \mu^{(l)}_c &= \frac{1}{HW} \sum_{h=1}^H \sum_{w=1}^Wx^{(l)}_{h, w, c} \\ Line 175: === Adaptive instance normalization === '''Adaptive instance normalization''' ('''AdaIN''') is a variant of instance normalization, designed specifically for neural style transfer with CNN, not for CNN in general.<ref>{{Cite journal \|~~last~~last1=Huang \|~~first~~first1=Xun \|last2=Belongie \|first2=Serge \|date=2017 \|title=Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization \|url=https://openaccess.thecvf.com/content_iccv_2017/html/Huang_Arbitrary_Style_Transfer_ICCV_2017_paper.html \|pages=1501–1510\|arxiv=1703.06868 }}</ref> In the AdaIN method of style transfer, we take a CNN, and two input images, one '''content''' and one '''style'''. Each image is processed through the same CNN, and at a certain layer <math>l</math>, the AdaIn is applied.

Normalization (machine learning): Difference between revisions