Content deleted Content added
m clean up |
Citation bot (talk | contribs) Alter: template type, title. Add: isbn, chapter-url, pages, chapter, arxiv, eprint, authors 1-1. Removed or converted URL. Removed access-date with no URL. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar |
||
Line 43:
=== Interpretation ===
<math>\gamma</math> and <math>\beta</math> allow the network to learn to undo the normalization if that is beneficial.<ref name=":1">{{Cite book |
Because a neural network can always be topped with a linear transform layer on top, BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus purely on modelling the nonlinear aspects of data.<ref>{{Cite journal |
It is claimed in the original publication that BatchNorm works by reducing "internal covariance shift", though the claim has both supporters<ref>{{Cite journal |last1=Xu |first1=Jingjing |last2=Sun |first2=Xu |last3=Zhang |first3=Zhiyuan |last4=Zhao |first4=Guangxiang |last5=Lin |first5=Junyang |date=2019 |title=Understanding and Improving Layer Normalization |url=https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=32 |arxiv=1911.07013}}</ref><ref>{{Cite journal |last1=Awais |first1=Muhammad |last2=Bin Iqbal |first2=Md. Tauhid |last3=Bae |first3=Sung-Ho |date=November 2021 |title=Revisiting Internal Covariate Shift for Batch Normalization |url=https://ieeexplore.ieee.org/document/9238401 |journal=IEEE Transactions on Neural Networks and Learning Systems |volume=32 |issue=11 |pages=5082–5092 |doi=10.1109/TNNLS.2020.3026784 |issn=2162-237X |pmid=33095717}}</ref> and detractors.<ref>{{Cite journal |last1=Bjorck |first1=Nils |last2=Gomes |first2=Carla P |last3=Selman |first3=Bart |last4=Weinberger |first4=Kilian Q |date=2018 |title=Understanding Batch Normalization |url=https://proceedings.neurips.cc/paper/2018/hash/36072923bfc3cf47745d704feb489480-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=31 |arxiv=1806.02375}}</ref><ref>{{Cite journal |last1=Santurkar |first1=Shibani |last2=Tsipras |first2=Dimitris |last3=Ilyas |first3=Andrew |last4=Madry |first4=Aleksander |date=2018 |title=How Does Batch Normalization Help Optimization? |url=https://proceedings.neurips.cc/paper/2018/hash/905056c1ac1dad141560467e0a99e1cf-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=31}}</ref>
Line 51:
The original paper<ref name=":0" /> recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is, <math>\phi(\mathrm{BN}(Wx + b))</math>, not <math>\mathrm{BN}(\phi(Wx + b))</math>. Also, the bias <math>b </math> does not matter, since will be canceled by the subsequent mean subtraction, so it is of form <math>\mathrm{BN}(Wx)</math>. That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to constant zero.<ref name=":0" />
For [[convolutional neural network]]s (CNN), BatchNorm must preserve the translation-invariance of CNN, which means that it must treat all outputs of the same kernel as if they are different data points within a batch.<ref name=":0" /> This is sometimes called Spatial BatchNorm, or BatchNorm2D, or per-channel BatchNorm.<ref>{{Cite web |title=BatchNorm2d — PyTorch 2.4 documentation |url=https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html |access-date=2024-09-26 |website=pytorch.org}}</ref><ref>{{Cite book |
Concretely, suppose we have a 2-dimensional convolutional layer defined by<math display="block">x^{(l)}_{h, w, c} = \sum_{h', w', c'} K^{(l)}_{h'-h, w'-w, c, c'} x_{h', w', c'}^{(l-1)} + b^{(l)}_c</math>where
Line 97:
== Layer normalization ==
'''Layer normalization''' ('''LayerNorm''')<ref name=":2">{{Cite journal |last1=Ba |first1=Jimmy Lei |last2=Kiros |first2=Jamie Ryan |last3=Hinton |first3=Geoffrey E. |date=2016 |title=Layer Normalization
For a given data input and layer, LayerNorm computes the mean (<math>\mu</math>) and variance (<math>\sigma^2</math>) over all the neurons in the layer. Similar to BatchNorm, learnable parameters <math>\gamma</math> (scale) and <math>\beta</math> (shift) are applied. It is defined by:<math display="block">\hat{x_i} = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y_i = \gamma_i \hat{x_i} + \beta_i</math>where <math>
Line 119:
</math> is added.
In [[recurrent neural network]]s<ref name=":2" /> and [[Transformer (deep learning architecture)|Transformers]],<ref>{{cite
For example, if the hidden vector in an RNN at timestep <math>
Line 134:
=== Root mean square layer normalization ===
'''Root mean square layer normalization''' ('''RMSNorm''')<ref>{{cite
\hat{x_i} = \frac{x_i}{\sqrt{\frac 1D \sum_{i=1}^D x_i^2}}, \quad y_i = \gamma \hat{x_i} + \beta
</math>Essentially it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>.
== Other normalizations ==
'''Weight normalization''' ('''WeightNorm''')<ref>{{cite
'''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal |
'''Adaptive layer norm''' ('''adaLN''')<ref>{{Cite journal |
== CNN-specific normalization ==
Line 149:
=== Local response normalization ===
'''Local response normalization'''<ref>{{Cite journal |
The numbers <math>k, n, \alpha, \beta</math> are hyperparameters picked by using a validation set.
It was a variant of the earlier '''local contrast normalization'''.<ref>{{Cite
Similar methods were called '''divisive normalization''', as they divide activations by a number depending on the activations. They were originally inspired by biology, where it was used to explain nonlinear responses of cortical neurons and nonlinear masking in visual perception.<ref>{{Cite journal |
Both kinds of local normalization were obsoleted by batch normalization, which is a more global form of normalization.<ref>{{Cite journal |
=== Group normalization ===
'''Group normalization''' ('''GroupNorm''')<ref>{{Cite journal |
Suppose at a layer <math>l</math>, there are channels <math>1, 2, \dots, C</math>, then we partition it into groups <math>g_1, \dots, g_G</math>. Then, we apply LayerNorm to each group.
=== Instance normalization ===
'''Instance normalization''' ('''InstanceNorm'''), or '''contrast normalization''', is a technique first developed for [[neural style transfer]], and is only used for CNNs.<ref>{{cite
\begin{aligned}
\mu^{(l)}_c &= \frac{1}{HW} \sum_{h=1}^H \sum_{w=1}^Wx^{(l)}_{h, w, c} \\
Line 175:
=== Adaptive instance normalization ===
'''Adaptive instance normalization''' ('''AdaIN''') is a variant of instance normalization, designed specifically for neural style transfer with CNN, not for CNN in general.<ref>{{Cite journal |
In the AdaIN method of style transfer, we take a CNN, and two input images, one '''content''' and one '''style'''. Each image is processed through the same CNN, and at a certain layer <math>l</math>, the AdaIn is applied.
|