Normalization (machine learning): Difference between revisions

Content deleted Content added
Special cases: per channel
Line 51:
The original paper<ref name=":0" /> recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is, <math>\phi(\mathrm{BN}(Wx + b))</math>, not <math>\mathrm{BN}(\phi(Wx + b))</math>. Also, the bias <math>b </math> does not matter, since will be canceled by the subsequent mean subtraction, so it is of form <math>\mathrm{BN}(Wx)</math>. That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to constant zero.<ref name=":0" />
 
For [[Convolutional neural network|convolutional neural networks]] (CNN), BatchNorm must preserve the translation-invariance of CNN, which means that it must treat all outputs of the same kernel as if they are different data points within a batch.<ref name=":0" /> This is sometimes called Spatial BatchNorm, or BatchNorm2D, or per-channel BatchNorm.<ref>{{Cite web |title=BatchNorm2d — PyTorch 2.4 documentation |url=https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html |access-date=2024-09-26 |website=pytorch.org}}</ref><ref>{{Cite book |last=Zhang |first=Aston |title=Dive into deep learning |last2=Lipton |first2=Zachary |last3=Li |first3=Mu |last4=Smola |first4=Alexander J. |date=2024 |publisher=Cambridge University Press |isbn=978-1-009-38943-3 |___location=Cambridge New York Port Melbourne New Delhi Singapore |chapter=8.5. Batch Normalization |chapter-url=https://d2l.ai/chapter_convolutional-modern/batch-norm.html}}</ref>
 
Concretely, suppose we have a 2-dimensional convolutional layer defined by<math display="block">x^{(l)}_{h, w, c} = \sum_{h', w', c'} K^{(l)}_{h'-h, w'-w, c, c'} x_{h', w', c'}^{(l-1)} + b^{(l)}_c</math>where