Revision as of 20:32, 26 September 2024 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Batch normalization: spatial batchnorm Tag: Visual edit ← Previous edit		Revision as of 21:30, 26 September 2024 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits m →Special cases Tag: Visual edit Next edit →
Line 49: === Special cases === The original paper<ref name=":0" /> recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is, ~~something like~~ <math>\phi(\mathrm{BN}(Wx + b))</math>, not <math>\mathrm{BN}(\phi(Wx + b))</math>. Also, the bias <math>b </math> does not matter, since will be canceled by the subsequent mean subtraction, so it is of form <math>\mathrm{BN}(Wx)</math>. That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to constant zero.<ref name=":0" /> For [[Convolutional neural network\|convolutional neural networks]] (CNN), BatchNorm must preserve the translation -invariance of CNN, which means that it must treat all outputs of the same kernel as if they are different data points within a batch.<ref name=":0" /> This is sometimes called Spatial BatchNorm, or BatchNorm2D.<ref>{{Cite web \|title=BatchNorm2d — PyTorch 2.4 documentation \|url=https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html \|access-date=2024-09-26 \|website=pytorch.org}}</ref> Concretely, suppose we have a 2-dimensional convolutional layer defined by<math display="block">x^{(l)}_{h, w, c} = \sum_{h', w', c'} K^{(l)}_{h'-h, w'-w, c, c'} x_{h', w', c'}^{(l-1)} + b^{(l)}_c</math>where Line 58: * <math>b^{(l)}_c</math> is the bias term for the <math>c</math>-th channel of the <math>l</math>-th layer. In order to preserve the translational invariance, BatchNorm treats all outputs from the same kernel in the same batch as more data in a batch. That is, it is applied once per ''kernel'' <math>c</math> (equivalently, once per channel <math>c</math>), not per ''activation'' <math>x^{(l+1)}_{h, w, c}</math>:<math display="block"> ~~That is, it is applied once per ''kernel'' <math>c</math> (equivalently, once per channel <math>c</math>), not per ''activation'' <math>x^{(l+1)}_{h, w, c}</math>:<math display="block">~~ \begin{aligned} \mu^{(l)}_c &= \frac{1}{BHW} \sum_{b=1}^B \sum_{h=1}^H \sum_{w=1}^W x^{(l)}_{(b), h, w, c} \\

Normalization (machine learning): Difference between revisions