Revision as of 06:38, 19 August 2024 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Root mean square layer normalization Tag: Visual edit ← Previous edit		Revision as of 00:05, 20 August 2024 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Other normalizations: local response norm Tag: Visual edit Next edit →
Line 49: === Special cases === The original paper<ref name=":0" /> recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is, something like <math>\mathrm{BN}(Wx + b)</math>, not <math>\mathrm{BN}(\phi(Wx + b))</math>. Also, the bias <math>b </math> does not matter, since will be canceled by the subsequent mean subtraction, so it is of form <math>\mathrm{BN}(Wx)</math>. That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to constant zero.<ref name=":0" /> For [[Convolutional neural network\|convolutional neural networks]] (CNN), BatchNorm must preserve the translation invariance of CNN, which means that it must treat all outputs of the same kernel as if they are different data points within a batch.<ref name=":0" /> Line 147: == CNN-specific normalization == There are some activation normalization techniques that are only used for CNNs. === Local response normalization === '''Local response normalization'''<ref>{{Cite journal \|last=Krizhevsky \|first=Alex \|last2=Sutskever \|first2=Ilya \|last3=Hinton \|first3=Geoffrey E \|date=2012 \|title=ImageNet Classification with Deep Convolutional Neural Networks \|url=https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=25}}</ref> was used in [[AlexNet]]. It was applied in a convolutional layer, just after a nonlinear activation function. It was defined by<math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j\right)^2\right)^\beta}</math>where <math>a_{x,y}^i</math> is the activation of the neuron at ___location <math>(x,y)</math> and channel <math>i</math>. In words, each pixel in a channel is suppressed by the activations of the same pixel in its adjacent channels. === Group normalization ===

Normalization (machine learning): Difference between revisions