Revision as of 21:41, 27 September 2024 edit Citation bot (talk \| contribs) Bots 5,869,708 edits Added class. \| Use this bot. Report bugs. \| Suggested by Headbomb \| #UCB_toolbar ← Previous edit		Revision as of 20:34, 8 October 2024 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,305 edits →Interpretation: improvement Tag: Visual edit Next edit →
Line 44: === Interpretation === <math>\gamma</math> and <math>\beta</math> allow the network to learn to undo the normalization if that is beneficial.<ref name=":1">{{Cite book \|last1=Goodfellow \|first1=Ian \|title=Deep learning \|last2=Bengio \|first2=Yoshua \|last3=Courville \|first3=Aaron \|date=2016 \|publisher=The MIT Press \|isbn=978-0-262-03561-3 \|series=Adaptive computation and machine learning \|___location=Cambridge, Massachusetts \|chapter=8.7.1. Batch Normalization}}</ref> ~~Because a neural network can always be topped with a linear transform layer on top,~~ BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus purely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be topped with a linear transform layer on top.<ref>{{Cite journal \|last1=Desjardins \|first1=Guillaume \|last2=Simonyan \|first2=Karen \|last3=Pascanu \|first3=Razvan \|last4=kavukcuoglu \|first4=koray \|date=2015 \|title=Natural Neural Networks \|url=https://proceedings.neurips.cc/paper_files/paper/2015/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=28}}</ref><ref name=":1" /> It is claimed in the original publication that BatchNorm works by reducing "internal covariance shift", though the claim has both supporters<ref>{{Cite journal \|last1=Xu \|first1=Jingjing \|last2=Sun \|first2=Xu \|last3=Zhang \|first3=Zhiyuan \|last4=Zhao \|first4=Guangxiang \|last5=Lin \|first5=Junyang \|date=2019 \|title=Understanding and Improving Layer Normalization \|url=https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=32 \|arxiv=1911.07013}}</ref><ref>{{Cite journal \|last1=Awais \|first1=Muhammad \|last2=Bin Iqbal \|first2=Md. Tauhid \|last3=Bae \|first3=Sung-Ho \|date=November 2021 \|title=Revisiting Internal Covariate Shift for Batch Normalization \|url=https://ieeexplore.ieee.org/document/9238401 \|journal=IEEE Transactions on Neural Networks and Learning Systems \|volume=32 \|issue=11 \|pages=5082–5092 \|doi=10.1109/TNNLS.2020.3026784 \|issn=2162-237X \|pmid=33095717}}</ref> and detractors.<ref>{{Cite journal \|last1=Bjorck \|first1=Nils \|last2=Gomes \|first2=Carla P \|last3=Selman \|first3=Bart \|last4=Weinberger \|first4=Kilian Q \|date=2018 \|title=Understanding Batch Normalization \|url=https://proceedings.neurips.cc/paper/2018/hash/36072923bfc3cf47745d704feb489480-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=31 \|arxiv=1806.02375}}</ref><ref>{{Cite journal \|last1=Santurkar \|first1=Shibani \|last2=Tsipras \|first2=Dimitris \|last3=Ilyas \|first3=Andrew \|last4=Madry \|first4=Aleksander \|date=2018 \|title=How Does Batch Normalization Help Optimization? \|url=https://proceedings.neurips.cc/paper/2018/hash/905056c1ac1dad141560467e0a99e1cf-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=31}}</ref> Line 95: return y </syntaxhighlight> === Improvements === BatchNorm has been very popular and there were many attempted improvements. Some examples include:<ref name=":3">https://arxiv.org/pdf/1906.03548</ref> * Ghost batch: Randomly partition a batch into sub-batches and perform BatchNorm separately on each. * Weight decay on <math>\gamma</math> and <math>\beta</math>. * Combine BatchNorm with GroupNorm. A particular problem with BatchNorm is that during training, the mean and variance were calculated on the fly for each batch (usually as an [[exponential moving average]]), but during inference, the mean and variance were frozen from those calculated during training. This train-test disparity degrades performance. The disparity can be decreased by simulating the moving average during inference:<ref name=":3" />{{Pg\|___location=Eq. 3}}<math display="block"> \begin{aligned} \mu &= \alpha E[x] + (1 - \alpha) \mu_{x, \text{train}} \\ \sigma^2 &= (\alpha E[x]^2 + (1 - \alpha) \mu_{x^2, \text{train}}) - \mu^2 \end{aligned} </math> == Layer normalization ==

Normalization (machine learning): Difference between revisions