Normalization (machine learning): Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Added class. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar
Interpretation: improvement
Line 44:
=== Interpretation ===
<math>\gamma</math> and <math>\beta</math> allow the network to learn to undo the normalization if that is beneficial.<ref name=":1">{{Cite book |last1=Goodfellow |first1=Ian |title=Deep learning |last2=Bengio |first2=Yoshua |last3=Courville |first3=Aaron |date=2016 |publisher=The MIT Press |isbn=978-0-262-03561-3 |series=Adaptive computation and machine learning |___location=Cambridge, Massachusetts |chapter=8.7.1. Batch Normalization}}</ref>
Because a neural network can always be topped with a linear transform layer on top, BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus purely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be topped with a linear transform layer on top.<ref>{{Cite journal |last1=Desjardins |first1=Guillaume |last2=Simonyan |first2=Karen |last3=Pascanu |first3=Razvan |last4=kavukcuoglu |first4=koray |date=2015 |title=Natural Neural Networks |url=https://proceedings.neurips.cc/paper_files/paper/2015/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=28}}</ref><ref name=":1" />
 
It is claimed in the original publication that BatchNorm works by reducing "internal covariance shift", though the claim has both supporters<ref>{{Cite journal |last1=Xu |first1=Jingjing |last2=Sun |first2=Xu |last3=Zhang |first3=Zhiyuan |last4=Zhao |first4=Guangxiang |last5=Lin |first5=Junyang |date=2019 |title=Understanding and Improving Layer Normalization |url=https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=32 |arxiv=1911.07013}}</ref><ref>{{Cite journal |last1=Awais |first1=Muhammad |last2=Bin Iqbal |first2=Md. Tauhid |last3=Bae |first3=Sung-Ho |date=November 2021 |title=Revisiting Internal Covariate Shift for Batch Normalization |url=https://ieeexplore.ieee.org/document/9238401 |journal=IEEE Transactions on Neural Networks and Learning Systems |volume=32 |issue=11 |pages=5082–5092 |doi=10.1109/TNNLS.2020.3026784 |issn=2162-237X |pmid=33095717}}</ref> and detractors.<ref>{{Cite journal |last1=Bjorck |first1=Nils |last2=Gomes |first2=Carla P |last3=Selman |first3=Bart |last4=Weinberger |first4=Kilian Q |date=2018 |title=Understanding Batch Normalization |url=https://proceedings.neurips.cc/paper/2018/hash/36072923bfc3cf47745d704feb489480-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=31 |arxiv=1806.02375}}</ref><ref>{{Cite journal |last1=Santurkar |first1=Shibani |last2=Tsipras |first2=Dimitris |last3=Ilyas |first3=Andrew |last4=Madry |first4=Aleksander |date=2018 |title=How Does Batch Normalization Help Optimization? |url=https://proceedings.neurips.cc/paper/2018/hash/905056c1ac1dad141560467e0a99e1cf-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=31}}</ref>
Line 95:
return y
</syntaxhighlight>
 
=== Improvements ===
BatchNorm has been very popular and there were many attempted improvements. Some examples include:<ref name=":3">https://arxiv.org/pdf/1906.03548</ref>
 
* Ghost batch: Randomly partition a batch into sub-batches and perform BatchNorm separately on each.
* Weight decay on <math>\gamma</math> and <math>\beta</math>.
* Combine BatchNorm with GroupNorm.
 
A particular problem with BatchNorm is that during training, the mean and variance were calculated on the fly for each batch (usually as an [[exponential moving average]]), but during inference, the mean and variance were frozen from those calculated during training. This train-test disparity degrades performance. The disparity can be decreased by simulating the moving average during inference:<ref name=":3" />{{Pg|___location=Eq. 3}}<math display="block">
\begin{aligned}
\mu &= \alpha E[x] + (1 - \alpha) \mu_{x, \text{train}} \\
\sigma^2 &= (\alpha E[x]^2 + (1 - \alpha) \mu_{x^2, \text{train}}) - \mu^2
\end{aligned}
</math>
 
== Layer normalization ==