Normalization (machine learning): Difference between revisions

Content deleted Content added
Special cases: notes on when to use which
Citation bot (talk | contribs)
Added bibcode. Removed URL that duplicated identifier. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 923/990
 
(7 intermediate revisions by 5 users not shown)
Line 78:
<math>\gamma</math> and <math>\beta</math> allow the network to learn to undo the normalization, if this is beneficial.<ref name=":1">{{Cite book |last1=Goodfellow |first1=Ian |title=Deep learning |last2=Bengio |first2=Yoshua |last3=Courville |first3=Aaron |date=2016 |publisher=The MIT Press |isbn=978-0-262-03561-3 |series=Adaptive computation and machine learning |___location=Cambridge, Massachusetts |chapter=8.7.1. Batch Normalization}}</ref> BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus solely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be augmented with a linear transformation layer on top.<ref>{{Cite journal |last1=Desjardins |first1=Guillaume |last2=Simonyan |first2=Karen |last3=Pascanu |first3=Razvan |last4=kavukcuoglu |first4=koray |date=2015 |title=Natural Neural Networks |url=https://proceedings.neurips.cc/paper_files/paper/2015/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=28}}</ref><ref name=":1" />
 
It is claimed in the original publication that BatchNorm works by reducing internal covariance shift, though the claim has both supporters<ref>{{Cite journal |last1=Xu |first1=Jingjing |last2=Sun |first2=Xu |last3=Zhang |first3=Zhiyuan |last4=Zhao |first4=Guangxiang |last5=Lin |first5=Junyang |date=2019 |title=Understanding and Improving Layer Normalization |url=https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=32 |arxiv=1911.07013}}</ref><ref>{{Cite journal |last1=Awais |first1=Muhammad |last2=Bin Iqbal |first2=Md. Tauhid |last3=Bae |first3=Sung-Ho |date=November 2021 |title=Revisiting Internal Covariate Shift for Batch Normalization |url=https://ieeexplore.ieee.org/document/9238401 |journal=IEEE Transactions on Neural Networks and Learning Systems |volume=32 |issue=11 |pages=5082–5092 |doi=10.1109/TNNLS.2020.3026784 |issn=2162-237X |pmid=33095717|bibcode=2021ITNNL..32.5082A }}</ref> and detractors.<ref>{{Cite journal |last1=Bjorck |first1=Nils |last2=Gomes |first2=Carla P |last3=Selman |first3=Bart |last4=Weinberger |first4=Kilian Q |date=2018 |title=Understanding Batch Normalization |url=https://proceedings.neurips.cc/paper/2018/hash/36072923bfc3cf47745d704feb489480-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=31 |arxiv=1806.02375}}</ref><ref>{{Cite journal |last1=Santurkar |first1=Shibani |last2=Tsipras |first2=Dimitris |last3=Ilyas |first3=Andrew |last4=Madry |first4=Aleksander |date=2018 |title=How Does Batch Normalization Help Optimization? |url=https://proceedings.neurips.cc/paper/2018/hash/905056c1ac1dad141560467e0a99e1cf-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=31}}</ref>
 
=== Special cases ===
Line 136:
 
return y
</syntaxhighlight>For multilayered [[Recurrent neural network|recurrent neural networks]] (RNN), BatchNorm is usually applied only for the ''input-to-hidden'' halfpart, ofnot the inputs''hidden-to-hidden'' part.<ref name=":4">{{Cite journalbook |lastlast1=Laurent |firstfirst1=Cesar |last2=Pereyra |first2=Gabriel |last3=Brakel |first3=Philemon |last4=Zhang |first4=Ying |last5=Bengio |first5=Yoshua |date=2016-03 |titlechapter=Batch normalized recurrent neural networks |urldate=http://ieeexplore.ieee.org/document/7472159/March 2016 |title=2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |publisher=IEEE |pages=2657–2661 |doi=10.1109/ICASSP.2016.7472159 |arxiv=1510.01378 |isbn=978-1-4799-9988-0}}</ref> Let the hidden state of the <math>l</math>-th layer at time <math>t</math> be <math>h_t^{(l)}</math>. The standard RNN, without normalization, satisfies<math display="block">h^{(l)}_t = \phi(W^{(l)} h_t^{l-1} + U^{(l)} h_{t-1}^{l} + b^{(l)}) </math>where <math>W^{(l)}, U^{(l)}, b^{(l)}</math> are weights and biases, and <math>\phi</math> is the activation function. Applying BatchNorm, this becomes<math display="block">h^{(l)}_t = \phi(\mathrm{BN}(W^{(l)} h_t^{l-1}) + U^{(l)} h_{t-1}^{l}) </math>There are two possible ways to define what a "batch" is in BatchNorm for RNNs: ''frame-wise'' and ''sequence-wise''. Concretely, consider applying an RNN to process a batch of sentences. Let <math>h_{b, t}^{(l)}</math> be the hidden state of the <math>l</math>-th layer for the <math>t</math>-th token of the <math>b</math>-th input sentence. Then frame-wise BatchNorm means normalizing over <math>b</math>:<math display="block">
\begin{aligned}
\mu_t^{(l)} &= \frac{1}{B} \sum_{b=1}^B h_{i,t}^{(l)} \\
Line 148:
</math>Frame-wise BatchNorm is suited for causal tasks such as next-character prediction, where future frames are unavailable, forcing normalization per frame. Sequence-wise BatchNorm is suited for tasks such as speech recognition, where the entire sequences are available, but with variable lengths. In a batch, the smaller sequences are padded with zeroes to match the size of the longest sequence of the batch. In such setups, frame-wise is not recommended, because the number of unpadded frames decreases along the time axis, leading to increasingly poorer statistics estimates.<ref name=":4" />
 
It is also possible to apply BatchNorm to [[Long short-term memory|LSTMs]].<ref>{{Citationcite |last=CooijmansarXiv |first=Tim |titleeprint=Recurrent Batch Normalization |date=2017-02-28 |url=https://arxiv.org/abs/1603.09025 |publisher=arXiv |doilast1=10.48550/arXiv.1603.09025Cooijmans |id first1=arXiv:1603.09025Tim | last2=Ballas | first2=Nicolas | last3=Laurent | first3=César | last4=Gülçehre | first4=Çağlar | last5=Courville | first5=Aaron | title=Recurrent Batch Normalization | date=2016 | class=cs.LG }}</ref>
 
=== Improvements ===
Line 207:
 
=== Root mean square layer normalization ===
'''Root mean square layer normalization''' ('''RMSNorm'''):<ref>{{cite arXiv |last1=Zhang |first1=Biao |title=Root Mean Square Layer Normalization |date=2019-10-16 |eprint=1910.07467 |last2=Sennrich |first2=Rico|class=cs.LG }}</ref> changes LayerNorm by:
 
<math display="block">
Line 213:
</math>
 
Essentially, it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>. It is also called '''L2 normalization'''. It is a special case of '''Lp normalization''', or '''power normalization''':<math display="block">
\hat{x_i} = \frac{x_i}{\left(\frac 1D \sum_{i=1}^D |x_i|^p \right)^{1/p}}, \quad y_i = \gamma \hat{x_i} + \beta
</math>where <math>p > 0</math> is a constant.
 
=== Adaptive ===
Line 291 ⟶ 293:
Some normalization methods were designed for use in [[Transformer (deep learning architecture)|transformers]].
 
The original 2017 transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train, and required careful [[Hyperparameter optimization|hyperparameter tuning]] and a "warm-up" in [[learning rate]], where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018,<ref>{{Citationcite |last1=WangarXiv |first1=Qiang |titleeprint=Learning Deep Transformer Models for Machine Translation1906.01787 |date last1=2019-06-04Wang |arxiv first1=1906.01787Qiang | last2=Li | first2=Bei | last3=Xiao | first3=Tong | last4=Zhu | first4=Jingbo | last5=Li | first5=Changliang | last6=Wong | first6=Derek F. | last7=Chao | first7=Lidia S. | title=Learning Deep Transformer Models for Machine Translation | date=2019 | class=cs.CL }}</ref> was found to be easier to train, requiring no warm-up, leading to faster convergence.<ref name="auto1">{{cite arXiv |eprint=2002.04745 |class=cs.LG |first1=Ruibin |last1=Xiong |first2=Yunchang |last2=Yang |title=On Layer Normalization in the Transformer Architecture |date=2020-06-29 |last3=He |first3=Di |last4=Zheng |first4=Kai |last5=Zheng |first5=Shuxin |last6=Xing |first6=Chen |last7=Zhang |first7=Huishuai |last8=Lan |first8=Yanyan |last9=Wang |first9=Liwei |last10=Liu |first10=Tie-Yan}}</ref>
 
'''FixNorm'''<ref>{{Citationcite arXiv | eprint=1710.01329 | last1=Nguyen | first1=Toan Q. | last2=Chiang | first2=David | title=Improving Lexical Choice in Neural Machine Translation | date=2018-04-172017 |arxiv class=1710cs.01329CL |last2=Chiang |first2=David}}</ref> and '''ScaleNorm<ref>{{Cite journal |last1=Nguyen |first1=Toan Q. |last2=Salazar |first2=Julian |date=2019-11-02 |title=Transformers without Tears: Improving the Normalization of Self-Attention |doi=10.5281/zenodo.3525484|arxiv=1910.05895 }}</ref>''' both normalize activation vectors in a transformer. The FixNorm method divides the ''output'' vectors from a transformer by their L2 norms, then multiplies by a learned parameter <math>g</math>. The ScaleNorm replaces all LayerNorms inside a transformer by division with L2 norm, then multiplying by a learned parameter <math>g'</math> (shared by all ScaleNorm modules of a transformer). '''Query-Key normalization''' ('''QKNorm''')<ref>{{Cite journal |last1=Henry |first1=Alex |last2=Dachapally |first2=Prudhvi Raj |last3=Pawar |first3=Shubham Shantaram |last4=Chen |first4=Yuxuan |date=November 2020 |editor-last=Cohn |editor-first=Trevor |editor2-last=He |editor2-first=Yulan |editor3-last=Liu |editor3-first=Yang |title=Query-Key Normalization for Transformers |url=https://aclanthology.org/2020.findings-emnlp.379/ |journal=Findings of the Association for Computational Linguistics: EMNLP 2020 |___location=Online |publisher=Association for Computational Linguistics |pages=4246–4253 |doi=10.18653/v1/2020.findings-emnlp.379|arxiv=2010.04245 }}</ref> normalizes query and key vectors to have unit L2 norm.
 
In '''nGPT''', many vectors are normalized to have unit L2 norm:<ref>{{Citationcite |last1=LoshchilovarXiv |first1=Ilya |titleeprint=nGPT: Normalized Transformer2410.01131 with Representation Learning on the Hypersphere| |datelast1=2024-10-01Loshchilov |arxiv first1=2410.01131Ilya | last2=Hsieh | first2=Cheng-Ping | last3=Sun | first3=Simeng | last4=Ginsburg | first4=Boris | title=NGPT: Normalized Transformer with Representation Learning on the Hypersphere | date=2024 | class=cs.LG }}</ref> hidden state vectors, input and output embedding vectors, weight matrix columns, and query and key vectors.
 
== Miscellaneous ==