Normalization (machine learning): Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 04:59, 14 November 2024 edit Perceptron599 (talk \| contribs) 390 edits Fix typo, general formatting improvements, add wikilinks Tags: Mobile edit Mobile web edit ← Previous edit		Latest revision as of 00:53, 27 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,868,077 edits Added bibcode. Removed URL that duplicated identifier. Removed parameters. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 923/990
(15 intermediate revisions by 7 users not shown)
Line 78: <math>\gamma</math> and <math>\beta</math> allow the network to learn to undo the normalization, if this is beneficial.<ref name=":1">{{Cite book \|last1=Goodfellow \|first1=Ian \|title=Deep learning \|last2=Bengio \|first2=Yoshua \|last3=Courville \|first3=Aaron \|date=2016 \|publisher=The MIT Press \|isbn=978-0-262-03561-3 \|series=Adaptive computation and machine learning \|___location=Cambridge, Massachusetts \|chapter=8.7.1. Batch Normalization}}</ref> BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus solely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be augmented with a linear transformation layer on top.<ref>{{Cite journal \|last1=Desjardins \|first1=Guillaume \|last2=Simonyan \|first2=Karen \|last3=Pascanu \|first3=Razvan \|last4=kavukcuoglu \|first4=koray \|date=2015 \|title=Natural Neural Networks \|url=https://proceedings.neurips.cc/paper_files/paper/2015/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=28}}</ref><ref name=":1" /> It is claimed in the original publication that BatchNorm works by reducing internal covariance shift, though the claim has both supporters<ref>{{Cite journal \|last1=Xu \|first1=Jingjing \|last2=Sun \|first2=Xu \|last3=Zhang \|first3=Zhiyuan \|last4=Zhao \|first4=Guangxiang \|last5=Lin \|first5=Junyang \|date=2019 \|title=Understanding and Improving Layer Normalization \|url=https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=32 \|arxiv=1911.07013}}</ref><ref>{{Cite journal \|last1=Awais \|first1=Muhammad \|last2=Bin Iqbal \|first2=Md. Tauhid \|last3=Bae \|first3=Sung-Ho \|date=November 2021 \|title=Revisiting Internal Covariate Shift for Batch Normalization ~~\|url=https://ieeexplore.ieee.org/document/9238401~~ \|journal=IEEE Transactions on Neural Networks and Learning Systems \|volume=32 \|issue=11 \|pages=5082–5092 \|doi=10.1109/TNNLS.2020.3026784 \|issn=2162-237X \|pmid=33095717\|bibcode=2021ITNNL..32.5082A }}</ref> and detractors.<ref>{{Cite journal \|last1=Bjorck \|first1=Nils \|last2=Gomes \|first2=Carla P \|last3=Selman \|first3=Bart \|last4=Weinberger \|first4=Kilian Q \|date=2018 \|title=Understanding Batch Normalization \|url=https://proceedings.neurips.cc/paper/2018/hash/36072923bfc3cf47745d704feb489480-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=31 \|arxiv=1806.02375}}</ref><ref>{{Cite journal \|last1=Santurkar \|first1=Shibani \|last2=Tsipras \|first2=Dimitris \|last3=Ilyas \|first3=Andrew \|last4=Madry \|first4=Aleksander \|date=2018 \|title=How Does Batch Normalization Help Optimization? \|url=https://proceedings.neurips.cc/paper/2018/hash/905056c1ac1dad141560467e0a99e1cf-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=31}}</ref> === Special cases === Line 95: * <math>b^{(l)}_c</math> is the bias term for the <math>c</math>-th channel of the <math>l</math>-th layer. In order to preserve the translational invariance, BatchNorm treats all outputs from the same kernel in the same batch as more data in a batch. That is, it is applied once per ''kernel'' <math>c</math> (equivalently, once per channel <math>c</math>), not per ''activation'' <math>x^{(l+1)}_{h, w, c}</math>: <math display="block"> Line 136: return y </syntaxhighlight>For multilayered [[Recurrent neural network\|recurrent neural networks]] (RNN), BatchNorm is usually applied only for the ''input-to-hidden'' part, not the ''hidden-to-hidden'' part.<ref name=":4">{{Cite book \|last1=Laurent \|first1=Cesar \|last2=Pereyra \|first2=Gabriel \|last3=Brakel \|first3=Philemon \|last4=Zhang \|first4=Ying \|last5=Bengio \|first5=Yoshua \|chapter=Batch normalized recurrent neural networks \|date=March 2016 \|title=2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) \|publisher=IEEE \|pages=2657–2661 \|doi=10.1109/ICASSP.2016.7472159 \|arxiv=1510.01378 \|isbn=978-1-4799-9988-0}}</ref> Let the hidden state of the <math>l</math>-th layer at time <math>t</math> be <math>h_t^{(l)}</math>. The standard RNN, without normalization, satisfies<math display="block">h^{(l)}_t = \phi(W^{(l)} h_t^{l-1} + U^{(l)} h_{t-1}^{l} + b^{(l)}) </math>where <math>W^{(l)}, U^{(l)}, b^{(l)}</math> are weights and biases, and <math>\phi</math> is the activation function. Applying BatchNorm, this becomes<math display="block">h^{(l)}_t = \phi(\mathrm{BN}(W^{(l)} h_t^{l-1}) + U^{(l)} h_{t-1}^{l}) </math>There are two possible ways to define what a "batch" is in BatchNorm for RNNs: ''frame-wise'' and ''sequence-wise''. Concretely, consider applying an RNN to process a batch of sentences. Let <math>h_{b, t}^{(l)}</math> be the hidden state of the <math>l</math>-th layer for the <math>t</math>-th token of the <math>b</math>-th input sentence. Then frame-wise BatchNorm means normalizing over <math>b</math>:<math display="block"> ~~</syntaxhighlight>~~ \begin{aligned} \mu_t^{(l)} &= \frac{1}{B} \sum_{b=1}^B h_{i,t}^{(l)} \\ (\sigma_t^{(l)})^2 &= \frac{1}{B} \sum_{b=1}^B (h_t^{(l)} - \mu_t^{(l)})^2 \end{aligned} </math>and sequence-wise means normalizing over <math>(b, t)</math>:<math display="block"> \begin{aligned} \mu^{(l)} &= \frac{1}{BT} \sum_{b=1}^B\sum_{t=1}^T h_{i,t}^{(l)} \\ (\sigma^{(l)})^2 &= \frac{1}{BT} \sum_{b=1}^B\sum_{t=1}^T (h_t^{(l)} - \mu^{(l)})^2 \end{aligned} </math>Frame-wise BatchNorm is suited for causal tasks such as next-character prediction, where future frames are unavailable, forcing normalization per frame. Sequence-wise BatchNorm is suited for tasks such as speech recognition, where the entire sequences are available, but with variable lengths. In a batch, the smaller sequences are padded with zeroes to match the size of the longest sequence of the batch. In such setups, frame-wise is not recommended, because the number of unpadded frames decreases along the time axis, leading to increasingly poorer statistics estimates.<ref name=":4" /> It is also possible to apply BatchNorm to [[Long short-term memory\|LSTMs]].<ref>{{cite arXiv \| eprint=1603.09025 \| last1=Cooijmans \| first1=Tim \| last2=Ballas \| first2=Nicolas \| last3=Laurent \| first3=César \| last4=Gülçehre \| first4=Çağlar \| last5=Courville \| first5=Aaron \| title=Recurrent Batch Normalization \| date=2016 \| class=cs.LG }}</ref> === Improvements === Line 183 ⟶ 195: </math> ~~notice~~Notice that the batch index <math>b</math> is removed, while the channel index <math>c</math> is added. In [[recurrent neural network]]s<ref name=":2" /> and [[Transformer (deep learning architecture)\|transformers]],<ref>{{cite arXiv \|last1=Phuong \|first1=Mary \|title=Formal Algorithms for Transformers \|date=2022-07-19 \|eprint=2207.09238 \|last2=Hutter \|first2=Marcus\|class=cs.LG }}</ref> LayerNorm is applied individually to each timestep. For example, if the hidden vector in an RNN at timestep <math>t</math> is <math>x^{(t)} \in \mathbb{R}^{D} Line 195 ⟶ 207: === Root mean square layer normalization === '''Root mean square layer normalization''' ('''RMSNorm'''):<ref>{{cite arXiv \|last1=Zhang \|first1=Biao \|title=Root Mean Square Layer Normalization \|date=2019-10-16 \|eprint=1910.07467 \|last2=Sennrich \|first2=Rico\|class=cs.LG }}</ref> ~~changes LayerNorm by:~~ <math display="block"> Line 201 ⟶ 213: </math> Essentially, it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>. It is also called '''L2 normalization'''. It is a special case of '''Lp normalization''', or '''power normalization''':<math display="block"> \hat{x_i} = \frac{x_i}{\left(\frac 1D \sum_{i=1}^D \|x_i\|^p \right)^{1/p}}, \quad y_i = \gamma \hat{x_i} + \beta </math>where <math>p > 0</math> is a constant. === Adaptive === Line 219 ⟶ 233: By reassigning <math>W_i \leftarrow \frac{W_i}{\\|W_i\\|_s}</math> after each update of the discriminator, we can upper-bound <math>\\|W_i\\|_s \leq 1</math>, and thus upper-bound <math>\\|D \\|_L</math>. The algorithm can be further accelerated by [[memoization]]: at step <math>t</math>, store <math>x^_i(t)</math>. Then, at step <math>t+1</math>, use <math>x^_i(t)</math> as the initial guess for the algorithm. Since <math>W_i(t+1)</math> is very close to <math>W_i(t)</math>, so is <math>x^_i(t)</math> to <math>x^_i(t+1)</math>, thus allowing rapid convergence. == CNN-specific normalization == Line 279 ⟶ 293: Some normalization methods were designed for use in [[Transformer (deep learning architecture)\|transformers]]. The original 2017 transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train, and required careful [[Hyperparameter optimization\|hyperparameter tuning]] and a "warm-up" in [[learning rate]], where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018,<ref>{{~~Citation~~cite ~~\|last1=Wang~~arXiv \|~~first1=Qiang~~ ~~\|title~~eprint=~~Learning Deep Transformer Models for Machine Translation \|date=2019-06-04 \|url=https://arxiv.org/abs/~~1906.01787 \|~~access-date~~ last1=~~2024-10-18~~Wang \|~~arxiv~~ first1=~~1906.01787~~Qiang \| last2=Li \| first2=Bei \| last3=Xiao \| first3=Tong \| last4=Zhu \| first4=Jingbo \| last5=Li \| first5=Changliang \| last6=Wong \| first6=Derek F. \| last7=Chao \| first7=Lidia S. \| title=Learning Deep Transformer Models for Machine Translation \| date=2019 \| class=cs.CL }}</ref> was found to be easier to train, requiring no warm-up, leading to faster convergence.<ref name="auto1">{{cite arXiv \|eprint=2002.04745 \|class=cs.LG \|first1=Ruibin \|last1=Xiong \|first2=Yunchang \|last2=Yang \|title=On Layer Normalization in the Transformer Architecture \|date=2020-06-29 \|last3=He \|first3=Di \|last4=Zheng \|first4=Kai \|last5=Zheng \|first5=Shuxin \|last6=Xing \|first6=Chen \|last7=Zhang \|first7=Huishuai \|last8=Lan \|first8=Yanyan \|last9=Wang \|first9=Liwei \|last10=Liu \|first10=Tie-Yan}}</ref> '''FixNorm'''<ref>{{~~Citation~~cite arXiv \| eprint=1710.01329 \| last1=Nguyen \| first1=Toan Q. \| last2=Chiang \| first2=David \| title=Improving Lexical Choice in Neural Machine Translation \| date=~~2018-04-17~~2017 \|~~url=https://arxiv.org/abs/1710.01329~~ ~~\|access-date~~class=~~2024-10-18 \|arxiv=1710~~cs.~~01329~~CL ~~\|last2=Chiang \|first2=David~~}}</ref> and '''ScaleNorm<ref>{{Cite journal \|last1=Nguyen \|first1=Toan Q. \|last2=Salazar \|first2=Julian \|date=2019-11-02 \|title=Transformers without Tears: Improving the Normalization of Self-Attention \|doi=10.5281/zenodo.3525484\|arxiv=1910.05895 }}</ref>''' both normalize activation vectors in a transformer. The FixNorm method divides the ''output'' vectors from a transformer by their L2 norms, then multiplies by a learned parameter <math>g</math>. The ScaleNorm replaces all LayerNorms inside a transformer by division with L2 norm, then multiplying by a learned parameter <math>g'</math> (shared by all ScaleNorm modules of a transformer). '''Query-Key normalization''' ('''QKNorm''')<ref>{{Cite journal \|last1=Henry \|first1=Alex \|last2=Dachapally \|first2=Prudhvi Raj \|last3=Pawar \|first3=Shubham Shantaram \|last4=Chen \|first4=Yuxuan \|date=November 2020 \|editor-last=Cohn \|editor-first=Trevor \|editor2-last=He \|editor2-first=Yulan \|editor3-last=Liu \|editor3-first=Yang \|title=Query-Key Normalization for Transformers \|url=https://aclanthology.org/2020.findings-emnlp.379/ \|journal=Findings of the Association for Computational Linguistics: EMNLP 2020 \|___location=Online \|publisher=Association for Computational Linguistics \|pages=4246–4253 \|doi=10.18653/v1/2020.findings-emnlp.379\|arxiv=2010.04245 }}</ref> normalizes query and key vectors to have unit L2 norm. In '''nGPT''', many vectors are normalized to have unit L2 norm:<ref>{{~~Citation~~cite ~~\|last1=Loshchilov~~arXiv \|~~first1=Ilya~~ ~~\|title~~eprint=~~nGPT: Normalized Transformer with Representation Learning on the Hypersphere \|date=2024-10-01 \|url=https://arxiv.org/abs/~~2410.01131 \|~~access-date~~ last1=~~2024-10-18~~Loshchilov \|~~arxiv~~ first1=~~2410.01131~~Ilya \| last2=Hsieh \| first2=Cheng-Ping \| last3=Sun \| first3=Simeng \| last4=Ginsburg \| first4=Boris \| title=NGPT: Normalized Transformer with Representation Learning on the Hypersphere \| date=2024 \| class=cs.LG }}</ref> hidden state vectors, input and output embedding vectors, weight matrix columns, and query and key vectors. == Miscellaneous == Line 297 ⟶ 311: == Further reading == * {{Cite web \|title=Normalization Layers \|url=https://nn.labml.ai/normalization/index.html \|access-date=2024-08-07 \|website=labml.ai Deep Learning Paper Implementations \|language=en}} ~~{{Differentiable computing}}~~ {{Artificial intelligence navbox}} [[Category:Articles with example Python (programming language) code]] [[Category:Deep learning]]