Content deleted Content added
ce |
Citation bot (talk | contribs) Added bibcode. Removed URL that duplicated identifier. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 923/990 |
||
(4 intermediate revisions by 4 users not shown) | |||
Line 78:
<math>\gamma</math> and <math>\beta</math> allow the network to learn to undo the normalization, if this is beneficial.<ref name=":1">{{Cite book |last1=Goodfellow |first1=Ian |title=Deep learning |last2=Bengio |first2=Yoshua |last3=Courville |first3=Aaron |date=2016 |publisher=The MIT Press |isbn=978-0-262-03561-3 |series=Adaptive computation and machine learning |___location=Cambridge, Massachusetts |chapter=8.7.1. Batch Normalization}}</ref> BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus solely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be augmented with a linear transformation layer on top.<ref>{{Cite journal |last1=Desjardins |first1=Guillaume |last2=Simonyan |first2=Karen |last3=Pascanu |first3=Razvan |last4=kavukcuoglu |first4=koray |date=2015 |title=Natural Neural Networks |url=https://proceedings.neurips.cc/paper_files/paper/2015/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=28}}</ref><ref name=":1" />
It is claimed in the original publication that BatchNorm works by reducing internal covariance shift, though the claim has both supporters<ref>{{Cite journal |last1=Xu |first1=Jingjing |last2=Sun |first2=Xu |last3=Zhang |first3=Zhiyuan |last4=Zhao |first4=Guangxiang |last5=Lin |first5=Junyang |date=2019 |title=Understanding and Improving Layer Normalization |url=https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=32 |arxiv=1911.07013}}</ref><ref>{{Cite journal |last1=Awais |first1=Muhammad |last2=Bin Iqbal |first2=Md. Tauhid |last3=Bae |first3=Sung-Ho |date=November 2021 |title=Revisiting Internal Covariate Shift for Batch Normalization
=== Special cases ===
Line 136:
return y
</syntaxhighlight>For multilayered [[Recurrent neural network|recurrent neural networks]] (RNN), BatchNorm is usually applied only for the ''input-to-hidden'' part, not the ''hidden-to-hidden'' part.<ref name=":4">{{Cite book |last1=Laurent |first1=Cesar |last2=Pereyra |first2=Gabriel |last3=Brakel |first3=Philemon |last4=Zhang |first4=Ying |last5=Bengio |first5=Yoshua |chapter=Batch normalized recurrent neural networks |date=March 2016 |title=2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
\begin{aligned}
\mu_t^{(l)} &= \frac{1}{B} \sum_{b=1}^B h_{i,t}^{(l)} \\
Line 148:
</math>Frame-wise BatchNorm is suited for causal tasks such as next-character prediction, where future frames are unavailable, forcing normalization per frame. Sequence-wise BatchNorm is suited for tasks such as speech recognition, where the entire sequences are available, but with variable lengths. In a batch, the smaller sequences are padded with zeroes to match the size of the longest sequence of the batch. In such setups, frame-wise is not recommended, because the number of unpadded frames decreases along the time axis, leading to increasingly poorer statistics estimates.<ref name=":4" />
It is also possible to apply BatchNorm to [[Long short-term memory|LSTMs]].<ref>{{
=== Improvements ===
Line 207:
=== Root mean square layer normalization ===
'''Root mean square layer normalization''' ('''RMSNorm'''):<ref>{{cite arXiv |last1=Zhang |first1=Biao |title=Root Mean Square Layer Normalization |date=2019-10-16 |eprint=1910.07467 |last2=Sennrich |first2=Rico|class=cs.LG }}</ref>
<math display="block">
Line 213:
</math>
Essentially, it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>. It is also called '''L2 normalization'''. It is a special case of '''Lp normalization''', or '''power normalization''':<math display="block">
\hat{x_i} = \frac{x_i}{\left(\frac 1D \sum_{i=1}^D |x_i|^p \right)^{1/p}}, \quad y_i = \gamma \hat{x_i} + \beta
</math>where <math>p > 0</math> is a constant.
=== Adaptive ===
Line 291 ⟶ 293:
Some normalization methods were designed for use in [[Transformer (deep learning architecture)|transformers]].
The original 2017 transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train, and required careful [[Hyperparameter optimization|hyperparameter tuning]] and a "warm-up" in [[learning rate]], where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018,<ref>{{
'''FixNorm'''<ref>{{
In '''nGPT''', many vectors are normalized to have unit L2 norm:<ref>{{
== Miscellaneous ==
|