Content deleted Content added
Citation bot (talk | contribs) Alter: template type, title. Add: isbn, chapter-url, pages, chapter, arxiv, eprint, authors 1-1. Removed or converted URL. Removed access-date with no URL. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar |
Citation bot (talk | contribs) Alter: title, template type. Add: chapter, class. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar |
||
Line 119:
</math> is added.
In [[recurrent neural network]]s<ref name=":2" /> and [[Transformer (deep learning architecture)|Transformers]],<ref>{{cite arXiv |last1=Phuong |first1=Mary |title=Formal Algorithms for Transformers |date=2022-07-19 |eprint=2207.09238 |last2=Hutter |first2=Marcus|class=cs.LG }}</ref> LayerNorm is applied individually to each timestep.
For example, if the hidden vector in an RNN at timestep <math>
Line 134:
=== Root mean square layer normalization ===
'''Root mean square layer normalization''' ('''RMSNorm''')<ref>{{cite arXiv |last1=Zhang |first1=Biao |title=Root Mean Square Layer Normalization |date=2019-10-16 |eprint=1910.07467 |last2=Sennrich |first2=Rico|class=cs.LG }}</ref> changes LayerNorm by<math display="block">
\hat{x_i} = \frac{x_i}{\sqrt{\frac 1D \sum_{i=1}^D x_i^2}}, \quad y_i = \gamma \hat{x_i} + \beta
</math>Essentially it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>.
== Other normalizations ==
'''Weight normalization''' ('''WeightNorm''')<ref>{{cite arXiv |last1=Salimans |first1=Tim |title=Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks |date=2016-06-03 |eprint=1602.07868 |last2=Kingma |first2=Diederik P.|class=cs.LG }}</ref> is a technique inspired by BatchNorm. It normalizes weight matrices in a neural network, rather than its neural activations.
'''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal |last1=Chen |first1=Zhao |last2=Badrinarayanan |first2=Vijay |last3=Lee |first3=Chen-Yu |last4=Rabinovich |first4=Andrew |date=2018-07-03 |title=GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks |url=https://proceedings.mlr.press/v80/chen18a.html |journal=Proceedings of the 35th International Conference on Machine Learning |language=en |publisher=PMLR |pages=794–803|arxiv=1711.02257 }}</ref> normalizes gradient vectors during backpropagation.
Line 155:
It was a variant of the earlier '''local contrast normalization'''.<ref>{{Cite book |last1=Jarrett |first1=Kevin |last2=Kavukcuoglu |first2=Koray |last3=Ranzato |first3=Marc' Aurelio |last4=LeCun |first4=Yann |chapter=What is the best multi-stage architecture for object recognition? |date=September 2009 |pages=2146–2153 |title=2009 IEEE 12th International Conference on Computer Vision |chapter-url=http://dx.doi.org/10.1109/iccv.2009.5459469 |publisher=IEEE |doi=10.1109/iccv.2009.5459469|isbn=978-1-4244-4420-5 }}</ref><math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j - \bar a_{x, y}^j\right)^2\right)^\beta}</math>where <math>\bar a_{x, y}^j</math> is the average activation in a small window centered on ___location <math>(x,y)</math> and channel <math>i</math>. The numbers <math>k, n, \alpha, \beta</math>, and the size of the small window, are hyperparameters picked by using a validation set.
Similar methods were called '''divisive normalization''', as they divide activations by a number depending on the activations. They were originally inspired by biology, where it was used to explain nonlinear responses of cortical neurons and nonlinear masking in visual perception.<ref>{{Cite
Both kinds of local normalization were obsoleted by batch normalization, which is a more global form of normalization.<ref>{{Cite journal |last1=Ortiz |first1=Anthony |last2=Robinson |first2=Caleb |last3=Morris |first3=Dan |last4=Fuentes |first4=Olac |last5=Kiekintveld |first5=Christopher |last6=Hassan |first6=Md Mahmudulla |last7=Jojic |first7=Nebojsa |date=2020 |title=Local Context Normalization: Revisiting Local Normalization |url=https://openaccess.thecvf.com/content_CVPR_2020/html/Ortiz_Local_Context_Normalization_Revisiting_Local_Normalization_CVPR_2020_paper.html |pages=11276–11285|arxiv=1912.05845 }}</ref>
Line 165:
=== Instance normalization ===
'''Instance normalization''' ('''InstanceNorm'''), or '''contrast normalization''', is a technique first developed for [[neural style transfer]], and is only used for CNNs.<ref>{{cite arXiv |last1=Ulyanov |first1=Dmitry |title=Instance Normalization: The Missing Ingredient for Fast Stylization |date=2017-11-06 |eprint=1607.08022 |last2=Vedaldi |first2=Andrea |last3=Lempitsky |first3=Victor|class=cs.CV }}</ref> It can be understood as the LayerNorm for CNN applied once per channel, or equivalently, as group normalization where each group consists of a single channel:<math display="block">
\begin{aligned}
\mu^{(l)}_c &= \frac{1}{HW} \sum_{h=1}^H \sum_{w=1}^Wx^{(l)}_{h, w, c} \\
|