Normalization (machine learning): Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Alter: title, template type. Add: chapter, class. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar
ce
Line 97:
 
== Layer normalization ==
'''Layer normalization''' ('''LayerNorm''')<ref name=":2">{{Cite journalarxiv |last1=Ba |first1=Jimmy Lei |last2=Kiros |first2=Jamie Ryan |last3=Hinton |first3=Geoffrey E. |date=2016 |title=Layer Normalization |arxiv=1607.06450}}</ref> is a common competitor to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size. It is a key component of [[Transformer (deep learning architecture)|Transformers]].
 
For a given data input and layer, LayerNorm computes the mean (<math>\mu</math>) and variance (<math>\sigma^2</math>) over all the neurons in the layer. Similar to BatchNorm, learnable parameters <math>\gamma</math> (scale) and <math>\beta</math> (shift) are applied. It is defined by:<math display="block">\hat{x_i} = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y_i = \gamma_i \hat{x_i} + \beta_i</math>where <math>