Revision as of 21:39, 27 September 2024 edit Headbomb (talk \| contribs) Edit filter managers, Autopatrolled, Extended confirmed users, Page movers, File movers, New page reviewers, Pending changes reviewers, Rollbackers, Template editors 473,387 edits ce ← Previous edit		Revision as of 21:40, 27 September 2024 edit undo Citation bot (talk \| contribs) Bots 5,868,192 edits Altered template type. Added eprint. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Headbomb \| #UCB_toolbar Next edit →
Line 97: == Layer normalization == '''Layer normalization''' ('''LayerNorm''')<ref name=":2">{{Cite ~~arxiv~~arXiv \|last1=Ba \|first1=Jimmy Lei \|last2=Kiros \|first2=Jamie Ryan \|last3=Hinton \|first3=Geoffrey E. \|date=2016 \|title=Layer Normalization \|~~arxiv~~eprint=1607.06450}}</ref> is a common competitor to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size. It is a key component of [[Transformer (deep learning architecture)\|Transformers]]. For a given data input and layer, LayerNorm computes the mean (<math>\mu</math>) and variance (<math>\sigma^2</math>) over all the neurons in the layer. Similar to BatchNorm, learnable parameters <math>\gamma</math> (scale) and <math>\beta</math> (shift) are applied. It is defined by:<math display="block">\hat{x_i} = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y_i = \gamma_i \hat{x_i} + \beta_i</math>where <math>

Normalization (machine learning): Difference between revisions