Revision as of 21:36, 27 September 2024 edit Citation bot (talk \| contribs) Bots 5,868,558 edits Alter: template type, title. Add: isbn, chapter-url, pages, chapter, arxiv, eprint, authors 1-1. Removed or converted URL. Removed access-date with no URL. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Headbomb \| #UCB_toolbar ← Previous edit		Revision as of 21:38, 27 September 2024 edit undo Citation bot (talk \| contribs) Bots 5,868,558 edits Alter: title, template type. Add: chapter, class. Removed parameters. \| Use this bot. Report bugs. \| Suggested by Headbomb \| #UCB_toolbar Next edit →
Line 119: </math> is added. In [[recurrent neural network]]s<ref name=":2" /> and [[Transformer (deep learning architecture)\|Transformers]],<ref>{{cite arXiv \|last1=Phuong \|first1=Mary \|title=Formal Algorithms for Transformers \|date=2022-07-19 \|eprint=2207.09238 \|last2=Hutter \|first2=Marcus\|class=cs.LG }}</ref> LayerNorm is applied individually to each timestep. For example, if the hidden vector in an RNN at timestep <math> Line 134: === Root mean square layer normalization === '''Root mean square layer normalization''' ('''RMSNorm''')<ref>{{cite arXiv \|last1=Zhang \|first1=Biao \|title=Root Mean Square Layer Normalization \|date=2019-10-16 \|eprint=1910.07467 \|last2=Sennrich \|first2=Rico\|class=cs.LG }}</ref> changes LayerNorm by<math display="block"> \hat{x_i} = \frac{x_i}{\sqrt{\frac 1D \sum_{i=1}^D x_i^2}}, \quad y_i = \gamma \hat{x_i} + \beta </math>Essentially it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>. == Other normalizations == '''Weight normalization''' ('''WeightNorm''')<ref>{{cite arXiv \|last1=Salimans \|first1=Tim \|title=Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks \|date=2016-06-03 \|eprint=1602.07868 \|last2=Kingma \|first2=Diederik P.\|class=cs.LG }}</ref> is a technique inspired by BatchNorm. It normalizes weight matrices in a neural network, rather than its neural activations. '''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal \|last1=Chen \|first1=Zhao \|last2=Badrinarayanan \|first2=Vijay \|last3=Lee \|first3=Chen-Yu \|last4=Rabinovich \|first4=Andrew \|date=2018-07-03 \|title=GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks \|url=https://proceedings.mlr.press/v80/chen18a.html \|journal=Proceedings of the 35th International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=794–803\|arxiv=1711.02257 }}</ref> normalizes gradient vectors during backpropagation. Line 155: It was a variant of the earlier '''local contrast normalization'''.<ref>{{Cite book \|last1=Jarrett \|first1=Kevin \|last2=Kavukcuoglu \|first2=Koray \|last3=Ranzato \|first3=Marc' Aurelio \|last4=LeCun \|first4=Yann \|chapter=What is the best multi-stage architecture for object recognition? \|date=September 2009 \|pages=2146–2153 \|title=2009 IEEE 12th International Conference on Computer Vision \|chapter-url=http://dx.doi.org/10.1109/iccv.2009.5459469 \|publisher=IEEE \|doi=10.1109/iccv.2009.5459469\|isbn=978-1-4244-4420-5 }}</ref><math display="block">b_{x, y}^i=\frac{a_{x, y}^i}{\left(k+\alpha \sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n / 2)}\left(a_{x, y}^j - \bar a_{x, y}^j\right)^2\right)^\beta}</math>where <math>\bar a_{x, y}^j</math> is the average activation in a small window centered on ___location <math>(x,y)</math> and channel <math>i</math>. The numbers <math>k, n, \alpha, \beta</math>, and the size of the small window, are hyperparameters picked by using a validation set. Similar methods were called '''divisive normalization''', as they divide activations by a number depending on the activations. They were originally inspired by biology, where it was used to explain nonlinear responses of cortical neurons and nonlinear masking in visual perception.<ref>{{Cite ~~journal~~book \|last1=Lyu \|first1=Siwei \|last2=Simoncelli \|first2=Eero P. \|~~date=2008 \|title~~chapter=Nonlinear ~~Image~~image ~~Representation~~representation ~~Using~~using ~~Divisive~~divisive ~~Normalization~~normalization \|~~journal~~date=~~Proceedings~~2008 ~~/ CVPR,~~\|title=2008 IEEE ~~Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society~~ Conference on Computer Vision and Pattern Recognition \|volume=2008 \|pages=1–8 \|doi=10.1109/CVPR.2008.4587821 \|issn=1063-6919 \|pmc=4207373 \|pmid=25346590\|isbn=978-1-4244-2242-5 }}</ref> Both kinds of local normalization were obsoleted by batch normalization, which is a more global form of normalization.<ref>{{Cite journal \|last1=Ortiz \|first1=Anthony \|last2=Robinson \|first2=Caleb \|last3=Morris \|first3=Dan \|last4=Fuentes \|first4=Olac \|last5=Kiekintveld \|first5=Christopher \|last6=Hassan \|first6=Md Mahmudulla \|last7=Jojic \|first7=Nebojsa \|date=2020 \|title=Local Context Normalization: Revisiting Local Normalization \|url=https://openaccess.thecvf.com/content_CVPR_2020/html/Ortiz_Local_Context_Normalization_Revisiting_Local_Normalization_CVPR_2020_paper.html \|pages=11276–11285\|arxiv=1912.05845 }}</ref> Line 165: === Instance normalization === '''Instance normalization''' ('''InstanceNorm'''), or '''contrast normalization''', is a technique first developed for [[neural style transfer]], and is only used for CNNs.<ref>{{cite arXiv \|last1=Ulyanov \|first1=Dmitry \|title=Instance Normalization: The Missing Ingredient for Fast Stylization \|date=2017-11-06 \|eprint=1607.08022 \|last2=Vedaldi \|first2=Andrea \|last3=Lempitsky \|first3=Victor\|class=cs.CV }}</ref> It can be understood as the LayerNorm for CNN applied once per channel, or equivalently, as group normalization where each group consists of a single channel:<math display="block"> \begin{aligned} \mu^{(l)}_c &= \frac{1}{HW} \sum_{h=1}^H \sum_{w=1}^Wx^{(l)}_{h, w, c} \\

Normalization (machine learning): Difference between revisions