Revision as of 20:55, 8 October 2024 edit Citation bot (talk \| contribs) Bots 5,868,136 edits Altered template type. Add: class, date, title, eprint, authors 1-4. Removed URL that duplicated identifier. Changed bare reference to CS1/2. \| Use this bot. Report bugs. \| Suggested by Cosmia Nebula \| #UCB_webform ← Previous edit		Revision as of 19:16, 10 October 2024 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Other normalizations: spectral norm Tag: Visual edit Next edit →
Line 154: </math>Essentially it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>. === ~~Other normalizations~~Adaptive === '''Adaptive layer norm''' ('''adaLN''')<ref>{{Cite journal \|last1=Peebles \|first1=William \|last2=Xie \|first2=Saining \|date=2023 \|title=Scalable Diffusion Models with Transformers \|url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html \|language=en \|pages=4195–4205 \|arxiv=2212.09748 }}</ref> computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data.▼ == Weight normalization == '''Weight normalization''' ('''WeightNorm''')<ref>{{cite arXiv \|last1=Salimans \|first1=Tim \|title=Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks \|date=2016-06-03 \|eprint=1602.07868 \|last2=Kingma \|first2=Diederik P.\|class=cs.LG }}</ref> is a technique inspired by BatchNorm. It normalizes weight matrices in a neural network, rather than its neural activations. One example is '''spectral normalization''', which divides weight matrices by their [[spectral norm]]. The spectral normalization is used in [[Generative adversarial network\|generative adversarial networks]] (GANs) such as the [[Wasserstein GAN]].<ref>{{cite arXiv \|eprint=1802.05957 \|class=cs.LG \|first1=Takeru \|last1=Miyato \|first2=Toshiki \|last2=Kataoka \|title=Spectral Normalization for Generative Adversarial Networks \|date=2018-02-16 \|last3=Koyama \|first3=Masanori \|last4=Yoshida \|first4=Yuichi}}</ref> The spectral radius can be efficiently computed by the following algorithm:{{blockquote\|'''INPUT''' matrix <math>W</math> and initial guess <math>x</math> '''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal \|last1=Chen \|first1=Zhao \|last2=Badrinarayanan \|first2=Vijay \|last3=Lee \|first3=Chen-Yu \|last4=Rabinovich \|first4=Andrew \|date=2018-07-03 \|title=GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks \|url=https://proceedings.mlr.press/v80/chen18a.html \|journal=Proceedings of the 35th International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=794–803\|arxiv=1711.02257 }}</ref> normalizes gradient vectors during backpropagation.▼ Iterate <math>x \mapsto \frac{1}{\\|Wx\\|_2}Wx</math> to convergence <math>x^</math>. This is the eigenvector of <math>W</math> with eigenvalue <math>\\|W\\|_s</math>. '''RETURN''' <math>x^, \\|Wx^\\|_2</math>}}By reassigning <math>W_i \leftarrow \frac{W_i}{\\|W_i\\|_s}</math> after each update of the discriminator, we can upper bound <math>\\|W_i\\|_s \leq 1</math>, and thus upper bound <math>\\|D \\|_L</math>. The algorithm can be further accelerated by [[memoization]]: At step <math>t</math>, store <math>x^_i(t)</math>. Then at step <math>t+1</math>, use <math>x^_i(t)</math> as the initial guess for the algorithm. Since <math>W_i(t+1)</math> is very close to <math>W_i(t)</math>, so is <math>x^_i(t)</math> close to <math>x^*_i(t+1)</math>, so this allows rapid convergence. ▲'''Adaptive layer norm''' ('''adaLN''')<ref>{{Cite journal \|last1=Peebles \|first1=William \|last2=Xie \|first2=Saining \|date=2023 \|title=Scalable Diffusion Models with Transformers \|url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html \|language=en \|pages=4195–4205\|arxiv=2212.09748 }}</ref> computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data. == CNN-specific normalization == Line 201 ⟶ 208: \end{aligned} </math> == Miscellaneous == ▲'''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal \|last1=Chen \|first1=Zhao \|last2=Badrinarayanan \|first2=Vijay \|last3=Lee \|first3=Chen-Yu \|last4=Rabinovich \|first4=Andrew \|date=2018-07-03 \|title=GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks \|url=https://proceedings.mlr.press/v80/chen18a.html \|journal=Proceedings of the 35th International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=794–803 \|arxiv=1711.02257 }}</ref> normalizes gradient vectors during backpropagation. == See also ==

Normalization (machine learning): Difference between revisions