Content deleted Content added
Citation bot (talk | contribs) Altered template type. Add: class, date, title, eprint, authors 1-4. Removed URL that duplicated identifier. Changed bare reference to CS1/2. | Use this bot. Report bugs. | Suggested by Cosmia Nebula | #UCB_webform |
→Other normalizations: spectral norm |
||
Line 154:
</math>Essentially it is LayerNorm where we enforce <math>\mu, \epsilon = 0</math>.
===
'''Adaptive layer norm''' ('''adaLN''')<ref>{{Cite journal |last1=Peebles |first1=William |last2=Xie |first2=Saining |date=2023 |title=Scalable Diffusion Models with Transformers |url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html |language=en |pages=4195–4205 |arxiv=2212.09748
== Weight normalization ==
'''Weight normalization''' ('''WeightNorm''')<ref>{{cite arXiv |last1=Salimans |first1=Tim |title=Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks |date=2016-06-03 |eprint=1602.07868 |last2=Kingma |first2=Diederik P.|class=cs.LG }}</ref> is a technique inspired by BatchNorm. It normalizes weight matrices in a neural network, rather than its neural activations.
One example is '''spectral normalization''', which divides weight matrices by their [[spectral norm]]. The spectral normalization is used in [[Generative adversarial network|generative adversarial networks]] (GANs) such as the [[Wasserstein GAN]].<ref>{{cite arXiv |eprint=1802.05957 |class=cs.LG |first1=Takeru |last1=Miyato |first2=Toshiki |last2=Kataoka |title=Spectral Normalization for Generative Adversarial Networks |date=2018-02-16 |last3=Koyama |first3=Masanori |last4=Yoshida |first4=Yuichi}}</ref> The spectral radius can be efficiently computed by the following algorithm:{{blockquote|'''INPUT''' matrix <math>W</math> and initial guess <math>x</math>
'''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal |last1=Chen |first1=Zhao |last2=Badrinarayanan |first2=Vijay |last3=Lee |first3=Chen-Yu |last4=Rabinovich |first4=Andrew |date=2018-07-03 |title=GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks |url=https://proceedings.mlr.press/v80/chen18a.html |journal=Proceedings of the 35th International Conference on Machine Learning |language=en |publisher=PMLR |pages=794–803|arxiv=1711.02257 }}</ref> normalizes gradient vectors during backpropagation.▼
Iterate <math>x \mapsto \frac{1}{\|Wx\|_2}Wx</math> to convergence <math>x^*</math>. This is the eigenvector of <math>W</math> with eigenvalue <math>\|W\|_s</math>.
'''RETURN''' <math>x^*, \|Wx^*\|_2</math>}}By reassigning <math>W_i \leftarrow \frac{W_i}{\|W_i\|_s}</math> after each update of the discriminator, we can upper bound <math>\|W_i\|_s \leq 1</math>, and thus upper bound <math>\|D \|_L</math>.
The algorithm can be further accelerated by [[memoization]]: At step <math>t</math>, store <math>x^*_i(t)</math>. Then at step <math>t+1</math>, use <math>x^*_i(t)</math> as the initial guess for the algorithm. Since <math>W_i(t+1)</math> is very close to <math>W_i(t)</math>, so is <math>x^*_i(t)</math> close to <math>x^*_i(t+1)</math>, so this allows rapid convergence.
▲'''Adaptive layer norm''' ('''adaLN''')<ref>{{Cite journal |last1=Peebles |first1=William |last2=Xie |first2=Saining |date=2023 |title=Scalable Diffusion Models with Transformers |url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html |language=en |pages=4195–4205|arxiv=2212.09748 }}</ref> computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data.
== CNN-specific normalization ==
Line 201 ⟶ 208:
\end{aligned}
</math>
== Miscellaneous ==
▲'''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal |last1=Chen |first1=Zhao |last2=Badrinarayanan |first2=Vijay |last3=Lee |first3=Chen-Yu |last4=Rabinovich |first4=Andrew |date=2018-07-03 |title=GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks |url=https://proceedings.mlr.press/v80/chen18a.html |journal=Proceedings of the 35th International Conference on Machine Learning |language=en |publisher=PMLR |pages=794–803 |arxiv=1711.02257
== See also ==
|