Content deleted Content added
→Local response normalization: hyperparm |
→Other normalizations: adaLN |
||
Line 144:
'''Gradient normalization''' ('''GradNorm''')<ref>{{Cite journal |last=Chen |first=Zhao |last2=Badrinarayanan |first2=Vijay |last3=Lee |first3=Chen-Yu |last4=Rabinovich |first4=Andrew |date=2018-07-03 |title=GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks |url=https://proceedings.mlr.press/v80/chen18a.html |journal=Proceedings of the 35th International Conference on Machine Learning |language=en |publisher=PMLR |pages=794–803}}</ref> normalizes gradient vectors during backpropagation.
'''Adaptive layer norm''' ('''adaLN''')<ref>{{Cite journal |last=Peebles |first=William |last2=Xie |first2=Saining |date=2023 |title=Scalable Diffusion Models with Transformers |url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html |language=en |pages=4195–4205}}</ref> computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data.
== CNN-specific normalization ==
|