Normalization (machine learning): Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Add: arxiv, authors 1-1. Removed URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Cosmia Nebula | #UCB_webform
Line 155:
 
=== Adaptive ===
'''Adaptive layer norm''' ('''adaLN''') computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data. It was first proposed for CNN,<ref>{{Cite journal |lastlast1=Perez |firstfirst1=Ethan |last2=Strub |first2=Florian |last3=De Vries |first3=Harm |last4=Dumoulin |first4=Vincent |last5=Courville |first5=Aaron |date=2018-04-29 |title=FiLM: Visual Reasoning with a General Conditioning Layer |url=https://ojs.aaai.org/index.php/AAAI/article/view/11671 |journal=Proceedings of the AAAI Conference on Artificial Intelligence |volume=32 |issue=1 |doi=10.1609/aaai.v32i1.11671 |issn=2374-3468}}</ref> and has been used effectively in [[diffusion Transformer]] (DiT).<ref>{{Cite journal |last1=Peebles |first1=William |last2=Xie |first2=Saining |date=2023 |title=Scalable Diffusion Models with Transformers |url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html |language=en |pages=4195–4205 |arxiv=2212.09748}}</ref> For example, in DiT, the conditioning information (such as text encoding vector) is processed by an MLP into <math>\gamma, \beta</math>, which is then applied in the LayerNorm module in a Transformer.
 
== Weight normalization ==
Line 182:
Both kinds of local normalization were obsoleted by batch normalization, which is a more global form of normalization.<ref>{{Cite journal |last1=Ortiz |first1=Anthony |last2=Robinson |first2=Caleb |last3=Morris |first3=Dan |last4=Fuentes |first4=Olac |last5=Kiekintveld |first5=Christopher |last6=Hassan |first6=Md Mahmudulla |last7=Jojic |first7=Nebojsa |date=2020 |title=Local Context Normalization: Revisiting Local Normalization |url=https://openaccess.thecvf.com/content_CVPR_2020/html/Ortiz_Local_Context_Normalization_Revisiting_Local_Normalization_CVPR_2020_paper.html |pages=11276–11285|arxiv=1912.05845 }}</ref>
 
Response normalization reappeared in ConvNeXT-2 as '''global response normalization'''.<ref>{{Cite journal |lastlast1=Woo |firstfirst1=Sanghyun |last2=Debnath |first2=Shoubhik |last3=Hu |first3=Ronghang |last4=Chen |first4=Xinlei |last5=Liu |first5=Zhuang |last6=Kweon |first6=In So |last7=Xie |first7=Saining |date=2023 |title=ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders |url=https://openaccess.thecvf.com/content/CVPR2023/html/Woo_ConvNeXt_V2_Co-Designing_and_Scaling_ConvNets_With_Masked_Autoencoders_CVPR_2023_paper.html |language=en |pages=16133–16142|arxiv=2301.00808 }}</ref>
 
=== Group normalization ===
Line 214:
Some normalization methods were designed for use in [[Transformer (deep learning architecture)|Transformers]].
 
The original 2017 Transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train and required careful hyperparameter tuning and a "warm-up" in learning rate, where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018,<ref>{{Citation |lastlast1=Wang |firstfirst1=Qiang |title=Learning Deep Transformer Models for Machine Translation |date=2019-06-04 |url=https://arxiv.org/abs/1906.01787 |access-date=2024-10-18 |doiarxiv=10.48550/arXiv.1906.01787 |last2=Li |first2=Bei |last3=Xiao |first3=Tong |last4=Zhu |first4=Jingbo |last5=Li |first5=Changliang |last6=Wong |first6=Derek F. |last7=Chao |first7=Lidia S.}}</ref> was found to be easier to train, requiring no warm-up, leading to faster convergence.<ref name="auto1">{{cite arXiv |eprint=2002.04745 |class=cs.LG |first1=Ruibin |last1=Xiong |first2=Yunchang |last2=Yang |title=On Layer Normalization in the Transformer Architecture |date=2020-06-29 |last3=He |first3=Di |last4=Zheng |first4=Kai |last5=Zheng |first5=Shuxin |last6=Xing |first6=Chen |last7=Zhang |first7=Huishuai |last8=Lan |first8=Yanyan |last9=Wang |first9=Liwei |last10=Liu |first10=Tie-Yan}}</ref>
 
'''FixNorm'''<ref>{{Citation |lastlast1=Nguyen |firstfirst1=Toan Q. |title=Improving Lexical Choice in Neural Machine Translation |date=2018-04-17 |url=https://arxiv.org/abs/1710.01329 |access-date=2024-10-18 |doiarxiv=10.48550/arXiv.1710.01329 |last2=Chiang |first2=David}}</ref> and '''ScaleNorm<ref>{{Cite journal |lastlast1=Nguyen |firstfirst1=Toan Q. |last2=Salazar |first2=Julian |date=2019-11-02 |title=Transformers without Tears: Improving the Normalization of Self-Attention |url=https://arxiv.org/abs/1910.05895 |doi=10.5281/zenodo.3525484|arxiv=1910.05895 }}</ref>''' both normalize activation vectors in a Transformer. The FixNorm method divides the ''output'' vectors from a Transformer by their L2 norms, then multiply by a learned parameter <math>g</math>. The ScaleNorm replaces all LayerNorms inside a Transformer by division with L2 norm, then multiplying by a learned parameter <math>g'</math> (shared by all ScaleNorm modules of a Transformer). '''Query-Key normalization''' ('''QKNorm''')<ref>{{Cite journal |lastlast1=Henry |firstfirst1=Alex |last2=Dachapally |first2=Prudhvi Raj |last3=Pawar |first3=Shubham Shantaram |last4=Chen |first4=Yuxuan |date=November 2020 |editor-last=Cohn |editor-first=Trevor |editor2-last=He |editor2-first=Yulan |editor3-last=Liu |editor3-first=Yang |title=Query-Key Normalization for Transformers |url=https://aclanthology.org/2020.findings-emnlp.379/ |journal=Findings of the Association for Computational Linguistics: EMNLP 2020 |___location=Online |publisher=Association for Computational Linguistics |pages=4246–4253 |doi=10.18653/v1/2020.findings-emnlp.379}}</ref> normalizes query and key vectors to have unit L2 norm.
 
In '''nGPT''', many vectors are normalized to have unit L2 norm:<ref>{{Citation |lastlast1=Loshchilov |firstfirst1=Ilya |title=nGPT: Normalized Transformer with Representation Learning on the Hypersphere |date=2024-10-01 |url=https://arxiv.org/abs/2410.01131 |access-date=2024-10-18 |doiarxiv=10.48550/arXiv.2410.01131 |last2=Hsieh |first2=Cheng-Ping |last3=Sun |first3=Simeng |last4=Ginsburg |first4=Boris}}</ref> hidden state vectors, input and output embedding vectors, weight matrix columns, query and key vectors.
 
== Miscellaneous ==