Revision as of 22:09, 18 October 2024 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,304 edits →Miscellaneous: nGPT Tag: Visual edit ← Previous edit		Revision as of 22:11, 18 October 2024 edit undo Citation bot (talk \| contribs) Bots 5,868,192 edits Add: arxiv, authors 1-1. Removed URL that duplicated identifier. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Cosmia Nebula \| #UCB_webform Next edit →
Line 155: === Adaptive === '''Adaptive layer norm''' ('''adaLN''') computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data. It was first proposed for CNN,<ref>{{Cite journal \|~~last~~last1=Perez \|~~first~~first1=Ethan \|last2=Strub \|first2=Florian \|last3=De Vries \|first3=Harm \|last4=Dumoulin \|first4=Vincent \|last5=Courville \|first5=Aaron \|date=2018-04-29 \|title=FiLM: Visual Reasoning with a General Conditioning Layer \|url=https://ojs.aaai.org/index.php/AAAI/article/view/11671 \|journal=Proceedings of the AAAI Conference on Artificial Intelligence \|volume=32 \|issue=1 \|doi=10.1609/aaai.v32i1.11671 \|issn=2374-3468}}</ref> and has been used effectively in [[diffusion Transformer]] (DiT).<ref>{{Cite journal \|last1=Peebles \|first1=William \|last2=Xie \|first2=Saining \|date=2023 \|title=Scalable Diffusion Models with Transformers \|url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html \|language=en \|pages=4195–4205 \|arxiv=2212.09748}}</ref> For example, in DiT, the conditioning information (such as text encoding vector) is processed by an MLP into <math>\gamma, \beta</math>, which is then applied in the LayerNorm module in a Transformer. == Weight normalization == Line 182: Both kinds of local normalization were obsoleted by batch normalization, which is a more global form of normalization.<ref>{{Cite journal \|last1=Ortiz \|first1=Anthony \|last2=Robinson \|first2=Caleb \|last3=Morris \|first3=Dan \|last4=Fuentes \|first4=Olac \|last5=Kiekintveld \|first5=Christopher \|last6=Hassan \|first6=Md Mahmudulla \|last7=Jojic \|first7=Nebojsa \|date=2020 \|title=Local Context Normalization: Revisiting Local Normalization \|url=https://openaccess.thecvf.com/content_CVPR_2020/html/Ortiz_Local_Context_Normalization_Revisiting_Local_Normalization_CVPR_2020_paper.html \|pages=11276–11285\|arxiv=1912.05845 }}</ref> Response normalization reappeared in ConvNeXT-2 as '''global response normalization'''.<ref>{{Cite journal \|~~last~~last1=Woo \|~~first~~first1=Sanghyun \|last2=Debnath \|first2=Shoubhik \|last3=Hu \|first3=Ronghang \|last4=Chen \|first4=Xinlei \|last5=Liu \|first5=Zhuang \|last6=Kweon \|first6=In So \|last7=Xie \|first7=Saining \|date=2023 \|title=ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders \|url=https://openaccess.thecvf.com/content/CVPR2023/html/Woo_ConvNeXt_V2_Co-Designing_and_Scaling_ConvNets_With_Masked_Autoencoders_CVPR_2023_paper.html \|language=en \|pages=16133–16142\|arxiv=2301.00808 }}</ref> === Group normalization === Line 214: Some normalization methods were designed for use in [[Transformer (deep learning architecture)\|Transformers]]. The original 2017 Transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train and required careful hyperparameter tuning and a "warm-up" in learning rate, where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018,<ref>{{Citation \|~~last~~last1=Wang \|~~first~~first1=Qiang \|title=Learning Deep Transformer Models for Machine Translation \|date=2019-06-04 \|url=https://arxiv.org/abs/1906.01787 \|access-date=2024-10-18 \|~~doi~~arxiv=~~10.48550/arXiv.~~1906.01787 \|last2=Li \|first2=Bei \|last3=Xiao \|first3=Tong \|last4=Zhu \|first4=Jingbo \|last5=Li \|first5=Changliang \|last6=Wong \|first6=Derek F. \|last7=Chao \|first7=Lidia S.}}</ref> was found to be easier to train, requiring no warm-up, leading to faster convergence.<ref name="auto1">{{cite arXiv \|eprint=2002.04745 \|class=cs.LG \|first1=Ruibin \|last1=Xiong \|first2=Yunchang \|last2=Yang \|title=On Layer Normalization in the Transformer Architecture \|date=2020-06-29 \|last3=He \|first3=Di \|last4=Zheng \|first4=Kai \|last5=Zheng \|first5=Shuxin \|last6=Xing \|first6=Chen \|last7=Zhang \|first7=Huishuai \|last8=Lan \|first8=Yanyan \|last9=Wang \|first9=Liwei \|last10=Liu \|first10=Tie-Yan}}</ref> '''FixNorm'''<ref>{{Citation \|~~last~~last1=Nguyen \|~~first~~first1=Toan Q. \|title=Improving Lexical Choice in Neural Machine Translation \|date=2018-04-17 \|url=https://arxiv.org/abs/1710.01329 \|access-date=2024-10-18 \|~~doi~~arxiv=~~10.48550/arXiv.~~1710.01329 \|last2=Chiang \|first2=David}}</ref> and '''ScaleNorm<ref>{{Cite journal \|~~last~~last1=Nguyen \|~~first~~first1=Toan Q. \|last2=Salazar \|first2=Julian \|date=2019-11-02 \|title=Transformers without Tears: Improving the Normalization of Self-Attention ~~\|url=https://arxiv.org/abs/1910.05895~~ \|doi=10.5281/zenodo.3525484\|arxiv=1910.05895 }}</ref>''' both normalize activation vectors in a Transformer. The FixNorm method divides the ''output'' vectors from a Transformer by their L2 norms, then multiply by a learned parameter <math>g</math>. The ScaleNorm replaces all LayerNorms inside a Transformer by division with L2 norm, then multiplying by a learned parameter <math>g'</math> (shared by all ScaleNorm modules of a Transformer). '''Query-Key normalization''' ('''QKNorm''')<ref>{{Cite journal \|~~last~~last1=Henry \|~~first~~first1=Alex \|last2=Dachapally \|first2=Prudhvi Raj \|last3=Pawar \|first3=Shubham Shantaram \|last4=Chen \|first4=Yuxuan \|date=November 2020 \|editor-last=Cohn \|editor-first=Trevor \|editor2-last=He \|editor2-first=Yulan \|editor3-last=Liu \|editor3-first=Yang \|title=Query-Key Normalization for Transformers \|url=https://aclanthology.org/2020.findings-emnlp.379/ \|journal=Findings of the Association for Computational Linguistics: EMNLP 2020 \|___location=Online \|publisher=Association for Computational Linguistics \|pages=4246–4253 \|doi=10.18653/v1/2020.findings-emnlp.379}}</ref> normalizes query and key vectors to have unit L2 norm. In '''nGPT''', many vectors are normalized to have unit L2 norm:<ref>{{Citation \|~~last~~last1=Loshchilov \|~~first~~first1=Ilya \|title=nGPT: Normalized Transformer with Representation Learning on the Hypersphere \|date=2024-10-01 \|url=https://arxiv.org/abs/2410.01131 \|access-date=2024-10-18 \|~~doi~~arxiv=~~10.48550/arXiv.~~2410.01131 \|last2=Hsieh \|first2=Cheng-Ping \|last3=Sun \|first3=Simeng \|last4=Ginsburg \|first4=Boris}}</ref> hidden state vectors, input and output embedding vectors, weight matrix columns, query and key vectors. == Miscellaneous ==

Normalization (machine learning): Difference between revisions