Revision as of 20:55, 17 May 2025 edit Headbomb (talk \| contribs) Edit filter managers, Autopatrolled, Extended confirmed users, Page movers, File movers, New page reviewers, Pending changes reviewers, Rollbackers, Template editors 473,387 edits ce ← Previous edit		Revision as of 20:56, 17 May 2025 edit undo Citation bot (talk \| contribs) Bots 5,868,072 edits Altered template type. Add: class, date, title, eprint, authors 1-7. Changed bare reference to CS1/2. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Headbomb \| #UCB_toolbar Next edit →
Line 148: </math>Frame-wise BatchNorm is suited for causal tasks such as next-character prediction, where future frames are unavailable, forcing normalization per frame. Sequence-wise BatchNorm is suited for tasks such as speech recognition, where the entire sequences are available, but with variable lengths. In a batch, the smaller sequences are padded with zeroes to match the size of the longest sequence of the batch. In such setups, frame-wise is not recommended, because the number of unpadded frames decreases along the time axis, leading to increasingly poorer statistics estimates.<ref name=":4" /> It is also possible to apply BatchNorm to [[Long short-term memory\|LSTMs]].<ref>{{~~arxiv~~cite arXiv \| eprint=1603.09025 \| last1=Cooijmans \| first1=Tim \| last2=Ballas \| first2=Nicolas \| last3=Laurent \| first3=César \| last4=Gülçehre \| first4=Çağlar \| last5=Courville \| first5=Aaron \| title=Recurrent Batch Normalization \| date=2016 \| class=cs.LG }}</ref> === Improvements === Line 291: Some normalization methods were designed for use in [[Transformer (deep learning architecture)\|transformers]]. The original 2017 transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train, and required careful [[Hyperparameter optimization\|hyperparameter tuning]] and a "warm-up" in [[learning rate]], where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018,<ref>{{~~arxiv~~cite arXiv \| eprint=1906.01787 \| last1=Wang \| first1=Qiang \| last2=Li \| first2=Bei \| last3=Xiao \| first3=Tong \| last4=Zhu \| first4=Jingbo \| last5=Li \| first5=Changliang \| last6=Wong \| first6=Derek F. \| last7=Chao \| first7=Lidia S. \| title=Learning Deep Transformer Models for Machine Translation \| date=2019 \| class=cs.CL }}</ref> was found to be easier to train, requiring no warm-up, leading to faster convergence.<ref name="auto1">{{cite arXiv \|eprint=2002.04745 \|class=cs.LG \|first1=Ruibin \|last1=Xiong \|first2=Yunchang \|last2=Yang \|title=On Layer Normalization in the Transformer Architecture \|date=2020-06-29 \|last3=He \|first3=Di \|last4=Zheng \|first4=Kai \|last5=Zheng \|first5=Shuxin \|last6=Xing \|first6=Chen \|last7=Zhang \|first7=Huishuai \|last8=Lan \|first8=Yanyan \|last9=Wang \|first9=Liwei \|last10=Liu \|first10=Tie-Yan}}</ref> '''FixNorm'''<ref>{{~~arxiv~~cite arXiv \| eprint=1710.01329 \| last1=Nguyen \| first1=Toan Q. \| last2=Chiang \| first2=David \| title=Improving Lexical Choice in Neural Machine Translation \| date=2017 \| class=cs.CL }}</ref> and '''ScaleNorm<ref>{{Cite journal \|last1=Nguyen \|first1=Toan Q. \|last2=Salazar \|first2=Julian \|date=2019-11-02 \|title=Transformers without Tears: Improving the Normalization of Self-Attention \|doi=10.5281/zenodo.3525484\|arxiv=1910.05895 }}</ref>''' both normalize activation vectors in a transformer. The FixNorm method divides the ''output'' vectors from a transformer by their L2 norms, then multiplies by a learned parameter <math>g</math>. The ScaleNorm replaces all LayerNorms inside a transformer by division with L2 norm, then multiplying by a learned parameter <math>g'</math> (shared by all ScaleNorm modules of a transformer). '''Query-Key normalization''' ('''QKNorm''')<ref>{{Cite journal \|last1=Henry \|first1=Alex \|last2=Dachapally \|first2=Prudhvi Raj \|last3=Pawar \|first3=Shubham Shantaram \|last4=Chen \|first4=Yuxuan \|date=November 2020 \|editor-last=Cohn \|editor-first=Trevor \|editor2-last=He \|editor2-first=Yulan \|editor3-last=Liu \|editor3-first=Yang \|title=Query-Key Normalization for Transformers \|url=https://aclanthology.org/2020.findings-emnlp.379/ \|journal=Findings of the Association for Computational Linguistics: EMNLP 2020 \|___location=Online \|publisher=Association for Computational Linguistics \|pages=4246–4253 \|doi=10.18653/v1/2020.findings-emnlp.379\|arxiv=2010.04245 }}</ref> normalizes query and key vectors to have unit L2 norm. In '''nGPT''', many vectors are normalized to have unit L2 norm:<ref>{{~~arxiv~~cite arXiv \| eprint=2410.01131 \| last1=Loshchilov \| first1=Ilya \| last2=Hsieh \| first2=Cheng-Ping \| last3=Sun \| first3=Simeng \| last4=Ginsburg \| first4=Boris \| title=NGPT: Normalized Transformer with Representation Learning on the Hypersphere \| date=2024 \| class=cs.LG }}</ref> hidden state vectors, input and output embedding vectors, weight matrix columns, and query and key vectors. == Miscellaneous ==

Normalization (machine learning): Difference between revisions