Normalization (machine learning): Difference between revisions

Content deleted Content added
ce
Citation bot (talk | contribs)
Altered template type. Add: class, date, title, eprint, authors 1-7. Changed bare reference to CS1/2. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar
Line 148:
</math>Frame-wise BatchNorm is suited for causal tasks such as next-character prediction, where future frames are unavailable, forcing normalization per frame. Sequence-wise BatchNorm is suited for tasks such as speech recognition, where the entire sequences are available, but with variable lengths. In a batch, the smaller sequences are padded with zeroes to match the size of the longest sequence of the batch. In such setups, frame-wise is not recommended, because the number of unpadded frames decreases along the time axis, leading to increasingly poorer statistics estimates.<ref name=":4" />
 
It is also possible to apply BatchNorm to [[Long short-term memory|LSTMs]].<ref>{{arxivcite arXiv | eprint=1603.09025 | last1=Cooijmans | first1=Tim | last2=Ballas | first2=Nicolas | last3=Laurent | first3=César | last4=Gülçehre | first4=Çağlar | last5=Courville | first5=Aaron | title=Recurrent Batch Normalization | date=2016 | class=cs.LG }}</ref>
 
=== Improvements ===
Line 291:
Some normalization methods were designed for use in [[Transformer (deep learning architecture)|transformers]].
 
The original 2017 transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train, and required careful [[Hyperparameter optimization|hyperparameter tuning]] and a "warm-up" in [[learning rate]], where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018,<ref>{{arxivcite arXiv | eprint=1906.01787 | last1=Wang | first1=Qiang | last2=Li | first2=Bei | last3=Xiao | first3=Tong | last4=Zhu | first4=Jingbo | last5=Li | first5=Changliang | last6=Wong | first6=Derek F. | last7=Chao | first7=Lidia S. | title=Learning Deep Transformer Models for Machine Translation | date=2019 | class=cs.CL }}</ref> was found to be easier to train, requiring no warm-up, leading to faster convergence.<ref name="auto1">{{cite arXiv |eprint=2002.04745 |class=cs.LG |first1=Ruibin |last1=Xiong |first2=Yunchang |last2=Yang |title=On Layer Normalization in the Transformer Architecture |date=2020-06-29 |last3=He |first3=Di |last4=Zheng |first4=Kai |last5=Zheng |first5=Shuxin |last6=Xing |first6=Chen |last7=Zhang |first7=Huishuai |last8=Lan |first8=Yanyan |last9=Wang |first9=Liwei |last10=Liu |first10=Tie-Yan}}</ref>
 
'''FixNorm'''<ref>{{arxivcite arXiv | eprint=1710.01329 | last1=Nguyen | first1=Toan Q. | last2=Chiang | first2=David | title=Improving Lexical Choice in Neural Machine Translation | date=2017 | class=cs.CL }}</ref> and '''ScaleNorm<ref>{{Cite journal |last1=Nguyen |first1=Toan Q. |last2=Salazar |first2=Julian |date=2019-11-02 |title=Transformers without Tears: Improving the Normalization of Self-Attention |doi=10.5281/zenodo.3525484|arxiv=1910.05895 }}</ref>''' both normalize activation vectors in a transformer. The FixNorm method divides the ''output'' vectors from a transformer by their L2 norms, then multiplies by a learned parameter <math>g</math>. The ScaleNorm replaces all LayerNorms inside a transformer by division with L2 norm, then multiplying by a learned parameter <math>g'</math> (shared by all ScaleNorm modules of a transformer). '''Query-Key normalization''' ('''QKNorm''')<ref>{{Cite journal |last1=Henry |first1=Alex |last2=Dachapally |first2=Prudhvi Raj |last3=Pawar |first3=Shubham Shantaram |last4=Chen |first4=Yuxuan |date=November 2020 |editor-last=Cohn |editor-first=Trevor |editor2-last=He |editor2-first=Yulan |editor3-last=Liu |editor3-first=Yang |title=Query-Key Normalization for Transformers |url=https://aclanthology.org/2020.findings-emnlp.379/ |journal=Findings of the Association for Computational Linguistics: EMNLP 2020 |___location=Online |publisher=Association for Computational Linguistics |pages=4246–4253 |doi=10.18653/v1/2020.findings-emnlp.379|arxiv=2010.04245 }}</ref> normalizes query and key vectors to have unit L2 norm.
 
In '''nGPT''', many vectors are normalized to have unit L2 norm:<ref>{{arxivcite arXiv | eprint=2410.01131 | last1=Loshchilov | first1=Ilya | last2=Hsieh | first2=Cheng-Ping | last3=Sun | first3=Simeng | last4=Ginsburg | first4=Boris | title=NGPT: Normalized Transformer with Representation Learning on the Hypersphere | date=2024 | class=cs.LG }}</ref> hidden state vectors, input and output embedding vectors, weight matrix columns, and query and key vectors.
 
== Miscellaneous ==