Content deleted Content added
Line 155:
=== Adaptive ===
'''Adaptive layer norm''' ('''adaLN''') computes the <math>\gamma, \beta</math> in a LayerNorm not from the layer activation itself, but from other data. It was first proposed for CNN,<ref>{{Cite journal |last=Perez |first=Ethan |last2=Strub |first2=Florian |last3=De Vries |first3=Harm |last4=Dumoulin |first4=Vincent |last5=Courville |first5=Aaron |date=2018-04-29 |title=FiLM: Visual Reasoning with a General Conditioning Layer |url=https://ojs.aaai.org/index.php/AAAI/article/view/11671 |journal=Proceedings of the AAAI Conference on Artificial Intelligence |volume=32 |issue=1 |doi=10.1609/aaai.v32i1.11671 |issn=2374-3468}}</ref> and has been used effectively in [[diffusion Transformer]] (DiT).<ref>{{Cite journal |last1=Peebles |first1=William |last2=Xie |first2=Saining |date=2023 |title=Scalable Diffusion Models with Transformers |url=https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html |language=en |pages=4195–4205 |arxiv=2212.09748}}</ref> For example, in DiT, the conditioning information (such as text encoding vector) is processed by an MLP into <math>\gamma, \beta</math>, which is then applied in the LayerNorm module in a Transformer.
== Weight normalization ==
|