Normalization (machine learning): Difference between revisions

Content deleted Content added
Fix typo, general formatting improvements, add wikilinks
Tags: Mobile edit Mobile web edit
Fix typo
Tags: Mobile edit Mobile web edit
Line 95:
* <math>b^{(l)}_c</math> is the bias term for the <math>c</math>-th channel of the <math>l</math>-th layer.
 
In order to preserve the translational invariance, BatchNorm treats all outputs from the same kernel in the same batch as more data in a batch. That is, it is applied once per ''kernel'' <math>c</math> (equivalently, once per channel <math>c</math>), not per ''activation'' <math>x^{(l+1)}_{h, w, c}</math>:
 
<math display="block">
Line 183:
</math>
 
noticeNotice that the batch index <math>b</math> is removed, while the channel index <math>c</math> is added.
 
In [[recurrent neural network]]s<ref name=":2" /> and [[Transformer (deep learning architecture)|transformers]],<ref>{{cite arXiv |last1=Phuong |first1=Mary |title=Formal Algorithms for Transformers |date=2022-07-19 |eprint=2207.09238 |last2=Hutter |first2=Marcus|class=cs.LG }}</ref> LayerNorm is applied individually to each timestep. For example, if the hidden vector in an RNN at timestep <math>t</math> is <math>x^{(t)} \in \mathbb{R}^{D}
Line 219:
By reassigning <math>W_i \leftarrow \frac{W_i}{\|W_i\|_s}</math> after each update of the discriminator, we can upper-bound <math>\|W_i\|_s \leq 1</math>, and thus upper-bound <math>\|D \|_L</math>.
 
The algorithm can be further accelerated by [[memoization]]: at step <math>t</math>, store <math>x^*_i(t)</math>. Then, at step <math>t+1</math>, use <math>x^*_i(t)</math> as the initial guess for the algorithm. Since <math>W_i(t+1)</math> is very close to <math>W_i(t)</math>, so is <math>x^*_i(t)</math> to <math>x^*_i(t+1)</math>, thus allowing rapid convergence.
 
== CNN-specific normalization ==