Activation function: Difference between revisions

Content deleted Content added
Further reading section
Citation bot (talk | contribs)
Altered template type. Add: class, eprint, volume, series, arxiv, bibcode, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Dominic3203 | #UCB_webform 63/199
Line 8:
Aside from their empirical performance, activation functions also have different mathematical properties:
 
; Nonlinear: When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator.<ref>{{Cite journal|author1-link=George Cybenko|last=Cybenko|first=G.|date=December 1989|title=Approximation by superpositions of a sigmoidal function|journal=Mathematics of Control, Signals, and Systems|language=en|volume=2|issue=4|pages=303–314|doi=10.1007/BF02551274|bibcode=1989MCSS....2..303C |s2cid=3958369|issn=0932-4194|url=https://hal.archives-ouvertes.fr/hal-03753170/file/Cybenko1989.pdf }}</ref> This is known as the [[Universal approximation theorem|Universal Approximation Theorem]]. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model.
; Range: When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller [[learning rate]]s are typically necessary.{{citation needed|date=January 2016}}
; Continuously differentiable: This property is desirable ([[Rectifier (neural networks)|ReLU]] is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible) for enabling gradient-based optimization methods. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.<ref>{{cite book|url={{google books |plainurl=y |id=0tFmf_UKl7oC}}|title=Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms|last=Snyman|first=Jan|date=3 March 2005|publisher=Springer Science & Business Media|isbn=978-0-387-24348-1}}</ref>
Line 46:
 
=== Other examples ===
Periodic functions can serve as activation functions. Usually the [[Sine wave|sinusoid]] is used, as any periodic function is decomposable into sinusoids by the [[Fourier transform]].<ref>{{Cite journal |lastlast1=Sitzmann |firstfirst1=Vincent |last2=Martel |first2=Julien |last3=Bergman |first3=Alexander |last4=Lindell |first4=David |last5=Wetzstein |first5=Gordon |date=2020 |title=Implicit Neural Representations with Periodic Activation Functions |url=https://proceedings.neurips.cc/paper/2020/hash/53c04118df112c13a8c34b38343b9c10-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=33 |pages=7462–7473|arxiv=2006.09661 }}</ref>
 
Quadratic activation maps <math>x \mapsto x^2</math>.<ref>{{Citation |last=Flake |first=Gary William |title=Square Unit Augmented Radially Extended Multilayer Perceptrons |date=1998 |work=Neural Networks: Tricks of the Trade |series=Lecture Notes in Computer Science |volume=1524 |pages=145–163 |editor-last=Orr |editor-first=Genevieve B. |url=https://link.springer.com/chapter/10.1007/3-540-49430-8_8 |access-date=2024-10-05 |place=Berlin, Heidelberg |publisher=Springer |language=en |doi=10.1007/3-540-49430-8_8 |isbn=978-3-540-49430-0 |editor2-last=Müller |editor2-first=Klaus-Robert}}</ref><ref>{{Cite journal |lastlast1=Du |firstfirst1=Simon |last2=Lee |first2=Jason |date=2018-07-03 |title=On the Power of Over-parametrization in Neural Networks with Quadratic Activation |url=https://proceedings.mlr.press/v80/du18a.html |journal=Proceedings of the 35th International Conference on Machine Learning |language=en |publisher=PMLR |pages=1329–1338|arxiv=1803.01206 }}</ref>
 
=== Folding activation functions ===
Line 208:
| <math>C^\infty</math>
|-
|Exponential Linear Sigmoid SquasHing (ELiSH)<ref>{{Citation |lastlast1=Basirat |firstfirst1=Mina |title=The Quest for the Golden Activation Function |date=2018-08-02 |url=https://arxiv.org/abs/1808.00783v1 |access-date=2024-10-05 |doiarxiv=10.48550/arXiv.1808.00783 |last2=Roth |first2=Peter M.}}</ref>
|
|<math>\begin{cases}
Line 276:
 
== Further reading ==
* {{cite arxivarXiv |lastlast1=Nwankpa |firstfirst1=Chigozie |title=Activation Functions: Comparison of trends in Practice and Research for Deep Learning |date=2018-11-08 |arxiveprint=1811.03378 |last2=Ijomah |first2=Winifred |last3=Gachagan |first3=Anthony |last4=Marshall |first4=Stephen|class=cs.LG }}
* {{cite journal |lastlast1=Dubey |firstfirst1=Shiv Ram |last2=Singh |first2=Satish Kumar |last3=Chaudhuri |first3=Bidyut Baran |year=2022 |title=Activation functions in deep learning: A comprehensive survey and benchmark |journal=Neurocomputing |publisher=Elsevier BV |volume=503 |pages=92–108 |doi=10.1016/j.neucom.2022.06.111 |issn=0925-2312 |doi-access=free}}
{{Differentiable computing}}