Activation function: Difference between revisions

Content deleted Content added
 
(4 intermediate revisions by 2 users not shown)
Line 2:
{{About||the formalism used to approximate the influence of an extracellular electrical field on neurons|activating function|a linear system’s transfer function|transfer function}}
{{Machine learning}}
{{Use dmy dates|date=August 2025}}
[[File:Logistic-curve.svg|thumb|Logistic activation function]]
The '''activation function''' of a node in an [[artificial neural network]] is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is ''nonlinear''.<ref>{{Cite web|url=http://didattica.cs.unicam.it/lib/exe/fetch.php?media=didattica:magistrale:kebi:ay_1718:ke-11_neural_networks.pdf|title=Neural Networks, p. 7|last=Hinkelmann|first=Knut|website=University of Applied Sciences Northwestern Switzerland|access-date=6 October 2018-10-06|archive-date=6 October 2018-10-06|archive-url=https://web.archive.org/web/20181006235506/http://didattica.cs.unicam.it/lib/exe/fetch.php?media=didattica:magistrale:kebi:ay_1718:ke-11_neural_networks.pdf|url-status=dead}}</ref>
 
Modern activation functions include the logistic ([[Sigmoid function|sigmoid]]) function used in the 2012 [[speech recognition]] model developed by [[Geoffrey Hinton|Hinton]] et al;<ref>{{Cite journal |last1=Hinton |first1=Geoffrey |last2=Deng |first2=Li |last3=Deng |first3=Li |last4=Yu |first4=Dong |last5=Dahl |first5=George |last6=Mohamed |first6=Abdel-rahman |last7=Jaitly |first7=Navdeep |last8=Senior |first8=Andrew |last9=Vanhoucke |first9=Vincent |last10=Nguyen |first10=Patrick |last11=Sainath |first11=Tara|author11-link= Tara Sainath |last12=Kingsbury |first12=Brian |year=2012 |title=Deep Neural Networks for Acoustic Modeling in Speech Recognition |journal=IEEE Signal Processing Magazine |volume=29 |issue=6 |pages=82–97 |doi=10.1109/MSP.2012.2205597|s2cid=206485943 }}</ref> the [[ReLU]] used in the 2012 [[AlexNet]] computer vision model<ref>{{Cite journal |last1=Krizhevsky |first1=Alex |last2=Sutskever |first2=Ilya |last3=Hinton |first3=Geoffrey E. |date=2017-05-24 May 2017 |title=ImageNet classification with deep convolutional neural networks |url=https://dl.acm.org/doi/10.1145/3065386 |journal=Communications of the ACM |language=en |volume=60 |issue=6 |pages=84–90 |doi=10.1145/3065386 |issn=0001-0782}}</ref><ref>{{Cite journal |last1=King Abdulaziz University |last2=Al-johania |first2=Norah |last3=Elrefaei |first3=Lamiaa |last4=Benha University |date=2019-06-30 June 2019 |title=Dorsal Hand Vein Recognition by Convolutional Neural Networks: Feature Learning and Transfer Learning Approaches |url=http://www.inass.org/2019/2019063019.pdf |journal=International Journal of Intelligent Engineering and Systems |volume=12 |issue=3 |pages=178–191 |doi=10.22266/ijies2019.0630.19}}</ref> and in the 2015 [[Residual neural network|ResNet]] model; and the smooth version of the ReLU, the [[ReLU#Gaussian-error linear unit (GELU)|GELU]], which was used in the 2018 [[BERT (language model)|BERT]] model.<ref name="ReferenceA">{{Cite arXiv |eprint=1606.08415 |title=Gaussian Error Linear Units (GELUs) |last1=Hendrycks |first1=Dan |last2=Gimpel |first2=Kevin |year=2016 |class=cs.LG}}</ref>
 
==Comparison of activation functions==
Line 14 ⟶ 15:
; Continuously differentiable: This property is desirable for enabling gradient-based optimization methods ([[Rectifier (neural networks)|ReLU]] is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible). The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.<ref>{{cite book|url={{google books |plainurl=y |id=0tFmf_UKl7oC}}|title=Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms|last=Snyman|first=Jan|date=3 March 2005|publisher=Springer Science & Business Media|isbn=978-0-387-24348-1}}</ref>
 
These properties do not decisively influence performance, nor are they the only mathematical properties that may be useful. For instance, the strictly positive range of the [[softplus]] makes it suitable for predicting variances in [[Autoencoder#Variational autoencoder (VAE)|variational autoencoders]].
 
== Mathematical details ==
The most common activation functions can be divided into three categories: [[ridge function]]s, [[radial function]]s and [[fold function]]s.
 
An activation function <math>f</math> is '''saturating''' if <math>\lim_{|v|\to \infty} |\nabla f(v)| = 0</math>. It is '''nonsaturating''' if it is <math>\lim_{|v|\to \infty} |\nabla f(v)| \neq 0</math>. Non-saturating activation functions, such as [[ReLU]], may be better than saturating activation functions, because they are less likely to suffer from the [[vanishing gradient problem]].<ref>{{Cite journal |last1=Krizhevsky |first1=Alex |last2=Sutskever |first2=Ilya |last3=Hinton |first3=Geoffrey E. |date=2017-05-24 May 2017 |title=ImageNet classification with deep convolutional neural networks |journal=Communications of the ACM |volume=60 |issue=6 |pages=84–90 |doi=10.1145/3065386 |s2cid=195908774 |issn=0001-0782|doi-access=free }}</ref>
 
=== Ridge activation functions ===
Line 30 ⟶ 31:
* [[logistic function|Logistic]] activation: <math>\phi(\mathbf v) = (1+\exp(-a-\mathbf v'\mathbf b))^{-1}</math>.
 
In [[Biological neural network|biologically inspired neural networks]], the activation function is usually an abstraction representing the rate of [[action potential]] firing in the cell.<ref>{{Cite journal|last1=Hodgkin|first1=A. L.|last2=Huxley|first2=A. F.|date=1952-08-28 August 1952|title=A quantitative description of membrane current and its application to conduction and excitation in nerve |journal=The Journal of Physiology|volume=117|issue=4|pages=500–544 |pmc=1392413|pmid=12991237|doi=10.1113/jphysiol.1952.sp004764}}</ref> In its simplest form, this function is [[Binary function|binary]]—that is, either the [[neuron]] is firing or not. Neurons also cannot fire faster than a certain rate, motivating [[sigmoid function|sigmoid]] activation functions whose range is a finite interval.
 
The function looks like <math>\phi(\mathbf v)=U(a + \mathbf v'\mathbf b)</math>, where <math>U</math> is the [[Heaviside step function]].
Line 50 ⟶ 51:
Periodic functions can serve as activation functions. Usually the [[Sine wave|sinusoid]] is used, as any periodic function is decomposable into sinusoids by the [[Fourier transform]].<ref>{{Cite journal |last1=Sitzmann |first1=Vincent |last2=Martel |first2=Julien |last3=Bergman |first3=Alexander |last4=Lindell |first4=David |last5=Wetzstein |first5=Gordon |date=2020 |title=Implicit Neural Representations with Periodic Activation Functions |url=https://proceedings.neurips.cc/paper/2020/hash/53c04118df112c13a8c34b38343b9c10-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=33 |pages=7462–7473|arxiv=2006.09661 }}</ref>
 
Quadratic activation maps <math>x \mapsto x^2</math>.<ref>{{Citation |last=Flake |first=Gary William |title=Square Unit Augmented Radially Extended Multilayer Perceptrons |date=1998 |work=Neural Networks: Tricks of the Trade |series=Lecture Notes in Computer Science |volume=1524 |pages=145–163 |editor-last=Orr |editor-first=Genevieve B. |url=https://link.springer.com/chapter/10.1007/3-540-49430-8_8 |access-date=5 October 2024-10-05 |place=Berlin, Heidelberg |publisher=Springer |language=en |doi=10.1007/3-540-49430-8_8 |isbn=978-3-540-49430-0 |editor2-last=Müller |editor2-first=Klaus-Robert|url-access=subscription }}</ref><ref>{{Cite journal |last1=Du |first1=Simon |last2=Lee |first2=Jason |date=3 July 2018-07-03 |title=On the Power of Over-parametrization in Neural Networks with Quadratic Activation |url=https://proceedings.mlr.press/v80/du18a.html |journal=Proceedings of the 35th International Conference on Machine Learning |language=en |publisher=PMLR |pages=1329–1338|arxiv=1803.01206 }}</ref>
 
=== Folding activation functions ===
{{Main|Fold function}}
Folding activation functions are extensively used in the [[Pooling layer|pooling layers]] in [[convolutional neural network]]s, and in output layers of [[multiclass classification]] networks. These activations perform [[Aggregate function|aggregation]] over the inputs, such as taking the [[mean]], [[minimum]] or [[maximum]]. In multiclass classification the [[Softmax function|softmax]] activation is often used.
 
=== Table of activation functions ===
Line 145 ⟶ 146:
| <math>C^\infty</math>
|-
| [[Rectifier (neural networks)#ELU|Exponential linear unit (ELU)]]<ref>{{Cite arXiv|last1=Clevert|first1=Djork-Arné|last2=Unterthiner|first2=Thomas|last3=Hochreiter|first3=Sepp|date=2015-11-23 November 2015|title=Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)|eprint=1511.07289 |class=cs.LG}}</ref>
| [[File:Activation elu.svg]]
| <math>\begin{cases}
Line 162 ⟶ 163:
\end{cases}</math>
|-
| Scaled exponential linear unit (SELU)<ref>{{Cite journal |last1=Klambauer |first1=Günter |last2=Unterthiner |first2=Thomas |last3=Mayr |first3=Andreas |last4=Hochreiter |first4=Sepp |date=8 June 2017-06-08 |title=Self-Normalizing Neural Networks |journal=Advances in Neural Information Processing Systems |volume=30 |issue=2017 |arxiv=1706.02515}}</ref>
| [[File:Activation selu.png]]
| <math>\lambda \begin{cases}
Line 189 ⟶ 190:
| <math>C^0</math>
|-
| Parametric rectified linear unit (PReLU)<ref>{{Cite arXiv |last1=He |first1=Kaiming |last2=Zhang |first2=Xiangyu |last3=Ren |first3=Shaoqing |last4=Sun |first4=Jian |date=6 February 2015-02-06 |title=Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification |eprint=1502.01852 |class=cs.CV}}</ref>
| [[File:Activation prelu.svg]]
| <math>\begin{cases}
Line 234 ⟶ 235:
| <math>C^\infty</math>
|-
|Exponential Linear Sigmoid SquasHing (ELiSH)<ref>{{Citation |last1=Basirat |first1=Mina |title=The Quest for the Golden Activation Function |date=2 August 2018-08-02 |arxiv=1808.00783 |last2=Roth |first2=Peter M.}}</ref>
|[[File:Elish_activation_function.png|thumb|An image of the ELiSH activation function plotted over the range [-3, 3] with a minumum value of ~0.881 at x ~= -0.172]]
|<math>\begin{cases}
Line 302 ⟶ 303:
 
== Further reading ==
* {{Citation |last1=Kunc |first1=Vladimír |title=Three Decades of Activations: A Comprehensive Survey of 400 Activation Functions for Neural Networks |date=2024-02-14 February 2024 |arxiv=2402.09092 |last2=Kléma |first2=Jiří}}
* {{cite arXiv |last1=Nwankpa |first1=Chigozie |title=Activation Functions: Comparison of trends in Practice and Research for Deep Learning |date=8 November 2018-11-08 |eprint=1811.03378 |last2=Ijomah |first2=Winifred |last3=Gachagan |first3=Anthony |last4=Marshall |first4=Stephen|class=cs.LG }}
* {{cite journal |last1=Dubey |first1=Shiv Ram |last2=Singh |first2=Satish Kumar |last3=Chaudhuri |first3=Bidyut Baran |year=2022 |title=Activation functions in deep learning: A comprehensive survey and benchmark |journal=Neurocomputing |publisher=Elsevier BV |volume=503 |pages=92–108 |doi=10.1016/j.neucom.2022.06.111 |issn=0925-2312 |doi-access=free|arxiv=2109.14545 }}