Activation function: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 06:37, 18 August 2025 edit Hooman Mallahzadeh (talk \| contribs) Extended confirmed users 4,649 edits →Comparison of activation functions ← Previous edit		Latest revision as of 05:28, 21 August 2025 edit undo Hooman Mallahzadeh (talk \| contribs) Extended confirmed users 4,649 edits →Folding activation functions
(4 intermediate revisions by 2 users not shown)
Line 2: {{About\|\|the formalism used to approximate the influence of an extracellular electrical field on neurons\|activating function\|a linear system’s transfer function\|transfer function}} {{Machine learning}} {{Use dmy dates\|date=August 2025}} [[File:Logistic-curve.svg\|thumb\|Logistic activation function]] The '''activation function''' of a node in an [[artificial neural network]] is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is ''nonlinear''.<ref>{{Cite web\|url=http://didattica.cs.unicam.it/lib/exe/fetch.php?media=didattica:magistrale:kebi:ay_1718:ke-11_neural_networks.pdf\|title=Neural Networks, p. 7\|last=Hinkelmann\|first=Knut\|website=University of Applied Sciences Northwestern Switzerland\|access-date=6 October 2018~~-10-06~~\|archive-date=6 October 2018~~-10-06~~\|archive-url=https://web.archive.org/web/20181006235506/http://didattica.cs.unicam.it/lib/exe/fetch.php?media=didattica:magistrale:kebi:ay_1718:ke-11_neural_networks.pdf\|url-status=dead}}</ref> Modern activation functions include the logistic ([[Sigmoid function\|sigmoid]]) function used in the 2012 [[speech recognition]] model developed by [[Geoffrey Hinton\|Hinton]] et al;<ref>{{Cite journal \|last1=Hinton \|first1=Geoffrey \|last2=Deng \|first2=Li \|last3=Deng \|first3=Li \|last4=Yu \|first4=Dong \|last5=Dahl \|first5=George \|last6=Mohamed \|first6=Abdel-rahman \|last7=Jaitly \|first7=Navdeep \|last8=Senior \|first8=Andrew \|last9=Vanhoucke \|first9=Vincent \|last10=Nguyen \|first10=Patrick \|last11=Sainath \|first11=Tara\|author11-link= Tara Sainath \|last12=Kingsbury \|first12=Brian \|year=2012 \|title=Deep Neural Networks for Acoustic Modeling in Speech Recognition \|journal=IEEE Signal Processing Magazine \|volume=29 \|issue=6 \|pages=82–97 \|doi=10.1109/MSP.2012.2205597\|s2cid=206485943 }}</ref> the [[ReLU]] used in the 2012 [[AlexNet]] computer vision model<ref>{{Cite journal \|last1=Krizhevsky \|first1=Alex \|last2=Sutskever \|first2=Ilya \|last3=Hinton \|first3=Geoffrey E. \|date=~~2017-05-~~24 May 2017 \|title=ImageNet classification with deep convolutional neural networks \|url=https://dl.acm.org/doi/10.1145/3065386 \|journal=Communications of the ACM \|language=en \|volume=60 \|issue=6 \|pages=84–90 \|doi=10.1145/3065386 \|issn=0001-0782}}</ref><ref>{{Cite journal \|last1=King Abdulaziz University \|last2=Al-johania \|first2=Norah \|last3=Elrefaei \|first3=Lamiaa \|last4=Benha University \|date=~~2019-06-~~30 June 2019 \|title=Dorsal Hand Vein Recognition by Convolutional Neural Networks: Feature Learning and Transfer Learning Approaches \|url=http://www.inass.org/2019/2019063019.pdf \|journal=International Journal of Intelligent Engineering and Systems \|volume=12 \|issue=3 \|pages=178–191 \|doi=10.22266/ijies2019.0630.19}}</ref> and in the 2015 [[Residual neural network\|ResNet]] model; and the smooth version of the ReLU, the [[ReLU#Gaussian-error linear unit (GELU)\|GELU]], which was used in the 2018 [[BERT (language model)\|BERT]] model.<ref name="ReferenceA">{{Cite arXiv \|eprint=1606.08415 \|title=Gaussian Error Linear Units (GELUs) \|last1=Hendrycks \|first1=Dan \|last2=Gimpel \|first2=Kevin \|year=2016 \|class=cs.LG}}</ref> ==Comparison of activation functions== Line 14 ⟶ 15: ; Continuously differentiable: This property is desirable for enabling gradient-based optimization methods ([[Rectifier (neural networks)\|ReLU]] is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible). The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.<ref>{{cite book\|url={{google books \|plainurl=y \|id=0tFmf_UKl7oC}}\|title=Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms\|last=Snyman\|first=Jan\|date=3 March 2005\|publisher=Springer Science & Business Media\|isbn=978-0-387-24348-1}}</ref> These properties do not decisively influence performance, nor are they the only mathematical properties that may be useful. For instance, the strictly positive range of the [[softplus]] makes it suitable for predicting variances in [[Autoencoder#Variational autoencoder (VAE)\|variational autoencoders]]. == Mathematical details == The most common activation functions can be divided into three categories: [[ridge function]]s, [[radial function]]s and [[fold function]]s. An activation function <math>f</math> is '''saturating''' if <math>\lim_{\|v\|\to \infty} \|\nabla f(v)\| = 0</math>. It is '''nonsaturating''' if ~~it is~~ <math>\lim_{\|v\|\to \infty} \|\nabla f(v)\| \neq 0</math>. Non-saturating activation functions, such as [[ReLU]], may be better than saturating activation functions, because they are less likely to suffer from the [[vanishing gradient problem]].<ref>{{Cite journal \|last1=Krizhevsky \|first1=Alex \|last2=Sutskever \|first2=Ilya \|last3=Hinton \|first3=Geoffrey E. \|date=~~2017-05-~~24 May 2017 \|title=ImageNet classification with deep convolutional neural networks \|journal=Communications of the ACM \|volume=60 \|issue=6 \|pages=84–90 \|doi=10.1145/3065386 \|s2cid=195908774 \|issn=0001-0782\|doi-access=free }}</ref> === Ridge activation functions === Line 30 ⟶ 31: * [[logistic function\|Logistic]] activation: <math>\phi(\mathbf v) = (1+\exp(-a-\mathbf v'\mathbf b))^{-1}</math>. In [[Biological neural network\|biologically inspired neural networks]], the activation function is usually an abstraction representing the rate of [[action potential]] firing in the cell.<ref>{{Cite journal\|last1=Hodgkin\|first1=A. L.\|last2=Huxley\|first2=A. F.\|date=~~1952-08-~~28 August 1952\|title=A quantitative description of membrane current and its application to conduction and excitation in nerve \|journal=The Journal of Physiology\|volume=117\|issue=4\|pages=500–544 \|pmc=1392413\|pmid=12991237\|doi=10.1113/jphysiol.1952.sp004764}}</ref> In its simplest form, this function is [[Binary function\|binary]]—that is, either the [[neuron]] is firing or not. Neurons also cannot fire faster than a certain rate, motivating [[sigmoid function\|sigmoid]] activation functions whose range is a finite interval. The function looks like <math>\phi(\mathbf v)=U(a + \mathbf v'\mathbf b)</math>, where <math>U</math> is the [[Heaviside step function]]. Line 50 ⟶ 51: Periodic functions can serve as activation functions. Usually the [[Sine wave\|sinusoid]] is used, as any periodic function is decomposable into sinusoids by the [[Fourier transform]].<ref>{{Cite journal \|last1=Sitzmann \|first1=Vincent \|last2=Martel \|first2=Julien \|last3=Bergman \|first3=Alexander \|last4=Lindell \|first4=David \|last5=Wetzstein \|first5=Gordon \|date=2020 \|title=Implicit Neural Representations with Periodic Activation Functions \|url=https://proceedings.neurips.cc/paper/2020/hash/53c04118df112c13a8c34b38343b9c10-Abstract.html \|journal=Advances in Neural Information Processing Systems \|publisher=Curran Associates, Inc. \|volume=33 \|pages=7462–7473\|arxiv=2006.09661 }}</ref> Quadratic activation maps <math>x \mapsto x^2</math>.<ref>{{Citation \|last=Flake \|first=Gary William \|title=Square Unit Augmented Radially Extended Multilayer Perceptrons \|date=1998 \|work=Neural Networks: Tricks of the Trade \|series=Lecture Notes in Computer Science \|volume=1524 \|pages=145–163 \|editor-last=Orr \|editor-first=Genevieve B. \|url=https://link.springer.com/chapter/10.1007/3-540-49430-8_8 \|access-date=5 October 2024~~-10-05~~ \|place=Berlin, Heidelberg \|publisher=Springer \|language=en \|doi=10.1007/3-540-49430-8_8 \|isbn=978-3-540-49430-0 \|editor2-last=Müller \|editor2-first=Klaus-Robert\|url-access=subscription }}</ref><ref>{{Cite journal \|last1=Du \|first1=Simon \|last2=Lee \|first2=Jason \|date=3 July 2018~~-07-03~~ \|title=On the Power of Over-parametrization in Neural Networks with Quadratic Activation \|url=https://proceedings.mlr.press/v80/du18a.html \|journal=Proceedings of the 35th International Conference on Machine Learning \|language=en \|publisher=PMLR \|pages=1329–1338\|arxiv=1803.01206 }}</ref> === Folding activation functions === {{Main\|Fold function}} Folding activation functions are extensively used in the [[Pooling layer\|pooling layers]] in [[convolutional neural network]]s, and in output layers of [[multiclass classification]] networks. These activations perform [[Aggregate function\|aggregation]] over the inputs, such as taking the [[mean]], [[minimum]] or [[maximum]]. In multiclass classification the [[Softmax function\|softmax]] activation is often used. === Table of activation functions === Line 145 ⟶ 146: \| <math>C^\infty</math> \|- \| [[Rectifier (neural networks)#ELU\|Exponential linear unit (ELU)]]<ref>{{Cite arXiv\|last1=Clevert\|first1=Djork-Arné\|last2=Unterthiner\|first2=Thomas\|last3=Hochreiter\|first3=Sepp\|date=~~2015-11-~~23 November 2015\|title=Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)\|eprint=1511.07289 \|class=cs.LG}}</ref> \| [[File:Activation elu.svg]] \| <math>\begin{cases} Line 162 ⟶ 163: \end{cases}</math> \|- \| Scaled exponential linear unit (SELU)<ref>{{Cite journal \|last1=Klambauer \|first1=Günter \|last2=Unterthiner \|first2=Thomas \|last3=Mayr \|first3=Andreas \|last4=Hochreiter \|first4=Sepp \|date=8 June 2017~~-06-08~~ \|title=Self-Normalizing Neural Networks \|journal=Advances in Neural Information Processing Systems \|volume=30 \|issue=2017 \|arxiv=1706.02515}}</ref> \| [[File:Activation selu.png]] \| <math>\lambda \begin{cases} Line 189 ⟶ 190: \| <math>C^0</math> \|- \| Parametric rectified linear unit (PReLU)<ref>{{Cite arXiv \|last1=He \|first1=Kaiming \|last2=Zhang \|first2=Xiangyu \|last3=Ren \|first3=Shaoqing \|last4=Sun \|first4=Jian \|date=6 February 2015~~-02-06~~ \|title=Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification \|eprint=1502.01852 \|class=cs.CV}}</ref> \| [[File:Activation prelu.svg]] \| <math>\begin{cases} Line 234 ⟶ 235: \| <math>C^\infty</math> \|- \|Exponential Linear Sigmoid SquasHing (ELiSH)<ref>{{Citation \|last1=Basirat \|first1=Mina \|title=The Quest for the Golden Activation Function \|date=2 August 2018~~-08-02~~ \|arxiv=1808.00783 \|last2=Roth \|first2=Peter M.}}</ref> \|[[File:Elish_activation_function.png\|thumb\|An image of the ELiSH activation function plotted over the range [-3, 3] with a minumum value of ~0.881 at x ~= -0.172]] \|<math>\begin{cases} Line 302 ⟶ 303: == Further reading == * {{Citation \|last1=Kunc \|first1=Vladimír \|title=Three Decades of Activations: A Comprehensive Survey of 400 Activation Functions for Neural Networks \|date=~~2024-02-~~14 February 2024 \|arxiv=2402.09092 \|last2=Kléma \|first2=Jiří}} * {{cite arXiv \|last1=Nwankpa \|first1=Chigozie \|title=Activation Functions: Comparison of trends in Practice and Research for Deep Learning \|date=8 November 2018~~-11-08~~ \|eprint=1811.03378 \|last2=Ijomah \|first2=Winifred \|last3=Gachagan \|first3=Anthony \|last4=Marshall \|first4=Stephen\|class=cs.LG }} * {{cite journal \|last1=Dubey \|first1=Shiv Ram \|last2=Singh \|first2=Satish Kumar \|last3=Chaudhuri \|first3=Bidyut Baran \|year=2022 \|title=Activation functions in deep learning: A comprehensive survey and benchmark \|journal=Neurocomputing \|publisher=Elsevier BV \|volume=503 \|pages=92–108 \|doi=10.1016/j.neucom.2022.06.111 \|issn=0925-2312 \|doi-access=free\|arxiv=2109.14545 }}