Activation function: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 07:00, 18 August 2025 edit ZKang123 (talk \| contribs) Extended confirmed users 12,245 edits m script-assisted date audit and style fixes per MOS:NUM ← Previous edit		Latest revision as of 05:28, 21 August 2025 edit undo Hooman Mallahzadeh (talk \| contribs) Extended confirmed users 4,649 edits →Folding activation functions
(3 intermediate revisions by the same user not shown)
Line 15: ; Continuously differentiable: This property is desirable for enabling gradient-based optimization methods ([[Rectifier (neural networks)\|ReLU]] is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible). The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.<ref>{{cite book\|url={{google books \|plainurl=y \|id=0tFmf_UKl7oC}}\|title=Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms\|last=Snyman\|first=Jan\|date=3 March 2005\|publisher=Springer Science & Business Media\|isbn=978-0-387-24348-1}}</ref> These properties do not decisively influence performance, nor are they the only mathematical properties that may be useful. For instance, the strictly positive range of the [[softplus]] makes it suitable for predicting variances in [[Autoencoder#Variational autoencoder (VAE)\|variational autoencoders]]. == Mathematical details == The most common activation functions can be divided into three categories: [[ridge function]]s, [[radial function]]s and [[fold function]]s. An activation function <math>f</math> is '''saturating''' if <math>\lim_{\|v\|\to \infty} \|\nabla f(v)\| = 0</math>. It is '''nonsaturating''' if ~~it is~~ <math>\lim_{\|v\|\to \infty} \|\nabla f(v)\| \neq 0</math>. Non-saturating activation functions, such as [[ReLU]], may be better than saturating activation functions, because they are less likely to suffer from the [[vanishing gradient problem]].<ref>{{Cite journal \|last1=Krizhevsky \|first1=Alex \|last2=Sutskever \|first2=Ilya \|last3=Hinton \|first3=Geoffrey E. \|date=24 May 2017 \|title=ImageNet classification with deep convolutional neural networks \|journal=Communications of the ACM \|volume=60 \|issue=6 \|pages=84–90 \|doi=10.1145/3065386 \|s2cid=195908774 \|issn=0001-0782\|doi-access=free }}</ref> === Ridge activation functions === Line 55: === Folding activation functions === {{Main\|Fold function}} Folding activation functions are extensively used in the [[Pooling layer\|pooling layers]] in [[convolutional neural network]]s, and in output layers of [[multiclass classification]] networks. These activations perform [[Aggregate function\|aggregation]] over the inputs, such as taking the [[mean]], [[minimum]] or [[maximum]]. In multiclass classification the [[Softmax function\|softmax]] activation is often used. === Table of activation functions ===