Activation function: Difference between revisions

Content deleted Content added
m script-assisted date audit and style fixes per MOS:NUM
 
(3 intermediate revisions by the same user not shown)
Line 15:
; Continuously differentiable: This property is desirable for enabling gradient-based optimization methods ([[Rectifier (neural networks)|ReLU]] is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible). The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.<ref>{{cite book|url={{google books |plainurl=y |id=0tFmf_UKl7oC}}|title=Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms|last=Snyman|first=Jan|date=3 March 2005|publisher=Springer Science & Business Media|isbn=978-0-387-24348-1}}</ref>
 
These properties do not decisively influence performance, nor are they the only mathematical properties that may be useful. For instance, the strictly positive range of the [[softplus]] makes it suitable for predicting variances in [[Autoencoder#Variational autoencoder (VAE)|variational autoencoders]].
 
== Mathematical details ==
The most common activation functions can be divided into three categories: [[ridge function]]s, [[radial function]]s and [[fold function]]s.
 
An activation function <math>f</math> is '''saturating''' if <math>\lim_{|v|\to \infty} |\nabla f(v)| = 0</math>. It is '''nonsaturating''' if it is <math>\lim_{|v|\to \infty} |\nabla f(v)| \neq 0</math>. Non-saturating activation functions, such as [[ReLU]], may be better than saturating activation functions, because they are less likely to suffer from the [[vanishing gradient problem]].<ref>{{Cite journal |last1=Krizhevsky |first1=Alex |last2=Sutskever |first2=Ilya |last3=Hinton |first3=Geoffrey E. |date=24 May 2017 |title=ImageNet classification with deep convolutional neural networks |journal=Communications of the ACM |volume=60 |issue=6 |pages=84–90 |doi=10.1145/3065386 |s2cid=195908774 |issn=0001-0782|doi-access=free }}</ref>
 
=== Ridge activation functions ===
Line 55:
=== Folding activation functions ===
{{Main|Fold function}}
Folding activation functions are extensively used in the [[Pooling layer|pooling layers]] in [[convolutional neural network]]s, and in output layers of [[multiclass classification]] networks. These activations perform [[Aggregate function|aggregation]] over the inputs, such as taking the [[mean]], [[minimum]] or [[maximum]]. In multiclass classification the [[Softmax function|softmax]] activation is often used.
 
=== Table of activation functions ===