Revision as of 14:55, 15 August 2025 edit Hooman Mallahzadeh (talk \| contribs) Extended confirmed users 4,649 edits Linking. ← Previous edit		Revision as of 06:37, 18 August 2025 edit undo Hooman Mallahzadeh (talk \| contribs) Extended confirmed users 4,649 edits →Comparison of activation functions Next edit →
Line 12: ; Nonlinear: When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator.<ref>{{Cite journal\|author1-link=George Cybenko\|last=Cybenko\|first=G.\|date=December 1989\|title=Approximation by superpositions of a sigmoidal function\|journal=Mathematics of Control, Signals, and Systems\|language=en\|volume=2\|issue=4\|pages=303–314\|doi=10.1007/BF02551274\|bibcode=1989MCSS....2..303C \|s2cid=3958369\|issn=0932-4194\|url=https://hal.archives-ouvertes.fr/hal-03753170/file/Cybenko1989.pdf }}</ref> This is known as the [[Universal approximation theorem\|Universal Approximation Theorem]]. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model. ; Range: When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller [[learning rate]]s are typically necessary.{{citation needed\|date=January 2016}} ; Continuously differentiable: This property is desirable for enabling gradient-based optimization methods ([[Rectifier (neural networks)\|ReLU]] is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible) ~~for enabling gradient-based optimization methods~~. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.<ref>{{cite book\|url={{google books \|plainurl=y \|id=0tFmf_UKl7oC}}\|title=Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms\|last=Snyman\|first=Jan\|date=3 March 2005\|publisher=Springer Science & Business Media\|isbn=978-0-387-24348-1}}</ref> These properties do not decisively influence performance, nor are they the only mathematical properties that may be useful. For instance, the strictly positive range of the softplus makes it suitable for predicting variances in [[Autoencoder#Variational autoencoder (VAE)\|variational autoencoders]].

Activation function: Difference between revisions