Rectified linear unit: Difference between revisions

Content deleted Content added
Yoderj (talk | contribs)
Potential problems: Revise differentiability
Removing notice of move discussion
 
(254 intermediate revisions by more than 100 users not shown)
Line 1:
{{Short description|Type of activation function}}
[[Image:Rectifier and softplus functions.svg|thumb|Plot of the rectifier (blue) and softplus (green) functions near {{nobr|''x'' {{=}} 0}}]]
{{Machine learning}}
In the context of [[artificial neural network]]s, the '''rectifier''' is an [[activation function]] defined as the positive part of its argument:
[[Image:ReLU_and_GELU.svg|thumb|Plot of the ReLU (blue) and [[#Gaussian-error linear unit (GELU)|GELU]] (green) functions near ''x'' = 0]]
In the context of [[Neural network (machine learning)|artificial neural networks]], the '''rectifier''' or '''ReLU (rectified linear unit) activation function'''<ref>{{cite web |last1=Brownlee |first1=Jason |title=A Gentle Introduction to the Rectified Linear Unit (ReLU) |url=https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/ |website=Machine Learning Mastery |access-date=8 April 2021 |date=8 January 2019}}</ref><ref>{{cite web |last1=Liu |first1=Danqing |title=A Practical Guide to ReLU |url=https://medium.com/@danqing/a-practical-guide-to-relu-b83ca804f1f7 |website=Medium |access-date=8 April 2021 |language=en |date=30 November 2017}}</ref> is an [[activation function]] defined as the non-negative part of its argument, i.e., the [[ramp function]]:
 
:<math>f\operatorname{ReLU}(x) = x^+ = \max(0, x)</math>, = \frac{x+|x|}{2} = \begin{cases}
x & \text{if } x > 0, \\
0 & x \le 0
\end{cases}</math>
 
where ''<math>x''</math> is the input to a neuron. This is also known as a [[rampArtificial functionneuron|neuron]]. andThis is analogous to [[half-wave rectification]] in [[electrical engineering]].
This [[activation function]] was first introduced to a dynamical network by Hahnloser et al. in a 2000 paper in Nature<ref name="Hahnloser2000">{{cite conference |authors=R Hahnloser, R. Sarpeshkar, M A Mahowald, R. J. Douglas, H.S. Seung |title=Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit |journal=Nature |volume=405 |year=2000 |pages=947–951}}</ref> with strong [[biological]] motivations and mathematical justifications.<ref name="Hahnloser2001">{{cite conference |authors=R Hahnloser, H.S. Seung |year=2001 |title=Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks|conference=NIPS 2001}}</ref>. It has been demonstrated for the first time in 2011 to enable better training of deeper networks <ref name="glorot2011">{{cite conference |authors=Xavier Glorot, Antoine Bordes and [[Yoshua Bengio]] |year=2011 |title=Deep sparse rectifier neural networks |conference=AISTATS |url=http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf}}</ref>, compared to the widely used activation functions prior to 2011, i.e., the [[Logistic function|logistic sigmoid]] (which is inspired by [[probability theory]]; see [[logistic regression]]) and its more practical<ref>{{cite encyclopedia |authors=[[Yann LeCun]], [[Leon Bottou]], Genevieve B. Orr and [[Klaus-Robert Müller]] |year=1998 |url=http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf |title=Efficient BackProp |editors=G. Orr and K. Müller |encyclopedia=Neural Networks: Tricks of the Trade |publisher=Springer}}</ref> counterpart, the [[hyperbolic tangent]]. The rectifier is, {{as of|2018|lc=y}}, the most popular activation function for [[deep learning|deep neural networks]].<ref>{{cite journal |first1=Yann |last1=LeCun |first2=Yoshua |last2=Bengio |first3=Geoffrey |last3=Hinton |title=Deep learning |journal=Nature |volume=521 |issue=7553 |year=2015 |pages=436–444 |doi=10.1038/nature14539 |pmid=26017442|bibcode=2015Natur.521..436L }}</ref><ref>{{cite arXiv |last1=Ramachandran |first1=Prajit |last2=Barret |first2=Zoph |last3=Quoc |first3=V. Le |date=October 16, 2017 |title=Searching for Activation Functions |eprint=1710.05941 |class=cs.NE}}</ref>
 
ReLU is one of the most popular activation functions for artificial neural networks,<ref>{{cite arXiv |last1=Ramachandran |first1=Prajit |last2=Barret |first2=Zoph |last3=Quoc |first3=V. Le |date=October 16, 2017 |title=Searching for Activation Functions |eprint=1710.05941 |class=cs.NE}}</ref> and finds application in [[computer vision]]<ref name="Yoshua Bengio-2011">{{cite conference |author1=Xavier Glorot |author2=Antoine Bordes |author3=[[Yoshua Bengio]] |year=2011 |title=Deep sparse rectifier neural networks |url=https://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf |conference=AISTATS |quote=Rectifier and softplus activation functions. The second one is a smooth version of the first.}}</ref> and [[speech recognition]]<ref>{{cite conference |author=László Tóth |year=2013 |title=Phone Recognition with Deep Sparse Rectifier Neural Networks |conference=[[International Conference on Acoustics, Speech and Signal Processing|ICASSP]] |url=http://www.inf.u-szeged.hu/~tothl/pubs/ICASSP2013.pdf}}</ref><ref name="Andrew L">Andrew L. Maas, Awni Y. Hannun, Andrew Y. Ng (2014). [https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf Rectifier Nonlinearities Improve Neural Network Acoustic Models].</ref> using [[Deep learning|deep neural nets]] and [[computational neuroscience]].<ref>{{cite journal |first1=D. |last1=Hansel |first2=C. |last2=van Vreeswijk |title=How noise contributes to contrast invariance of orientation tuning in cat visual cortex |journal=[[J. Neurosci.]] |volume=22 |issue= 12|year=2002 |pages=5118–5128 |doi=10.1523/JNEUROSCI.22-12-05118.2002 |pmid= 12077207 |pmc=6757721 }}</ref><ref>{{Cite journal |doi = 10.1103/PhysRevX.5.041030 |volume = 5 |issue = 4 |page = 041030 |last1 = Kadmon |first1 = Jonathan |last2 = Sompolinsky |first2 = Haim |title = Transition to Chaos in Random Neuronal Networks |journal = Physical Review X |date = 2015-11-19 |arxiv = 1508.06486 |bibcode = 2015PhRvX...5d1030K |s2cid = 7813832}}</ref>
A unit employing the rectifier is also called a '''rectified linear unit''' ('''ReLU''').<ref name="nair2010"/>
 
{{TOC limit}}
A smooth approximation to the rectifier is the [[analytic function]]
:<math>f(x) = \log(1 + \exp x),</math>
which is called the '''softplus''' function.<ref>C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, R. Garcia, NIPS'2000, (2001),[http://papers.nips.cc/paper/1920-incorporating-second-order-functional-knowledge-for-better-option-pricing.pdf Incorporating Second-Order Functional Knowledge for Better Option Pricing].</ref> The derivative of softplus is <math>f'(x) = \exp x / (1 + \exp x) = 1 / (1 + \exp (-x))</math>, i.e. the [[logistic function]].
 
== History ==
Rectified linear units find applications in [[computer vision]]<ref name="glorot2011"/> and [[speech recognition]]<ref name="tothl2013">{{cite conference |authors=László Tóth |year=2013 |title=Phone Recognition with Deep Sparse Rectifier Neural Networks |conference=[[International Conference on Acoustics, Speech and Signal Processing|ICASSP]] |url=http://www.inf.u-szeged.hu/~tothl/pubs/ICASSP2013.pdf}}</ref><ref name="maas2014">Andrew L. Maas, Awni Y. Hannun, Andrew Y. Ng (2014). [http://web.stanford.edu/~awni/papers/relu_hybrid_icml2013_final.pdf Rectifier Nonlinearities Improve Neural Network Acoustic Models]</ref> using [[Deep learning|deep neural nets]].
The ReLU was first used by [[Alston Scott Householder|Alston Householder]] in 1941 as a mathematical abstraction of biological neural networks.<ref>{{Cite journal |last=Householder |first=Alston S. |date=June 1941 |title=A theory of steady-state activity in nerve-fiber networks: I. Definitions and preliminary lemmas |url=http://link.springer.com/10.1007/BF02478220 |journal=The Bulletin of Mathematical Biophysics |language=en |volume=3 |issue=2 |pages=63–69 |doi=10.1007/BF02478220 |issn=0007-4985|url-access=subscription }}</ref>
 
[[Kunihiko Fukushima]] in 1969 used ReLU in the context of visual feature extraction in hierarchical neural networks.<ref>{{cite journal |last1=Fukushima |first1=K. |date=1969 |title=Visual feature extraction by a multilayered network of analog threshold elements |journal=IEEE Transactions on Systems Science and Cybernetics |volume=5 |issue=4 |pages=322–333 |doi=10.1109/TSSC.1969.300225}}</ref><ref>{{cite book |last1=Fukushima |first1=K. |title=Competition and Cooperation in Neural Nets |last2=Miyake |first2=S. |date=1982 |publisher=Springer |isbn=978-3-540-11574-8 |series=Lecture Notes in Biomathematics |volume=45 |pages=267–285 |chapter=Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition |doi=10.1007/978-3-642-46466-9_18}}</ref> Thirty years later, Hahnloser et al. argued that ReLU approximates the biological relationship between neural firing rates and input current, in addition to enabling recurrent neural network dynamics to stabilise under weaker criteria.<ref>{{cite journal |last1=Hahnloser |first1=R. |last2=Sarpeshkar |first2=R. |last3=Mahowald |first3=M. A. |last4=Douglas |first4=R. J. |last5=Seung |first5=H. S. |year=2000 |title=Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit |journal=[[Nature (journal)|Nature]] |volume=405 |issue=6789 |pages=947–951 |bibcode=2000Natur.405..947H |doi=10.1038/35016072 |pmid=10879535 |s2cid=4399014}}</ref><ref>{{Cite journal |last1=Hahnloser |first1=Richard |last2=Seung |first2=H. Sebastian |date=2000 |title=Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks |url=https://proceedings.neurips.cc/paper/2000/hash/c8cbd669cfb2f016574e9d147092b5bb-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=13}}</ref>
== Variants ==
 
Prior to 2010, most activation functions used were the [[Logistic function|logistic sigmoid]] (which is inspired by [[probability theory]]; see [[logistic regression]]) and its more numerically efficient<ref>{{cite encyclopedia |year=1998 |title=Efficient BackProp |encyclopedia=Neural Networks: Tricks of the Trade |publisher=Springer |url=http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf |editor1=G. Orr |author2=[[Leon Bottou]] |author3=Genevieve B. Orr |author4=[[Klaus-Robert Müller]] |author=[[Yann LeCun]] |editor2=K. Müller |access-date=2012-12-07 |archive-date=2018-08-31 |archive-url=https://web.archive.org/web/20180831075352/http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf |url-status=dead }}</ref> counterpart, the [[hyperbolic tangent]]. Around 2010, the use of ReLU became common again.
===Noisy ReLUs===
Rectified linear units can be extended to include [[Gaussian noise]], making them noisy ReLUs, giving<ref name="nair2010">{{cite conference|authors=Vinod Nair and [[Geoffrey Hinton]]|first=|date=|year=2010|title=Rectified Linear Units Improve Restricted Boltzmann Machines|url=https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf|conference=[[International Conference on Machine Learning|ICML]]|volume=|pages=|via=}}</ref>
:<math>f(x) = \max(0, x + Y)</math>, with <math>Y \sim \mathcal{N}(0, \sigma(x))</math>
 
Jarrett et al. (2009) noted that rectification by either [[Absolute value|absolute]] or ReLU (which they called "positive part") was critical for object recognition in convolutional neural networks (CNNs), specifically because it allows [[Pooling layer#Average pooling|average pooling]] without neighboring filter outputs cancelling each other out. They hypothesized that the use of sigmoid or tanh was responsible for poor performance in previous CNNs.<ref>{{Cite book |last1=Jarrett |first1=Kevin |last2=Kavukcuoglu |first2=Koray |last3=Ranzato |first3=Marc'Aurelio |last4=LeCun |first4=Yann |chapter=What is the best multi-stage architecture for object recognition? |date=September 2009 |title=2009 IEEE 12th International Conference on Computer Vision |pages=2146–2153 |doi=10.1109/ICCV.2009.5459469|isbn=978-1-4244-4420-5 }}</ref>
Noisy ReLUs have been used with some success in [[restricted Boltzmann machine]]s for computer vision tasks.<ref name="nair2010"/>
 
Nair and Hinton (2010) made a theoretical argument that the [[softplus]] activation function should be used, in that the softplus function numerically approximates the sum of an exponential number of linear models that share parameters. They then proposed ReLU as a good approximation to it. Specifically, they began by considering a single binary neuron in a [[Boltzmann machine]] that takes <math>x</math> as input, and produces 1 as output with probability <math>\sigma(x) = \frac{1}{1 + e^{-x}}</math>. They then considered extending its range of output by making infinitely many copies of it <math>X_1, X_2, X_3, \dots</math>, that all take the same input, offset by an amount <math>0.5, 1.5, 2.5, \dots</math>, then their outputs are added together as <math>\sum_{i=1}^\infty X_i</math>. They then demonstrated that <math>\sum_{i=1}^\infty X_i</math> is approximately equal to <math>\mathcal N(\log(1+e^x), \sigma(x))</math>, which is also approximately equal to <math>\operatorname{ReLU}(\mathcal N( x, \sigma(x)))</math>, where <math>\mathcal N</math> stands for the [[Normal distribution|gaussian distribution]].
=== Leaky ReLUs ===
 
They also argued for another reason for using ReLU: that it allows "intensity equivariance" in image recognition. That is, multiplying input image by a constant <math>k</math> multiplies the output also. In contrast, this is false for other activation functions like sigmoid or tanh. They found that ReLU activation allowed good empirical performance in [[restricted Boltzmann machine]]s.<ref name=":0">Nair, Vinod, and Geoffrey E. Hinton. "[https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf Rectified linear units improve restricted boltzmann machines]." ''Proceedings of the 27th international conference on machine learning (ICML-10)''. 2010.</ref>
Leaky ReLUs allow a small, positive gradient when the unit is not active.<ref name="maas2014"/>
:<math>f(x) = \begin{cases}
x & \mbox{if } x > 0 \\
0.01x & \mbox{otherwise}
\end{cases}</math>
 
Glorot et al (2011) argued that ReLU has the following advantages over sigmoid or tanh:
Parametric ReLUs take this idea further by making the coefficient of leakage into a parameter that is learned along with the other neural network parameters.<ref name="prelu">{{cite arxiv |eprint=1502.01852 |last1=He |first1=Kaiming |title=Delving Deep into Rectifiers: Surpassing Human-Level Performance on Image ''Net'' Classification |last2=Zhang |first2=Xiangyu |last3=Ren |first3=Shaoqing |last4=Sun |first4=Jian |class=cs.CV |year=2015}}</ref>
 
* ReLU is more similar to biological neurons' responses in their main operating regime.
:<math>f(x) = \begin{cases}
* ReLU avoids vanishing gradients.
x & \mbox{if } x > 0 \\
* ReLU is cheaper to compute.
a x & \mbox{otherwise}
* ReLU creates sparse representation naturally, because many hidden units output exactly zero for a given input.
 
They also found empirically that deep networks trained with ReLU can achieve strong performance ''without'' unsupervised pre-training, especially on large, purely supervised tasks.<ref name="Yoshua Bengio-2011" />
 
== Advantages ==
 
Advantages of ReLU include:
 
* [[Sparse matrix|Sparse]] activation: for example, in a [[Weight initialization|randomly initialized]] network, only about 50% of [[Hidden layer|hidden units]] are activated (i.e. have a non-zero output).
* Better gradient propagation: fewer [[Vanishing gradient problem|vanishing gradient]] problems compared to sigmoidal activation functions that saturate in both directions.<ref name="Yoshua Bengio-2011" />
* Efficiency: only requires comparison and addition.
* Scale-invariant ([[Homogeneous function|homogeneous]], or "intensity equivariance"<ref name=":0" />):
: <math>\max(0, ax) = a \max(0, x) \text{ for } a \geq 0</math>.
 
== Potential problems ==
 
Possible downsides can include:
 
* Non-differentiability at zero (however, it is differentiable anywhere else, and the value of the [[derivative]] at zero can be chosen to be 0 or 1 arbitrarily).
* Not zero-centered: ReLU outputs are always non-negative. This can make it harder for the network to learn during backpropagation, because gradient updates tend to push weights in one direction (positive or negative). [[Batch normalization]] can help address this.{{Citation needed|date=April 2024}}
* ReLU is unbounded.
* Redundancy of the parametrization: Because ReLU is scale-invariant, the network computes the exact same function by scaling the weights and biases in front of a ReLU activation by <math>k</math>, and the weights after by <math>1/k</math>.<ref name="Yoshua Bengio-2011" />
* {{anchor|Dying ReLU}}Dying ReLU: ReLU neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, and so the neuron becomes stuck in a perpetually inactive state (it "dies"). This is a form of the [[vanishing gradient problem]]. In some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model capacity and potentially even halting the learning process. This problem typically arises when the learning rate is set too high. It may be mitigated by using "leaky" ReLU instead, where a small positive slope is assigned for <math>x<0</math>. However, depending on the task, performance may be reduced.
 
== Variants ==
 
=== Piecewise-linear variants ===
 
'''Leaky ReLU''' (2014) allows a small, positive gradient when the unit is inactive,<ref name="Andrew L"/> helping to mitigate the vanishing gradient problem. This gradient is defined by a parameter <math>\alpha</math>, typically set to 0.01–0.3.<ref>{{Cite web |url = https://pytorch.org/docs/stable/generated/torch.nn.LeakyReLU.html |title = PyTorch Leaky ReLU docs}}</ref><ref>{{Cite web |url = https://www.tensorflow.org/api_docs/python/tf/keras/layers/LeakyReLU |title = TensorFlow Leaky ReLU docs}}</ref>
 
: <math>f(x) = \begin{cases}
x & x > 0, \\
\alpha x & x \le 0,
\end{cases} \qquad
f'(x) = \begin{cases}
1 & x > 0, \\
\alpha & x \le 0.
\end{cases}</math>
Note that for <math>a\leq1</math>, this is equivalent to
:<math>f(x) = \max(x, ax)</math>
and thus has a relation to "maxout" networks.<ref name="prelu"/>
 
The same function can also be expressed without the piecewise notation as:
=== ELUs ===
: <math> f(x) = \frac{1+\alpha}{2} x+\frac{1-\alpha}{2} |x| </math>
Exponential linear units try to make the mean activations closer to zero which speeds up learning. It has been shown that ELUs can obtain higher classification accuracy than ReLUs.<ref>{{Cite arxiv |eprint=1511.07289 |last1=Clevert |first1=Djork-Arné |title=Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) |last2=Unterthiner |first2=Thomas |last3=Hochreiter |first3=Sepp |class=cs.LG |year=2015}}</ref>
 
'''Parametric ReLU (PReLU, 2016)''' takes this idea further by making <math>\alpha</math> a learnable parameter along with the other network parameters.<ref name="He-2015">{{cite arXiv |eprint=1502.01852 |last1=He |first1=Kaiming |title=Delving Deep into Rectifiers: Surpassing Human-Level Performance on Image ''Net'' Classification |last2=Zhang |first2=Xiangyu |last3=Ren |first3=Shaoqing |last4=Sun |first4=Jian |class=cs.CV |year=2015}}</ref>
<math>f(x) = \begin{cases}
 
x & \mbox{if } x \geq 0 \\
Note that for <math>\alpha \le 1</math>, this is equivalent to
a(e^x-1) & \mbox{otherwise}
: <math>f(x) = \max(x, \alpha x)</math>
and thus has a relation to "maxout" networks.<ref name="He-2015"/>
 
'''Concatenated ReLU (CReLU, 2016)''' preserves positive and negative phase information by returning two values:<ref>{{Cite journal |last1=Shang |first1=Wenling |last2=Sohn |first2=Kihyuk |last3=Almeida |first3=Diogo |last4=Lee |first4=Honglak |date=2016-06-11 |title=Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units |url=https://proceedings.mlr.press/v48/shang16.html |journal=Proceedings of the 33rd International Conference on Machine Learning |language=en |publisher=PMLR |pages=2217–2225|arxiv=1603.05201 }}</ref>
 
: <math>f(x) = [\operatorname{ReLU}(x), \operatorname{ReLU}(-x)].</math>
 
=== Smooth variants ===
 
====Softplus====
{{main|Softplus}}
[[File:Softplus.svg|thumb|320x320px|Plot of the softplus function and the [[ramp function]]]]
A smooth approximation to the rectifier is the [[analytic function]]
 
: <math>f(x) = \ln(1 + e^x),\qquad
f'(x) = \frac{e^{x}}{1 + e^{x}} = \frac{1}{1 + e^{-x}}</math>
 
which is called the ''softplus'' (2000)<ref>{{Cite journal |last1=Dugas |first1=Charles
|last2=Bengio |first2=Yoshua
|last3=Bélisle |first3=François
|last4=Nadeau |first4=Claude
|last5=Garcia |first5=René
|date=2000-01-01 |title=Incorporating second-order functional knowledge for better option pricing
|url=http://papers.nips.cc/paper/1920-incorporating-second-order-functional-knowledge-for-better-option-pricing.pdf
|journal=Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS'00)
|publisher=MIT Press
|pages=451–457
|quote=Since the sigmoid ''h'' has a positive first derivative, its primitive, which we call softplus, is convex.
}}</ref><ref name="Yoshua Bengio-2011" /> or ''SmoothReLU'' function.<ref>{{Cite web |url=https://software.intel.com/sites/products/documentation/doclib/daal/daal-user-and-reference-guides/daal_prog_guide/GUID-FAC73B9B-A597-4F7D-A5C4-46707E4A92A0.htm
|title=Smooth Rectifier Linear Unit (SmoothReLU) Forward Layer
|date=2017
|website=Developer Guide for Intel Data Analytics Acceleration Library
|language=en-US
|access-date=2018-12-04
}}</ref> For large negative <math>x</math> it is roughly <math>\ln 1</math>, so just above 0, while for large positive <math>x</math> it is roughly <math>\ln(e^x)</math>, so just above <math>x</math>.
 
This function can be approximated as:
 
: <math>\ln\left(1 + e^x \right) \approx \begin{cases} \ln2, & x=0,\\[6pt] \frac x {1-e^{-x/\ln2}}, & x\neq 0 \end{cases}</math>
 
By making the change of variables <math>x = y\ln(2)</math>, this is equivalent to
 
: <math>\log_2(1 + 2^y) \approx \begin{cases} 1,& y=0,\\[6pt] \frac{y}{1-e^{-y}}, & y\neq 0\end{cases}</math>
 
A sharpness parameter <math>k</math> may be included:
 
: <math>f(x) = \frac{\ln(1 + e^{kx})} k, \qquad f'(x) = \frac{e^{kx}}{1 + e^{kx}} = \frac{1}{1 + e^{-kx}}</math>
 
The derivative of softplus is the [[logistic function]]. This in turn can be viewed as a smooth approximation of the derivative of the rectifier, the [[Heaviside step function]].
 
The multivariable generalization of single-variable softplus is the [[LogSumExp]] with the first argument set to zero:
 
: <math>\operatorname{LSE_0}^+(x_1, \dots, x_n) := \operatorname{LSE}(0, x_1, \dots, x_n) = \ln(1 + e^{x_1} + \cdots + e^{x_n})</math>
 
The LogSumExp function is
 
: <math>\operatorname{LSE}(x_1, \dots, x_n) = \ln(e^{x_1} + \cdots + e^{x_n})</math>
 
and its gradient is the [[softmax function|softmax]]; the softmax with the first argument set to zero is the multivariable generalization of the logistic function. Both LogSumExp and softmax are used in machine learning.
 
==== ELU ====
Exponential linear units (2015) smoothly allow negative values. This is an attempt to make the mean activations closer to zero, which speeds up learning. It has been shown that ELUs can obtain higher classification accuracy than ReLUs.<ref>{{Cite arXiv |eprint=1511.07289 |last1=Clevert |first1=Djork-Arné |title=Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) |last2=Unterthiner |first2=Thomas |last3=Hochreiter |first3=Sepp |class=cs.LG |year=2015}}</ref>
 
: <math>f(x) = \begin{cases}
x & x > 0, \\
\alpha \left(e^x - 1\right) & x \le 0
\end{cases} \qquad
f'(x) = \begin{cases}
1 & x > 0, \\
\alpha e^x & x \le 0
\end{cases}</math>
 
In these formulas, <math>a\alpha</math> is a hyper-parameter[[Hyperparameter (machine learning)|hyperparameter]] to be tuned andwith the constraint <math>a\alpha \geq 0</math> is a constraint.
 
Given the same interpretation of <math>\alpha</math>, ELU can be viewed as a smoothed version of a shifted ReLU (SReLU), which has the form <math>f(x) = \max(- \alpha, x)</math>.
== Advantages ==
 
==== Gaussian-error linear unit (GELU) ====
* Biological plausibility: One-sided, compared to the [[Antisymmetric relation|antisymmetry]] of [[tanh]].
GELU (2016) is a smooth approximation to the rectifier:
* Sparse activation: For example, in a randomly initialized network, only about 50% of hidden units are activated (having a non-zero output).
* Better gradient propagation: Fewer [[Vanishing gradient problem|vanishing gradient]] problems compared to sigmoidal activation functions that saturate in both directions.<ref>{{Cite journal|last=Glorot|first=Xavier|last2=Bordes|first2=Antoine|last3=Bengio|first3=Yoshua|date=2011-06-14|title=Deep Sparse Rectifier Neural Networks|url=http://proceedings.mlr.press/v15/glorot11a.html|journal=PMLR|language=en|issn=1938-7228}}</ref>
* Efficient computation: Only comparison, addition and multiplication.
* Scale-invariant: <math>\max(0, ax) = a \max(0, x) \mbox{ for } a \geq 0</math>.
 
: <math>f(x) = x \Phi(x),</math>
Rectifying activation functions were used to separate specific excitation and unspecific inhibition in the Neural Abstraction Pyramid, which was trained in a supervised way to learn several computer vision tasks<ref name=NeuralAbstractionPyramid>{{cite book|last=Behnke|first=Sven|year=2003|title=Hierarchical Neural Networks for Image Interpretation|url= https://www.researchgate.net/publication/220688219_Hierarchical_Neural_Networks_for_Image_Interpretation|series=Lecture Notes in Computer Science|volume=2766|publisher=Springer|doi= 10.1007/b11963}}</ref>.
In 2011,<ref name="glorot2011"/> the use of the rectifier as a non-linearity has been shown to enable training deep [[Supervised learning|supervised]] neural networks without requiring [[Unsupervised learning|unsupervised]] pre-training.
Rectified linear units, compared to [[sigmoid function]] or similar activation functions, allow for faster and effective training of deep neural architectures on large and complex datasets.
 
: <math>f'(x) = x \Phi'(x) + \Phi(x)</math>
== Potential problems ==
 
* Non-differentiable at zero: however it is differentiable anywhere else, and a value of 0 or 1 can be chosen arbitrarily to fill the point where the input is 0.
where <math>\Phi(x) = P(X \leqslant x)</math> is the [[cumulative distribution function]] of the standard [[normal distribution]].
* Non-zero centered
 
* Unbounded
This activation function is illustrated in the figure at the start of this article. It has a "bump" with negative derivative to the left of ''x'' < 0. It serves as the default activation for many transformer models such as [[BERT (language model)|BERT]].<ref name="Hendrycks-2016">{{Cite arXiv |eprint = 1606.08415 |title = Gaussian Error Linear Units (GELUs) |last1 = Hendrycks |first1 = Dan |last2 = Gimpel |first2 = Kevin |year = 2016 |class = cs.LG}}</ref>
* Dying ReLU problem: ReLU neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, and so the neuron becomes stuck in a perpetually inactive state and "dies." This is a form of the [[Vanishing gradient problem|vanishing gradient problem]]. In some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model capacity. This problem typically arises when the learning rate is set too high. It may be mitigated by using Leaky ReLUs instead.
 
==== SiLU ====
{{main|Swish function}}
[[File:Swish.svg|thumb|The swish function]]
The SiLU (sigmoid linear unit) or [[swish function]]<ref name="Diganta Misra-2019" /> is another smooth approximation which uses the [[logistic function|sigmoid (logistic) function]], first introduced in the 2016 GELU paper:<ref name="Hendrycks-2016" />
 
: <math>f(x) = x \cdot \operatorname{sigmoid}(x),</math>
 
: <math>f'(x) = x \cdot \operatorname{sigmoid}'(x) + \operatorname{sigmoid}(x)</math>
 
It is cheaper to calculate than GELU. It also has a "bump".
 
==== Mish ====
The mish function (2019) can also be used as a smooth approximation of the rectifier.<ref name="Diganta Misra-2019">{{citation |url=https://www.bmvc2020-conference.com/assets/papers/0928.pdf |title=Mish: A Self Regularized Non-Monotonic Activation Function |author1=Diganta Misra |arxiv=1908.08681v1 |date=23 Aug 2019 |access-date=26 March 2022}}.</ref> It is defined as
 
: <math>f(x) = x \tanh\big(\operatorname{softplus}(x)\big),</math>
 
where <math>\tanh(x)</math> is the [[hyperbolic tangent]], and <math>\operatorname{softplus}(x)</math> is the [[softplus]] function.
 
Mish was obtained by experimenting with functions similar to Swish (SiLU, see above). It is non-monotonic (has a "bump") like Swish. The main new feature is that it exhibits a "self-regularizing" behavior attributed to a term in its first derivative.<ref name="Diganta Misra-2019"/><ref name="Shaw-2020">{{Cite web |last=Shaw |first=Sweta |date=2020-05-10 |title=Activation Functions Compared with Experiments |url=https://wandb.ai/shweta/Activation%20Functions/reports/Activation-Functions-Compared-with-Experiments--VmlldzoxMDQwOTQ |access-date=2022-07-11 |website=W&B |language=en}}</ref>
 
==== Squareplus ====
 
Squareplus (2021)<ref>{{cite arXiv |last=Barron |first=Jonathan T. |eprint=2112.11687 |title=Squareplus: A Softplus-Like Algebraic Rectifier |class=cs.NE |date=22 December 2021}}</ref> is the function
:<math>f(x) = \frac{x + \sqrt{x^2 + b}}{2}</math>
where <math>b \geq 0</math> is a hyperparameter that determines the "size" of the curved region near <math>x = 0</math>. (For example, letting <math>b = 0</math> yields ReLU, and letting <math>b = 4</math> yields the [[metallic mean]] function.)
Squareplus shares many properties with softplus: It is [[monotonic function|monotonic]], strictly [[positive (mathematics)|positive]], approaches 0 as <math>x \to -\infty</math>, approaches the identity as <math>x \to +\infty</math>, and is <math>C^\infty</math> [[smooth function|smooth]]. However, squareplus can be computed using only [[algebraic functions]], making it well-suited for settings where computational resources or instruction sets are limited. Additionally, squareplus requires no special consideration to ensure numerical stability when <math>x</math> is large.
 
==== DELU ====
ExtendeD Exponential Linear Unit (DELU, 2023) is an activation function which is smoother within the neighborhood of zero and sharper for bigger values, allowing better allocation of neurons in the learning process for higher performance. Thanks to its unique design, it has been shown that DELU may obtain higher classification accuracy than ReLU and ELU.<ref>{{cite journal |last1=Çatalbaş |first1=Burak |last2=Morgül |first2=Ömer |title=Deep learning with ExtendeD Exponential Linear Unit (DELU) |journal=Neural Computing and Applications |date=16 August 2023 |volume=35 |issue=30 |pages=22705–22724 |doi=10.1007/s00521-023-08932-z |url=https://link.springer.com/article/10.1007/s00521-023-08932-z |access-date=20 April 2025|url-access=subscription }}</ref>
 
: <math>f(x) = \begin{cases}
x & x > x_c, \\
(e^{ax} - 1)/b & x \le x_c
\end{cases} \qquad
f'(x) = \begin{cases}
1 & x > x_c, \\
(a / b) e^{ax} & x \le x_c
\end{cases}</math>
 
In these formulas, <math>a</math>, <math>b</math> and <math>x_c</math> are [[Hyperparameter (machine learning)|hyperparameter]] values which could be set as default constraints <math>a = 1</math>, <math>b = 2</math> and <math>x_c = 1.25643</math>, as done in the original work.
 
==See also==
Line 73 ⟶ 210:
*[[Sigmoid function]]
*[[Tobit model]]
*[[Layer (deep learning)]]
 
==References==
{{Reflist|30em}}
 
{{Artificial intelligence navbox}}
 
[[Category:Artificial neural networks]]