Autoencoder: Difference between revisions

Content deleted Content added
OAbot (talk | contribs)
m Open access bot: doi added to citation with #oabot.
 
(199 intermediate revisions by 97 users not shown)
Line 1:
{{Short description|Neural network that learns efficient data encoding in an unsupervised manner}}
{{Distinguish|Autocoder|Autocode}}
{{Use dmy dates|date=March 2020|cs1-dates=y}}
[[File:Autoencoder schema.png|thumb|upright=1.15|A schema of an ''autoencoder''. An autoencoder has two main parts: an ''encoder'' that maps the message to a code, and a ''decoder'' that reconstructs the message from the code.]]
{{Machine learning bar}}
{{Machine learning|Artificial neural network}}
An '''autoencoder''' is a type of [[artificial neural network]] used to learn [[Feature learning|efficient data codings]] in an [[unsupervised learning|unsupervised]] manner.<ref>{{cite journal|doi=10.1002/aic.690370209|title=Nonlinear principal component analysis using autoassociative neural networks|journal=AIChE Journal|volume=37|issue=2|pages=233–243|date=1991|last1=Kramer|first1=Mark A.|url= https://www.researchgate.net/profile/Abir_Alobaid/post/To_learn_a_probability_density_function_by_using_neural_network_can_we_first_estimate_density_using_nonparametric_methods_then_train_the_network/attachment/59d6450279197b80779a031e/AS:451263696510979@1484601057779/download/NL+PCA+by+using+ANN.pdf}}</ref> The aim of an autoencoder is to learn a [[Feature learning|representation]] (encoding) for a set of data, typically for [[dimensionality reduction]], by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. Several variants exist to the basic model, with the aim of forcing the learned representations of the input to assume useful properties.<ref name=":0" /> Examples are the regularized autoencoders (''Sparse'', ''Denoising'' and ''Contractive'' autoencoders), proven effective in learning representations for subsequent classification tasks,<ref name=":4" /> and ''Variational'' autoencoders, with their recent applications as generative models.<ref name=":11">{{cite journal |arxiv=1906.02691|doi=10.1561/2200000056|bibcode=2019arXiv190602691K|title=An Introduction to Variational Autoencoders|date=2019|last1=Welling|first1=Max|last2=Kingma|first2=Diederik P.|journal=Foundations and Trends in Machine Learning|volume=12|issue=4|pages=307–392}}</ref> Autoencoders are effectively used for solving many applied problems, from [[face recognition]]<ref>Hinton GE, Krizhevsky A, Wang SD. [http://www.cs.toronto.edu/~fritz/absps/transauto6.pdf Transforming auto-encoders.] In International Conference on Artificial Neural Networks 2011 Jun 14 (pp. 44-51). Springer, Berlin, Heidelberg.</ref> to acquiring the semantic meaning of words.<ref>{{cite journal|doi=10.1016/j.neucom.2008.04.030|title=Modeling word perception using the Elman network|journal=Neurocomputing|volume=71|issue=16–18|pages=3150|date=2008|last1=Liou|first1=Cheng-Yuan|last2=Huang|first2=Jau-Chi|last3=Yang|first3=Wen-Chie}}</ref><ref>{{cite journal|doi=10.1016/j.neucom.2013.09.055|title=Autoencoder for words|journal=Neurocomputing|volume=139|pages=84–96|date=2014|last1=Liou|first1=Cheng-Yuan|last2=Cheng|first2=Wei-Chen|last3=Liou|first3=Jiun-Wei|last4=Liou|first4=Daw-Ran}}</ref>
 
An '''autoencoder''' is a type of [[artificial neural network]] used to learn [[Feature learning|efficient codings]] of unlabeled data ([[unsupervised learning]]). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for [[dimensionality reduction]], to generate lower-dimensional embeddings for subsequent use by other [[machine learning]] algorithms.<ref>{{Cite book|last1=Bank |first1=Dor |last2=Koenigstein |first2=Noam |last3=Giryes |first3=Raja |year=2023 |chapter=Autoencoders |editor-last1=Rokach |editor-first1=Lior |editor-last2=Maimon |editor-first2=Oded |editor-last3=Shmueli |editor-first3=Erez |title=Machine learning for data science handbook |chapter-url=https://link.springer.com/chapter/10.1007/978-3-031-24628-9_16 |language=en |pages=353–374 |doi=10.1007/978-3-031-24628-9_16|isbn=978-3-031-24627-2 }}</ref>
 
Variants exist which aim to make the learned representations assume useful properties.<ref name=":0" /> Examples are regularized autoencoders (''sparse'', ''denoising'' and ''contractive'' autoencoders), which are effective in learning representations for subsequent [[Statistical classification|classification]] tasks,<ref name=":4" /> and [[Variational autoencoder|''variational'' autoencoders]], which can be used as [[generative model]]s.<ref name=":11">{{cite journal |arxiv=1906.02691|doi=10.1561/2200000056|bibcode=2019arXiv190602691K|title=An Introduction to Variational Autoencoders|date=2019|last1=Welling|first1=Max|last2=Kingma|first2=Diederik P.|journal=Foundations and Trends in Machine Learning|volume=12|issue=4|pages=307–392|s2cid=174802445}}</ref> Autoencoders are applied to many problems, including [[Facial recognition system|facial recognition]],<ref>Hinton GE, Krizhevsky A, Wang SD. [http://www.cs.toronto.edu/~fritz/absps/transauto6.pdf Transforming auto-encoders.] In International Conference on Artificial Neural Networks 2011 Jun 14 (pp. 44-51). Springer, Berlin, Heidelberg.</ref> [[Feature (computer vision)|feature detection]],<ref name=":2">{{Cite book|last=Géron|first=Aurélien|title=Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow|publisher=O’Reilly Media, Inc.|year=2019|___location=Canada|pages=739–740}}</ref> [[anomaly detection]], and [[Word embedding|learning the meaning of words]].<ref>{{cite journal|doi=10.1016/j.neucom.2008.04.030|title=Modeling word perception using the Elman network|journal=Neurocomputing|volume=71|issue=16–18|pages=3150|date=2008|last1=Liou|first1=Cheng-Yuan|last2=Huang|first2=Jau-Chi|last3=Yang|first3=Wen-Chie|url=http://ntur.lib.ntu.edu.tw//handle/246246/155195 }}</ref><ref>{{cite journal|doi=10.1016/j.neucom.2013.09.055|title=Autoencoder for words|journal=Neurocomputing|volume=139|pages=84–96|date=2014|last1=Liou|first1=Cheng-Yuan|last2=Cheng|first2=Wei-Chen|last3=Liou|first3=Jiun-Wei|last4=Liou|first4=Daw-Ran}}</ref> In terms of [[Synthetic data|data synthesis]], autoencoders can also be used to randomly generate new data that is similar to the input (training) data.<ref name=":2" />
 
{{Toclimit|3}}
 
== IntroductionMathematical principles ==
An ''autoencoder'' is a [[neural network]] that learns to copy its input to its output. It has an internal (''hidden'') layer that describes a ''code'' used to represent the input, and it is constituted by two main parts: an encoder that maps the input into the code, and a decoder that maps the code to a reconstruction of the original input.
 
=== Definition ===
Performing the copying task perfectly would simply duplicate the signal, and this is why autoencoders usually are restricted in ways that force them to reconstruct the input approximately, preserving only the most relevant aspects of the data in the copy.
An autoencoder is defined by the following components: <blockquote>Two sets: the space of decoded messages <math>\mathcal X</math>; the space of encoded messages <math>\mathcal Z</math>. Typically <math>\mathcal X</math> and <math>\mathcal Z</math> are [[Euclidean space]]s, that is, <math>\mathcal X = \R^m, \mathcal Z = \R^n</math> with <math>m > n.</math> </blockquote><blockquote>Two [[parameterization|parametrized]] families of functions: the encoder family <math>E_\phi:\mathcal{X} \rightarrow \mathcal{Z}</math>, parametrized by <math>\phi</math>; the decoder family <math>D_\theta:\mathcal{Z} \rightarrow \mathcal{X}</math>, parametrized by <math>\theta</math>.</blockquote>For any <math>x\in \mathcal X</math>, we usually write <math>z = E_\phi(x)</math>, and refer to it as the code, the [[latent variable]], latent representation, latent vector, etc. Conversely, for any <math>z\in \mathcal Z</math>, we usually write <math>x' = D_\theta(z)</math>, and refer to it as the (decoded) message.
 
Usually, both the encoder and the decoder are defined as [[multilayer perceptron]]s (MLPs). For example, a one-layer-MLP encoder <math>E_\phi</math> is:
The idea of autoencoders has been popular in the field of neural networks for decades, and the first applications date back to the '80s.<ref name=":0" /><ref>{{Cite journal|last=Schmidhuber|first=Jürgen|date=January 2015|title=Deep learning in neural networks: An overview|journal=Neural Networks|volume=61|pages=85–117|doi=10.1016/j.neunet.2014.09.003|pmid=25462637|arxiv=1404.7828}}</ref><ref>Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and Helmholtz free energy. In ''Advances in neural information processing systems 6'' (pp. 3-10).</ref> Their most traditional application was [[dimensionality reduction]] or [[feature learning]], but more recently the autoencoder concept has become more widely used for learning [[generative model]]s of data.<ref name="VAE">{{cite arxiv|eprint=1312.6114|author1=Diederik P Kingma|title=Auto-Encoding Variational Bayes|last2=Welling|first2=Max|class=stat.ML|date=2013}}</ref><ref name="gan_faces">Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015 {{url|http://torch.ch/blog/2015/11/13/gan.html}}</ref> Some of the most powerful [[Artificial intelligence|AIs]] in the 2010s involved sparse autoencoders stacked inside of [[Deep learning|deep]] neural networks.<ref name="domingos">{{cite book|title=The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World|title-link=The Master Algorithm|last1=Domingos|first1=Pedro|publisher=Basic Books|date=2015|isbn=978-046506192-1|at="Deeper into the Brain" subsection|chapter=4|author-link=Pedro Domingos}}</ref>
 
:<math>E_\phi(\mathbf x) = \sigma(Wx+b)</math>
==Basic Architecture==
[[File:Autoencoder schema.png|thumb|Schema of a basic Autoencoder]]The simplest form of an autoencoder is a [[feedforward neural network|feedforward]], non-[[recurrent neural network]] similar to single layer perceptrons that participate in [[multilayer perceptron]]s (MLP) – having an input layer, an output layer and one or more hidden layers connecting them – where the output layer has the same number of nodes (neurons) as the input layer, and with the purpose of reconstructing its inputs (minimizing the difference between the input and the output) instead of predicting the target value <math>Y</math> given inputs <math>X</math>. Therefore, autoencoders are [[unsupervised learning]] models (do not require labeled inputs to enable learning).
 
where <math>\sigma</math> is an element-wise [[activation function]], <math>W</math> is a "weight" matrix, and <math>b</math> is a "bias" vector.
An autoencoder consists of two parts, the encoder and the decoder, which can be defined as transitions <math>\phi</math> and <math>\psi,</math> such that:
 
=== Training an autoencoder ===
:<math>\phi:\mathcal{X} \rightarrow \mathcal{F}</math>
An autoencoder, by itself, is simply a tuple of two functions. To judge its ''quality'', we need a ''task''. A task is defined by a reference probability distribution <math>\mu_{ref}</math> over <math>\mathcal X</math>, and a "reconstruction quality" function <math>d: \mathcal X \times \mathcal X \to [0, \infty]</math>, such that <math>d(x, x')</math> measures how much <math>x'</math> differs from <math>x</math>.
:<math>\psi:\mathcal{F} \rightarrow \mathcal{X}</math>
:<math>\phi,\psi = \underset{\phi,\psi}{\operatorname{arg\,min}}\, \|X-(\psi \circ \phi) X\|^2</math>
 
With those, we can define the loss function for the autoencoder as<math display="block">L(\theta, \phi) := \mathbb \mathbb E_{x\sim \mu_{ref}}[d(x, D_\theta(E_\phi(x)))]</math>The ''optimal'' autoencoder for the given task <math>(\mu_{ref}, d)</math> is then <math>\arg\min_{\theta, \phi}L(\theta, \phi)</math>. The search for the optimal autoencoder can be accomplished by any mathematical optimization technique, but usually by [[gradient descent]]. This search process is referred to as "training the autoencoder".
In the simplest case, given one hidden layer, the encoder stage of an autoencoder takes the input <math>\mathbf{x} \in \mathbb{R}^d = \mathcal{X}</math> and maps it to <math>\mathbf{h} \in \mathbb{R}^p = \mathcal{F}</math>:
 
In most situations, the reference distribution is just the [[Empirical measure|empirical distribution]] given by a dataset <math>\{x_1, ..., x_N\} \subset \mathcal X</math>, so that<math display="block">\mu_{ref} = \frac{1}{N}\sum_{i=1}^N \delta_{x_i}</math>
:<math>\mathbf{h} = \sigma(\mathbf{Wx}+\mathbf{b})</math>
 
where <math>\delta_{x_i}</math> is the [[Dirac measure]], the quality function is just L2 loss: <math>d(x, x') = \|x - x'\|_2^2</math>, and <math>\|\cdot\|_2</math> is the [[Norm (mathematics)#Euclidean norm|Euclidean norm]]. Then the problem of searching for the optimal autoencoder is just a [[Least squares|least-squares]] optimization:<math display="block">\min_{\theta, \phi} L(\theta, \phi),\qquad \text{where } L(\theta, \phi) = \frac{1}{N}\sum_{i=1}^N \|x_i - D_\theta(E_\phi(x_i))\|_2^2</math>
This image <math>\mathbf{h}</math> is usually referred to as ''code'', ''latent variables'', or ''latent representation''. Here, <math>\sigma</math> is an element-wise [[activation function]] such as a [[sigmoid function]] or a [[rectified linear unit]]. <math>\mathbf{W}</math> is a weight matrix and <math>\mathbf{b}</math> is a bias vector. Weights and biases are usually initialized randomly, and then updated iteratively during training through [[Backpropagation]]. After that, the decoder stage of the autoencoder maps <math>\mathbf{h}</math> to the reconstruction <math>\mathbf{x'}</math> of the same shape as <math>\mathbf{x}</math>:
 
=== Interpretation ===
:<math>\mathbf{x'} = \sigma'(\mathbf{W'h}+\mathbf{b'})</math>
An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function <math>d</math>.
 
The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space <math>\mathcal Z</math> usually has fewer dimensions than the message space <math>\mathcal{X}</math>.
where <math>\mathbf{\sigma'}, \mathbf{W'}, \text{ and }\mathbf{b'}</math> for the decoder may be unrelated to the corresponding <math>\mathbf{\sigma}, \mathbf{W}, \text{ and } \mathbf{b}</math> for the encoder.
 
Such an autoencoder is called ''undercomplete''. It can be interpreted as [[Data compression|compressing]] the message, or [[Dimensionality reduction|reducing its dimensionality]].<ref name=":12">{{cite journal |last1=Kramer |first1=Mark A. |date=1991 |title=Nonlinear principal component analysis using autoassociative neural networks |url=https://www.researchgate.net/profile/Abir_Alobaid/post/To_learn_a_probability_density_function_by_using_neural_network_can_we_first_estimate_density_using_nonparametric_methods_then_train_the_network/attachment/59d6450279197b80779a031e/AS:451263696510979@1484601057779/download/NL+PCA+by+using+ANN.pdf |journal=AIChE Journal |volume=37 |issue=2 |pages=233–243 |bibcode=1991AIChE..37..233K |doi=10.1002/aic.690370209}}</ref><ref name=":7" />
Autoencoders are trained to minimise reconstruction errors (such as [[Mean squared error|squared errors]]), often referred to as the "[[Loss function|loss]]":
 
At the limit of an ideal undercomplete autoencoder, every possible code <math>z</math> in the code space is used to encode a message <math>x</math> that really appears in the distribution <math>\mu_{ref}</math>, and the decoder is also perfect: <math>D_\theta(E_\phi(x)) = x</math>. This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code <math>z</math> and obtaining <math>D_\theta(z)</math>, which is a message that really appears in the distribution <math>\mu_{ref}</math>.
:<math>\mathcal{L}(\mathbf{x},\mathbf{x'})=\|\mathbf{x}-\mathbf{x'}\|^2=\|\mathbf{x}-\sigma'(\mathbf{W'}(\sigma(\mathbf{Wx}+\mathbf{b}))+\mathbf{b'})\|^2</math>
 
If the code space <math>\mathcal Z</math> has dimension larger than (''overcomplete''), or equal to, the message space <math>\mathcal{X}</math>, or the hidden units are given enough capacity, an autoencoder can learn the [[identity function]] and become useless. However, experimental results found that overcomplete autoencoders might still [[feature learning|learn useful features]].<ref name="bengio">{{Cite journal |last1=Bengio |first1=Y. |date=2009 |title=Learning Deep Architectures for AI |url=http://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf |journal=Foundations and Trends in Machine Learning |volume=2 |issue=8 |pages=1795–7 |citeseerx=10.1.1.701.9550 |doi=10.1561/2200000006 |pmid=23946944|s2cid=207178999 }}</ref>
where <math>\mathbf{x}</math> is usually averaged over some input training set.
 
In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below.<ref name=":0" />
As mentioned before, the training of an autoencoder is performed through [[Backpropagation|Backpropagation of the error]], just like a regular [[feedforward neural network]].
 
Should the [[feature (machine learning)|feature space]] <math>\mathcal{F}</math> have lower dimensionality than the input space <math>\mathcal{X}</math>, the feature vector <math>\phi(x)</math> can be regarded as a [[Data compression|compressed]] representation of the input <math>x</math>. This is the case of ''undercomplete'' autoencoders. If the hidden layers are larger than (''overcomplete autoencoders)'', or equal to, the input layer, or the hidden units are given enough capacity, an autoencoder can potentially learn the [[identity function]] and become useless. However, experimental results have shown that autoencoders might still [[feature learning|learn useful features]] in these cases.<ref name="bengio">{{Cite journal|last1=Bengio|first1=Y.|date=2009|title=Learning Deep Architectures for AI|url=http://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf|journal=Foundations and Trends in Machine Learning|volume=2|issue=8|pages=1795–7|doi=10.1561/2200000006|pmc=|pmid=23946944|citeseerx=10.1.1.701.9550}}</ref> In the ideal setting, one should be able to tailor the code dimension and the model capacity on the basis of the complexity of the data distribution to be modeled. One way to do so, is to exploit the model variants known as ''Regularized Autoencoders''.<ref name=":0" />
 
==Variations==
 
===Variational autoencoder (VAE)===
=== Regularized Autoencoders ===
[[File:VAE Basic.png|thumb|300x300px|The basic scheme of a variational autoencoder. The model receives <math>x</math> as input. The encoder compresses it into the latent space. The decoder receives as input the information sampled from the latent space and produces <math>{x'}</math> as similar as possible to <math>x</math>.]]
Various techniques exist to prevent autoencoders from learning the identity function and to improve their ability to capture important information and learn richer representations.
{{Main|Variational autoencoder}}
 
[[Variational autoencoder]]s (VAEs) belong to the families of [[variational Bayesian methods]]. Despite the architectural similarities with basic autoencoders, VAEs are architected with different goals and have a different mathematical formulation. The latent space is, in this case, composed of a mixture of distributions instead of fixed vectors.
====Sparse autoencoder (SAE)====
[[File:Autoencoder sparso.png|thumb|Simple schema of a single-layer sparse autoencoder. The hidden nodes in bright yellow are activated, while the light yellow ones are inactive. The activation depends on the input.]]
Recently, it has been observed that when [[Representation learning|representations]] are learnt in a way that encourages sparsity, improved performance is obtained on classification tasks.<ref name=":5">{{Cite journal|last1=Frey|first1=Brendan|last2=Makhzani|first2=Alireza|date=2013-12-19|title=k-Sparse Autoencoders|arxiv=1312.5663|bibcode=2013arXiv1312.5663M}}</ref> Sparse autoencoder may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at once.<ref name="domingos" /> This sparsity constraint forces the model to respond to the unique statistical features of the input data used for training.
 
Given an input dataset <math>x</math> characterized by an unknown probability function <math>P(x)</math> and a multivariate latent encoding vector <math>z</math>, the objective is to model the data as a distribution <math>p_\theta(x)</math>, with <math>\theta</math> defined as the set of the network parameters so that <math>p_\theta(x) = \int_{z}p_\theta(x,z)dz </math>.
Specifically, a sparse autoencoder is an autoencoder whose training criterion involves a sparsity penalty <math>\Omega(\boldsymbol h)</math> on the code layer <math>\boldsymbol h</math>.
 
===Sparse autoencoder (SAE)===
<math>\mathcal{L}(\mathbf{x},\mathbf{x'}) + \Omega(\boldsymbol h)</math>
Inspired by the [[sparse coding]] hypothesis in neuroscience, ''sparse autoencoders'' (SAE) are variants of autoencoders, such that the codes <math>E_\phi(x)</math> for messages tend to be ''sparse codes'', that is, <math>E_\phi(x)</math> is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time.<ref name="domingos">{{cite book |last1=Domingos |first1=Pedro |author-link=Pedro Domingos |title=The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World |title-link=The Master Algorithm |date=2015 |publisher=Basic Books |isbn=978-046506192-1 |at="Deeper into the Brain" subsection |chapter=4}}</ref> Encouraging sparsity improves performance on classification tasks.<ref name=":1" /> [[File:Autoencoder sparso.png|thumb|Simple schema of a single-layer sparse autoencoder. The hidden nodes in bright yellow are activated, while the light yellow ones are inactive. The activation depends on the input.]]
There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the '''k-sparse autoencoder'''.<ref name=":1">{{cite arXiv |eprint=1312.5663 |class=cs.LG |first1=Alireza |last1=Makhzani |first2=Brendan |last2=Frey |title=K-Sparse Autoencoders |date=2013}}</ref>
 
The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder:<math display="block">f_k(x_1, ..., x_n) = (x_1 b_1, ..., x_n b_n)</math>where <math>b_i = 1</math> if <math>|x_i|</math> ranks in the top k, and 0 otherwise.
Recalling that <math>\boldsymbol h=f(\boldsymbol W \boldsymbol x + \boldsymbol b)</math>, the penalty encourages the model to activate (i.e. output value close to 1) some specific areas of the network on the basis of the input data, while forcing all other neurons to be inactive (i.e. to have an output value close to 0).<ref name=":6" />
 
Backpropagating through <math>f_k</math> is simple: set gradient to 0 for <math>b_i = 0</math> entries, and keep gradient for <math>b_i=1</math> entries. This is essentially a generalized [[Rectifier (neural networks)|ReLU]] function.<ref name=":1" />
This sparsity of activation can be achieved by formulating the penalty terms in different ways.
 
The other way is a [[Relaxation (approximation)|relaxed version]] of the k-sparse autoencoder. Instead of forcing sparsity, we add a '''sparsity regularization loss''', then optimize for<math display="block">\min_{\theta, \phi}L(\theta, \phi) + \lambda L_{\text{sparse}} (\theta, \phi)</math>where <math>\lambda > 0</math> measures how much sparsity we want to enforce.<ref name=":6" />
* One way to do it, is by exploiting the [[Kullback–Leibler divergence|Kullback-Leibler (KL) divergence]].<ref name=":5" /><ref name=":6">Ng, A. (2011). Sparse autoencoder. ''CS294A Lecture notes'', ''72''(2011), 1-19.</ref><ref>{{Cite journal|last1=Nair|first1=Vinod|last2=Hinton|first2=Geoffrey E.|date=2009|title=3D Object Recognition with Deep Belief Nets|url=http://dl.acm.org/citation.cfm?id=2984093.2984244|journal=Proceedings of the 22Nd International Conference on Neural Information Processing Systems|series=NIPS'09|___location=USA|publisher=Curran Associates Inc.|pages=1339–1347|isbn=9781615679119}}</ref><ref>{{Cite journal|last1=Zeng|first1=Nianyin|last2=Zhang|first2=Hong|last3=Song|first3=Baoye|last4=Liu|first4=Weibo|last5=Li|first5=Yurong|last6=Dobaie|first6=Abdullah M.|date=2018-01-17|title=Facial expression recognition via learning deep sparse autoencoders|journal=Neurocomputing|volume=273|pages=643–649|doi=10.1016/j.neucom.2017.08.043|issn=0925-2312}}</ref> Let
 
Let the autoencoder architecture have <math>K</math> layers. To define a sparsity regularization loss, we need a "desired" sparsity <math>\hat \rho_k</math> for each layer, a weight <math>w_k</math> for how much to enforce each sparsity, and a function <math>s: [0, 1]\times [0, 1] \to [0, \infty]</math> to measure how much two sparsities differ.
<math>\hat{\rho_j} = \frac{1}{m}\sum_{i=1}^{m}[h_j(x_i)]</math>
 
For each input <math>x</math>, let the actual sparsity of activation in each layer <math>k</math> be<math display="block">\rho_k(x) = \frac 1n \sum_{i=1}^n a_{k, i}(x)</math>where <math>a_{k, i}(x)</math> is the activation in the <math>i</math> -th neuron of the <math>k</math> -th layer upon input <math>x</math>.
be the average activation of the hidden unit <math>j</math> (averaged over the <math>m</math> training examples). Note that the notation <math>h_j(x_i)</math> makes explicit what the input affecting the activation was, i.e. it identifies which input value the activation is function of. To encourage most of the neurons to be inactive, we would like <math>\hat{\rho_j}</math> to be as close to 0 as possible. Therefore, this method enforces the constraint <math>\hat{\rho_j} = \rho </math> where <math>\rho </math> is the sparsity parameter, a value close to zero, leading the activation of the hidden units to be mostly zero as well. The penalty term <math>\Omega(\boldsymbol h)</math> will then take a form that penalizes <math>\hat{\rho_j}</math> for deviating significantly from <math>\rho</math>, exploiting the KL divergence:
 
The sparsity loss upon input <math>x</math> for one layer is <math>s(\hat\rho_k, \rho_k(x))</math>, and the sparsity regularization loss for the entire autoencoder is the expected weighted sum of sparsity losses:<math display="block">L_{\text{sparse}}(\theta, \phi) = \mathbb \mathbb E_{x\sim\mu_X}\left[\sum_{k\in 1:K} w_k s(\hat\rho_k, \rho_k(x)) \right]</math>Typically, the function <math>s</math> is either the [[Kullback–Leibler divergence|Kullback-Leibler (KL) divergence]], as<ref name=":1" /><ref name=":6">Ng, A. (2011). [https://web.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf Sparse autoencoder]. ''CS294A Lecture notes'', ''72''(2011), 1-19.</ref><ref>{{Cite journal|last1=Nair|first1=Vinod|last2=Hinton|first2=Geoffrey E.|date=2009|title=3D Object Recognition with Deep Belief Nets|url=http://dl.acm.org/citation.cfm?id=2984093.2984244|journal=Proceedings of the 22nd International Conference on Neural Information Processing Systems|series=NIPS'09|___location=USA|publisher=Curran Associates Inc.|pages=1339–1347|isbn=9781615679119}}</ref><ref>{{Cite journal|last1=Zeng|first1=Nianyin|last2=Zhang|first2=Hong|last3=Song|first3=Baoye|last4=Liu|first4=Weibo|last5=Li|first5=Yurong|last6=Dobaie|first6=Abdullah M.|date=2018-01-17|title=Facial expression recognition via learning deep sparse autoencoders|journal=Neurocomputing|volume=273|pages=643–649|doi=10.1016/j.neucom.2017.08.043|issn=0925-2312}}</ref>
<math>\sum_{j=1}^{s}KL(\rho || \hat{\rho_j}) = \sum_{j=1}^{s}[\rho \log \frac{\rho}{\hat{\rho_j}}+(1- \rho)\log \frac{1-\rho}{1-\hat{\rho_j}}]</math> where <math>j</math> is summing over the <math>s</math> hidden nodes in the hidden layer, and <math>KL(\rho || \hat{\rho_j}) </math> is the KL-divergence between a Bernoulli random variable with mean <math>\rho</math> and a Bernoulli random variable with mean <math>\hat{\rho_j}</math>.<ref name=":6" />
 
::<math>s(\rho, \hat\rho) = KL(\rho || \hat{\rho}) = \rho \log \frac{\rho}{\hat{\rho}}+(1- \rho)\log \frac{1-\rho}{1-\hat{\rho}}</math>
* Another way to achieve sparsity in the activation of the hidden unit, is by applying L1 or L2 regularization terms on the activation, scaled by a certain parameter <math>\lambda</math>.<ref>{{cite arxiv |eprint=1505.05561|last1=Arpit|first1=Devansh|last2=Zhou|first2=Yingbo|last3=Ngo|first3=Hung|last4=Govindaraju|first4=Venu|title=Why Regularized Auto-Encoders learn Sparse Representation?|class=stat.ML|date=2015}}</ref> For instance, in the case of L1 the [[loss function]] would become
or the L1 loss, as <math>s(\rho, \hat\rho) = |\rho- \hat\rho|</math>, or the L2 loss, as <math>s(\rho, \hat\rho) = |\rho- \hat\rho|^2</math>.
 
Alternatively, the sparsity regularization loss may be defined without reference to any "desired sparsity", but simply force as much sparsity as possible. In this case, one can define the sparsity regularization loss as <math display="block">L_{\text{sparse}}(\theta, \phi) = \mathbb \mathbb E_{x\sim\mu_X}\left[
<math>\mathcal{L}(\mathbf{x},\mathbf{x'}) + \lambda \sum_i |h_i|</math>
\sum_{k\in 1:K} w_k \|h_k\|
\right]</math>where <math>h_k</math> is the activation vector in the <math>k</math>-th layer of the autoencoder. The norm <math>\|\cdot\|</math> is usually the L1 norm (giving the L1 sparse autoencoder) or the L2 norm (giving the L2 sparse autoencoder).
 
===Denoising autoencoder (DAE)===
* A further proposed strategy to force sparsity in the model is that of manually zeroing all but the strongest hidden unit activations (''[[k-sparse autoencoder]]'').<ref name=":1">{{cite arxiv |eprint=1312.5663|last1=Makhzani|first1=Alireza|last2=Frey|first2=Brendan|title=K-Sparse Autoencoders|class=cs.LG|date=2013}}</ref> The k-sparse autoencoder is based on a linear autoencoder (i.e. with linear activation function) and tied weights. The identification of the strongest activations can be achieved by sorting the activities and keeping only the first ''k'' values, or by using ReLU hidden units with thresholds that are adaptively adjusted until the k largest activities are identified. This selection acts like the previously mentioned regularization terms in that it prevents the model from reconstructing the input using too many neurons.<ref name=":1" />
 
====[[File:Denoising -autoencoder.png|thumb|A (DAE)====schema of a denoising autoencoder]]
''Denoising autoencoders'' (DAE) try to achieve a ''good'' representation by changing the ''reconstruction criterion''.<ref name=":0" /><ref name=":4" />
 
A DAE, originally called a "robust autoassociative network" by Mark A. Kramer,<ref name=":13">{{Cite journal |last=Kramer |first=M. A. |date=1992-04-01 |title=Autoassociative neural networks |url=https://dx.doi.org/10.1016/0098-1354%2892%2980051-A |journal=Computers & Chemical Engineering |series=Neutral network applications in chemical engineering |language=en |volume=16 |issue=4 |pages=313–328 |doi=10.1016/0098-1354(92)80051-A |issn=0098-1354|url-access=subscription }}</ref> is trained by intentionally corrupting the inputs of a standard autoencoder during training. A noise process is defined by a probability distribution <math>\mu_T</math> over functions <math>T:\mathcal X \to \mathcal X</math>. That is, the function <math>T</math> takes a message <math>x\in \mathcal X</math>, and corrupts it to a noisy version <math>T(x)</math>. The function <math>T</math> is selected randomly, with a probability distribution <math>\mu_T</math>.
Differently from sparse autoencoders or undercomplete autoencoders that constrain representation, [[Denoising autoencoders]] (DAE) try to achieve a ''good'' representation by changing the ''reconstruction criterion''.<ref name=":0" />
 
Given a task <math>(\mu_{\text{ref}}, d)</math>, the problem of training a DAE is the optimization problem:<math display="block">\min_{\theta, \phi}L(\theta, \phi) = \mathbb \mathbb E_{x\sim \mu_X, T\sim\mu_T}[d(x, (D_\theta\circ E_\phi \circ T)(x))]</math>That is, the optimal DAE should take any noisy message and attempt to recover the original message without noise, thus the name "denoising"''.''
Indeed, DAEs take a partially '''corrupted input''' and are trained to recover the original ''undistorted'' ''input''. In practice, the objective of denoising autoencoders is that of cleaning the corrupted input, or ''denoising.'' Two underlying assumptions are inherent to this approach:
 
Usually, the noise process <math>T</math> is applied only during training and testing, not during downstream use.
*Higher level representations are relatively stable and robust to the corruption of the input;
* To perform denoising well, the model needs to extract features that capture useful structure in the distribution of the input.<ref name=":4">{{Cite journal|last1=Vincent|first1=Pascal|last2=Larochelle|first2=Hugo|date=2010|title=Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion|url=|journal=Journal of Machine Learning Research|volume=11|pages=3371–3408|via=}}</ref>
 
The use of DAE depends on two assumptions:
In other words, denoising is advocated as a training criterion for learning to extract useful features that will constitute better higher level representations of the input.<ref name=":4" />
* There exist representations to the messages that are relatively stable and robust to the type of noise we are likely to encounter;
* The said representations capture structures in the input distribution that are useful for our purposes.<ref name=":4">{{Cite journal|last1=Vincent|first1=Pascal|last2=Larochelle|first2=Hugo|date=2010|title=Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion|journal=Journal of Machine Learning Research|volume=11|pages=3371–3408}}</ref>
 
Example noise processes include:
The training process of a DAE works as follows:
 
* additive isotropic [[Additive white Gaussian noise|Gaussian noise]],
* The initial input <math>x</math> is corrupted into <math>\boldsymbol \tilde{x}</math> through stochastic mapping <math>\boldsymbol \tilde{x}\thicksim q_{D}(\boldsymbol \tilde{x}|\boldsymbol x)</math>.
* masking noise (a fraction of the input is randomly chosen and set to 0)
* The corrupted input <math>\boldsymbol \tilde{x}</math> is then mapped to a hidden representation with the same process of the standard autoencoder, <math>\boldsymbol h=f_{\theta}(\boldsymbol \tilde{x})=s(\boldsymbol W\boldsymbol\tilde{x}+\boldsymbol b)</math>.
* salt-and-pepper noise (a fraction of the input is randomly chosen and randomly set to its minimum or maximum value).<ref name=":4" />
* From the hidden representation the model reconstructs <math>\boldsymbol z=g_{\theta'}(\boldsymbol h)</math>.<ref name=":4" />
 
=== Contractive autoencoder (CAE) ===
The model's parameters <math>\theta</math> and <math>\theta'</math> are trained to minimize the average reconstruction error over the training data, specifically, minimizing the difference between <math>\boldsymbol z</math> and the original uncorrupted input <math>\boldsymbol x</math>.<ref name=":4" /> Note that each time a random example <math>\boldsymbol x</math> is presented to the model, a new corrupted version is generated stochastically on the basis of <math>q_{D}(\boldsymbol \tilde{x}|\boldsymbol x)</math>.
A ''contractive autoencoder'' (CAE) adds the contractive regularization loss to the standard autoencoder loss:<math display="block">\min_{\theta, \phi}L(\theta, \phi) + \lambda L_{\text{cont}} (\theta, \phi)</math>where <math>\lambda > 0</math> measures how much contractive-ness we want to enforce. The contractive regularization loss itself is defined as the expected square of [[Frobenius norm]] of the [[Jacobian matrix and determinant|Jacobian matrix]] of the encoder activations with respect to the input:<math display="block">L_{\text{cont}}(\theta, \phi) = \mathbb E_{x\sim \mu_{ref}} \|\nabla_x E_\phi(x) \|_F^2</math>To understand what <math>L_{\text{cont}}</math> measures, note the fact<math display="block">\|E_\phi(x + \delta x) - E_\phi(x)\|_2 \leq \|\nabla_x E_\phi(x) \|_F \|\delta x\|_2</math>for any message <math>x\in \mathcal X</math>, and small variation <math>\delta x</math> in it. Thus, if <math>\|\nabla_x E_\phi(x) \|_F^2</math> is small, it means that a small neighborhood of the message maps to a small neighborhood of its code. This is a desired property, as it means small variation in the message leads to small, perhaps even zero, variation in its code, like how two pictures may look the same even if they are not exactly the same.
 
The DAE can be understood as an infinitesimal limit of CAE: in the limit of small Gaussian input noise, DAEs make the reconstruction function resist small but finite-sized input perturbations, while CAEs make the extracted features resist infinitesimal input perturbations.
The above-mentioned training process could be developed with any kind of corruption process. Some examples might be ''additive isotropic Gaussian noise, Masking noise'' (a fraction of the input chosen at random for each example is forced to 0) or ''Salt-and-pepper noise'' (a fraction of the input chosen at random for each example is set to its minimum or maximum value with uniform probability).<ref name=":4" />
 
=== Minimum description length autoencoder (MDL-AE) ===
Finally, notice that the corruption of the input is performed only during the training phase of the DAE. Once the model has learnt the optimal parameters, in order to extract the representations from the original data no corruption is added.
A ''minimum description length autoencoder'' (MDL-AE) is an advanced variation of the traditional autoencoder, which leverages principles from information theory, specifically the [[Minimum description length|Minimum Description Length (MDL) principle]]. The MDL principle posits that the best model for a dataset is the one that provides the shortest combined encoding of the model and the data. In the context of [[autoencoders]], this principle is applied to ensure that the learned representation is not only compact but also interpretable and efficient for reconstruction.
 
The MDL-AE seeks to minimize the total description length of the data, which includes the size of the [[latent representation]] (code length) and the error in reconstructing the original data. The objective can be expressed as
==== Contractive autoencoder (CAE) ====
<math>L_{\text{code}} + L_{\text{error}}</math>, where <math>L_{\text{code}}</math> represents the length of the compressed latent representation and <math>L_{\text{error}}</math> denotes the reconstruction error.<ref name=":5">{{Cite journal |last1=Hinton |first1=Geoffrey E |last2=Zemel |first2=Richard |date=1993 |title=Autoencoders, Minimum Description Length and Helmholtz Free Energy |url=https://proceedings.neurips.cc/paper/1993/hash/9e3cfc48eccf81a0d57663e129aef3cb-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=6}}</ref>
Contractive autoencoder adds an explicit regularizer in their objective function that forces the model to learn a function that is robust to slight variations of input values. This regularizer corresponds to the [[Frobenius norm]] of the [[Jacobian matrix and determinant|Jacobian matrix]] of the encoder activations with respect to the input. Since the penalty is applied to training examples only, this term forces the model to learn useful information about the training distribution. The final objective function has the following form:
:<math>\mathcal{L}(\mathbf{x},\mathbf{x'}) + \lambda \sum_i || \nabla_x h_i ||^2</math>
 
=== Concrete autoencoder (CAE) ===
The name contractive comes from the fact that the CAE is encouraged to map a neighborhood of input points to a smaller neighborhood of output points.<ref name=":0" />
The ''concrete autoencoder'' is designed for discrete feature selection.<ref>{{cite arXiv|last1=Abid|first1=Abubakar|last2=Balin|first2=Muhammad Fatih|last3=Zou|first3=James|date=2019-01-27|title=Concrete Autoencoders for Differentiable Feature Selection and Reconstruction|eprint=1901.09346|class=cs.LG}}</ref> A concrete autoencoder forces the latent space to consist only of a user-specified number of features. The concrete autoencoder uses a continuous [[Relaxation (approximation)|relaxation]] of the [[categorical distribution]] to allow gradients to pass through the feature selector layer, which makes it possible to use standard [[backpropagation]] to learn an optimal subset of input features that minimize reconstruction loss.
 
==Advantages of depth==
There is a connection between the denoising autoencoder (DAE) and the contractive autoencoder (CAE): in the limit of small Gaussian input noise, DAE make the reconstruction function resist small but finite-sized perturbations of the input, while CAE make the extracted features resist infinitesimal perturbations of the input.
[[File:Autoencoder_structure.png|350x350px|Schematic structure of an autoencoder with 3 fully connected hidden layers. The code (z, or h for reference in the text) is the most internal layer.|thumb]]
Autoencoders are often trained with a single-layer encoder and a single-layer decoder, but using many-layered (deep) encoders and decoders offers many advantages.<ref name=":0" />
 
* Depth can exponentially reduce the computational cost of representing some functions.
===Variational autoencoder (VAE)===
* Depth can exponentially decrease the amount of training data needed to learn some functions.
{{Split section |Variational autoencoder |discuss={{TALKPAGENAME}}#Split proposed |date=May 2020}}
* Experimentally, deep autoencoders yield better compression compared to shallow or linear autoencoders.<ref name=":7" />
 
=== Training ===
Unlike classical (sparse, denoising, etc.) autoencoders, Variational autoencoders (VAEs) are [[generative model]]s, like [[Generative adversarial network|Generative Adversarial Networks]].<ref name=":2">An, J., & Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. ''Special Lecture on IE'', ''2''(1).</ref> Their association with this group of models derives mainly from the architectural affinity with the basic autoencoder (the final training objective has an encoder and a decoder), but their mathematical formulation differs significantly.<ref>{{cite arxiv |eprint=1606.05908|last1=Doersch|first1=Carl|title=Tutorial on Variational Autoencoders|class=stat.ML|date=2016}}</ref> VAEs are [[directed probabilistic graphical model]]s (DPGM) whose posterior is approximated by a neural network, forming an autoencoder-like architecture.<ref name=":2" /><ref name="1bitVAE">{{cite arxiv|eprint=1911.12410|author1=Khobahi, S.|title=Model-Aware Deep Architectures for One-Bit Compressive Variational Autoencoding|last2=Soltanalian|first2=M.|class=eess.SP|date=2019}}</ref> Differently from discriminative modeling that aims to learn a predictor given the observation, ''generative modeling'' tries to simulate how the data is generated, in order to understand the underlying causal relations. Causal relations have indeed the great potential of being generalizable.<ref name=":11" />
[[Geoffrey Hinton]] developed the [[deep belief network]] technique for training many-layered deep autoencoders. His method involves treating each neighboring set of two layers as a [[restricted Boltzmann machine]] so that pretraining approximates a good solution, then using backpropagation to fine-tune the results.<ref name=":7">{{cite journal|last1=Hinton|first1=G. E.|last2=Salakhutdinov|first2=R.R.|title=Reducing the Dimensionality of Data with Neural Networks|journal=Science|date=28 July 2006|volume=313|issue=5786|pages=504–507|doi=10.1126/science.1127647|pmid=16873662|bibcode=2006Sci...313..504H|s2cid=1658773}}</ref>
 
Researchers have debated whether joint training (i.e. training the whole architecture together with a single global reconstruction objective to optimize) would be better for deep auto-encoders.<ref name=":9">{{cite arXiv |eprint=1405.1380|last1=Zhou|first1=Yingbo|last2=Arpit|first2=Devansh|last3=Nwogu|first3=Ifeoma|last4=Govindaraju|first4=Venu|title=Is Joint Training Better for Deep Auto-Encoders?|class=stat.ML|date=2014}}</ref> A 2015 study showed that joint training learns better data models along with more representative features for classification as compared to the layerwise method.<ref name=":9" /> However, their experiments showed that the success of joint training depends heavily on the regularization strategies adopted.<ref name=":9" /><ref>R. Salakhutdinov and G. E. Hinton, “Deep Boltzmann machines,” in AISTATS, 2009, pp. 448–455.</ref>
Variational autoencoder models make strong assumptions concerning the distribution of ''latent variables''. They use a [[Variational Bayesian methods|variational approach]] for latent representation learning, which results in an additional loss component and a specific estimator for the training algorithm called the Stochastic Gradient Variational Bayes (SGVB) estimator.<ref name="VAE" /> It assumes that the data is generated by a directed [[graphical model]] <math>p_\theta(\mathbf{x}|\mathbf{h})</math> and that the encoder is learning an approximation <math>q_{\phi}(\mathbf{h}|\mathbf{x})</math> to the [[Posterior probability|posterior distribution]] <math>p_{\theta}(\mathbf{h}|\mathbf{x})</math> where <math>\mathbf{\phi}</math> and <math>\mathbf{\theta}</math> denote the parameters of the encoder (recognition model) and decoder (generative model) respectively. The probability distribution of the latent vector of a VAE typically matches that of the training data much closer than a standard autoencoder. The objective of VAE has the following form:
 
== History ==
:<math>\mathcal{L}(\mathbf{\phi},\mathbf{\theta},\mathbf{x})=D_{\mathrm{KL}}(q_{\phi}(\mathbf{h}|\mathbf{x})\Vert p_{\theta}(\mathbf{h}))-\mathbb{E}_{q_{\phi}(\mathbf{h}|\mathbf{x})}\big(\log p_{\theta}(\mathbf{x}|\mathbf{h})\big)</math>
(Oja, 1982)<ref>{{Cite journal |last=Oja |first=Erkki |date=1982-11-01 |title=Simplified neuron model as a principal component analyzer |url=https://link.springer.com/article/10.1007/BF00275687 |journal=Journal of Mathematical Biology |language=en |volume=15 |issue=3 |pages=267–273 |doi=10.1007/BF00275687 |pmid=7153672 |issn=1432-1416|url-access=subscription }}</ref> noted that [[Principal component analysis | PCA]] is equivalent to a neural network with one hidden layer with identity activation function. In the language of autoencoding, the input-to-hidden module is the encoder, and the hidden-to-output module is the decoder. Subsequently, in (Baldi and Hornik, 1989)<ref name="auto">{{Cite journal |last1=Baldi |first1=Pierre |last2=Hornik |first2=Kurt |date=1989-01-01 |title=Neural networks and principal component analysis: Learning from examples without local minima |url=https://www.sciencedirect.com/science/article/abs/pii/0893608089900142 |journal=Neural Networks |volume=2 |issue=1 |pages=53–58 |doi=10.1016/0893-6080(89)90014-2 |issn=0893-6080|url-access=subscription }}</ref> and (Kramer, 1991)<ref name=":12" /> generalized PCA to autoencoders, a technique which they termed "nonlinear PCA".
 
Immediately after the resurgence of neural networks in the 1980s, it was suggested in 1986<ref>{{Cite book |last1=Rumelhart |first1=David E. |url=https://direct.mit.edu/books/book/4424/Parallel-Distributed-ProcessingExplorations-in-the |title=Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations |last2=McClelland |first2=James L. |last3=AU |date=1986 |publisher=The MIT Press |isbn=978-0-262-29140-8 |language=en |chapter=2. A General Framework for Parallel Distributed Processing |doi=10.7551/mitpress/5236.001.0001}}</ref> that a neural network be put in "auto-association mode". This was then implemented in (Harrison, 1987)<ref>Harrison TD (1987) A Connectionist framework for continuous speech recognition. Cambridge University Ph. D. dissertation</ref> and (Elman, Zipser, 1988)<ref>{{Cite journal |last1=Elman |first1=Jeffrey L. |last2=Zipser |first2=David |date=1988-04-01 |title=Learning the hidden structure of speech |url=https://pubs.aip.org/jasa/article/83/4/1615/826094/Learning-the-hidden-structure-of-speechLearning |journal=The Journal of the Acoustical Society of America |language=en |volume=83 |issue=4 |pages=1615–1626 |doi=10.1121/1.395916 |pmid=3372872 |bibcode=1988ASAJ...83.1615E |issn=0001-4966|url-access=subscription }}</ref> for speech and in (Cottrell, Munro, Zipser, 1987)<ref>{{Cite journal |last1=Cottrell |first1=Garrison W. |last2=Munro |first2=Paul |last3=Zipser |first3=David |date=1987 |title=Learning Internal Representation From Gray-Scale Images: An Example of Extensional Programming |url=https://escholarship.org/uc/item/2zs7w6z8 |journal=Proceedings of the Annual Meeting of the Cognitive Science Society |language=en |volume=9 }}</ref> for images.<ref name=":14" /> In (Hinton, Salakhutdinov, 2006),<ref name=":72">{{cite journal |last1=Hinton |first1=G. E. |last2=Salakhutdinov |first2=R.R. |date=28 July 2006 |title=Reducing the Dimensionality of Data with Neural Networks |journal=Science |volume=313 |issue=5786 |pages=504–507 |bibcode=2006Sci...313..504H |doi=10.1126/science.1127647 |pmid=16873662 |s2cid=1658773}}</ref> [[deep belief network]]s were developed. These train a pair [[restricted Boltzmann machine]]s as encoder-decoder pairs, then train another pair on the latent representation of the first pair, and so on.<ref name="scholar">{{Cite journal |vauthors=Hinton G |year=2009 |title=Deep belief networks |journal=Scholarpedia |volume=4 |issue=5 |pages=5947 |bibcode=2009SchpJ...4.5947H |doi=10.4249/scholarpedia.5947 |doi-access=free}}</ref>
Here, <math>D_{\mathrm{KL}}</math> stands for the [[Kullback–Leibler divergence]]. The prior over the latent variables is usually set to be the centred isotropic multivariate [[Gaussian function|Gaussian]] <math>p_{\theta}(\mathbf{h})=\mathcal{N}(\mathbf{0,I})</math>; however, alternative configurations have been considered.<ref>{{Cite journal|last1=Partaourides|first1=Harris|last2=Chatzis|first2=Sotirios P.|date=June 2017|title=Asymmetric deep generative models|journal=Neurocomputing|volume=241|pages=90–96|doi=10.1016/j.neucom.2017.02.028|url=https://zenodo.org/record/3452902}}</ref>
 
The first applications of AE date to early 1990s.<ref name=":0" /><ref>{{Cite journal |last=Schmidhuber |first=Jürgen |date=January 2015 |title=Deep learning in neural networks: An overview |journal=Neural Networks |volume=61 |pages=85–117 |arxiv=1404.7828 |doi=10.1016/j.neunet.2014.09.003 |pmid=25462637 |s2cid=11715509}}</ref><ref name=":5" /> Their most traditional application was [[dimensionality reduction]] or [[feature learning]], but the concept became widely used for learning [[generative model]]s of data.<ref name="VAE">{{cite arXiv |eprint=1312.6114 |class=stat.ML |author1=Diederik P Kingma |first2=Max |last2=Welling |title=Auto-Encoding Variational Bayes |date=2013}}</ref><ref name="gan_faces">Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015 {{URL|http://torch.ch/blog/2015/11/13/gan.html}}</ref> Some of the most powerful [[Artificial intelligence|AIs]] in the 2010s involved autoencoder modules as a component of larger AI systems, such as VAE in [[Stable Diffusion]], discrete VAE in Transformer-based image generators like [[DALL-E|DALL-E 1]], etc.
Commonly, the shape of the variational and the likelihood distributions are chosen such that they are factorized Gaussians:
 
During the early days, when the terminology was uncertain, the autoencoder has also been called identity mapping,<ref name="auto"/><ref name=":12" /> auto-associating,<ref>{{Cite journal |last1=Ackley |first1=D |last2=Hinton |first2=G |last3=Sejnowski |first3=T |date=March 1985 |title=A learning algorithm for boltzmann machines |url=http://doi.wiley.com/10.1016/S0364-0213(85)80012-4 |journal=Cognitive Science |language=en |volume=9 |issue=1 |pages=147–169 |doi=10.1016/S0364-0213(85)80012-4}}</ref> [[self-supervised learning|self-supervised]] [[backpropagation]],<ref name=":12" /> or Diabolo network.<ref>{{Cite journal |last1=Schwenk |first1=Holger |last2=Bengio |first2=Yoshua |date=1997 |title=Training Methods for Adaptive Boosting of Neural Networks |url=https://proceedings.neurips.cc/paper/1997/hash/9cb67ffb59554ab1dabb65bcb370ddd9-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=10}}</ref><ref name="bengio" />
:<math>
\begin{align}
q_{\phi}(\mathbf{h}|\mathbf{x}) &= \mathcal{N}(\boldsymbol{\rho}(\mathbf{x}), \boldsymbol{\omega}^2(\mathbf{x})\mathbf{I}), \\
p_{\theta}(\mathbf{x}|\mathbf{h}) &= \mathcal{N}(\boldsymbol{\mu}(\mathbf{h}), \boldsymbol{\sigma}^2(\mathbf{h})\mathbf{I}),
\end{align}
</math>
 
== Applications ==
where <math> \boldsymbol{\rho}(\mathbf{x}) </math> and <math>\boldsymbol{\omega}^2(\mathbf{x}) </math> are the encoder outputs, while <math> \boldsymbol{\mu}(\mathbf{h}) </math> and <math>\boldsymbol{\sigma}^2(\mathbf{h}) </math> are the decoder outputs.
The two main applications of autoencoders are [[dimensionality reduction]] and [[information retrieval]] (or [[Content-addressable memory|associative memory]]),<ref name=":0">{{Cite book|url=http://www.deeplearningbook.org|title=Deep Learning|last1=Goodfellow|first1=Ian|last2=Bengio|first2=Yoshua|last3=Courville|first3=Aaron|publisher=MIT Press|date=2016|isbn=978-0262035613}}</ref> but modern variations have been applied to other tasks.
This choice is justified by the simplifications<ref name="VAE" /> that it produces when evaluating both the KL divergence and the likelihood term in variational objective defined above.
 
=== Dimensionality reduction ===
VAE have been criticized because they generate blurry images.<ref name="SigmaVAE2" /> However, researchers employing this model were showing only the mean of the distributions, <math> \boldsymbol{\mu}(\mathbf{h}) </math>, rather than a sample of the learned Gaussian distribution
[[File:PCA vs Linear Autoencoder.png|thumb|Plot of the first two Principal Components (left) and a two-dimension hidden layer of a Linear Autoencoder (Right) applied to the [[Fashion MNIST]] dataset.<ref name=":10">{{Cite web|url=https://github.com/zalandoresearch/fashion-mnist|title=Fashion MNIST|website=[[GitHub]]|date=2019-07-12}}</ref> The two models being both linear learn to span the same subspace. The projection of the data points is indeed identical, apart from rotation of the subspace. While PCA selects a specific orientation up to reflections in the general case, the cost function of a simple autoencoder is invariant to rotations of the latent space.]][[Dimensionality reduction]] was one of the first [[deep learning]] applications.<ref name=":0" />
 
For Hinton's 2006 study,<ref name=":7" /> he pretrained a multi-layer autoencoder with a stack of [[Restricted Boltzmann machine|RBMs]] and then used their weights to initialize a deep autoencoder with gradually smaller hidden layers until hitting a bottleneck of 30 neurons. The resulting 30 dimensions of the code yielded a smaller reconstruction error compared to the first 30 components of a principal component analysis (PCA), and learned a representation that was qualitatively easier to interpret, clearly separating data clusters.<ref name=":0" /><ref name=":7" />
:<math> \mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}(\mathbf{h}), \boldsymbol{\sigma}^2(\mathbf{h})\mathbf{I}) </math>.
 
Reducing dimensions can improve performance on tasks such as classification.<ref name=":0" /> Indeed, the hallmark of dimensionality reduction is to place semantically related examples near each other.<ref name=":3">{{Cite journal|last1=Salakhutdinov|first1=Ruslan|last2=Hinton|first2=Geoffrey|date=2009-07-01|title=Semantic hashing|journal=International Journal of Approximate Reasoning|series=Special Section on Graphical Models and Information Retrieval|volume=50|issue=7|pages=969–978|doi=10.1016/j.ijar.2008.11.006|issn=0888-613X|doi-access=free}}</ref>
These samples were shown to be overly noisy due to the choice of a factorized Gaussian distribution.<ref name="SigmaVAE2">{{cite arxiv |eprint=1804.01050|last1=Dorta|first1=Garoe|last2=Vicente|first2=Sara|last3=Agapito|first3=Lourdes|last4=Campbell|first4=Neill D. F.|last5=Simpson|first5=Ivor|title=Training VAEs Under Structured Residuals|class=stat.ML|date=2018}}</ref><ref name="SigmaVAE1">{{cite arxiv |eprint=1802.07079|last1=Dorta|first1=Garoe|last2=Vicente|first2=Sara|last3=Agapito|first3=Lourdes|last4=Campbell|first4=Neill D. F.|last5=Simpson|first5=Ivor|title=Structured Uncertainty Prediction Networks|class=stat.ML|date=2018}}</ref> Employing a Gaussian distribution with a full covariance matrix,
 
==== Principal component analysis ====
:<math>
[[File:Reconstruction autoencoders vs PCA.png|thumb|Reconstruction of 28x28pixel images by an Autoencoder with a code size of two (two-units hidden layer) and the reconstruction from the first two Principal Components of PCA. Images come from the [[Fashion MNIST|Fashion MNIST dataset]].<ref name=":10" />]]
p_{\theta}(\mathbf{x}|\mathbf{h}) = \mathcal{N}(\boldsymbol{\mu}(\mathbf{h}), \boldsymbol{\Sigma}(\mathbf{h})),
If linear activations are used, or only a single sigmoid hidden layer, then the optimal solution to an autoencoder is strongly related to [[principal component analysis]] (PCA).<ref name=":14">{{Cite journal|last1=Bourlard|first1=H.|last2=Kamp|first2=Y.|date=1988|title=Auto-association by multilayer perceptrons and singular value decomposition|journal=Biological Cybernetics|volume=59|issue=4–5|pages=291–294|doi=10.1007/BF00332918|pmid=3196773|s2cid=206775335|url=http://infoscience.epfl.ch/record/82601}}</ref><ref>{{cite book|title=Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '14|last1=Chicco|first1=Davide|last2=Sadowski|first2=Peter|last3=Baldi|first3=Pierre|date=2014|isbn=9781450328944|pages=533|chapter=Deep autoencoder neural networks for gene ontology annotation predictions|doi=10.1145/2649387.2649442|hdl=11311/964622|s2cid=207217210|url=http://dl.acm.org/citation.cfm?id=2649442}}</ref> The weights of an autoencoder with a single hidden layer of size <math>p</math> (where <math>p</math> is less than the size of the input) span the same vector subspace as the one spanned by the first <math>p</math> principal components, and the output of the autoencoder is an orthogonal projection onto this subspace. The autoencoder weights are not equal to the principal components, and are generally not orthogonal, yet the principal components may be recovered from them using the [[singular value decomposition]].<ref>{{cite arXiv|last1=Plaut|first1=E|title=From Principal Subspaces to Principal Components with Linear Autoencoders|eprint=1804.10253|date=2018|class=stat.ML}}</ref>
</math>
 
However, the potential of autoencoders resides in their non-linearity, allowing the model to learn more powerful generalizations compared to PCA, and to reconstruct the input with significantly lower information loss.<ref name=":7" />
could solve this issue, but is computationally intractable and numerically unstable, as it requires estimating a covariance matrix from a single data sample. However, later research<ref name="SigmaVAE2" /><ref name="SigmaVAE1" /> showed that a restricted approach where the inverse matrix <math> \boldsymbol{\Sigma}^{-1}(\mathbf{h}) </math> is sparse, could be tractably employed to generate images with high-frequency details.
 
=== Information retrieval ===
Large-scale VAE models have been developed in different domains to represent data in a compact probabilistic latent space. For example, VQ-VAE<ref>Generating Diverse High-Fidelity Images with VQ-VAE-2 https://arxiv.org/abs/1906.00446</ref> for image generation and Optimus <ref>Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space https://arxiv.org/abs/2004.04092</ref> for language modeling.
[[Information retrieval]] benefits particularly from [[dimensionality reduction]] in that search can become more efficient in certain kinds of low dimensional spaces. Autoencoders were indeed applied to semantic hashing, proposed by [[Russ Salakhutdinov|Salakhutdinov]] and Hinton in 2007.<ref name=":3" /> By training the algorithm to produce a low-dimensional binary code, all database entries could be stored in a [[hash table]] mapping binary code vectors to entries. This table would then support information retrieval by returning all entries with the same binary code as the query, or slightly less similar entries by flipping some bits from the query encoding.
 
=== Anomaly detection ===
==Advantages of Depth==
Another application for autoencoders is [[anomaly detection]].<ref name=":13" /><ref>{{Cite book |last1=Morales-Forero |first1=A. |last2=Bassetto |first2=S. |title=2019 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM) |chapter=Case Study: A Semi-Supervised Methodology for Anomaly Detection and Diagnosis |date=December 2019 |___location=Macao, Macao |publisher=IEEE |pages=1031–1037 |doi=10.1109/IEEM44572.2019.8978509 |isbn=978-1-7281-3804-6|s2cid=211027131 }}</ref><ref>{{Cite book |last1=Sakurada |first1=Mayu |last2=Yairi |first2=Takehisa |title=Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis |chapter=Anomaly Detection Using Autoencoders with Nonlinear Dimensionality Reduction |date=December 2014 |chapter-url=http://dl.acm.org/citation.cfm?doid=2689746.2689747 |language=en |___location=Gold Coast, Australia QLD, Australia |publisher=ACM Press |pages=4–11 |doi=10.1145/2689746.2689747 |isbn=978-1-4503-3159-3|s2cid=14613395 }}</ref><ref name=":8">An, J., & Cho, S. (2015). [http://dm.snu.ac.kr/static/docs/TR/SNUDM-TR-2015-03.pdf Variational Autoencoder based Anomaly Detection using Reconstruction Probability]. ''Special Lecture on IE'', ''2'', 1-18.</ref><ref>{{Cite book |last1=Zhou |first1=Chong |last2=Paffenroth |first2=Randy C. |title=Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining |chapter=Anomaly Detection with Robust Deep Autoencoders |date=2017-08-04 |chapter-url=https://dl.acm.org/doi/10.1145/3097983.3098052 |language=en |publisher=ACM |pages=665–674 |doi=10.1145/3097983.3098052 |isbn=978-1-4503-4887-4|s2cid=207557733 }}</ref><ref>{{Cite journal|doi=10.1016/j.patrec.2017.07.016|title=A study of deep convolutional auto-encoders for anomaly detection in videos|year=2018|last1=Ribeiro|first1=Manassés|last2=Lazzaretti|first2=André Eugênio|last3=Lopes|first3=Heitor Silvério|journal=Pattern Recognition Letters|volume=105|pages=13–22|bibcode=2018PaReL.105...13R}}</ref> By learning to replicate the most salient features in the training data under some of the constraints described previously, the model is encouraged to learn to precisely reproduce the most frequently observed characteristics. When facing anomalies, the model should worsen its reconstruction performance. In most cases, only data with normal instances are used to train the autoencoder; in others, the frequency of anomalies is small compared to the observation set so that its contribution to the learned representation could be ignored. After training, the autoencoder will accurately reconstruct "normal" data, while failing to do so with unfamiliar anomalous data.<ref name=":8" /> Reconstruction error (the error between the original data and its low dimensional reconstruction) is used as an anomaly score to detect anomalies.<ref name=":8" />
[[File:Autoencoder_structure.png|350x350px|Schematic structure of an autoencoder with 3 fully connected hidden layers. The code (z, or h for reference in the text) is the most internal layer.|thumb]]
Typically, this means that on a validation set the empirical distribution of reconstruction errors is recorded and then (e.g.) the empirical 95-percentile <math>x_p</math> is taken as threshold <math>t:=x_p</math> to flag anomalous data points: <math>\text{loss}(x, \text{reconstruction}(x))>t \implies \text{anomaly}</math>. Since the threshold is an empirical [[quantile]] estimate, there is an inherent difficulty with "correctly" setting this threshold:
Autoencoders are often trained with only a single layer encoder and a single layer decoder, but using deep encoders and decoders offers many advantages.<ref name=":0" />
In many cases the distribution of the empirical quantile is asymptotically a normal distribution <math>\text{empirical p-quantile} \sim \mathcal{N}\left(\mu=p, \sigma^2=\frac{p( 1 - p )}{n f(x_p)^2}\right),</math> with <math>f(x_p)</math> the probability density at the quantile. This means that the variance grows if an extreme quantile is considered (because <math>f(x_p)</math> is small there). This means that there is a, potentially, a big uncertainty in what is the right choice for the threshold since it is ''estimated'' from a validation set.
 
* Depth can exponentially reduce the computational cost of representing some functions.<ref name=":0" />
* Depth can exponentially decrease the amount of training data needed to learn some functions.<ref name=":0" />
* Experimentally, deep autoencoders yield better compression compared to shallow or linear autoencoders.<ref name=":7" />
 
=== Training Deep Architectures ===
[[Geoffrey Hinton]] developed a pretraining technique for training many-layered deep autoencoders. This method involves treating each neighbouring set of two layers as a [[restricted Boltzmann machine]] so that the pretraining approximates a good solution, then using a backpropagation technique to fine-tune the results.<ref name=":7">{{cite journal|last1=Hinton|first1=G. E.|last2=Salakhutdinov|first2=R.R.|title=Reducing the Dimensionality of Data with Neural Networks|journal=Science|date=28 July 2006|volume=313|issue=5786|pages=504–507|doi=10.1126/science.1127647|pmid=16873662|bibcode=2006Sci...313..504H}}</ref> This model takes the name of [[deep belief network]].
 
Recently, researchers have debated whether joint training (i.e. training the whole architecture together with a single global reconstruction objective to optimize) would be better for deep auto-encoders.<ref name=":9">{{cite arxiv |eprint=1405.1380|last1=Zhou|first1=Yingbo|last2=Arpit|first2=Devansh|last3=Nwogu|first3=Ifeoma|last4=Govindaraju|first4=Venu|title=Is Joint Training Better for Deep Auto-Encoders?|class=stat.ML|date=2014}}</ref> A study published in 2015 empirically showed that the joint training method not only learns better data models, but also learned more representative features for classification as compared to the layerwise method.<ref name=":9" /> However, their experiments highlighted how the success of joint training for deep autoencoder architectures depends heavily on the regularization strategies adopted in the modern variants of the model.<ref name=":9" /><ref>R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in
 
AISTATS, 2009, pp. 448–455.</ref>
 
== Applications ==
The two main applications of autoencoders since the 80s have been ''dimensionality reduction'' and ''information retrieval,''<ref name=":0">{{Cite book|url=http://www.deeplearningbook.org|title=Deep Learning|last1=Goodfellow|first1=Ian|last2=Bengio|first2=Yoshua|last3=Courville|first3=Aaron|publisher=MIT Press|date=2016|isbn=978-0262035613|___location=|pages=}}</ref> but modern variations of the basic model were proven successful when applied to different domains and tasks.
 
=== Dimensionality Reduction ===
[[File:PCA vs Linear Autoencoder.png|thumb|Plot of the first two Principal Components (left) and a two-dimension hidden layer of a Linear Autoencoder (Right) applied to the [[Fashion MNIST dataset]].<ref name=":10">{{Cite web|url=https://github.com/zalandoresearch/fashion-mnist|title=Fashion MNIST|last=|first=|date=2019-07-12|website=|archive-url=|archive-date=|access-date=}}</ref> The two models being both linear learn to span the same subspace. The projection of the data points is indeed identical, apart from rotation of the subspace - to which PCA is invariant.]][[Dimensionality reduction|Dimensionality Reduction]] was one of the first applications of [[deep learning]], and one of the early motivations to study autoencoders.<ref name=":0" /> In a nutshell, the objective is to find a proper projection method, that maps data from high feature space to low feature space.<ref name=":0" />
 
One milestone paper on the subject was that of [[Geoffrey Hinton]] with his publication in [[Science (journal)|Science Magazine]] in 2006:<ref name=":7" /> in that study, he pretrained a multi-layer autoencoder with a stack of [[Restricted Boltzmann machine|RBMs]] and then used their weights to initialize a deep autoencoder with gradually smaller hidden layers until a bottleneck of 30 neurons. The resulting 30 dimensions of the code yielded a smaller reconstruction error compared to the first 30 principal components of a [[Principal component analysis|PCA]], and learned a representation that was qualitatively easier to interpret, clearly separating clusters in the original data.<ref name=":0" /><ref name=":7" />
 
Representing data in a lower-dimensional space can improve performance on different tasks, such as classification.<ref name=":0" /> Indeed, many forms of [[dimensionality reduction]] place semantically related examples near each other,<ref name=":3">{{Cite journal|last1=Salakhutdinov|first1=Ruslan|last2=Hinton|first2=Geoffrey|date=2009-07-01|title=Semantic hashing|journal=International Journal of Approximate Reasoning|series=Special Section on Graphical Models and Information Retrieval|volume=50|issue=7|pages=969–978|doi=10.1016/j.ijar.2008.11.006|issn=0888-613X|doi-access=free}}</ref> aiding generalization.
 
==== Relationship with principal component analysis (PCA) ====
[[File:Reconstruction autoencoders vs PCA.png|thumb|Reconstruction of 28x28pixel images by an Autoencoder with a code size of two (two-units hidden layer) and the reconstruction from the first two Principal Components of PCA. Images come from the [[Fashion MNIST dataset]].<ref name=":10" />]]
If linear activations are used, or only a single sigmoid hidden layer, then the optimal solution to an autoencoder is strongly related to [[principal component analysis]] (PCA).<ref>{{Cite journal|last1=Bourlard|first1=H.|last2=Kamp|first2=Y.|date=1988|title=Auto-association by multilayer perceptrons and singular value decomposition|journal=Biological Cybernetics|volume=59|issue=4–5|pages=291–294|doi=10.1007/BF00332918|pmc=|pmid=3196773|url=http://infoscience.epfl.ch/record/82601}}</ref><ref>{{cite book|title=Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '14|last1=Chicco|first1=Davide|last2=Sadowski|first2=Peter|last3=Baldi|first3=Pierre|date=2014|isbn=9781450328944|pages=533|chapter=Deep autoencoder neural networks for gene ontology annotation predictions|doi=10.1145/2649387.2649442|hdl=11311/964622|url=http://dl.acm.org/citation.cfm?id=2649442}}</ref> The weights of an autoencoder with a single hidden layer of size <math>p</math> (where <math>p</math> is less than the size of the input) span the same vector subspace as the one spanned by the first <math>p</math> principal components, and the output of the autoencoder is an orthogonal projection onto this subspace. The autoencoder weights are not equal to the principal components, and are generally not orthogonal, yet the principal components may be recovered from them using the [[singular value decomposition]].<ref>{{cite arxiv|last1=Plaut|first1=E|title=From Principal Subspaces to Principal Components with Linear Autoencoders|eprint=1804.10253|date=2018|class=stat.ML}}</ref>
 
However, the potential of Autoencoders resides in their non-linearity, allowing the model to learn more powerful generalizations compared to PCA, and to reconstruct back the input with a significantly lower loss of information.<ref name=":7" />
 
=== Information Retrieval ===
[[Information retrieval|Information Retrieval]] benefits particularly from [[dimensionality reduction]] in that search can become extremely efficient in certain kinds of low dimensional spaces. Autoencoders were indeed applied to '''[[semantic hashing]]''', proposed by [[Russ Salakhutdinov|Salakhutdinov]] and [[Geoffrey Hinton|Hinton]] in 2007.<ref name=":3" /> In a nutshell, training the algorithm to produce a low-dimensional binary code, then all database entries could be stored in a [[hash table]] mapping binary code vectors to entries. This table would then allow to perform information retrieval by returning all entries with the same binary code as the query, or slightly less similar entries by flipping some bits from the encoding of the query.
 
Recent literature has however shown that certain autoencoding models can, counterintuitively, be very good at reconstructing anomalous examples and consequently not able to reliably perform anomaly detection.<ref>{{cite arXiv|last1=Nalisnick|first1=Eric|last2=Matsukawa|first2=Akihiro|last3=Teh|first3=Yee Whye|last4=Gorur|first4=Dilan|last5=Lakshminarayanan|first5=Balaji|date=2019-02-24|title=Do Deep Generative Models Know What They Don't Know?|class=stat.ML|eprint=1810.09136}}</ref><ref>{{Cite journal|last1=Xiao|first1=Zhisheng|last2=Yan|first2=Qing|last3=Amit|first3=Yali|date=2020|title=Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder|url=https://proceedings.neurips.cc/paper/2020/hash/eddea82ad2755b24c4e168c5fc2ebd40-Abstract.html|journal=Advances in Neural Information Processing Systems|language=en|volume=33|arxiv=2003.02977}}</ref>
=== Anomaly Detection ===
Intuitively, this can be understood by considering those one layer auto encoders which are related to PCA - also in this case there can be perfect rein reconstructions for points which are far away from the data region but which lie on a principal component axis.
Another field of application for autoencoders is [[anomaly detection]].<ref>Sakurada, M., & Yairi, T. (2014, December). Anomaly detection using autoencoders with nonlinear dimensionality reduction. In ''Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis'' (p. 4). ACM.</ref><ref name=":8">An, J., & Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. ''Special Lecture on IE'', ''2'', 1-18.</ref><ref>Zhou, C., & Paffenroth, R. C. (2017, August). Anomaly detection with robust deep autoencoders. In ''Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining'' (pp. 665-674). ACM.</ref><ref>Ribeiro, M., Lazzaretti, A. E., & Lopes, H. S. (2018). A study of deep convolutional auto-encoders for anomaly detection in videos. ''Pattern Recognition Letters'', ''105'', 13-22.</ref><ref>{{Cite journal|last=Zavrak|first=Sultan|last2=Iskefiyeli|first2=Murat|date=2020|title=ANOMALY-BASED INTRUSION DETECTION FROM NETWORK FLOW FEATURES USING VARIATIONAL AUTOENCODER|url=https://ieeexplore.ieee.org/document/9113298/|journal=IEEE Access|pages=1–1|doi=10.1109/ACCESS.2020.3001350|issn=2169-3536|doi-access=free}}</ref> By learning to replicate the most salient features in the training data under some of the constraints described previously, the model is encouraged to learn how to precisely reproduce the most frequent characteristics of the observations. When facing anomalies, the model should worsen its reconstruction performance. In most cases, only data with normal instances are used to train the autoencoder; in others, the frequency of anomalies is so small compared to the whole population of observations, that its contribution to the representation learnt by the model could be ignored. After training, the autoencoder will reconstruct normal data very well, while failing to do so with anomaly data which the autoencoder has not encountered.<ref name=":8" /> Reconstruction error of a data point, which is the error between the original data point and its low dimensional reconstruction, is used as an anomaly score to detect anomalies.<ref name=":8" />
 
It is best to analyze if the anomalies which are flagged by the auto encoder are true anomalies. In this sense all the metrics in [[Evaluation of binary classifiers]] can be considered. The fundamental challenge which comes with the unsupervised (self-supervised) learning setting is, that labels for rare events do not exist (in which case the labels first have to be gathered and the data set will be imbalanced) or anomaly indicating labels are very rare, introducing larger [[confidence interval]]s for these performance estimates.
=== Image Processing ===
The peculiar characteristics of autoencoders have rendered these model extremely useful in the processing of images for various tasks.
 
=== Image processing ===
One example can be found in lossy [[image compression]] task, where autoencoders demonstrated their potential by outperforming other approaches and being proven competitive against [[JPEG 2000]].<ref>{{cite arxiv |eprint=1703.00395|last1=Theis|first1=Lucas|last2=Shi|first2=Wenzhe|last3=Cunningham|first3=Andrew|last4=Huszár|first4=Ferenc|title=Lossy Image Compression with Compressive Autoencoders|class=stat.ML|date=2017}}</ref>
The characteristics of autoencoders are useful in image processing.
 
One example can be found in lossy [[image compression]], where autoencoders outperformed other approaches and proved competitive against [[JPEG 2000]].<ref>{{cite arXiv |eprint=1703.00395|last1=Theis|first1=Lucas|last2=Shi|first2=Wenzhe|last3=Cunningham|first3=Andrew|last4=Huszár|first4=Ferenc|title=Lossy Image Compression with Compressive Autoencoders|class=stat.ML|date=2017}}</ref><ref>{{cite book |last1=Balle |first1=J |last2=Laparra |first2=V |last3=Simoncelli |first3=EP |chapter=End-to-end optimized image compression |title=International Conference on Learning Representations |date=April 2017 |arxiv=1611.01704}}</ref>
Another useful application of autoencoders in the field of image preprocessing is [[image denoising]]'''.'''<ref>Cho, K. (2013, February). Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images. In ''International Conference on Machine Learning'' (pp. 432-440).</ref><ref>{{cite arxiv |eprint=1301.3468|last1=Cho|first1=Kyunghyun|title=Boltzmann Machines and Denoising Autoencoders for Image Denoising|class=stat.ML|date=2013}}</ref> The need for efficient image restoration methods has grown with the massive production of digital images and movies of all kinds, often taken in poor conditions.<ref>Antoni Buades, Bartomeu Coll, Jean-Michel Morel. A review of image denoising algorithms, with a new one. Multiscale Modeling and Simulation: A SIAM Interdisciplinary Journal, Society for Industrial and Applied Mathematics, 2005, 4 (2), pp.490-530. hal-00271141</ref>
 
Another useful application of autoencoders in image preprocessing is [[image denoising]].<ref>Cho, K. (2013, February). Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images. In ''International Conference on Machine Learning'' (pp. 432-440).</ref><ref>{{cite arXiv |eprint=1301.3468|last1=Cho|first1=Kyunghyun|title=Boltzmann Machines and Denoising Autoencoders for Image Denoising|class=stat.ML|date=2013}}</ref><ref>{{Cite journal|doi = 10.1137/040616024|title = A Review of Image Denoising Algorithms, with a New One |url=https://hal.archives-ouvertes.fr/hal-00271141 |year = 2005|last1 = Buades|first1 = A.|last2 = Coll|first2 = B.|last3 = Morel|first3 = J. M.|journal = Multiscale Modeling & Simulation|volume = 4|issue = 2|pages = 490–530|s2cid = 218466166 }}</ref>
Autoencoders are increasingly proving their ability even in more delicate contexts such as [[medical imaging]]. In this context, they have also been used for [[image denoising]]<ref>{{Cite journal|last=Gondara|first=Lovedeep|date=December 2016|title=Medical Image Denoising Using Convolutional Denoising Autoencoders|journal=2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)|___location=Barcelona, Spain|publisher=IEEE|pages=241–246|doi=10.1109/ICDMW.2016.0041|isbn=9781509059102|arxiv=1608.04667|bibcode=2016arXiv160804667G}}</ref> as well as [[super-resolution]].<ref>{{cite journal |last1=Tzu-Hsi |first1=Song |last2=Sanchez |first2=Victor |last3=Hesham |first3=EIDaly |last4=Nasir M. |first4=Rajpoot |title=Hybrid deep autoencoder with Curvature Gaussian for detection of various types of cells in bone marrow trephine biopsy images |journal=2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017) |date=2017 |pages=1040–1043 |doi=10.1109/ISBI.2017.7950694 |isbn=978-1-5090-1172-8 }}</ref> In the field of image-assisted diagnosis, there exist some experiments using autoencoders for the detection of [[breast cancer]]<ref>{{cite journal |last1=Xu |first1=Jun |last2=Xiang |first2=Lei |last3=Liu |first3=Qingshan |last4=Gilmore |first4=Hannah |last5=Wu |first5=Jianzhong |last6=Tang |first6=Jinghai |last7=Madabhushi |first7=Anant |title=Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology Images |journal=IEEE Transactions on Medical Imaging |date=January 2016 |volume=35 |issue=1 |pages=119–130 |doi=10.1109/TMI.2015.2458702 |pmid=26208307 |pmc=4729702 }}</ref> or even modelling the relation between the cognitive decline of [[Alzheimer's Disease]] and the latent features of an autoencoder trained with [[MRI]]<ref>{{cite journal |last1=Martinez-Murcia |first1=Francisco J. |last2=Ortiz |first2=Andres |last3=Gorriz |first3=Juan M. |last4=Ramirez |first4=Javier |last5=Castillo-Barnes |first5=Diego |s2cid=195187846 |title=Studying the Manifold Structure of Alzheimer's Disease: A Deep Learning Approach Using Convolutional Autoencoders |journal=IEEE Journal of Biomedical and Health Informatics |volume=24 |issue=1 |pages=17–26 |doi=10.1109/JBHI.2019.2914970 |pmid=31217131 |date=2020 }}</ref>
 
Autoencoders found use in more demanding contexts such as [[medical imaging]] where they have been used for [[image denoising]]<ref>{{Cite book|last=Gondara|first=Lovedeep|title=2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW) |chapter=Medical Image Denoising Using Convolutional Denoising Autoencoders |date=December 2016|___location=Barcelona, Spain|publisher=IEEE|pages=241–246|doi=10.1109/ICDMW.2016.0041|isbn=9781509059102|arxiv=1608.04667|bibcode=2016arXiv160804667G|s2cid=14354973}}</ref> as well as [[super-resolution]].<ref>{{Cite journal|last1=Zeng|first1=Kun|last2=Yu|first2=Jun|last3=Wang|first3=Ruxin|last4=Li|first4=Cuihua|last5=Tao|first5=Dacheng|s2cid=20787612|date=January 2017|title=Coupled Deep Autoencoder for Single Image Super-Resolution|journal=IEEE Transactions on Cybernetics|volume=47|issue=1|pages=27–37|doi=10.1109/TCYB.2015.2501373|pmid=26625442|bibcode=2017ITCyb..47...27Z |issn=2168-2267}}</ref><ref>{{cite book |last1=Tzu-Hsi |first1=Song |last2=Sanchez |first2=Victor |last3=Hesham |first3=EIDaly |last4=Nasir M. |first4=Rajpoot |title=2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017) |chapter=Hybrid deep autoencoder with Curvature Gaussian for detection of various types of cells in bone marrow trephine biopsy images |date=2017 |pages=1040–1043 |doi=10.1109/ISBI.2017.7950694 |isbn=978-1-5090-1172-8 |s2cid=7433130 }}</ref> In image-assisted diagnosis, experiments have applied autoencoders for [[breast cancer]] detection<ref>{{cite journal |last1=Xu |first1=Jun |last2=Xiang |first2=Lei |last3=Liu |first3=Qingshan |last4=Gilmore |first4=Hannah |last5=Wu |first5=Jianzhong |last6=Tang |first6=Jinghai |last7=Madabhushi |first7=Anant |title=Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology Images |journal=IEEE Transactions on Medical Imaging |date=January 2016 |volume=35 |issue=1 |pages=119–130 |doi=10.1109/TMI.2015.2458702 |pmid=26208307 |pmc=4729702 |bibcode=2016ITMI...35..119X }}</ref> and for modelling the relation between the cognitive decline of [[Alzheimer's disease]] and the latent features of an autoencoder trained with [[MRI]].<ref>{{cite journal |last1=Martinez-Murcia |first1=Francisco J. |last2=Ortiz |first2=Andres |last3=Gorriz |first3=Juan M. |last4=Ramirez |first4=Javier |last5=Castillo-Barnes |first5=Diego |s2cid=195187846 |title=Studying the Manifold Structure of Alzheimer's Disease: A Deep Learning Approach Using Convolutional Autoencoders |journal=IEEE Journal of Biomedical and Health Informatics |volume=24 |issue=1 |pages=17–26 |doi=10.1109/JBHI.2019.2914970 |pmid=31217131 |date=2020 |bibcode=2020IJBHI..24...17M |doi-access=free |hdl=10630/28806 |hdl-access=free }}</ref>
Lastly, other successful experiments have been carried out exploiting variations of the basic autoencoder for [[image super-resolution]] tasks.<ref>{{Cite journal|last1=Zeng|first1=Kun|last2=Yu|first2=Jun|last3=Wang|first3=Ruxin|last4=Li|first4=Cuihua|last5=Tao|first5=Dacheng|s2cid=20787612|date=January 2017|title=Coupled Deep Autoencoder for Single Image Super-Resolution|journal=IEEE Transactions on Cybernetics|volume=47|issue=1|pages=27–37|doi=10.1109/TCYB.2015.2501373|pmid=26625442|issn=2168-2267}}</ref>
 
=== Drug discovery ===
In 2019 molecules generated with a special type of variational autoencoders were validated experimentally all the way intoin mice,.<ref>{{cite journal |last1=Zhavoronkov |first1=Alex|s2cid=201716327|date=2019|title=Deep learning enables rapid identification of potent DDR1 kinase inhibitors |journal=Nature Biotechnology |volume=37|issue=9|pages=1038–1040|doi=10.1038/s41587-019-0224-x |pmid=31477924}}</ref><ref>{{cite journalmagazine |last1=Gregory |first1=Barber |title=A Molecule Designed By AI Exhibits 'Druglike' Qualities |url=https://www.wired.com/story/molecule-designed-ai-exhibits-druglike-qualities/ |journalmagazine=Wired}}</ref>
 
=== Population synthesis ===
In 2019 a variational autoencoder framework was used to do population synthesis by approximating high-dimensional survey data.<ref>{{Cite journal|last1=Borysov|first1=Stanislav S.|last2=Rich|first2=Jeppe|last3=Pereira|first3=Francisco C.|date=September 2019|title=How to generate micro-agents? A deep generative modeling approach to population synthesis|journal=Transportation Research Part C: Emerging Technologies|language=en|volume=106|pages=73–97|doi=10.1016/j.trc.2019.07.006|arxiv=1808.06910}}</ref> By sampling agents from the approximated distribution new synthetic 'fake' populations, with similar statistical properties as those of the original population, were generated.
 
=== Popularity prediction ===
Recently, a stacked autoencoder framework have shownproduced promising results in predicting popularity of social media posts,<ref>{{cite book |doi=10.1109/CSCITA.2017.8066548|chapter=Predicting the popularity of instagram posts for a lifestyle magazine using deep learning|title=2017 2nd IEEE International Conference on Communication Systems, Computing and IT Applications (CSCITA)|pages=174–177|date=2017|last1=De|first1=Shaunak|last2=Maity|first2=Abhishek|last3=Goel|first3=Vritti|last4=Shitole|first4=Sanjay|last5=Bhattacharya|first5=Avik|s2cid=35350962|isbn=978-1-5090-4381-1}}</ref> which is helpful for online advertisementadvertising strategies.
 
=== Machine Translationtranslation ===
AutoencoderAutoencoders hashave been successfully applied to the [[machine translation]] of human languages, which is usually referred to as [[neural machine translation]] (NMT).<ref>{{cite arxivarXiv |eprint=1409.1259|last1=Cho|first1=Kyunghyun|author2=Bart van Merrienboer|last3=Bahdanau|first3=Dzmitry|last4=Bengio|first4=Yoshua|title=On the Properties of Neural Machine Translation: Encoder-Decoder Approaches|class=cs.CL|date=2014}}</ref><ref>{{cite arxivarXiv |eprint=1409.3215|last1=Sutskever|first1=Ilya|last2=Vinyals|first2=Oriol|last3=Le|first3=Quoc V.|title=Sequence to Sequence Learning with Neural Networks|class=cs.CL|date=2014}}</ref> InUnlike NMTtraditional autoencoders, the output does not match the input - it is in another language. In NMT, texts are treated as sequences to be encoded into the learning procedure, while inon the decoder side sequences in the target languageslanguage(s) will beare generated. Recent years also see the application of [[languageLanguage]] -specific autoencoders to incorporate thefurther [[linguistic]] features into the learning procedure, such as Chinese decomposition features.<ref>{{cite arxivarXiv |eprint=1805.01565|last1=Han|first1=Lifeng|last2=Kuang|first2=Shaohui|title=Incorporating Chinese Radicals into Neural Machine Translation: Deeper Than Character Level|class=cs.CL|date=2018}}</ref> Machine translation is rarely still done with autoencoders, due to the availability of more effective [[Transformer (machine learning model)|transformer]] networks.
 
=== Communication Systems ===
Autoencoders in communication systems are important because they help in encoding data into a more resilient representation for channel impairments, which is crucial for transmitting information while minimizing errors. In Addition, AE-based systems can optimize end-to-end communication performance. This approach can solve the several limitations of designing communication systems such as the inherent difficulty in accurately modeling the complex behavior of real-world channels.<ref>{{cite arXiv |eprint=2412.13843|last1=Alnaseri|first1=Omar|last2=Alzubaidi|first2=Laith|last3=Himeur|first3=Yassine|last4=Timmermann|first4=Jens|title=A Review on Deep Learning Autoencoder in the Design of Next-Generation Communication Systems|class=eess.SP|date=2024}}</ref>
 
==See also==
* [[Representation learning]]
* [[Singular value decomposition]]
* [[Sparse dictionary learning]]
* [[Deep learning]]
 
== Further reading ==
 
* {{cite book |last1=Bank |first1=Dor |title=Machine Learning for Data Science Handbook |last2=Koenigstein |first2=Noam |last3=Giryes |first3=Raja |publisher=Springer International Publishing |year=2023 |isbn=978-3-031-24627-2 |publication-place=Cham |chapter=Autoencoders |doi=10.1007/978-3-031-24628-9_16}}
* {{Cite book |last1=Goodfellow |first1=Ian |title=Deep learning |last2=Bengio |first2=Yoshua |last3=Courville |first3=Aaron |date=2016 |publisher=The MIT press |isbn=978-0-262-03561-3 |series=Adaptive computation and machine learning |___location=Cambridge, Mass |chapter=14. Autoencoders |chapter-url=https://www.deeplearningbook.org/contents/autoencoders.html}}
 
==References==
{{Reflist|30em}}
 
{{Artificial intelligence navbox}}
{{Noise}}
 
[[Category:ArtificialNeural neuralnetwork networksarchitectures]]
[[Category:Unsupervised learning]]
[[Category:Dimension reduction]]