Autoencoder: Difference between revisions

Content deleted Content added
Lokfahrer (talk | contribs)
remove external dead link; version visible though https://web.archive.org/web/20210629102308/https://pythoncodingai.com/autoencoders-its-types-and-design/ seems to not contribute much that the article itself does not already cover.
Citation bot (talk | contribs)
Alter: pages, template type. Add: magazine, bibcode, website, authors 1-1. Removed parameters. Formatted dashes. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by AManWithNoPlan | #UCB_webform 250/1776
Line 5:
An '''autoencoder''' is a type of [[artificial neural network]] used to learn [[Feature learning|efficient codings]] of unlabeled data ([[unsupervised learning]]).<ref>{{cite journal|doi=10.1002/aic.690370209|title=Nonlinear principal component analysis using autoassociative neural networks|journal=AIChE Journal|volume=37|issue=2|pages=233–243|date=1991|last1=Kramer|first1=Mark A.|url= https://www.researchgate.net/profile/Abir_Alobaid/post/To_learn_a_probability_density_function_by_using_neural_network_can_we_first_estimate_density_using_nonparametric_methods_then_train_the_network/attachment/59d6450279197b80779a031e/AS:451263696510979@1484601057779/download/NL+PCA+by+using+ANN.pdf}}</ref> The encoding is validated and refined by attempting to regenerate the input from the encoding. The autoencoder learns a [[Feature learning|representation]] (encoding) for a set of data, typically for [[dimensionality reduction]], by training the network to ignore insignificant data (“noise”).
 
Variants exist, aiming to force the learned representations to assume useful properties.<ref name=":0" /> Examples are regularized autoencoders (''Sparse'', ''Denoising'' and ''Contractive''), which are effective in learning representations for subsequent [[Statistical classification|classification]] tasks,<ref name=":4" /> and ''Variational'' autoencoders, with applications as [[generative model]]s.<ref name=":11">{{cite journal |arxiv=1906.02691|doi=10.1561/2200000056|bibcode=2019arXiv190602691K|title=An Introduction to Variational Autoencoders|date=2019|last1=Welling|first1=Max|last2=Kingma|first2=Diederik P.|journal=Foundations and Trends in Machine Learning|volume=12|issue=4|pages=307–392|s2cid=174802445}}</ref> Autoencoders are applied to many problems, from [[face recognition|facial recognition]],<ref>Hinton GE, Krizhevsky A, Wang SD. [http://www.cs.toronto.edu/~fritz/absps/transauto6.pdf Transforming auto-encoders.] In International Conference on Artificial Neural Networks 2011 Jun 14 (pp. 44-51). Springer, Berlin, Heidelberg.</ref> feature detection,<ref name=":2">{{Cite book|last=Géron|first=Aurélien|title=Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow|publisher=O’Reilly Media, Inc.|year=2019|___location=Canada|pages=739-740739–740}}</ref> anomaly detection to acquiring the meaning of words.<ref>{{cite journal|doi=10.1016/j.neucom.2008.04.030|title=Modeling word perception using the Elman network|journal=Neurocomputing|volume=71|issue=16–18|pages=3150|date=2008|last1=Liou|first1=Cheng-Yuan|last2=Huang|first2=Jau-Chi|last3=Yang|first3=Wen-Chie}}</ref><ref>{{cite journal|doi=10.1016/j.neucom.2013.09.055|title=Autoencoder for words|journal=Neurocomputing|volume=139|pages=84–96|date=2014|last1=Liou|first1=Cheng-Yuan|last2=Cheng|first2=Wei-Chen|last3=Liou|first3=Jiun-Wei|last4=Liou|first4=Daw-Ran}}</ref> Autoencoders are also generative models: they can randomly generate new data that is similar to the input data (training data).<ref name=":2" />
 
{{Toclimit|3}}
Line 14:
The simplest way to perform the copying task perfectly would be to duplicate the signal. Instead, autoencoders are typically forced to reconstruct the input approximately, preserving only the most relevant aspects of the data in the copy.
 
The idea of autoencoders has been popular for decades. The first applications date to the 1980s.<ref name=":0" /><ref>{{Cite journal|last=Schmidhuber|first=Jürgen|date=January 2015|title=Deep learning in neural networks: An overview|journal=Neural Networks|volume=61|pages=85–117|doi=10.1016/j.neunet.2014.09.003|pmid=25462637|arxiv=1404.7828|s2cid=11715509}}</ref><ref>Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and Helmholtz free energy. In ''Advances in neural information processing systems 6'' (pp. 3-10).</ref> Their most traditional application was [[dimensionality reduction]] or [[feature learning]], but the concept became widely used for learning [[generative model]]s of data.<ref name="VAE">{{cite arxivarXiv|eprint=1312.6114|author1=Diederik P Kingma|title=Auto-Encoding Variational Bayes|last2=Welling|first2=Max|class=stat.ML|date=2013}}</ref><ref name="gan_faces">Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015 {{url|http://torch.ch/blog/2015/11/13/gan.html}}</ref> Some of the most powerful [[Artificial intelligence|AIs]] in the 2010s involved autoencoders stacked inside [[Deep learning|deep]] neural networks.<ref name="domingos">{{cite book|title=The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World|title-link=The Master Algorithm|last1=Domingos|first1=Pedro|publisher=Basic Books|date=2015|isbn=978-046506192-1|at="Deeper into the Brain" subsection|chapter=4|author-link=Pedro Domingos}}</ref>[[File:Autoencoder schema.png|thumb|Schema of a basic Autoencoder]]The simplest form of an autoencoder is a [[feedforward neural network|feedforward]], non-[[recurrent neural network]] similar to single layer [[perceptrons]] that participate in [[multilayer perceptron]]s (MLP) – employing an input layer and an output layer connected by one or more hidden layers. The output layer has the same number of nodes (neurons) as the input layer. Its purpose is to reconstruct its inputs (minimizing the difference between the input and the output) instead of predicting a target value <math>Y</math> given inputs <math>X</math>. Therefore, autoencoders learn unsupervised.
 
An autoencoder consists of two parts, the encoder and the decoder, which can be defined as transitions <math>\phi</math> and <math>\psi,</math> such that:
Line 69:
:where <math>j</math> is summing over the <math>s</math> hidden nodes in the hidden layer, and <math>KL(\rho || \hat{\rho_j}) </math> is the KL-divergence between a [[Bernoulli distribution|Bernoulli random variable]] with mean <math>\rho</math> and a Bernoulli random variable with mean <math>\hat{\rho_j}</math>.<ref name=":6" />
 
* Another way to achieve sparsity is by applying L1 or L2 regularization terms on the activation, scaled by a certain parameter <math>\lambda</math>.<ref>{{cite arxivarXiv |eprint=1505.05561|last1=Arpit|first1=Devansh|last2=Zhou|first2=Yingbo|last3=Ngo|first3=Hung|last4=Govindaraju|first4=Venu|title=Why Regularized Auto-Encoders learn Sparse Representation?|class=stat.ML|date=2015}}</ref> For instance, in the case of L1 the [[loss function]] becomes
 
::<math>\mathcal{L}(\mathbf{x},\mathbf{x'}) + \lambda \sum_i |h_i|</math>
 
* A further proposed strategy to force sparsity is to manually zero all but the strongest hidden unit activations (''k-sparse autoencoder'').<ref name=":1">{{cite arxivarXiv |eprint=1312.5663|last1=Makhzani|first1=Alireza|last2=Frey|first2=Brendan|title=K-Sparse Autoencoders|class=cs.LG|date=2013}}</ref> The k-sparse autoencoder is based on a linear autoencoder (i.e. with linear activation function) and tied weights. The identification of the strongest activations can be achieved by sorting the activities and keeping only the first ''k'' values, or by using [[Rectifier (neural networks)|ReLU]] hidden units with thresholds that are adaptively adjusted until the k largest activities are identified. This selection acts like the previously mentioned regularization terms in that it prevents the model from reconstructing the input using too many neurons.<ref name=":1" />
 
====Denoising autoencoder (DAE)====
Line 107:
 
=== Concrete autoencoder ===
The concrete autoencoder is designed for discrete feature selection.<ref>{{cite arxivarXiv|lastlast1=Abid|firstfirst1=Abubakar|last2=Balin|first2=Muhammad Fatih|last3=Zou|first3=James|date=2019-01-27|title=Concrete Autoencoders for Differentiable Feature Selection and Reconstruction|eprint=1901.09346|class=cs.LG}}</ref> A concrete autoencoder forces the latent space to consist only of a user-specified number of features. The concrete autoencoder uses a continuous [[Relaxation (approximation)|relaxation]] of the [[categorical distribution]] to allow gradients to pass through the feature selector layer, which makes it possible to use standard [[backpropagation]] to learn an optimal subset of input features that minimize reconstruction loss.
 
===Variational autoencoder (VAE)===
Line 127:
[[Geoffrey Hinton]] developed the [[deep belief network]] technique for training many-layered deep autoencoders. His method involves treating each neighbouring set of two layers as a [[restricted Boltzmann machine]] so that pretraining approximates a good solution, then using backpropagation to fine-tune the results.<ref name=":7">{{cite journal|last1=Hinton|first1=G. E.|last2=Salakhutdinov|first2=R.R.|title=Reducing the Dimensionality of Data with Neural Networks|journal=Science|date=28 July 2006|volume=313|issue=5786|pages=504–507|doi=10.1126/science.1127647|pmid=16873662|bibcode=2006Sci...313..504H|s2cid=1658773}}</ref>
 
Researchers have debated whether joint training (i.e. training the whole architecture together with a single global reconstruction objective to optimize) would be better for deep auto-encoders.<ref name=":9">{{cite arxivarXiv |eprint=1405.1380|last1=Zhou|first1=Yingbo|last2=Arpit|first2=Devansh|last3=Nwogu|first3=Ifeoma|last4=Govindaraju|first4=Venu|title=Is Joint Training Better for Deep Auto-Encoders?|class=stat.ML|date=2014}}</ref> A 2015 study showed that joint training learns better data models along with more representative features for classification as compared to the layerwise method.<ref name=":9" /> However, their experiments showed that the success of joint training depends heavily on the regularization strategies adopted.<ref name=":9" /><ref>R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in AISTATS, 2009, pp. 448–455.</ref>
 
== Applications ==
Line 133:
 
=== Dimensionality reduction ===
[[File:PCA vs Linear Autoencoder.png|thumb|Plot of the first two Principal Components (left) and a two-dimension hidden layer of a Linear Autoencoder (Right) applied to the [[Fashion MNIST dataset]].<ref name=":10">{{Cite web|url=https://github.com/zalandoresearch/fashion-mnist|title=Fashion MNIST|website=[[GitHub]]|date=2019-07-12}}</ref> The two models being both linear learn to span the same subspace. The projection of the data points is indeed identical, apart from rotation of the subspace - to which PCA is invariant.]][[Dimensionality reduction]] was one of the first [[deep learning]] applications.<ref name=":0" />
 
For Hinton's 2006 study,<ref name=":7" /> he pretrained a multi-layer autoencoder with a stack of [[Restricted Boltzmann machine|RBMs]] and then used their weights to initialize a deep autoencoder with gradually smaller hidden layers until hitting a bottleneck of 30 neurons. The resulting 30 dimensions of the code yielded a smaller reconstruction error compared to the first 30 components of a principal component analysis (PCA), and learned a representation that was qualitatively easier to interpret, clearly separating data clusters.<ref name=":0" /><ref name=":7" />
Line 141:
==== Principal component analysis ====
[[File:Reconstruction autoencoders vs PCA.png|thumb|Reconstruction of 28x28pixel images by an Autoencoder with a code size of two (two-units hidden layer) and the reconstruction from the first two Principal Components of PCA. Images come from the [[Fashion MNIST dataset]].<ref name=":10" />]]
If linear activations are used, or only a single sigmoid hidden layer, then the optimal solution to an autoencoder is strongly related to [[principal component analysis]] (PCA).<ref>{{Cite journal|last1=Bourlard|first1=H.|last2=Kamp|first2=Y.|date=1988|title=Auto-association by multilayer perceptrons and singular value decomposition|journal=Biological Cybernetics|volume=59|issue=4–5|pages=291–294|doi=10.1007/BF00332918|pmid=3196773|s2cid=206775335|url=http://infoscience.epfl.ch/record/82601}}</ref><ref>{{cite book|title=Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '14|last1=Chicco|first1=Davide|last2=Sadowski|first2=Peter|last3=Baldi|first3=Pierre|date=2014|isbn=9781450328944|pages=533|chapter=Deep autoencoder neural networks for gene ontology annotation predictions|doi=10.1145/2649387.2649442|hdl=11311/964622|s2cid=207217210|url=http://dl.acm.org/citation.cfm?id=2649442}}</ref> The weights of an autoencoder with a single hidden layer of size <math>p</math> (where <math>p</math> is less than the size of the input) span the same vector subspace as the one spanned by the first <math>p</math> principal components, and the output of the autoencoder is an orthogonal projection onto this subspace. The autoencoder weights are not equal to the principal components, and are generally not orthogonal, yet the principal components may be recovered from them using the [[singular value decomposition]].<ref>{{cite arxivarXiv|last1=Plaut|first1=E|title=From Principal Subspaces to Principal Components with Linear Autoencoders|eprint=1804.10253|date=2018|class=stat.ML}}</ref>
 
However, the potential of autoencoders resides in their non-linearity, allowing the model to learn more powerful generalizations compared to PCA, and to reconstruct the input with significantly lower information loss.<ref name=":7" />
Line 149:
 
=== Anomaly detection ===
Another application for autoencoders is [[anomaly detection]].<ref> Morales-Forero, A., & Bassetto, S. (2019, December). Case Study: A Semi-Supervised Methodology for Anomaly Detection and Diagnosis. In ''2019 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM)'' (p. 4) (pp. 1031-1037). IEEE.</ref> <ref>Sakurada, M., & Yairi, T. (2014, December). Anomaly detection using autoencoders with nonlinear dimensionality reduction. In ''Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis'' (p. 4). ACM.</ref><ref name=":8">An, J., & Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. ''Special Lecture on IE'', ''2'', 1-18.</ref><ref>Zhou, C., & Paffenroth, R. C. (2017, August). Anomaly detection with robust deep autoencoders. In ''Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining'' (pp. 665-674). ACM.</ref><ref>{{Cite journal|doi=10.1016/j.patrec.2017.07.016|title=A study of deep convolutional auto-encoders for anomaly detection in videos|year=2018|last1=Ribeiro|first1=Manassés|last2=Lazzaretti|first2=André Eugênio|last3=Lopes|first3=Heitor Silvério|journal=Pattern Recognition Letters|volume=105|pages=13–22|bibcode=2018PaReL.105...13R}}</ref> By learning to replicate the most salient features in the training data under some of the constraints described previously, the model is encouraged to learn to precisely reproduce the most frequently observed characteristics. When facing anomalies, the model should worsen its reconstruction performance. In most cases, only data with normal instances are used to train the autoencoder; in others, the frequency of anomalies is small compared to the observation set so that its contribution to the learned representation could be ignored. After training, the autoencoder will accurately reconstruct "normal" data, while failing to do so with unfamiliar anomalous data.<ref name=":8" /> Reconstruction error (the error between the original data and its low dimensional reconstruction) is used as an anomaly score to detect anomalies.<ref name=":8" />
 
Recent literature has however shown that certain autoencoding models can, counterintuitively, be very good at reconstructing anomalous examples and consequently not able to reliably perform anomaly detection.<ref>{{cite arxivarXiv|last1=Nalisnick|first1=Eric|last2=Matsukawa|first2=Akihiro|last3=Teh|first3=Yee Whye|last4=Gorur|first4=Dilan|last5=Lakshminarayanan|first5=Balaji|date=2019-02-24|title=Do Deep Generative Models Know What They Don't Know?|class=stat.ML|eprint=1810.09136}}</ref><ref>{{Cite journal|last1=Xiao|first1=Zhisheng|last2=Yan|first2=Qing|last3=Amit|first3=Yali|date=2020|title=Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder|url=https://proceedings.neurips.cc/paper/2020/hash/eddea82ad2755b24c4e168c5fc2ebd40-Abstract.html|journal=Advances in Neural Information Processing Systems|language=en|volume=33|arxiv=2003.02977}}</ref>
 
=== Image processing ===
The characteristics of autoencoders are useful in image processing.
 
One example can be found in lossy [[image compression]], where autoencoders outperformed other approaches and proved competitive against [[JPEG 2000]].<ref>{{cite arxivarXiv |eprint=1703.00395|last1=Theis|first1=Lucas|last2=Shi|first2=Wenzhe|last3=Cunningham|first3=Andrew|last4=Huszár|first4=Ferenc|title=Lossy Image Compression with Compressive Autoencoders|class=stat.ML|date=2017}}</ref><ref>{{cite book |last1=Balle |first1=J |last2=Laparra |first2=V |last3=Simoncelli |first3=EP |chapter=End-to-end optimized image compression |title=International Conference on Learning Representations |date=April 2017 |arxiv=1611.01704}}</ref>
 
Another useful application of autoencoders in image preprocessing is [[image denoising]].<ref>Cho, K. (2013, February). Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images. In ''International Conference on Machine Learning'' (pp. 432-440).</ref><ref>{{cite arxivarXiv |eprint=1301.3468|last1=Cho|first1=Kyunghyun|title=Boltzmann Machines and Denoising Autoencoders for Image Denoising|class=stat.ML|date=2013}}</ref><ref>{{Cite journal|doi = 10.1137/040616024|title = A Review of Image Denoising Algorithms, with a New One |url=https://hal.archives-ouvertes.fr/hal-00271141 |year = 2005|last1 = Buades|first1 = A.|last2 = Coll|first2 = B.|last3 = Morel|first3 = J. M.|journal = Multiscale Modeling & Simulation|volume = 4|issue = 2|pages = 490–530}}</ref>
 
Autoencoders found use in more demanding contexts such as [[medical imaging]] where they have been used for [[image denoising]]<ref>{{Cite journal|last=Gondara|first=Lovedeep|date=December 2016|title=Medical Image Denoising Using Convolutional Denoising Autoencoders|journal=2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)|___location=Barcelona, Spain|publisher=IEEE|pages=241–246|doi=10.1109/ICDMW.2016.0041|isbn=9781509059102|arxiv=1608.04667|bibcode=2016arXiv160804667G|s2cid=14354973}}</ref> as well as [[super-resolution]].<ref>{{Cite journal|last1=Zeng|first1=Kun|last2=Yu|first2=Jun|last3=Wang|first3=Ruxin|last4=Li|first4=Cuihua|last5=Tao|first5=Dacheng|s2cid=20787612|date=January 2017|title=Coupled Deep Autoencoder for Single Image Super-Resolution|journal=IEEE Transactions on Cybernetics|volume=47|issue=1|pages=27–37|doi=10.1109/TCYB.2015.2501373|pmid=26625442|issn=2168-2267}}</ref><ref>{{cite journal |last1=Tzu-Hsi |first1=Song |last2=Sanchez |first2=Victor |last3=Hesham |first3=EIDaly |last4=Nasir M. |first4=Rajpoot |title=Hybrid deep autoencoder with Curvature Gaussian for detection of various types of cells in bone marrow trephine biopsy images |journal=2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017) |date=2017 |pages=1040–1043 |doi=10.1109/ISBI.2017.7950694 |isbn=978-1-5090-1172-8 |s2cid=7433130 }}</ref> In image-assisted diagnosis, experiments have applied autoencoders for [[breast cancer]] detection<ref>{{cite journal |last1=Xu |first1=Jun |last2=Xiang |first2=Lei |last3=Liu |first3=Qingshan |last4=Gilmore |first4=Hannah |last5=Wu |first5=Jianzhong |last6=Tang |first6=Jinghai |last7=Madabhushi |first7=Anant |title=Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology Images |journal=IEEE Transactions on Medical Imaging |date=January 2016 |volume=35 |issue=1 |pages=119–130 |doi=10.1109/TMI.2015.2458702 |pmid=26208307 |pmc=4729702 }}</ref> and for modelling the relation between the cognitive decline of [[Alzheimer's disease]] and the latent features of an autoencoder trained with [[MRI]].<ref>{{cite journal |last1=Martinez-Murcia |first1=Francisco J. |last2=Ortiz |first2=Andres |last3=Gorriz |first3=Juan M. |last4=Ramirez |first4=Javier |last5=Castillo-Barnes |first5=Diego |s2cid=195187846 |title=Studying the Manifold Structure of Alzheimer's Disease: A Deep Learning Approach Using Convolutional Autoencoders |journal=IEEE Journal of Biomedical and Health Informatics |volume=24 |issue=1 |pages=17–26 |doi=10.1109/JBHI.2019.2914970 |pmid=31217131 |date=2020 |doi-access=free }}</ref>
 
=== Drug discovery ===
In 2019 molecules generated with variational autoencoders were validated experimentally in mice.<ref>{{cite journal |last1=Zhavoronkov |first1=Alex|s2cid=201716327|date=2019|title=Deep learning enables rapid identification of potent DDR1 kinase inhibitors |journal=Nature Biotechnology |volume=37|issue=9|pages=1038–1040|doi=10.1038/s41587-019-0224-x |pmid=31477924}}</ref><ref>{{cite journalmagazine |last1=Gregory |first1=Barber |title=A Molecule Designed By AI Exhibits 'Druglike' Qualities |url=https://www.wired.com/story/molecule-designed-ai-exhibits-druglike-qualities/ |journalmagazine=Wired}}</ref>
 
=== Popularity prediction ===
Line 169:
 
=== Machine translation ===
Autoencoders have been applied to [[machine translation]], which is usually referred to as [[neural machine translation]] (NMT).<ref>{{cite arxivarXiv |eprint=1409.1259|last1=Cho|first1=Kyunghyun|author2=Bart van Merrienboer|last3=Bahdanau|first3=Dzmitry|last4=Bengio|first4=Yoshua|title=On the Properties of Neural Machine Translation: Encoder-Decoder Approaches|class=cs.CL|date=2014}}</ref><ref>{{cite arxivarXiv |eprint=1409.3215|last1=Sutskever|first1=Ilya|last2=Vinyals|first2=Oriol|last3=Le|first3=Quoc V.|title=Sequence to Sequence Learning with Neural Networks|class=cs.CL|date=2014}}</ref> Unlike traditional autoencoders, the output does not match the input - it is in another language. In NMT, texts are treated as sequences to be encoded into the learning procedure, while on the decoder side sequences in the target language(s) are generated. [[Language]]-specific autoencoders incorporate further [[linguistic]] features into the learning procedure, such as Chinese decomposition features.<ref>{{cite arxivarXiv |eprint=1805.01565|last1=Han|first1=Lifeng|last2=Kuang|first2=Shaohui|title=Incorporating Chinese Radicals into Neural Machine Translation: Deeper Than Character Level|class=cs.CL|date=2018}}</ref>
 
==See also==