T-distributed stochastic neighbor embedding: Difference between revisions

Content deleted Content added
Laurburke (talk | contribs)
m Correcting the scikit-learn library name to all lowercase.
OAbot (talk | contribs)
m Open access bot: url-access updated in citation with #oabot.
 
(31 intermediate revisions by 21 users not shown)
Line 3:
[[File:T-SNE visualisation of word embeddings generated using 19th century literature.png|thumb|T-SNE visualisation of [[word embedding]]s generated using 19th century literature]]
[[File:T-SNE Embedding of MNIST.png|thumb|T-SNE embeddings of [[MNIST]] dataset]]
{{lower caselowercase title}}
{{Data Visualization}}
'''t-distributed stochastic neighbor embedding''' ('''t-SNE''') is a [[statistical]] method for visualizing high-dimensional data by giving each datapoint a ___location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Sam Roweis and [[Geoffrey Hinton]] and Sam Roweis,<ref name="SNE">{{cite conference |author1-lastdate=RoweisJanuary 2002 |author1-firsttitle=Sam|Stochastic neighbor embedding author2-last=Hinton|author2-firsturl=Geoffreyhttps://papers.nips.cc/paper_files/paper/2002/file/6150ccc6069bea6b5716254057a194ef-Paper.pdf |conference=[[Neural Information Processing Systems]] |titleauthor1-last=StochasticHinton neighbor|author1-first=Geoffrey embedding|dateauthor2-last= January 2002Roweis |urlauthor2-first=https://cs.nyu.edu/~roweis/papers/sne_final.pdfSam}}</ref> where Laurens van der Maaten and Hinton proposed the [[Student's t-distribution|''t''-distributed]] variant.<ref name=MaatenHinton>{{cite journal|last=van der Maaten|first=L.J.P.|author2=Hinton, G.E. |title=Visualizing Data Using t-SNE|journal=Journal of Machine Learning Research |volume=9|date=Nov 2008|pages=2579–2605|url=http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf}}</ref> It is a [[nonlinear dimensionality reduction]] technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.
 
The t-SNE algorithm comprises two main stages. First, t-SNE constructs a [[probability distribution]] over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the [[Kullback–Leibler divergence]] (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the [[Euclidean distance]] between objects as the base of its similarity metric, this can be changed as appropriate. A [[Riemannian metric|Riemannian]] variant is [[Uniform manifold approximation and projection|UMAP]].
 
t-SNE has been used for visualization in a wide range of applications, including [[genomics]], [[computer security]] research,<ref>{{cite journal|last=Gashi|first=I.|author2=Stankovic, V. |author3=Leita, C. |author4=Thonnard, O. |title=An Experimental Study of Diversity with Off-the-shelf AntiVirus Engines|journal=Proceedings of the IEEE International Symposium on Network Computing and Applications|year=2009|pages=4–11}}</ref> [[natural language processing]], [[music analysis]],<ref>{{cite journal|last=Hamel|first=P.|author2=Eck, D. |title=Learning Features from Music Audio with Deep Belief Networks|journal=Proceedings of the International Society for Music Information Retrieval Conference|year=2010|pages=339–344}}</ref> [[cancer research]],<ref>{{cite journal|last=Jamieson|first=A.R.|author2=Giger, M.L. |author3=Drukker, K. |author4=Lui, H. |author5=Yuan, Y. |author6=Bhooshan, N. |title=Exploring Nonlinear Feature Space Dimension Reduction and Data Representation in Breast CADx with Laplacian Eigenmaps and t-SNE|journal=Medical Physics |issue=1|year=2010|pages=339–351|doi=10.1118/1.3267037|pmid=20175497|volume=37|pmc=2807447}}</ref> [[bioinformatics]],<ref>{{cite journal|last=Wallach|first=I.|author2=Liliean, R. |title=The Protein-Small-Molecule Database, A Non-Redundant Structural Resource for the Analysis of Protein-Ligand Binding|journal=Bioinformatics |year=2009|pages=615–620|doi=10.1093/bioinformatics/btp035|volume=25|issue=5|pmid=19153135|doi-access=free}}</ref> geological ___domain interpretation,<ref>{{Cite journal|date=2019-04-01|title=A comparison of t-SNE, SOM and SPADE for identifying material type domains in geological data|url=https://www.sciencedirect.com/science/article/pii/S0098300418306010|journal=Computers & Geosciences|language=en|volume=125|pages=78–89|doi=10.1016/j.cageo.2019.01.011|issn=0098-3004|last1=Balamurali|first1=Mehala|last2=Silversides|first2=Katherine L.|last3=Melkumyan|first3=Arman|bibcode=2019CG....125...78B |s2cid=67926902|url-access=subscription}}</ref><ref>{{Cite journalbook|last1=Balamurali|first1=Mehala|last2=Melkumyan|first2=Arman|date=2016|editor-last=Hirose|editor-first=Akira|editor2-last=Ozawa|editor2-first=Seiichi|editor3-last=Doya|editor3-first=Kenji|editor4-last=Ikeda|editor4-first=Kazushi|editor5-last=Lee|editor5-first=Minho|editor6-last=Liu|editor6-first=Derong|titlechapter=t-SNE Based Visualisation and Clustering of Geological Domain|chapter-url=https://link.springer.com/chapter/10.1007/978-3-319-46681-1_67|journaltitle=Neural Information Processing|series=Lecture Notes in Computer Science|volume=9950|language=en|___location=Cham|publisher=Springer International Publishing|pages=565–572|doi=10.1007/978-3-319-46681-1_67|isbn=978-3-319-46681-1}}</ref><ref>{{Cite journal|last1=Leung|first1=Raymond|last2=Balamurali|first2=Mehala|last3=Melkumyan|first3=Arman|date=2021-01-01|title=Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering|url=https://doi.org/10.1007/s11004-019-09839-z|journal=Mathematical Geosciences|language=en|volume=53|issue=1|pages=105–130|doi=10.1007/s11004-019-09839-z|bibcode=2021MatGe..53..105L |s2cid=208329378|issn=1874-8953|url-access=subscription}}</ref> and biomedical signal processing.<ref>{{Cite book|last1=Birjandtalab|first1=J.|last2=Pouyan|first2=M. B.|last3=Nourani|first3=M.|datetitle=2016 IEEE-02-01EMBS International Conference on Biomedical and Health Informatics (BHI) |titlechapter=Nonlinear dimension reduction for EEG-based epileptic seizure detection |journaldate=2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI)02-01|pages=595–598|doi=10.1109/BHI.2016.7455968|isbn=978-1-5090-2455-1|s2cid=8074617}}</ref>
 
For a data set with ''n'' elements, t-SNE runs in {{math|O(''n''<sup>2</sup>)}} time and requires {{math|O(''n''<sup>2</sup>)}} space.<ref>{{cite arXiv|title=Approximated and User Steerable tSNE for Progressive Visual Analytics|last=Pezzotti|first=Nicola|date=2015 |class=cs.CV |eprint=1512.01655 }}</ref>
While t-SNE plots often seem to display [[cluster analysis|clusters]], the visual clusters can be influenced strongly by the chosen parameterization and therefore a good understanding of the parameters for t-SNE is necessary. Such "clusters" can be shown to even appear in non-clustered data,<ref>{{Cite web|url=https://stats.stackexchange.com/a/264647|title=K-means clustering on the output of t-SNE|website=Cross Validated|access-date=2018-04-16}}</ref> and thus may be false findings. Interactive exploration may thus be necessary to choose parameters and validate results.<ref>{{Cite journal|last1=Pezzotti|first1=Nicola|last2=Lelieveldt|first2=Boudewijn P. F.|last3=Maaten|first3=Laurens van der|last4=Hollt|first4=Thomas|last5=Eisemann|first5=Elmar|last6=Vilanova|first6=Anna|date=2017-07-01|title=Approximated and User Steerable tSNE for Progressive Visual Analytics|journal=IEEE Transactions on Visualization and Computer Graphics|language=en-US|volume=23|issue=7|pages=1739–1752|doi=10.1109/tvcg.2016.2570755|pmid=28113434|issn=1077-2626|arxiv=1512.01655|s2cid=353336}}</ref><ref>{{cite web|url=https://distill.pub/2016/misread-tsne/|title=How to Use t-SNE Effectively|last1=Wattenberg|first1=Martin|last2=Viégas|first2=Fernanda|date=2016-10-13|publisher=Distill|language=en|access-date=4 December 2017|last3=Johnson|first3=Ian}}</ref> It has been demonstrated that t-SNE is often able to recover well-separated clusters, and with special parameter choices, approximates a simple form of [[spectral clustering]].<ref>{{cite arXiv|last1=Linderman|first1=George C.|last2=Steinerberger|first2=Stefan|date=2017-06-08|title=Clustering with t-SNE, provably|eprint=1706.02582|class=cs.LG}}</ref>
 
== Details ==
Given a set of <math>N</math> high-dimensional objects <math>\mathbf{x}_1, \dots, \mathbf{x}_N</math>, t-SNE first computes probabilities <math>p_{ij}</math> that are proportional to the similarity of objects <math>\mathbf{x}_i</math> and <math>\mathbf{x}_j</math>, as follows.
 
For <math>i \neq j</math>, define
: <math>p_{j\mid i} = \frac{\exp(-\lVert\mathbf{x}_i - \mathbf{x}_j\rVert^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\lVert\mathbf{x}_i - \mathbf{x}_k\rVert^2 / 2\sigma_i^2)}</math>
and set <math>p_{i\mid i} = 0</math>.
Note thatthe above denominator ensures <math>\sum_j p_{j\mid i} = 1</math> for all <math>i</math>.
 
As Vanvan der Maaten and Hinton explained: "The similarity of datapoint <math>x_j</math> to datapoint <math>x_i</math> is the conditional probability, <math>p_{j|i}</math>, that <math>x_i</math> would pick <math>x_j</math> as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at <math>x_i</math>."<ref name=MaatenHinton/>
 
Now define
: <math>p_{ij} = \frac{p_{j\mid i} + p_{i\mid j}}{2N}</math>
 
This is motivated because <math>p_{i}</math> and <math>p_{ij}</math> from the N samples are estimated as 1/N, so the conditional probability can be written as <math>p_{ji\mid ij} = Np_{ij}</math> and <math>p_{j\mid i} = Np_{ji}</math> . Since <math> p_{jiij} = p_{ji}</math>, you can obtain previous formula.
 
Also note that <math>p_{ii} = 0 </math> and <math>\sum_{i, j} p_{ij} = 1</math>.
This is motivated because <math>p_{i}</math> and <math>p_{i}</math> from the N samples are estimated as 1/N, so conditional probability can be written as <math>p_{j\mid i} = Np_{ij}</math> and <math>p_{j\mid i} = Np_{ji}</math> . Since <math> p_{ji} = p_{ji}</math>, you can obtain previous formula.
 
The bandwidth of the [[Gaussian kernel]]s <math>\sigma_i</math> is set in such a way that the [[Entropy (information theory)|entropy]] of the conditional distribution equals a predefined entropy using the [[bisection method]]. As a result, the bandwidth is adapted to the [[density]] of the data: smaller values of <math>\sigma_i</math> are used in denser parts of the data space. The entropy increases with the [[perplexity]] of this distribution <math>P_i</math>; this relation is seen as
Also note that <math>p_{ii} = 0 </math> and <math>\sum_{i, j} p_{ij} = 1</math>.
 
: <math>Perp(P_i) = 2^{H(P_i)}</math>
The bandwidth of the [[Gaussian kernel]]s <math>\sigma_i</math> is set in such a way that the [[Entropy (information theory)|entropy]] of the conditional distribution equals a predefined entropy using the [[bisection method]]. As a result, the bandwidth is adapted to the [[density]] of the data: smaller values of <math>\sigma_i</math> are used in denser parts of the data space.
 
where <math>H(P_i)</math> is the Shannon entropy <math>H(P_i)=-\sum_jp_{j|i}\log_2p_{j|i}.</math>
Since the Gaussian kernel uses the Euclidean distance <math>\lVert x_i-x_j \rVert</math>, it is affected by the [[curse of dimensionality]], and in high dimensional data when distances lose the ability to discriminate, the <math>p_{ij}</math> become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the [[intrinsic dimension]] of each point, to alleviate this.<ref>{{Cite conference|last1=Schubert|first1=Erich|last2=Gertz|first2=Michael|date=2017-10-04|title=Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection|conference=SISAP 2017 – 10th International Conference on Similarity Search and Applications|pages=188–203|doi=10.1007/978-3-319-68474-1_13}}</ref>
 
The perplexity is a hand-chosen parameter of t-SNE, and as the authors state, "perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.".<ref name=MaatenHinton/>
 
Since the Gaussian kernel uses the [[Euclidean distance]] <math>\lVert x_i-x_j \rVert</math>, it is affected by the [[curse of dimensionality]], and in high dimensional data when distances lose the ability to discriminate, the <math>p_{ij}</math> become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the [[intrinsic dimension]] of each point, to alleviate this.<ref>{{Cite conference|last1=Schubert|first1=Erich|last2=Gertz|first2=Michael|date=2017-10-04|title=Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection|conference=SISAP 2017 – 10th International Conference on Similarity Search and Applications|pages=188–203|doi=10.1007/978-3-319-68474-1_13}}</ref>
 
t-SNE aims to learn a <math>d</math>-dimensional map <math>\mathbf{y}_1, \dots, \mathbf{y}_N</math> (with <math>\mathbf{y}_i \in \mathbb{R}^d</math> and <math>d</math> typically chosen as 2 or 3) that reflects the similarities <math>p_{ij}</math> as well as possible. To this end, it measures similarities <math>q_{ij}</math> between two points in the map <math>\mathbf{y}_i</math> and <math>\mathbf{y}_j</math>, using a very similar approach.
Line 41 ⟶ 46:
 
and set <math> q_{ii} = 0 </math>.
Herein a heavy-tailed [[Student t-distribution]] (with one-degree of freedom, which is the same as a [[Cauchy distribution]]) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map.
 
The locations of the points <math>\mathbf{y}_i</math> in the map are determined by minimizing the (non-symmetric) [[Kullback–Leibler divergence]] of the distribution <math>P</math> from the distribution <math>Q</math>, that is:
Line 49 ⟶ 54:
The minimization of the Kullback–Leibler divergence with respect to the points <math>\mathbf{y}_i</math> is performed using [[gradient descent]].
The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.
 
== Output ==
While t-SNE plots often seem to display [[cluster analysis|clusters]], the visual clusters can be influenced strongly influenced by the chosen parameterization (especially the perplexity) and thereforeso a good understanding of the parameters for t-SNE is necessaryneeded. Such "clusters" can be shown to even appear in non-clusteredstructured data with no clear clustering,<ref>{{Cite web|url=https://stats.stackexchange.com/a/264647 |title=K-means clustering on the output of t-SNE |websiteurl=Crosshttps://stats.stackexchange.com/a/264647 Validated|access-date=2018-04-16 |website=Cross Validated}}</ref> and thusso may be false findings. InteractiveSimilarly, the size of clusters produced by t-SNE is not informative, and neither is the distance between clusters.<ref>{{Cite journal |last1=Wattenberg |first1=Martin |last2=Viégas |first2=Fernanda |last3=Johnson |first3=Ian |date=2016-10-13 |title=How to Use t-SNE Effectively |url=http://distill.pub/2016/misread-tsne |journal=Distill |language=en |volume=1 |issue=10 |pages=e2 |doi=10.23915/distill.00002 |issn=2476-0757|doi-access=free }}</ref> Thus, interactive exploration may thus be necessaryneeded to choose parameters and validate results.<ref>{{Cite journal |last1=Pezzotti |first1=Nicola |last2=Lelieveldt |first2=Boudewijn P. F. |last3=Maaten |first3=Laurens van der |last4=Hollt |first4=Thomas |last5=Eisemann |first5=Elmar |last6=Vilanova |first6=Anna |date=2017-07-01 |title=Approximated and User Steerable tSNE for Progressive Visual Analytics |journal=IEEE Transactions on Visualization and Computer Graphics |language=en-US |volume=23 |issue=7 |pages=1739–1752 |arxiv=1512.01655 |doi=10.1109/tvcg.2016.2570755|pmid=28113434 |issn=1077-2626 |arxivpmid=1512.0165528113434 |s2cid=353336}}</ref><ref>{{cite web|url=https://distill.pub/2016/misread-tsne/|title=How to Usejournal t-SNE Effectively|last1=Wattenberg |first1=Martin |last2=Viégas |first2=Fernanda |last3=Johnson |first3=Ian |date=2016-10-13 |publishertitle=How to Use t-SNE Effectively |url=https://distill.pub/2016/misread-tsne/ |journal=Distill |language=en |volume=1 |issue=10 |doi=10.23915/distill.00002 |access-date=4 December 2017 |last3doi-access=Johnson|first3=Ianfree}}</ref> It has been demonstratedshown that t-SNE iscan often able to recover well-separated clusters, and with special parameter choices, approximates a simple form of [[spectral clustering]].<ref>{{cite arXiv |last1eprint=Linderman1706.02582 |class=cs.LG |first1=George C. |last2last1=SteinerbergerLinderman |first2=Stefan |datelast2=2017-06-08Steinerberger |title=Clustering with t-SNE, provably |eprintdate=1706.02582|class=cs.LG2017-06-08}}</ref>
 
== Software ==
* A C++ implementation of Barnes-Hut is available on the [https://github.com/lvdmaaten/bhtsne github account] of one of the original authors.
* The R package [https://CRAN.R-project.org/package=Rtsne Rtsne] implements t-SNE in [[R (programming language)|R]].
* [[ELKI]] contains tSNE, also with Barnes-Hut approximation
* [[scikit-learn]], a popular machine learning libarylibrary in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation.
* Tensorboard, the visualization kit associated with [[TensorFlow]], also implements t-SNE ([https://projector.tensorflow.org/ online version])
* The [[Julia (programming language)|Julia]] package [https://juliapackages.com/p/tsne TSne] implements t-SNE
 
== References ==
Line 60 ⟶ 70:
 
== External links ==
* {{Cite journal |last1=Wattenberg |first1=Martin |last2=Viégas |first2=Fernanda |last3=Johnson |first3=Ian |date=2016-10-13 |title=How to Use t-SNE Effectively |url=https://distill.pub/2016/misread-tsne/ |journal=Distill |language=en |volume=1 |issue=10 |pages=e2 |doi=10.23915/distill.00002 |issn=2476-0757|doi-access=free }}. Interactive demonstration and tutorial.
* [https://www.youtube.com/watch?v=RJVL80Gg3lA Visualizing Data Using t-SNE], Google Tech Talk about t-SNE
* [https://lvdmaaten.github.io/tsne/ Implementations of t-SNE in various languages], A link collection maintained by Laurens van der Maaten