T-distributed stochastic neighbor embedding: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 21:53, 15 December 2022 edit Laurburke (talk \| contribs) 51 edits m Correcting the scikit-learn library name to all lowercase. ← Previous edit		Latest revision as of 01:25, 24 May 2025 edit undo OAbot (talk \| contribs) Bots 643,717 edits m Open access bot: url-access updated in citation with #oabot.
(31 intermediate revisions by 21 users not shown)
Line 3: [[File:T-SNE visualisation of word embeddings generated using 19th century literature.png\|thumb\|T-SNE visualisation of [[word embedding]]s generated using 19th century literature]] [[File:T-SNE Embedding of MNIST.png\|thumb\|T-SNE embeddings of [[MNIST]] dataset]] {{~~lower case~~lowercase title}} {{Data Visualization}} '''t-distributed stochastic neighbor embedding''' ('''t-SNE''') is a [[statistical]] method for visualizing high-dimensional data by giving each datapoint a ___location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by ~~Sam Roweis and~~ [[Geoffrey Hinton]] and Sam Roweis,<ref name="SNE">{{cite conference \|~~author1-last~~date=~~Roweis~~January 2002 \|~~author1-first~~title=~~Sam\|~~Stochastic neighbor embedding ~~author2-last=Hinton~~\|~~author2-first~~url=~~Geoffrey~~https://papers.nips.cc/paper_files/paper/2002/file/6150ccc6069bea6b5716254057a194ef-Paper.pdf \|conference=[[Neural Information Processing Systems]] \|~~title~~author1-last=~~Stochastic~~Hinton ~~neighbor~~\|author1-first=Geoffrey ~~embedding~~\|~~date~~author2-last= ~~January 2002~~Roweis \|~~url~~author2-first=~~https://cs.nyu.edu/~roweis/papers/sne_final.pdf~~Sam}}</ref> where Laurens van der Maaten and Hinton proposed the [[Student's t-distribution\|''t''-distributed]] variant.<ref name=MaatenHinton>{{cite journal\|last=van der Maaten\|first=L.J.P.\|author2=Hinton, G.E. \|title=Visualizing Data Using t-SNE\|journal=Journal of Machine Learning Research \|volume=9\|date=Nov 2008\|pages=2579–2605\|url=http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf}}</ref> It is a [[nonlinear dimensionality reduction]] technique ~~well-suited~~ for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The t-SNE algorithm comprises two main stages. First, t-SNE constructs a [[probability distribution]] over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the [[Kullback–Leibler divergence]] (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the [[Euclidean distance]] between objects as the base of its similarity metric, this can be changed as appropriate. A [[Riemannian metric\|Riemannian]] variant is [[Uniform manifold approximation and projection\|UMAP]]. t-SNE has been used for visualization in a wide range of applications, including [[genomics]], [[computer security]] research,<ref>{{cite journal\|last=Gashi\|first=I.\|author2=Stankovic, V. \|author3=Leita, C. \|author4=Thonnard, O. \|title=An Experimental Study of Diversity with Off-the-shelf AntiVirus Engines\|journal=Proceedings of the IEEE International Symposium on Network Computing and Applications\|year=2009\|pages=4–11}}</ref> [[natural language processing]], [[music analysis]],<ref>{{cite journal\|last=Hamel\|first=P.\|author2=Eck, D. \|title=Learning Features from Music Audio with Deep Belief Networks\|journal=Proceedings of the International Society for Music Information Retrieval Conference\|year=2010\|pages=339–344}}</ref> [[cancer research]],<ref>{{cite journal\|last=Jamieson\|first=A.R.\|author2=Giger, M.L. \|author3=Drukker, K. \|author4=Lui, H. \|author5=Yuan, Y. \|author6=Bhooshan, N. \|title=Exploring Nonlinear Feature Space Dimension Reduction and Data Representation in Breast CADx with Laplacian Eigenmaps and t-SNE\|journal=Medical Physics \|issue=1\|year=2010\|pages=339–351\|doi=10.1118/1.3267037\|pmid=20175497\|volume=37\|pmc=2807447}}</ref> [[bioinformatics]],<ref>{{cite journal\|last=Wallach\|first=I.\|author2=Liliean, R. \|title=The Protein-Small-Molecule Database, A Non-Redundant Structural Resource for the Analysis of Protein-Ligand Binding\|journal=Bioinformatics \|year=2009\|pages=615–620\|doi=10.1093/bioinformatics/btp035\|volume=25\|issue=5\|pmid=19153135\|doi-access=free}}</ref> geological ___domain interpretation,<ref>{{Cite journal\|date=2019-04-01\|title=A comparison of t-SNE, SOM and SPADE for identifying material type domains in geological data\|url=https://www.sciencedirect.com/science/article/pii/S0098300418306010\|journal=Computers & Geosciences\|language=en\|volume=125\|pages=78–89\|doi=10.1016/j.cageo.2019.01.011\|issn=0098-3004\|last1=Balamurali\|first1=Mehala\|last2=Silversides\|first2=Katherine L.\|last3=Melkumyan\|first3=Arman\|bibcode=2019CG....125...78B \|s2cid=67926902\|url-access=subscription}}</ref><ref>{{Cite ~~journal~~book\|last1=Balamurali\|first1=Mehala\|last2=Melkumyan\|first2=Arman\|date=2016\|editor-last=Hirose\|editor-first=Akira\|editor2-last=Ozawa\|editor2-first=Seiichi\|editor3-last=Doya\|editor3-first=Kenji\|editor4-last=Ikeda\|editor4-first=Kazushi\|editor5-last=Lee\|editor5-first=Minho\|editor6-last=Liu\|editor6-first=Derong\|~~title~~chapter=t-SNE Based Visualisation and Clustering of Geological Domain\|chapter-url=https://link.springer.com/chapter/10.1007/978-3-319-46681-1_67\|~~journal~~title=Neural Information Processing\|series=Lecture Notes in Computer Science\|volume=9950\|language=en\|___location=Cham\|publisher=Springer International Publishing\|pages=565–572\|doi=10.1007/978-3-319-46681-1_67\|isbn=978-3-319-46681-1}}</ref><ref>{{Cite journal\|last1=Leung\|first1=Raymond\|last2=Balamurali\|first2=Mehala\|last3=Melkumyan\|first3=Arman\|date=2021-01-01\|title=Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering\|url=https://doi.org/10.1007/s11004-019-09839-z\|journal=Mathematical Geosciences\|language=en\|volume=53\|issue=1\|pages=105–130\|doi=10.1007/s11004-019-09839-z\|bibcode=2021MatGe..53..105L \|s2cid=208329378\|issn=1874-8953\|url-access=subscription}}</ref> and biomedical signal processing.<ref>{{Cite book\|last1=Birjandtalab\|first1=J.\|last2=Pouyan\|first2=M. B.\|last3=Nourani\|first3=M.\|~~date~~title=2016 IEEE-~~02-01~~EMBS International Conference on Biomedical and Health Informatics (BHI) \|~~title~~chapter=Nonlinear dimension reduction for EEG-based epileptic seizure detection \|~~journal~~date=2016 ~~IEEE~~-~~EMBS International Conference on Biomedical and Health Informatics (BHI)~~02-01\|pages=595–598\|doi=10.1109/BHI.2016.7455968\|isbn=978-1-5090-2455-1\|s2cid=8074617}}</ref> For a data set with ''n'' elements, t-SNE runs in {{math\|O(''n''<sup>2</sup>)}} time and requires {{math\|O(''n''<sup>2</sup>)}} space.<ref>{{cite arXiv\|title=Approximated and User Steerable tSNE for Progressive Visual Analytics\|last=Pezzotti\|first=Nicola\|date=2015 \|class=cs.CV \|eprint=1512.01655 }}</ref> While t-SNE plots often seem to display [[cluster analysis\|clusters]], the visual clusters can be influenced strongly by the chosen parameterization and therefore a good understanding of the parameters for t-SNE is necessary. Such "clusters" can be shown to even appear in non-clustered data,<ref>{{Cite web\|url=https://stats.stackexchange.com/a/264647\|title=K-means clustering on the output of t-SNE\|website=Cross Validated\|access-date=2018-04-16}}</ref> and thus may be false findings. Interactive exploration may thus be necessary to choose parameters and validate results.<ref>{{Cite journal\|last1=Pezzotti\|first1=Nicola\|last2=Lelieveldt\|first2=Boudewijn P. F.\|last3=Maaten\|first3=Laurens van der\|last4=Hollt\|first4=Thomas\|last5=Eisemann\|first5=Elmar\|last6=Vilanova\|first6=Anna\|date=2017-07-01\|title=Approximated and User Steerable tSNE for Progressive Visual Analytics\|journal=IEEE Transactions on Visualization and Computer Graphics\|language=en-US\|volume=23\|issue=7\|pages=1739–1752\|doi=10.1109/tvcg.2016.2570755\|pmid=28113434\|issn=1077-2626\|arxiv=1512.01655\|s2cid=353336}}</ref><ref>{{cite web\|url=https://distill.pub/2016/misread-tsne/\|title=How to Use t-SNE Effectively\|last1=Wattenberg\|first1=Martin\|last2=Viégas\|first2=Fernanda\|date=2016-10-13\|publisher=Distill\|language=en\|access-date=4 December 2017\|last3=Johnson\|first3=Ian}}</ref> It has been demonstrated that t-SNE is often able to recover well-separated clusters, and with special parameter choices, approximates a simple form of [[spectral clustering]].<ref>{{cite arXiv\|last1=Linderman\|first1=George C.\|last2=Steinerberger\|first2=Stefan\|date=2017-06-08\|title=Clustering with t-SNE, provably\|eprint=1706.02582\|class=cs.LG}}</ref>▼ == Details == Given a set of <math>N</math> high-dimensional objects <math>\mathbf{x}_1, \dots, \mathbf{x}_N</math>, t-SNE first computes probabilities <math>p_{ij}</math> that are proportional to the similarity of objects <math>\mathbf{x}_i</math> and <math>\mathbf{x}_j</math>, as follows. For <math>i \neq j</math>, define : <math>p_{j\mid i} = \frac{\exp(-\lVert\mathbf{x}_i - \mathbf{x}_j\rVert^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\lVert\mathbf{x}_i - \mathbf{x}_k\rVert^2 / 2\sigma_i^2)}</math> and set <math>p_{i\mid i} = 0</math>. Note ~~that~~the above denominator ensures <math>\sum_j p_{j\mid i} = 1</math> for all <math>i</math>. As ~~Van~~van der Maaten and Hinton explained: "The similarity of datapoint <math>x_j</math> to datapoint <math>x_i</math> is the conditional probability, <math>p_{j\|i}</math>, that <math>x_i</math> would pick <math>x_j</math> as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at <math>x_i</math>."<ref name=MaatenHinton/> Now define : <math>p_{ij} = \frac{p_{j\mid i} + p_{i\mid j}}{2N}</math> This is motivated because <math>p_{i}</math> and <math>p_{ij}</math> from the N samples are estimated as 1/N, so the conditional probability can be written as <math>p_{ji\mid ij} = Np_{ij}</math> and <math>p_{j\mid i} = Np_{ji}</math> . Since <math> p_{jiij} = p_{ji}</math>, you can obtain previous formula.▼ Also note that <math>p_{ii} = 0 </math> and <math>\sum_{i, j} p_{ij} = 1</math>. ▼ ▲This is motivated because <math>p_{i}</math> and <math>p_{i}</math> from the N samples are estimated as 1/N, so conditional probability can be written as <math>p_{j\mid i} = Np_{ij}</math> and <math>p_{j\mid i} = Np_{ji}</math> . Since <math> p_{ji} = p_{ji}</math>, you can obtain previous formula. The bandwidth of the [[Gaussian kernel]]s <math>\sigma_i</math> is set in such a way that the [[Entropy (information theory)\|entropy]] of the conditional distribution equals a predefined entropy using the [[bisection method]]. As a result, the bandwidth is adapted to the [[density]] of the data: smaller values of <math>\sigma_i</math> are used in denser parts of the data space. The entropy increases with the [[perplexity]] of this distribution <math>P_i</math>; this relation is seen as▼ ▲Also note that <math>p_{ii} = 0 </math> and <math>\sum_{i, j} p_{ij} = 1</math>. : <math>Perp(P_i) = 2^{H(P_i)}</math> ▲The bandwidth of the [[Gaussian kernel]]s <math>\sigma_i</math> is set in such a way that the [[Entropy (information theory)\|entropy]] of the conditional distribution equals a predefined entropy using the [[bisection method]]. As a result, the bandwidth is adapted to the [[density]] of the data: smaller values of <math>\sigma_i</math> are used in denser parts of the data space. where <math>H(P_i)</math> is the Shannon entropy <math>H(P_i)=-\sum_jp_{j\|i}\log_2p_{j\|i}.</math> Since the Gaussian kernel uses the Euclidean distance <math>\lVert x_i-x_j \rVert</math>, it is affected by the [[curse of dimensionality]], and in high dimensional data when distances lose the ability to discriminate, the <math>p_{ij}</math> become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the [[intrinsic dimension]] of each point, to alleviate this.<ref>{{Cite conference\|last1=Schubert\|first1=Erich\|last2=Gertz\|first2=Michael\|date=2017-10-04\|title=Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection\|conference=SISAP 2017 – 10th International Conference on Similarity Search and Applications\|pages=188–203\|doi=10.1007/978-3-319-68474-1_13}}</ref>▼ The perplexity is a hand-chosen parameter of t-SNE, and as the authors state, "perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.".<ref name=MaatenHinton/> ▲Since the Gaussian kernel uses the [[Euclidean distance]] <math>\lVert x_i-x_j \rVert</math>, it is affected by the [[curse of dimensionality]], and in high dimensional data when distances lose the ability to discriminate, the <math>p_{ij}</math> become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the [[intrinsic dimension]] of each point, to alleviate this.<ref>{{Cite conference\|last1=Schubert\|first1=Erich\|last2=Gertz\|first2=Michael\|date=2017-10-04\|title=Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection\|conference=SISAP 2017 – 10th International Conference on Similarity Search and Applications\|pages=188–203\|doi=10.1007/978-3-319-68474-1_13}}</ref> t-SNE aims to learn a <math>d</math>-dimensional map <math>\mathbf{y}_1, \dots, \mathbf{y}_N</math> (with <math>\mathbf{y}_i \in \mathbb{R}^d</math> and <math>d</math> typically chosen as 2 or 3) that reflects the similarities <math>p_{ij}</math> as well as possible. To this end, it measures similarities <math>q_{ij}</math> between two points in the map <math>\mathbf{y}_i</math> and <math>\mathbf{y}_j</math>, using a very similar approach. Line 41 ⟶ 46: and set <math> q_{ii} = 0 </math>. Herein a heavy-tailed [[Student t-distribution]] (with one-degree of freedom, which is the same as a [[Cauchy distribution]]) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map. The locations of the points <math>\mathbf{y}_i</math> in the map are determined by minimizing the (non-symmetric) [[Kullback–Leibler divergence]] of the distribution <math>P</math> from the distribution <math>Q</math>, that is: Line 49 ⟶ 54: The minimization of the Kullback–Leibler divergence with respect to the points <math>\mathbf{y}_i</math> is performed using [[gradient descent]]. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs. == Output == ▲While t-SNE plots often seem to display [[cluster analysis\|clusters]], the visual clusters can be ~~influenced~~ strongly influenced by the chosen parameterization (especially the perplexity) and ~~therefore~~so a good understanding of the parameters for t-SNE is ~~necessary~~needed. Such "clusters" can be shown to even appear in ~~non-clustered~~structured data with no clear clustering,<ref>{{Cite web~~\|url=https://stats.stackexchange.com/a/264647~~ \|title=K-means clustering on the output of t-SNE \|~~website~~url=~~Cross~~https://stats.stackexchange.com/a/264647 ~~Validated~~\|access-date=2018-04-16 \|website=Cross Validated}}</ref> and ~~thus~~so may be false findings. ~~Interactive~~Similarly, the size of clusters produced by t-SNE is not informative, and neither is the distance between clusters.<ref>{{Cite journal \|last1=Wattenberg \|first1=Martin \|last2=Viégas \|first2=Fernanda \|last3=Johnson \|first3=Ian \|date=2016-10-13 \|title=How to Use t-SNE Effectively \|url=http://distill.pub/2016/misread-tsne \|journal=Distill \|language=en \|volume=1 \|issue=10 \|pages=e2 \|doi=10.23915/distill.00002 \|issn=2476-0757\|doi-access=free }}</ref> Thus, interactive exploration may ~~thus~~ be ~~necessary~~needed to choose parameters and validate results.<ref>{{Cite journal \|last1=Pezzotti \|first1=Nicola \|last2=Lelieveldt \|first2=Boudewijn P. F. \|last3=Maaten \|first3=Laurens van der \|last4=Hollt \|first4=Thomas \|last5=Eisemann \|first5=Elmar \|last6=Vilanova \|first6=Anna \|date=2017-07-01 \|title=Approximated and User Steerable tSNE for Progressive Visual Analytics \|journal=IEEE Transactions on Visualization and Computer Graphics \|language=en-US \|volume=23 \|issue=7 \|pages=1739–1752 \|arxiv=1512.01655 \|doi=10.1109/tvcg.2016.2570755~~\|pmid=28113434~~ \|issn=1077-2626 \|~~arxiv~~pmid=~~1512.01655~~28113434 \|s2cid=353336}}</ref><ref>{{cite ~~web\|url=https://distill.pub/2016/misread-tsne/\|title=How to Use~~journal ~~t-SNE Effectively~~\|last1=Wattenberg \|first1=Martin \|last2=Viégas \|first2=Fernanda \|last3=Johnson \|first3=Ian \|date=2016-10-13 \|~~publisher~~title=How to Use t-SNE Effectively \|url=https://distill.pub/2016/misread-tsne/ \|journal=Distill \|language=en \|volume=1 \|issue=10 \|doi=10.23915/distill.00002 \|access-date=4 December 2017 \|~~last3~~doi-access=~~Johnson\|first3=Ian~~free}}</ref> It has been ~~demonstrated~~shown that t-SNE iscan often ~~able to~~ recover well-separated clusters, and with special parameter choices, approximates a simple form of [[spectral clustering]].<ref>{{cite arXiv \|~~last1~~eprint=~~Linderman~~1706.02582 \|class=cs.LG \|first1=George C. \|~~last2~~last1=~~Steinerberger~~Linderman \|first2=Stefan \|~~date~~last2=~~2017-06-08~~Steinerberger \|title=Clustering with t-SNE, provably \|~~eprint~~date=~~1706.02582\|class=cs.LG~~2017-06-08}}</ref> == Software == * A C++ implementation of Barnes-Hut is available on the [https://github.com/lvdmaaten/bhtsne github account] of one of the original authors. * The R package [https://CRAN.R-project.org/package=Rtsne Rtsne] implements t-SNE in [[R (programming language)\|R]]. * [[ELKI]] contains tSNE, also with Barnes-Hut approximation * [[scikit-learn]], a popular machine learning ~~libary~~library in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation. * Tensorboard, the visualization kit associated with [[TensorFlow]], also implements t-SNE ([https://projector.tensorflow.org/ online version]) * The [[Julia (programming language)\|Julia]] package [https://juliapackages.com/p/tsne TSne] implements t-SNE == References == Line 60 ⟶ 70: == External links == * {{Cite journal \|last1=Wattenberg \|first1=Martin \|last2=Viégas \|first2=Fernanda \|last3=Johnson \|first3=Ian \|date=2016-10-13 \|title=How to Use t-SNE Effectively \|url=https://distill.pub/2016/misread-tsne/ \|journal=Distill \|language=en \|volume=1 \|issue=10 \|pages=e2 \|doi=10.23915/distill.00002 \|issn=2476-0757\|doi-access=free }}. Interactive demonstration and tutorial. * [https://www.youtube.com/watch?v=RJVL80Gg3lA Visualizing Data Using t-SNE], Google Tech Talk about t-SNE * [https://lvdmaaten.github.io/tsne/ Implementations of t-SNE in various languages], A link collection maintained by Laurens van der Maaten