T-distributed stochastic neighbor embedding: Difference between revisions

Content deleted Content added
OAbot (talk | contribs)
m Open access bot: doi updated in citation with #oabot.
BunnysBot (talk | contribs)
m Fix CW Error #61 with GenFixes (T1)
Line 3:
[[File:T-SNE visualisation of word embeddings generated using 19th century literature.png|thumb|T-SNE visualisation of [[word embedding]]s generated using 19th century literature]]
[[File:T-SNE Embedding of MNIST.png|thumb|T-SNE embeddings of [[MNIST]] dataset]]
{{lower caselowercase title}}
{{Data Visualization}}
'''t-distributed stochastic neighbor embedding''' ('''t-SNE''') is a [[statistical]] method for visualizing high-dimensional data by giving each datapoint a ___location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by [[Geoffrey Hinton]] and Sam Roweis,<ref name=SNE>{{cite conference|author1-last=Hinton|author1-first=Geoffrey| author2-last=Roweis|author2-first=Sam|conference=[[Neural Information Processing Systems]]|title=Stochastic neighbor embedding|date= January 2002 |url=https://cs.nyu.edu/~roweis/papers/sne_final.pdf}}</ref> where Laurens van der Maaten and Hinton proposed the [[Student's t-distribution|''t''-distributed]] variant.<ref name=MaatenHinton>{{cite journal|last=van der Maaten|first=L.J.P.|author2=Hinton, G.E. |title=Visualizing Data Using t-SNE|journal=Journal of Machine Learning Research |volume=9|date=Nov 2008|pages=2579–2605|url=http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf}}</ref> It is a [[nonlinear dimensionality reduction]] technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.
Line 14:
 
== Details ==
Given a set of <math>N</math> high-dimensional objects <math>\mathbf{x}_1, \dots, \mathbf{x}_N</math>, t-SNE first computes probabilities <math>p_{ij}</math> that are proportional to the similarity of objects <math>\mathbf{x}_i</math> and <math>\mathbf{x}_j</math>, as follows.
 
For <math>i \neq j</math>, define
: <math>p_{j\mid i} = \frac{\exp(-\lVert\mathbf{x}_i - \mathbf{x}_j\rVert^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\lVert\mathbf{x}_i - \mathbf{x}_k\rVert^2 / 2\sigma_i^2)}</math>
and set <math>p_{i\mid i} = 0</math>.
Note the above denominator ensures <math>\sum_j p_{j\mid i} = 1</math> for all <math>i</math>.
 
As van der Maaten and Hinton explained: "The similarity of datapoint <math>x_j</math> to datapoint <math>x_i</math> is the conditional probability, <math>p_{j|i}</math>, that <math>x_i</math> would pick <math>x_j</math> as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at <math>x_i</math>."<ref name=MaatenHinton/>
Line 25:
Now define
: <math>p_{ij} = \frac{p_{j\mid i} + p_{i\mid j}}{2N}</math>
 
 
This is motivated because <math>p_{i}</math> and <math>p_{j}</math> from the N samples are estimated as 1/N, so the conditional probability can be written as <math>p_{i\mid j} = Np_{ij}</math> and <math>p_{j\mid i} = Np_{ji}</math> . Since <math> p_{ij} = p_{ji}</math>, you can obtain previous formula.
 
Also note that <math>p_{ii} = 0 </math> and <math>-\sum_{i, j} p_{ij} = 1</math>.
 
The bandwidth of the [[Gaussian kernel]]s <math>\sigma_i</math> is set in such a way that the [[Entropy (information theory)|entropy]] of the conditional distribution equals a predefined entropy using the [[bisection method]]. As a result, the bandwidth is adapted to the [[density]] of the data: smaller values of <math>\sigma_i</math> are used in denser parts of the data space. The entropy increases with the perplexity of this distribution <math>P_i</math>; this relation is seen as
 
: <math>Perp(P_i) = 2^{H(P_i)}</math>
 
 
where <math>H(P_i)</math> is the shannon entropy <math>H(P_i)=-\sum_jp_{j|i}\log_2p_{j|i}.</math>
 
The perplexity is a hand-chosen parameter of t-SNE, and as the authors state, "perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.".<ref name=MaatenHinton/>.
 
Since the Gaussian kernel uses the Euclidean distance <math>\lVert x_i-x_j \rVert</math>, it is affected by the [[curse of dimensionality]], and in high dimensional data when distances lose the ability to discriminate, the <math>p_{ij}</math> become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the [[intrinsic dimension]] of each point, to alleviate this.<ref>{{Cite conference|last1=Schubert|first1=Erich|last2=Gertz|first2=Michael|date=2017-10-04|title=Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection|conference=SISAP 2017 – 10th International Conference on Similarity Search and Applications|pages=188–203|doi=10.1007/978-3-319-68474-1_13}}</ref>
Line 48 ⟶ 46:
 
and set <math> q_{ii} = 0 </math>.
Herein a heavy-tailed [[Student t-distribution]] (with one-degree of freedom, which is the same as a [[Cauchy distribution]]) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map.
 
The locations of the points <math>\mathbf{y}_i</math> in the map are determined by minimizing the (non-symmetric) [[Kullback–Leibler divergence]] of the distribution <math>P</math> from the distribution <math>Q</math>, that is: