Revision as of 08:52, 11 November 2024 edit OAbot (talk \| contribs) Bots 643,717 edits m Open access bot: doi updated in citation with #oabot. ← Previous edit		Revision as of 16:43, 10 December 2024 edit undo BunnysBot (talk \| contribs) Bots 38,790 edits m Fix CW Error #61 with GenFixes (T1) Tag: AWB Next edit →
Line 3: [[File:T-SNE visualisation of word embeddings generated using 19th century literature.png\|thumb\|T-SNE visualisation of [[word embedding]]s generated using 19th century literature]] [[File:T-SNE Embedding of MNIST.png\|thumb\|T-SNE embeddings of [[MNIST]] dataset]] {{~~lower case~~lowercase title}} {{Data Visualization}} '''t-distributed stochastic neighbor embedding''' ('''t-SNE''') is a [[statistical]] method for visualizing high-dimensional data by giving each datapoint a ___location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by [[Geoffrey Hinton]] and Sam Roweis,<ref name=SNE>{{cite conference\|author1-last=Hinton\|author1-first=Geoffrey\| author2-last=Roweis\|author2-first=Sam\|conference=[[Neural Information Processing Systems]]\|title=Stochastic neighbor embedding\|date= January 2002 \|url=https://cs.nyu.edu/~roweis/papers/sne_final.pdf}}</ref> where Laurens van der Maaten and Hinton proposed the [[Student's t-distribution\|''t''-distributed]] variant.<ref name=MaatenHinton>{{cite journal\|last=van der Maaten\|first=L.J.P.\|author2=Hinton, G.E. \|title=Visualizing Data Using t-SNE\|journal=Journal of Machine Learning Research \|volume=9\|date=Nov 2008\|pages=2579–2605\|url=http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf}}</ref> It is a [[nonlinear dimensionality reduction]] technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. Line 14: == Details == Given a set of <math>N</math> high-dimensional objects <math>\mathbf{x}_1, \dots, \mathbf{x}_N</math>, t-SNE first computes probabilities <math>p_{ij}</math> that are proportional to the similarity of objects <math>\mathbf{x}_i</math> and <math>\mathbf{x}_j</math>, as follows. For <math>i \neq j</math>, define : <math>p_{j\mid i} = \frac{\exp(-\lVert\mathbf{x}_i - \mathbf{x}_j\rVert^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\lVert\mathbf{x}_i - \mathbf{x}_k\rVert^2 / 2\sigma_i^2)}</math> and set <math>p_{i\mid i} = 0</math>. Note the above denominator ensures <math>\sum_j p_{j\mid i} = 1</math> for all <math>i</math>. As van der Maaten and Hinton explained: "The similarity of datapoint <math>x_j</math> to datapoint <math>x_i</math> is the conditional probability, <math>p_{j\|i}</math>, that <math>x_i</math> would pick <math>x_j</math> as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at <math>x_i</math>."<ref name=MaatenHinton/> Line 25: Now define : <math>p_{ij} = \frac{p_{j\mid i} + p_{i\mid j}}{2N}</math> This is motivated because <math>p_{i}</math> and <math>p_{j}</math> from the N samples are estimated as 1/N, so the conditional probability can be written as <math>p_{i\mid j} = Np_{ij}</math> and <math>p_{j\mid i} = Np_{ji}</math> . Since <math> p_{ij} = p_{ji}</math>, you can obtain previous formula. Also note that <math>p_{ii} = 0 </math> and <math>-\sum_{i, j} p_{ij} = 1</math>. The bandwidth of the [[Gaussian kernel]]s <math>\sigma_i</math> is set in such a way that the [[Entropy (information theory)\|entropy]] of the conditional distribution equals a predefined entropy using the [[bisection method]]. As a result, the bandwidth is adapted to the [[density]] of the data: smaller values of <math>\sigma_i</math> are used in denser parts of the data space. The entropy increases with the perplexity of this distribution <math>P_i</math>; this relation is seen as : <math>Perp(P_i) = 2^{H(P_i)}</math> where <math>H(P_i)</math> is the shannon entropy <math>H(P_i)=-\sum_jp_{j\|i}\log_2p_{j\|i}.</math> The perplexity is a hand-chosen parameter of t-SNE, and as the authors state, "perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.".<ref name=MaatenHinton/>. Since the Gaussian kernel uses the Euclidean distance <math>\lVert x_i-x_j \rVert</math>, it is affected by the [[curse of dimensionality]], and in high dimensional data when distances lose the ability to discriminate, the <math>p_{ij}</math> become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the [[intrinsic dimension]] of each point, to alleviate this.<ref>{{Cite conference\|last1=Schubert\|first1=Erich\|last2=Gertz\|first2=Michael\|date=2017-10-04\|title=Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection\|conference=SISAP 2017 – 10th International Conference on Similarity Search and Applications\|pages=188–203\|doi=10.1007/978-3-319-68474-1_13}}</ref> Line 48 ⟶ 46: and set <math> q_{ii} = 0 </math>. Herein a heavy-tailed [[Student t-distribution]] (with one-degree of freedom, which is the same as a [[Cauchy distribution]]) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map. The locations of the points <math>\mathbf{y}_i</math> in the map are determined by minimizing the (non-symmetric) [[Kullback–Leibler divergence]] of the distribution <math>P</math> from the distribution <math>Q</math>, that is:

T-distributed stochastic neighbor embedding: Difference between revisions