T-distributed stochastic neighbor embedding: Difference between revisions

Content deleted Content added
WikiCleanerBot (talk | contribs)
m v2.05b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation)
m +link
Line 7:
'''t-distributed stochastic neighbor embedding''' ('''t-SNE''') is a [[statistical]] method for visualizing high-dimensional data by giving each datapoint a ___location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by [[Geoffrey Hinton]] and Sam Roweis,<ref name=SNE>{{cite conference|author1-last=Hinton|author1-first=Geoffrey| author2-last=Roweis|author2-first=Sam|conference=[[Neural Information Processing Systems]]|title=Stochastic neighbor embedding|date= January 2002 |url=https://cs.nyu.edu/~roweis/papers/sne_final.pdf}}</ref> where Laurens van der Maaten proposed the [[Student's t-distribution|''t''-distributed]] variant.<ref name=MaatenHinton>{{cite journal|last=van der Maaten|first=L.J.P.|author2=Hinton, G.E. |title=Visualizing Data Using t-SNE|journal=Journal of Machine Learning Research |volume=9|date=Nov 2008|pages=2579–2605|url=http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf}}</ref> It is a [[nonlinear dimensionality reduction]] technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.
 
The t-SNE algorithm comprises two main stages. First, t-SNE constructs a [[probability distribution]] over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the [[Kullback–Leibler divergence]] (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the [[Euclidean distance]] between objects as the base of its similarity metric, this can be changed as appropriate. A [[Riemannian metric|Riemannian]] variant is [[Uniform manifold approximation and projection|UMAP]].
 
t-SNE has been used for visualization in a wide range of applications, including [[genomics]], [[computer security]] research,<ref>{{cite journal|last=Gashi|first=I.|author2=Stankovic, V. |author3=Leita, C. |author4=Thonnard, O. |title=An Experimental Study of Diversity with Off-the-shelf AntiVirus Engines|journal=Proceedings of the IEEE International Symposium on Network Computing and Applications|year=2009|pages=4–11}}</ref> [[natural language processing]], [[music analysis]],<ref>{{cite journal|last=Hamel|first=P.|author2=Eck, D. |title=Learning Features from Music Audio with Deep Belief Networks|journal=Proceedings of the International Society for Music Information Retrieval Conference|year=2010|pages=339–344}}</ref> [[cancer research]],<ref>{{cite journal|last=Jamieson|first=A.R.|author2=Giger, M.L. |author3=Drukker, K. |author4=Lui, H. |author5=Yuan, Y. |author6=Bhooshan, N. |title=Exploring Nonlinear Feature Space Dimension Reduction and Data Representation in Breast CADx with Laplacian Eigenmaps and t-SNE|journal=Medical Physics |issue=1|year=2010|pages=339–351|doi=10.1118/1.3267037|pmid=20175497|volume=37|pmc=2807447}}</ref> [[bioinformatics]],<ref>{{cite journal|last=Wallach|first=I.|author2=Liliean, R. |title=The Protein-Small-Molecule Database, A Non-Redundant Structural Resource for the Analysis of Protein-Ligand Binding|journal=Bioinformatics |year=2009|pages=615–620|doi=10.1093/bioinformatics/btp035|volume=25|issue=5|pmid=19153135|doi-access=free}}</ref> geological ___domain interpretation,<ref>{{Cite journal|date=2019-04-01|title=A comparison of t-SNE, SOM and SPADE for identifying material type domains in geological data|url=https://www.sciencedirect.com/science/article/pii/S0098300418306010|journal=Computers & Geosciences|language=en|volume=125|pages=78–89|doi=10.1016/j.cageo.2019.01.011|issn=0098-3004|last1=Balamurali|first1=Mehala|last2=Silversides|first2=Katherine L.|last3=Melkumyan|first3=Arman|bibcode=2019CG....125...78B |s2cid=67926902}}</ref><ref>{{Cite journal|last1=Balamurali|first1=Mehala|last2=Melkumyan|first2=Arman|date=2016|editor-last=Hirose|editor-first=Akira|editor2-last=Ozawa|editor2-first=Seiichi|editor3-last=Doya|editor3-first=Kenji|editor4-last=Ikeda|editor4-first=Kazushi|editor5-last=Lee|editor5-first=Minho|editor6-last=Liu|editor6-first=Derong|title=t-SNE Based Visualisation and Clustering of Geological Domain|url=https://link.springer.com/chapter/10.1007/978-3-319-46681-1_67|journal=Neural Information Processing|series=Lecture Notes in Computer Science|volume=9950|language=en|___location=Cham|publisher=Springer International Publishing|pages=565–572|doi=10.1007/978-3-319-46681-1_67|isbn=978-3-319-46681-1}}</ref><ref>{{Cite journal|last1=Leung|first1=Raymond|last2=Balamurali|first2=Mehala|last3=Melkumyan|first3=Arman|date=2021-01-01|title=Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering|url=https://doi.org/10.1007/s11004-019-09839-z|journal=Mathematical Geosciences|language=en|volume=53|issue=1|pages=105–130|doi=10.1007/s11004-019-09839-z|bibcode=2021MaGeo..53..105L |s2cid=208329378|issn=1874-8953}}</ref> and biomedical signal processing.<ref>{{Cite book|last1=Birjandtalab|first1=J.|last2=Pouyan|first2=M. B.|last3=Nourani|first3=M.|title=2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI) |chapter=Nonlinear dimension reduction for EEG-based epileptic seizure detection |date=2016-02-01|pages=595–598|doi=10.1109/BHI.2016.7455968|isbn=978-1-5090-2455-1|s2cid=8074617}}</ref>