Revision as of 14:22, 23 June 2024 edit ПБХ (talk \| contribs) 16 edits m →Error rates ← Previous edit		Revision as of 13:09, 18 July 2024 edit undo Peblo Peblo (talk \| contribs) 1 edit mNo edit summary Next edit →
Line 29: In the classification phase, ''k'' is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the ''k'' training samples nearest to that query point. [[File:KNN decision surface animation.gif\|thumb\|alt=kNN decision surface\|Application of a ''k-''NN classifier considering ''k'' = 3 neighbors. Left - Given the test point "?", the algorithm seeks the 3 closest points in the training set, and adopts the majority vote to classify it as "class red". Right - By iteratively repeating the prediction over the whole feature space (X1, X2), one can depict the "decision surface".]] A commonly used distance metric for [[continuous variable]]s is [[Euclidean distance]]. For discrete variables, such as for text classification, another metric can be used, such as the '''overlap metric''' (or [[Hamming distance]]). In the context of gene expression microarray data, for example, ''k''-NN has been employed with correlation coefficients, such as Pearson and Spearman, as a metric.<ref>{{cite journal \|last1=Jaskowiak \|first1=Pablo A. \|last2=Campello \|first2=Ricardo J. G. B. \|title=Comparing Correlation Coefficients as Dissimilarity Measures for Cancer Classification in Gene Expression Data \|journal=Brazilian Symposium on Bioinformatics (BSB 2011) \|year=2011 \|pages=1–8 \|citeseerx=10.1.1.208.993 }}</ref> Often, the classification accuracy of ''k''-NN can be improved significantly if the distance metric is learned with specialized algorithms such as [[Large Margin Nearest Neighbor]] or [[Neighbourhood components analysis]].

K-nearest neighbors algorithm: Difference between revisions