Content deleted Content added
Peblo Peblo (talk | contribs) mNo edit summary |
|||
Line 29:
In the classification phase, ''k'' is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the ''k'' training samples nearest to that query point.
[[File:KNN decision surface animation.gif|thumb|alt=kNN decision surface|Application of a ''k-''NN classifier considering ''k'' = 3 neighbors. Left - Given the test point "?", the algorithm seeks the 3 closest points in the training set, and adopts the majority vote to classify it as "class red". Right - By iteratively repeating the prediction over the whole feature space (X1, X2), one can depict the "decision surface".]]
A commonly used distance metric for [[continuous variable]]s is [[Euclidean distance]]. For discrete variables, such as for text classification, another metric can be used, such as the '''overlap metric''' (or [[Hamming distance]]). In the context of gene expression microarray data, for example, ''k''-NN has been employed with correlation coefficients, such as Pearson and Spearman, as a metric.<ref>{{cite journal |last1=Jaskowiak |first1=Pablo A. |last2=Campello |first2=Ricardo J. G. B. |title=Comparing Correlation Coefficients as Dissimilarity Measures for Cancer Classification in Gene Expression Data |journal=Brazilian Symposium on Bioinformatics (BSB 2011) |year=2011 |pages=1–8 |citeseerx=10.1.1.208.993 }}</ref> Often, the classification accuracy of ''k''-NN can be improved significantly if the distance metric is learned with specialized algorithms such as [[Large Margin Nearest Neighbor]] or [[Neighbourhood components analysis]].
|