K-nearest neighbors algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 17:03, 27 September 2023 edit Habil zare (talk \| contribs) 300 edits m →See also: Nearest neighbor graph ← Previous edit		Latest revision as of 23:48, 16 April 2025 edit undo Citation bot (talk \| contribs) Bots 5,860,852 edits Add: publisher, arxiv, doi, pages, issue, volume. \| Use this bot. Report bugs. \| Suggested by Dominic3203 \| Linked from User:LinguisticMystic/cs/outline \| #UCB_webform_linked 1079/2277
(15 intermediate revisions by 13 users not shown)
Line 1: {{short description\|Non-parametric classification method~~}}{{About\|supervised classification/regression, but NOT a clustering algorithm\|algorithms for finding nearest neighbors\|Nearest neighbor search}}{{Distinguish\|k-means clustering~~}} {{DISPLAYTITLE:''k''-nearest neighbors algorithm}} {{distinguish\|Nearest neighbor search\|Nearest neighbor interpolation\|k-means clustering{{!}}''k''-means clustering}} In [[statistics]], the '''''k''-nearest neighbors algorithm''' ('''''k''-NN''') is a [[Non-parametric statistics\|non-parametric]] [[supervised learning]] method first developed by [[Evelyn Fix]] and [[Joseph Lawson Hodges Jr.\|Joseph Hodges]] in 1951,<ref>{{Cite report \| last1=Fix \| first1=Evelyn \| last2= Hodges \| first2=Joseph L. \| title=Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties \| issue=Report Number 4, Project Number 21-49-004 \| year=1951 \| url=https://apps.dtic.mil/dtic/tr/fulltext/u2/a800276.pdf \| archive-url=https://web.archive.org/web/20200926212807/https://apps.dtic.mil/dtic/tr/fulltext/u2/a800276.pdf \| url-status=live \| archive-date=September 26, 2020 \| publisher=USAF School of Aviation Medicine, Randolph Field, Texas}}</ref> and later expanded by [[Thomas M. Cover\|Thomas Cover]].<ref name=":1" /> It is used for [[statistical classification\|classification]] and [[regression analysis\|regression]]. In both cases, the input consists of the ''k'' closest training examples in a [[data set]]. The output depends on whether ''k''-NN is used for classification or regression:▼ ▲In [[statistics]], the '''''k''-nearest neighbors algorithm''' ('''''k''-NN''') is a [[Non-parametric statistics\|non-parametric]] [[supervised learning]] method. It was first developed by [[Evelyn Fix]] and [[Joseph Lawson Hodges Jr.\|Joseph Hodges]] in 1951,<ref>{{Cite report \| last1=Fix \| first1=Evelyn \| last2= Hodges \| first2=Joseph L. \| title=Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties \| issue=Report Number 4, Project Number 21-49-004 \| year=1951 \| url=https://apps.dtic.mil/dtic/tr/fulltext/u2/a800276.pdf \| archive-url=https://web.archive.org/web/20200926212807/https://apps.dtic.mil/dtic/tr/fulltext/u2/a800276.pdf \| url-status=live \| archive-date=September 26, 2020 \| publisher=USAF School of Aviation Medicine, Randolph Field, Texas}}</ref> and later expanded by [[Thomas M. Cover\|Thomas Cover]].<ref name=":1" /> It is used for [[statistical classification\|classification]] and [[regression analysis\|regression]]. In both cases, the input consists of the ''k'' closest training examples in a [[data set]]. The output depends on whether ''k''-NN is used for classification or regression: * In ''k-NN classification'', the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its ''k'' nearest neighbors (''k'' is a positive [[integer]], typically small). If ''k'' = 1, then the object is simply assigned to the class of that single nearest neighbor. Most Inoften, it is used for [[statistical classification\|classification]], as a '''''k''-NN ~~regression~~classifier''', the output of which is ~~the~~a ~~property~~class ~~value~~membership. ~~for~~An object is classified by a plurality vote of its neighbors, with the object. ~~This~~being ~~value~~assigned isto the ~~average~~class ofmost ~~the~~common ~~values~~among ofits ''k'' nearest neighbors (''k'' is a positive [[integer]], typically small). If ''k''  =  1, then the ~~output~~object is simply assigned to the ~~value~~class of that single nearest neighbor. The ''k''-NN algorithm can also be generalized for [[regression analysis\|regression]]. In ''{{mvar\|k}}-NN regression'', also known as ''[[nearest neighbor smoothing]]'', the output is the property value for the object. This value is the average of the values of ''k'' nearest neighbors. If ''k'' = 1, then the output is simply assigned to the value of that single nearest neighbor, also known as ''[[nearest neighbor interpolation]]''. ''k''-NN is a type of [[classification]] where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales then [[Normalization (statistics)\|normalizing]] the training data can improve its accuracy dramatically.<ref>{{Cite book\|last=Hastie, Trevor.\|title=The elements of statistical learning : data mining, inference, and prediction : with 200 full-color illustrations\|date=2001\|publisher=Springer\|others=Tibshirani, Robert., Friedman, J. H. (Jerome H.)\|isbn=0-387-95284-5\|___location=New York\|oclc=46809224}}</ref>▼ ~~Both~~For ~~for~~both classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that ~~the~~ nearer neighbors contribute more to the average than ~~the more~~ distant ones. For example, a common weighting scheme consists inof giving each neighbor a weight of 1/''d'', where ''d'' is the distance to the neighbor.<ref>This scheme is a generalization of linear interpolation.</ref> The input consists of the ''k'' closest training examples in a [[data set]]. The neighbors are taken from a set of objects for which the class (for ''k''-NN classification) or the object property value (for ''k''-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. A peculiarity (sometimes even a disadvantage) of the ''k''-NN algorithm is ~~that it is~~its ~~sensitive~~sensitivity to the local structure of the data. ▲In ''k''-NN ~~is a type of [[~~classification~~]] where~~ the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance ~~for classification~~, if the features represent different physical units or come in vastly different scales, then feature-wise [[Normalization (statistics)\|normalizing]] of the training data can greatly improve its accuracy ~~dramatically~~.<ref>{{Cite book\|last=Hastie, Trevor.\|title=The elements of statistical learning : data mining, inference, and prediction : with 200 full-color illustrations\|date=2001\|publisher=Springer\|others=Tibshirani, Robert., Friedman, J. H. (Jerome H.)\|isbn=0-387-95284-5\|___location=New York\|oclc=46809224}}</ref> ==Statistical setting== Line 28 ⟶ 30: In the classification phase, ''k'' is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the ''k'' training samples nearest to that query point. [[File:KNN decision surface animation.gif\|thumb\|alt=kNN decision surface\|Application of a ''k-''NN classifier considering ''k'' = 3 neighbors. Left - Given the test point "?", the algorithm seeks the 3 closest points in the training set, and adopts the majority vote to classify it as "class red". Right - By iteratively repeating the prediction over the whole feature space (X1, X2), one can depict the "decision surface".]] A commonly used distance metric for [[continuous variable]]s is [[Euclidean distance]]. For discrete variables, such as for text classification, another metric can be used, such as the '''overlap metric''' (or [[Hamming distance]]). In the context of gene expression microarray data, for example, ''k''-NN has been employed with correlation coefficients, such as Pearson and Spearman, as a metric.<ref>{{cite journal \|last1=Jaskowiak \|first1=Pablo A. \|last2=Campello \|first2=Ricardo J. G. B. \|title=Comparing Correlation Coefficients as Dissimilarity Measures for Cancer Classification in Gene Expression Data \|journal=Brazilian Symposium on Bioinformatics (BSB 2011) \|year=2011 \|pages=1–8 \|citeseerx=10.1.1.208.993 }}</ref> Often, the classification accuracy of ''k''-NN can be improved significantly if the distance metric is learned with specialized algorithms such as [[Large Margin Nearest Neighbor]] or [[Neighbourhood components analysis]].▼ ▲A commonly used distance metric for [[continuous variable]]s is [[Euclidean distance]]. For discrete variables, such as for text classification, another metric can be used, such as the '''overlap metric''' (or [[Hamming distance]]). In the context of gene expression microarray data, for example, ''k''-NN has been employed with correlation coefficients, such as Pearson and Spearman, as a metric.<ref>{{cite journal \|last1=Jaskowiak \|first1=Pablo A. \|last2=Campello \|first2=Ricardo J. G. B. \|title=Comparing Correlation Coefficients as Dissimilarity Measures for Cancer Classification in Gene Expression Data \|journal=Brazilian Symposium on Bioinformatics (BSB 2011) \|year=2011 \|pages=1–8 \|citeseerx=10.1.1.208.993 }}</ref> Often, the classification accuracy of ''k''-NN can be improved significantly if the distance metric is learned with specialized algorithms such as [[Large Margin Nearest Neighbor]] or [[Neighbourhood components analysis]]. [[File:Kmeans clustering WHR2023 data.gif\|thumb\|An animated visualization of K-means clustering with ''k'' = 3, grouping countries based on life expectancy, GDP, and happiness—demonstrating how KNN operates in higher dimensions. Click to view the animation.<ref>Helliwell, J. F., Layard, R., Sachs, J. D., Aknin, L. B., De Neve, J.-E., & Wang, S. (Eds.). (2023). World Happiness Report 2023 (11th ed.). Sustainable Development Solutions Network.</ref>]] A drawback of the basic "majority voting" classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the ''k'' nearest neighbors due to their large number.<ref name="Coomans_Massart1982">{{Cite journal \| doi = 10.1016/S0003-2670(01)95359-0 Line 39 ⟶ 43: \| pages = 15–27 \| year = 1982 \| bibcode = 1982AcAC..136...15C }} }} </ref> One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its ''k'' nearest neighbors. The class (or value, in regression problems) of each of the ''k'' nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example, in a [[self-organizing map]] (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. ''K''-NN can then be applied to the SOM. Line 59 ⟶ 63: </ref> ==The ~~{{math\|~~1}}-nearest neighbor classifier== The most intuitive nearest neighbour type classifier is the one nearest neighbour classifier that assigns a point {{mvar\|x}} to the class of its closest neighbour in the feature space, that is <math>C_n^{1nn}(x) = Y_{(1)}</math>. Line 103 ⟶ 107: \| url = https://archive.org/details/arxiv-1202.2194 \| journal = International Journal of Remote Sensing \| volume = 32 }}</ref>▼ \| issue = 21 \| pages = 6109–6132 \| doi = 10.1080/01431161.2010.507795 \| arxiv = 1202.2194 ▲ }}</ref> The naive version of the algorithm is easy to implement by computing the distances from the test example to all stored examples, but it is computationally intensive for large training sets. Using an approximate [[nearest neighbor search]] algorithm makes ''k-''NN computationally tractable even for large data sets. Many nearest neighbor search algorithms have been proposed over the years; these generally seek to reduce the number of distance evaluations actually performed. ''k-''NN has some strong [[consistency (statistics)\|consistency]] results. As the amount of data approaches infinity, the two-class ''k-''NN algorithm is guaranteed to yield an error rate no worse than twice the [[Bayes error rate]] (the minimum achievable error rate given the distribution of the data).<ref name=":1">{{cite journal \|url=http://ssg.mit.edu/cal/abs/2000_spring/np_dens/classification/cover67.pdf \|doi=10.1109/TIT.1967.1053964 \|author-link1=Thomas M. Cover \|last1=Cover \|first1=Thomas M. \|author-link2=Peter E. Hart \|last2=Hart \|first2=Peter E. \|title=Nearest neighbor pattern classification \|journal=IEEE Transactions on Information Theory \|year=1967 \|volume=13 \|issue=1 \|pages=21–27 \| citeseerx=10.1.1.68.2616 \|s2cid=5246200 \|archive-date=2018-12-23 \|access-date=2018-05-24 \|archive-url=https://web.archive.org/web/20181223080756/http://ssg.mit.edu/cal/abs/2000_spring/np_dens/classification/cover67.pdf \|url-status=dead }}</ref> Various improvements to the ''k''-NN speed are possible by using proximity graphs.<ref>{{cite journal \|doi=10.1142/S0218195905001622 \|last=Toussaint \|first=Godfried T. \|title=Geometric proximity graphs for improving nearest neighbor methods in instance-based learning and data mining \|journal=International Journal of Computational Geometry and Applications \|volume=15 \|issue=2 \|date=April 2005 \|pages=101–150 }}</ref> For multi-class ''k-''NN classification, [[Thomas M. Cover\|Cover]] and [[Peter E. Hart\|Hart]] (1967) prove an upper bound error rate of <math display="block">R^ \ \leq \ R_{k\mathrm{NN}} \ \leq \ R^* \left(2 - \frac{MR^}{M-1}\right)</math> where <math>R^</math> is the Bayes error rate (which is the minimal error rate possible), <math>R_{kNN}</math> is the asymptotic ''k-''NN error rate, and {{mvar\|M}} is the number of classes in the problem. This bound is tight in the sense that both the lower and upper bounds are achievable by some distribution.<ref>Devroye, L., Gyorfi, L. & Lugosi, G. A Probabilistic Theory of Pattern Recognition. Discrete Appl Math 73, 192–194 (1997).</ref> For <math>M=2</math> and as the Bayesian error rate <math>R^</math> approaches zero, this limit reduces to "not more than twice the Bayesian error rate". ==Error rates== Line 131 ⟶ 140: for some constants <math> B_1 </math> and <math> B_2 </math>. The choice <math>k^ = \left\lfloor B n^{\frac 4 {d+4}} \right\rfloor</math> offers a trade off between the two terms in the above display, for which the <math>k^</math>-nearest neighbour error converges to the Bayes error at the optimal ([[minimax]]) rate <math>\mathcal{O}\left(n^{-\frac 4 {d+4}}\right)</math>. ==Metric learning== Line 153 ⟶ 162: [[Feature extraction]] and dimension reduction can be combined in one step using [[Principal Component Analysis\|principal component analysis]] (PCA), [[linear discriminant analysis]] (LDA), or [[Canonical correlation\|canonical correlation analysis]] (CCA) techniques as a pre-processing step, followed by clustering by ''k''-NN on [[Feature (machine learning)\|feature vectors]] in reduced-dimension space. This process is also called low-dimensional [[embedding]].<ref>{{citation \|last1=Shaw \|first1=Blake \|last2=Jebara \|first2=Tony \|title=Structure preserving embedding \|work=Proceedings of the 26th Annual International Conference on Machine Learning \|year=2009 \|pages=1–8 \| publication-date=June 2009 \|url=http://www.cs.columbia.edu/~jebara/papers/spe-icml09.pdf \|doi=10.1145/1553374.1553494 \|isbn=9781605585161 \|s2cid=8522279 }}</ref> For very-high-dimensional datasets (e.g. when performing a similarity search on live video streams, DNA data or high-dimensional [[time series]]) running a fast '''approximate''' ''k''-NN search using [[Locality Sensitive Hashing\|locality sensitive hashing]], "random projections",<ref>{{cite book\|doi=10.1145/502512.502546\|chapter=Random projection in dimensionality reduction \|title=Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '01 \|year=2001 \|last1=Bingham \|first1=Ella \|last2=Mannila \|first2=Heikki \|pages=245–250 \|isbn=158113391X \|s2cid=1854295 }}</ref> "sketches" <ref>Ryan, Donna (editor); ''High Performance Discovery in Time Series'', Berlin: Springer, 2004, {{ISBN\|0-387-00857-8}}</ref> or other high-dimensional similarity search techniques from the [[VLDB conference\|VLDB]] toolbox might be the only feasible option. == Decision boundary == Line 175 ⟶ 184: Condensed nearest neighbor (CNN, the ''[[Peter E. Hart\|Hart]] algorithm'') is an algorithm designed to reduce the data set for ''k''-NN classification.<ref>{{cite journal \| last1 = Hart \| first1 = Peter E. \| author-link = Peter E. Hart \| year = 1968 \| title = The Condensed Nearest Neighbor Rule \| journal = IEEE Transactions on Information Theory \| volume = 18 \| pages = 515–516 \| doi = 10.1109/TIT.1968.1054155 }}</ref> It selects the set of prototypes ''U'' from the training data, such that 1NN with ''U'' can classify the examples almost as accurately as 1NN does with the whole data set. [[File:BorderRAtio.PNG\|thumb\|130px\|right\|Calculation of the border ratio.]] [[File:PointsTypes.png\|thumb\|130px\|right\|Three types of points: prototypes, class-outliers, and absorbed points.]] Given a training set ''X'', CNN works iteratively: Line 184 ⟶ 193: Use ''U'' instead of ''X'' for classification. The examples that are not prototypes are called "absorbed" points. It is efficient to scan the training examples in order of decreasing border ratio.<ref name="MirkesKnn">Mirkes, Evgeny M.; [http://www.math.le.ac.uk/people/ag153/homepage/KNN/KNN3.html ''KNN and Potential Energy: applet''] {{Webarchive\|url=https://web.archive.org/web/20120119201715/http://www.math.le.ac.uk/people/ag153/homepage/KNN/KNN3.html \|date=2012-01-19 }}, University of Leicester, 2011</ref> The border ratio of a training example ''x'' is defined as {{block indent \| em = 1.5 \| text = {{math\|1= ''a''(''x'') = {{sfrac\|{{norm\|''x'-y''}}\|{{norm\|''x-y''}}}} }} }} Line 202 ⟶ 211: == ''k''-NN regression == {{further\|Nearest neighbor smoothing}} In ''k''-NN regression, the ''k''-NN algorithm{{citation needed\|date=September 2019}} is used for estimating continuous variables. One such algorithm uses a weighted average of the ''k'' nearest neighbors, weighted by the inverse of their distance. This algorithm works as follows:▼ ▲In ''k''-NN regression, also known as ''k''-NN smoothing, the ''k''-NN algorithm is used for estimating [[continuous variable]]s.{{citation needed\|date=September 2019}} ~~is used for estimating continuous variables.~~ One such algorithm uses a weighted average of the ''k'' nearest neighbors, weighted by the inverse of their distance. This algorithm works as follows: # Compute the Euclidean or [[Mahalanobis distance]] from the query example to the labeled examples. Line 220 ⟶ 231: [[Closest pair of points problem]] * [[Nearest neighbor graph]] * [[Segmentation-based object categorization]] {{Clear right}} Line 226 ⟶ 238: ==Further reading== * {{cite book \|editor=Dasarathy, Belur V. \|editor-link=Belur V. Dasarathy \|year=1991 \|title=Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques \|publisher=IEEE Computer Society Press \|isbn=978-0818689307}} * {{cite book \|title=Nearest-Neighbor Methods in Learning and Vision \|editor=Shakhnarovich, Gregory \|editor2=Darrell, Trevor \|editor3=Indyk, Piotr \|publisher=[[MIT Press]] \|year=2005 \|isbn=978-0262195478}}