K-nearest neighbors algorithm: Difference between revisions

Content deleted Content added
Line 3:
{{distinguish|Nearest neighbor search|Nearest neighbor interpolation|k-means clustering{{!}}''k''-means clustering}}
 
In [[statistics]], the '''''k''-nearest neighbors algorithm''' ('''''k''-NN''') is a [[Non-parametric statistics|non-parametric]] [[supervised learning]] method. It was first developed by [[Evelyn Fix]] and [[Joseph Lawson Hodges Jr.|Joseph Hodges]] in 1951,<ref>{{Cite report | last1=Fix | first1=Evelyn | last2= Hodges | first2=Joseph L. | title=Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties | issue=Report Number 4, Project Number 21-49-004 | year=1951 | url=https://apps.dtic.mil/dtic/tr/fulltext/u2/a800276.pdf | archive-url=https://web.archive.org/web/20200926212807/https://apps.dtic.mil/dtic/tr/fulltext/u2/a800276.pdf | url-status=live | archive-date=September 26, 2020 | publisher=USAF School of Aviation Medicine, Randolph Field, Texas}}</ref> and later expanded by [[Thomas M. Cover|Thomas Cover]].<ref name=":1" /> It is used for [[statistical classification|classification]] and [[regression analysis|regression]]. In both cases, the input consists of the ''k'' closest training examples in a [[data set]]. The output depends on whether ''k''-NN is used for classification or regression:
Most often, it is used for [[statistical classification|classification]], as a '''''k''-NN classifier''', the output of which is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its ''k'' nearest neighbors (''k'' is a positive [[integer]], typically small). If ''k''&nbsp;=&nbsp;1, then the object is simply assigned to the class of that single nearest neighbor.
 
* InThe ''k''-NN classification'',algorithm thecan outputalso isbe ageneralized classfor membership[[regression analysis|regression]]. AnIn object''{{mvar|k}}-NN isregression'', classifiedalso byknown aas plurality''[[nearest voteneighbor ofsmoothing]]'', itsthe neighbors,output withis the property value for the object. beingThis assignedvalue tois the classaverage mostof commonthe amongvalues itsof ''k'' nearest neighbors (''k'' is a positive [[integer]], typically small). If ''k''&nbsp; =&nbsp; 1, then the objectoutput is simply assigned to the classvalue of that single nearest neighbor, also known as ''[[nearest neighbor interpolation]]''.
* In ''k-NN regression'', the output is the property value for the object. This value is the average of the values of ''k'' nearest neighbors. If ''k'' = 1, then the output is simply assigned to the value of that single nearest neighbor.
 
''k''-NN is a type of [[classification]] where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales, then feature-wise [[Normalization (statistics)|normalizing]] of the training data can greatly improve its accuracy.<ref>{{Cite book|last=Hastie, Trevor.|title=The elements of statistical learning : data mining, inference, and prediction : with 200 full-color illustrations|date=2001|publisher=Springer|others=Tibshirani, Robert., Friedman, J. H. (Jerome H.)|isbn=0-387-95284-5|___location=New York|oclc=46809224}}</ref>
 
For both classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that nearer neighbors contribute more to the average than distant ones. For example, a common weighting scheme consists of giving each neighbor a weight of 1/''d'', where ''d'' is the distance to the neighbor.<ref>This scheme is a generalization of linear interpolation.</ref>
 
The input consists of the ''k'' closest training examples in a [[data set]].
The neighbors are taken from a set of objects for which the class (for ''k''-NN classification) or the object property value (for ''k''-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.
 
A peculiarity (sometimes even a disadvantage) of the ''k''-NN algorithm is its sensitivity to the local structure of the data.
In ''k''-NN is a type of [[classification]] where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales, then feature-wise [[Normalization (statistics)|normalizing]] of the training data can greatly improve its accuracy.<ref>{{Cite book|last=Hastie, Trevor.|title=The elements of statistical learning : data mining, inference, and prediction : with 200 full-color illustrations|date=2001|publisher=Springer|others=Tibshirani, Robert., Friedman, J. H. (Jerome H.)|isbn=0-387-95284-5|___location=New York|oclc=46809224}}</ref>
 
==Statistical setting==