Content deleted Content added
Fgnievinski (talk | contribs) |
Fgnievinski (talk | contribs) |
||
Line 3:
{{distinguish|Nearest neighbor search|Nearest neighbor interpolation|k-means clustering{{!}}''k''-means clustering}}
In [[statistics]], the '''''k''-nearest neighbors algorithm''' ('''''k''-NN''') is a [[Non-parametric statistics|non-parametric]] [[supervised learning]] method. It was first developed by [[Evelyn Fix]] and [[Joseph Lawson Hodges Jr.|Joseph Hodges]] in 1951,<ref>{{Cite report | last1=Fix | first1=Evelyn | last2= Hodges | first2=Joseph L. | title=Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties | issue=Report Number 4, Project Number 21-49-004 | year=1951 | url=https://apps.dtic.mil/dtic/tr/fulltext/u2/a800276.pdf | archive-url=https://web.archive.org/web/20200926212807/https://apps.dtic.mil/dtic/tr/fulltext/u2/a800276.pdf | url-status=live | archive-date=September 26, 2020 | publisher=USAF School of Aviation Medicine, Randolph Field, Texas}}</ref> and later expanded by [[Thomas M. Cover|Thomas Cover]].<ref name=":1" />
Most often, it is used for [[statistical classification|classification]], as a '''''k''-NN classifier''', the output of which is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its ''k'' nearest neighbors (''k'' is a positive [[integer]], typically small). If ''k'' = 1, then the object is simply assigned to the class of that single nearest neighbor.
''k''-NN is a type of [[classification]] where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales, then feature-wise [[Normalization (statistics)|normalizing]] of the training data can greatly improve its accuracy.<ref>{{Cite book|last=Hastie, Trevor.|title=The elements of statistical learning : data mining, inference, and prediction : with 200 full-color illustrations|date=2001|publisher=Springer|others=Tibshirani, Robert., Friedman, J. H. (Jerome H.)|isbn=0-387-95284-5|___location=New York|oclc=46809224}}</ref>▼
For both classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that nearer neighbors contribute more to the average than distant ones. For example, a common weighting scheme consists of giving each neighbor a weight of 1/''d'', where ''d'' is the distance to the neighbor.<ref>This scheme is a generalization of linear interpolation.</ref>
The input consists of the ''k'' closest training examples in a [[data set]].
The neighbors are taken from a set of objects for which the class (for ''k''-NN classification) or the object property value (for ''k''-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.
A peculiarity (sometimes even a disadvantage) of the ''k''-NN algorithm is its sensitivity to the local structure of the data.
▲In ''k''-NN
==Statistical setting==
|