K-nearest neighbors algorithm: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 21:20, 23 November 2024 edit Fgnievinski (talk \| contribs) Autopatrolled, Extended confirmed users 71,081 edits →top ← Previous edit		Latest revision as of 23:48, 16 April 2025 edit undo Citation bot (talk \| contribs) Bots 5,860,939 edits Add: publisher, arxiv, doi, pages, issue, volume. \| Use this bot. Report bugs. \| Suggested by Dominic3203 \| Linked from User:LinguisticMystic/cs/outline \| #UCB_webform_linked 1079/2277
(5 intermediate revisions by 5 users not shown)
Line 33: A commonly used distance metric for [[continuous variable]]s is [[Euclidean distance]]. For discrete variables, such as for text classification, another metric can be used, such as the '''overlap metric''' (or [[Hamming distance]]). In the context of gene expression microarray data, for example, ''k''-NN has been employed with correlation coefficients, such as Pearson and Spearman, as a metric.<ref>{{cite journal \|last1=Jaskowiak \|first1=Pablo A. \|last2=Campello \|first2=Ricardo J. G. B. \|title=Comparing Correlation Coefficients as Dissimilarity Measures for Cancer Classification in Gene Expression Data \|journal=Brazilian Symposium on Bioinformatics (BSB 2011) \|year=2011 \|pages=1–8 \|citeseerx=10.1.1.208.993 }}</ref> Often, the classification accuracy of ''k''-NN can be improved significantly if the distance metric is learned with specialized algorithms such as [[Large Margin Nearest Neighbor]] or [[Neighbourhood components analysis]]. [[File:Kmeans clustering WHR2023 data.gif\|thumb\|An animated visualization of K-means clustering with ''k'' = 3, grouping countries based on life expectancy, GDP, and happiness—demonstrating how KNN operates in higher dimensions. Click to view the animation.<ref>Helliwell, J. F., Layard, R., Sachs, J. D., Aknin, L. B., De Neve, J.-E., & Wang, S. (Eds.). (2023). World Happiness Report 2023 (11th ed.). Sustainable Development Solutions Network.</ref>]] A drawback of the basic "majority voting" classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the ''k'' nearest neighbors due to their large number.<ref name="Coomans_Massart1982">{{Cite journal \| doi = 10.1016/S0003-2670(01)95359-0 Line 43: \| pages = 15–27 \| year = 1982 \| bibcode = 1982AcAC..136...15C }} }} </ref> One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its ''k'' nearest neighbors. The class (or value, in regression problems) of each of the ''k'' nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example, in a [[self-organizing map]] (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. ''K''-NN can then be applied to the SOM. Line 63: </ref> ==The ~~{{math\|~~1}}-nearest neighbor classifier== The most intuitive nearest neighbour type classifier is the one nearest neighbour classifier that assigns a point {{mvar\|x}} to the class of its closest neighbour in the feature space, that is <math>C_n^{1nn}(x) = Y_{(1)}</math>. Line 107: \| url = https://archive.org/details/arxiv-1202.2194 \| journal = International Journal of Remote Sensing \| volume = 32 }}</ref>▼ \| issue = 21 \| pages = 6109–6132 \| doi = 10.1080/01431161.2010.507795 \| arxiv = 1202.2194 ▲ }}</ref> The naive version of the algorithm is easy to implement by computing the distances from the test example to all stored examples, but it is computationally intensive for large training sets. Using an approximate [[nearest neighbor search]] algorithm makes ''k-''NN computationally tractable even for large data sets. Many nearest neighbor search algorithms have been proposed over the years; these generally seek to reduce the number of distance evaluations actually performed. ''k-''NN has some strong [[consistency (statistics)\|consistency]] results. As the amount of data approaches infinity, the two-class ''k-''NN algorithm is guaranteed to yield an error rate no worse than twice the [[Bayes error rate]] (the minimum achievable error rate given the distribution of the data).<ref name=":1">{{cite journal \|url=http://ssg.mit.edu/cal/abs/2000_spring/np_dens/classification/cover67.pdf \|doi=10.1109/TIT.1967.1053964 \|author-link1=Thomas M. Cover \|last1=Cover \|first1=Thomas M. \|author-link2=Peter E. Hart \|last2=Hart \|first2=Peter E. \|title=Nearest neighbor pattern classification \|journal=IEEE Transactions on Information Theory \|year=1967 \|volume=13 \|issue=1 \|pages=21–27 \| citeseerx=10.1.1.68.2616 \|s2cid=5246200 \|archive-date=2018-12-23 \|access-date=2018-05-24 \|archive-url=https://web.archive.org/web/20181223080756/http://ssg.mit.edu/cal/abs/2000_spring/np_dens/classification/cover67.pdf \|url-status=dead }}</ref> Various improvements to the ''k''-NN speed are possible by using proximity graphs.<ref>{{cite journal \|doi=10.1142/S0218195905001622 \|last=Toussaint \|first=Godfried T. \|title=Geometric proximity graphs for improving nearest neighbor methods in instance-based learning and data mining \|journal=International Journal of Computational Geometry and Applications \|volume=15 \|issue=2 \|date=April 2005 \|pages=101–150 }}</ref> For multi-class ''k-''NN classification, [[Thomas M. Cover\|Cover]] and [[Peter E. Hart\|Hart]] (1967) prove an upper bound error rate of Line 188 ⟶ 193: Use ''U'' instead of ''X'' for classification. The examples that are not prototypes are called "absorbed" points. It is efficient to scan the training examples in order of decreasing border ratio.<ref name="MirkesKnn">Mirkes, Evgeny M.; [http://www.math.le.ac.uk/people/ag153/homepage/KNN/KNN3.html ''KNN and Potential Energy: applet''] {{Webarchive\|url=https://web.archive.org/web/20120119201715/http://www.math.le.ac.uk/people/ag153/homepage/KNN/KNN3.html \|date=2012-01-19 }}, University of Leicester, 2011</ref> The border ratio of a training example ''x'' is defined as {{block indent \| em = 1.5 \| text = {{math\|1= ''a''(''x'') = {{sfrac\|{{norm\|''x'-y''}}\|{{norm\|''x-y''}}}} }} }} Line 206 ⟶ 211: == ''k''-NN regression == {{further\|Nearest neighbor smoothing}} In ''k''-NN regression, the ''k''-NN algorithm{{citation needed\|date=September 2019}} is used for estimating continuous variables. One such algorithm uses a weighted average of the ''k'' nearest neighbors, weighted by the inverse of their distance. This algorithm works as follows:▼ ▲In ''k''-NN regression, also known as ''k''-NN smoothing, the ''k''-NN algorithm is used for estimating [[continuous variable]]s.{{citation needed\|date=September 2019}} ~~is used for estimating continuous variables.~~ One such algorithm uses a weighted average of the ''k'' nearest neighbors, weighted by the inverse of their distance. This algorithm works as follows: # Compute the Euclidean or [[Mahalanobis distance]] from the query example to the labeled examples. Line 231 ⟶ 238: ==Further reading== * {{cite book \|editor=Dasarathy, Belur V. \|editor-link=Belur V. Dasarathy \|year=1991 \|title=Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques \|publisher=IEEE Computer Society Press \|isbn=978-0818689307}} * {{cite book \|title=Nearest-Neighbor Methods in Learning and Vision \|editor=Shakhnarovich, Gregory \|editor2=Darrell, Trevor \|editor3=Indyk, Piotr \|publisher=[[MIT Press]] \|year=2005 \|isbn=978-0262195478}}