k Nearest Neighbors - niranjv/ml-notes GitHub Wiki

Overview

  • Non-parametric method for classification & regression
  • Given k > 0 and instance x_0, identify k points in training data closest to x_0 based on their covariates
  • Calculate conditional probability P(Y = j | X = x_0) for each class as number of points in k that fall in class j.
  • Then apply Bayes classifier and assign instance x_0 to class with largest conditional probability
  • Choice of k has large impact on nature of classifier obtained. Small k => high variance, low bias; large k => low variance, high bias
  • Can weight contribution of neighbors so neighbors that are nearer contribute more than those that are far away (.eg., weight inversely prop. to distance) with constraint that weights sum to 1.
  • Typical distance metric for continuous covariates is Euclidean distance. For discrete covariates like text, can use Hamming distance. For gene expression data, correlation can also be used as metric.
  • Best value for k can be selected by cross-validation
  • When k = 1 and training set size -> Inf, then k-NN error rate is at most the Bayes error rate.
  • Performance of k-NN can be improved by learning a custom distance metric based on label info using methods like neighborhood components analysis and large margin nearest neighbor.
  • Distance to kth nearest neighbor is a local density estimate and is an outlier score for anomaly detection.

Advantages

  • Very simple algorithm to implement
  • No explicit training step; training data just needs to be stored
  • No assumptions about data or classes
  • Robust to noisy training data (especially, if contribution is weighted by inverse of distance)

Disadvantages

  • Computationally expensive to find nearest neighbors in large data sets
  • Classes cannot be interpreted
  • k needs to be specified ahead of time and may not know the best value for k. Workaround is to determine k via cross-validation
  • Computational cost for large data sets is high if entire training set data is used. Workaround is to use only the points in the training data that are used to determine class boundaries
  • Type of distance used affects results a lot
  • If one class in the training data has a lot more data points than the other classes, this class can dominate class predictions since more points in a neighborhood belong to this class. Weighting neighbors by distance is a possible workaround. Another workaround is to use a different data representation (e.g., SOM) and run k-NN on this representation.
  • Accuracy of k can be degraded by presence of noisy / irrelevant / redundant features and by inconsistent feature scales. Workaround is doing k-NN after scaling and dimensionality reduction.
  • Cannot handle situations where a point is equally likely to belong to 2 or more classes
  • Cannot handle categorial covariates well

Ref

  • ISLR, Section 2.2 (Bayes classifier)
  • Wikipedia
⚠️ **GitHub.com Fallback** ⚠️