Authored by Tony Feng
Created on May 18th, 2022
Last Modified on May 19th, 2022
Intro
K-nearest neighbors (KNN) is a simple, easy to understand machine learning algorithms. It examines the labels of a chosen number of data points surrounding a target data point, in order to make a prediction about the class that the data point falls into.
It is a type of supervised learning algorithm used for both regression and classification. 1) In classification, it tries to predict the class for the test data by a majority vote from K amount of neighbours. 2) In the case of regression, the prediction is the mean of the K selected training points.
Characteristics
Non-parametic
KNN is non-parametric, which means there is no assumption for underlying data distribution (i.e. the model structure is determined from the dataset).
Lazy
It is called Lazy algorithm because it does not need any training data points for model generation. All training data is used in the testing phase. This makes training faster, but making the testing phase slower and costlier.
Similarity Measure
KNN stores all the available cases and predicts the new data based on a similarity measure.
KNN Details
Steps
- Choosing a K and a distance metric
- Calculating the distance between a test example and each training examples
- Sorting the calculated distances
- Returning a prediction
- Classification: Assigning the most frequent label of the top K neighbors
- Regression: Outputing the mean (average) of K-nearest neighbours
K Selection
There are no pre-defined statistical methods to find the most favorable value of K. Sometimes, we may apply a domain knowledge to select a K. Additionally, we can derive a plot between error rate and K denoting values in a defined range.
A smaller K will lead to overfitting, while a larger K results in smooth decision boundaries or underfitting.
In some cases, the algorithm faces a dilemma in majority vote. One solution, which will be discussed later, is to change to another decision rule. Besides, we can consider the parity of the number of classes and the value of K.
- If the number of classes is even, then K should be odd;
- If the number of classed is odd, then K should be even.
Distance Metric
There are various methods for calculating this distance, of which the most commonly known methods are Euclidian, Manhattan (for continuous variables) and Hamming distance (for categorical variables).
Euclidean Distance
$$ ED = \sqrt{\sum_{i=1}^{k}(x_{i}-y_{i})^{2}} $$
Manhattan Distance
$$ MD = \sum_{i=1}^{k}|x_{i}-y_{i}| $$
Hamming Distance
$$ HD = \sum_{i=1}^{k} M_i $$
, where $ M_i = 0 $ if $ x_i = y_i $, else $ M_i = 1 $.
Decision Rules
When combining the class labels, the simplest method is to take the majority vote, but this can be a problem if the nearest neighbors vary widely in their distance and the closest neighbors more reliably indicate the class of the object.
To overcome the issue, we can assign weights to the nearest k points . The intuition behind weighted KNN is to give more weight to the points which are nearby and less weight to the points which are farther away.
Evaluation
Pros
- KNN is straightforward and easy to implement.
- No “training” process are needed in the traditional sense.
- KNN doesn’t function with emphasis on assumptions.
- KNN can quickly adapt to new data.
- This algorithm can be used for both regression and classification.
Cons
- KNN algorithm is computationally expensive and slow.
- It needs a large memory storage.
- The efficiency of KNN is inversely proportional to the number of features.
- It is highly susceptible to the curse of dimensionality
- It cannot function properly with missing values.
- Outliers are especially impactful for KNN.
Referrence
- K-Nearest Neighbor - Medium
- K-Nearest Neighbor(KNN) Algorithm for Machine Learning
- Intro to K-Nearest Neighbours (KNN) — Machine Learning 101
- K-Nearest Neighbors (KNN) Algorithm
- What is a KNN (K-Nearest Neighbors)?
- What is distance weighted KNN?