[Machine Learning] Topic 4: K-Nearest Neighbours

Authored by Tony Feng

Created on May 18th, 2022

Last Modified on May 19th, 2022

Intro

K-nearest neighbors (KNN) is a simple, easy to understand machine learning algorithms. It examines the labels of a chosen number of data points surrounding a target data point, in order to make a prediction about the class that the data point falls into.

It is a type of supervised learning algorithm used for both regression and classification. 1) In classification, it tries to predict the class for the test data by a majority vote from K amount of neighbours. 2) In the case of regression, the prediction is the mean of the K selected training points.

Characteristics

Non-parametic

KNN is non-parametric, which means there is no assumption for underlying data distribution (i.e. the model structure is determined from the dataset).

Lazy

It is called Lazy algorithm because it does not need any training data points for model generation. All training data is used in the testing phase. This makes training faster, but making the testing phase slower and costlier.

Similarity Measure

KNN stores all the available cases and predicts the new data based on a similarity measure.

KNN Details

Steps

Choosing a K and a distance metric
Calculating the distance between a test example and each training examples
Sorting the calculated distances
Returning a prediction
- Classification: Assigning the most frequent label of the top K neighbors
- Regression: Outputing the mean (average) of K-nearest neighbours

K Selection

There are no pre-defined statistical methods to find the most favorable value of K. Sometimes, we may apply a domain knowledge to select a K. Additionally, we can derive a plot between error rate and K denoting values in a defined range.

A smaller K will lead to overfitting, while a larger K results in smooth decision boundaries or underfitting.

In some cases, the algorithm faces a dilemma in majority vote. One solution, which will be discussed later, is to change to another decision rule. Besides, we can consider the parity of the number of classes and the value of K.

If the number of classes is even, then K should be odd;
If the number of classed is odd, then K should be even.

Distance Metric

There are various methods for calculating this distance, of which the most commonly known methods are Euclidian, Manhattan (for continuous variables) and Hamming distance (for categorical variables).

Euclidean Distance

$$ ED = \sqrt{\sum_{i=1}^{k}(x_{i}-y_{i})^{2}} $$

Manhattan Distance

$$ MD = \sum_{i=1}^{k}|x_{i}-y_{i}| $$

Hamming Distance

$$ HD = \sum_{i=1}^{k} M_i $$

, where $ M_i = 0 $ if $ x_i = y_i $, else $ M_i = 1 $.

Decision Rules

When combining the class labels, the simplest method is to take the majority vote, but this can be a problem if the nearest neighbors vary widely in their distance and the closest neighbors more reliably indicate the class of the object.

To overcome the issue, we can assign weights to the nearest k points . The intuition behind weighted KNN is to give more weight to the points which are nearby and less weight to the points which are farther away.

Evaluation

Pros

KNN is straightforward and easy to implement.
No “training” process are needed in the traditional sense.
KNN doesn’t function with emphasis on assumptions.
KNN can quickly adapt to new data.
This algorithm can be used for both regression and classification.

Cons

KNN algorithm is computationally expensive and slow.
It needs a large memory storage.
The efficiency of KNN is inversely proportional to the number of features.
It is highly susceptible to the curse of dimensionality
It cannot function properly with missing values.
Outliers are especially impactful for KNN.