K Nearest neighbour (KNN) is one of the most basic classification algorithms. It follows the assumption that similar things exist in close proximity to each other.
The K-nearest neighbour algorithm calculates the distance between the various points whose category is known and then selects the shortest distance for the new data point. It is an algorithm that comes under supervised learning scenarios, which means we already have the predefined classes available.
The K in the KNN algorithm refers to the number of the neighbours considered which can take any whole number as input.
Consider an example which shows whether a certain product is normal or has an anomaly in it, where blue squares show the normal product, red triangles anomaly and green circle as the product which we want to predict. In this case, if we consider the value of K to be 3 i.e inner circle the object is classified as an anomaly due to vote out by the triangle but if we consider K = 5 the product is classified as normal due to the majority of the blue square.
To conclude we can say in simpler terms that KNN takes the nearest neighbours of the point and tries to classify based on the majority of the points it is closest to.
How To Consider Distances Between Points?
There are certain distances which are commonly used while using the KNN algorithm.
- Euclidean Distance :-
Consider 2 points and so euclidean distance between them would be . This distance denotes the shortest distance between the two points it is also known as Norm. - Manhattan distance:-
Manhattan distance refers to the sum of absolute difference between the points - Minkowski distance:-
For the distance between two points it is calculated as
If we make p = 1 it becomes Manhattan distance and when p = 2 it becomes euclidean distance.
How to select the value of K?
Before we dive into understanding how to select the optimal value of K we need to understand what is meant by decision boundary.
Consider a data set consisting of circles and pus labels so if we try to generalize the relation or the pattern of classification we can draw a line or a curve as drawn by the blue line which easily separates the majority of plus and the circles. The curve acts as a guide to the algorithm in classifying the point. The curve is known as the decision boundary.
Now if we consider another data set which has points in the following distribution and considers
K = 1 then the decision boundary will be something like
And if the test point is closer to the centre circle it would be classified as a circle rather than as a cross which is in majority around the centre. Such kind of model is known as an overfitted model
In another case if we consider the value of K to be very large we may end up classifying everything as square as the decision boundary would tend more towards the majority of the label which being square, in this case, such model is known as underfit model thus it is important to select the optimal value of K
Now we may ask how can we calculate perfect K?
To answer that there is no step by step process to calculate the perfect value of the K.
To calculate the value of the K we will have to go by approximation and guess and to find the best match.
Let’s understand this with an example
As usual, let’s start with importing the libraries and then reading the dataset:
from sklearn import datasets from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split |
dataset = datasets.load_breast_cancer() |
This data set contains the data about whether a cancer is malignant or not.
X_train , X_test , Y_train , Y_test = train_test_split(dataset.data , dataset.target , test_size = 0.2,random_state=0) |
We then split data into training and testing datasets
clf = KNeighborsClassifier() clf.fit(X_train, Y_train) |
Next we select the classifier as Kneigbor and fit it using X_train and Y_train data
clf.score(X_test, Y_test) |
0.9385964912280702
Upon scoring the algorithm we get a score of .93
By default KNN selects the value of K to be 5.
Now we may argue to change the score we can try and set the value of K to some different number and then rerun over the test data to increase the score but in reality, we are trying to fit the test data to the model which in end would defeat the purpose of the test data.
As an alternative, we may say that what if we try to train the algorithm using training data and then repass it as test data to decide the optimum value of K, but in that case, we will end up getting a 100% accuracy over training data but it will fail on the test data and give low scores.
Cross-validation and K Fold Cross Validation
Now to find the optimum value of K what we can do is take our training data and split it into two parts in a 60:40 ratio then we can use the 60% of the data to train our model and the remaining 40% of the training data to test and determine the value of K without compromising the integrity of the Test data. This process is known as Cross-validation.
However, it is also said that the more the training data is available the better is the model but we already lost 40% of our data to testing cases to overcome this we use K fold cross-validation.
Consider we have a box denoting our training data
Then k fold cross validation states that we can split our training data into P number of parts and us P -1 parts for training and remaining 1 for the testing purpose.
In simpler words, we can use 1 part at a time for testing and the remaining for training i.e. if we considered 1st part for testing then from 2nd part onwards up to the end will be used for training purposes, next 2nd part would be considered for testing and remaining parts would be considered for training, so on and so forth
After calculating scores of various parts we can take an average of the scores and consider it as the score for the particular value of K
Let’s see this using an example
Continuing on the above example where the breast cancer dataset was chosen.
Now to implement K fold CV we use a for loop and repeatedly score the data for the various values of K and store them in an array so it would be easier to plot using matplotlib
x_axis = [] y_axis = [] for i in range(1, 26, 2): clf = KNeighborsClassifier(n_neighbors = i) score = cross_val_score(clf, X_train, Y_train) x_axis.append(i) y_axis.append(score.mean()) |
We use odd values of the K, while considering the values of K assume in the neighbour of the unknown point we get an equal number of label A or label B it would be difficult to classify. Thus, to avoid it we consider an odd number of neighbours
import matplotlib.pyplot as plt plt.plot(x_axis, y_axis) plt.show() |
We import matplotlib and plot the variation of score against the various value of K
From this plot we come to know that the best accuracy comes near K = 7 Thus we will consider K as 7.
Now let’s try to implement this algorithm from scratch by ourselves
Using K fold CV we came to know that we would get the best result at K = 7 so let’s run the algorithm using K = 7 and score it on the same random state using sklearn
from sklearn import datasets from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split |
Importing libraries and dataset from sklearn
dataset = datasets.load_breast_cancer() X_train, X_test, Y_train, Y_test = train_test_split(dataset.data, dataset.target, test_size = 0.2, random_state = 0) |
Loading dataset and splitting the data in training and testing data in 80:20 ratio , since we would be trying to implement this algorithm ourselves we are splitting data using a particular random state so as to get same split every time
clf = KNeighborsClassifier(n_neighbors=7) clf.fit(X_train, Y_train) |
Calling the classifier and specifying the value of K as 7 then we call the fit function so as to train the model
clf.score(X_test, Y_test) |
Upon scoring the algorithm we get a score of
0.94736842105263153
Limitation of KNN
KNN’s limitations: KNN is an extremely strong algorithm. It is also known as “slow learner.” It does, however, have the following limitations:
- Doesn’t function well with huge datasets: Because KNN is a distance-based method, the cost of computing the distance between a new point and each old point is quite high, degrading the algorithm’s speed.
- Doesn’t function well with a large number of dimensions: Same as above. The cost of calculating distance grows expensive in higher dimensional space, affecting performance.
- Sensitive to outliers and missing data: Because KNN is sensitive to outliers and missing values, we must first impute missing values and remove outliers before using the KNN method.