K - Nearest Neighbours
K-Nearest Neighbours is an extremely simple algorithm, where the prediction for a new instance
is the plurality class among its k closest neighbours.
To made it easy, let us say you want to decide whether you want to go on a holiday and are utterly confused between Italy, and Paris. You ask 100 people about these options (I know it is unrealistic, but still assume), but then you pick up your closest 5 people, who have similar preference as you, and then decide on the basis of their majority opinion you decide where to go.
Theoretically speaking, for a new instance Xi its distance (Euclidean, Manhattan, Minkowski etc.) is calculated with all the points in the training data. k points having minimum distance with Xi are considered its 'closest neighbours'. The class having majority in these k neighbours is considered as the predicted output for Xi.
In the following example we have 15 observations and 3 variables X1, X2, and X3. Decision is our dependent variable : Italy or Paris.
Now we have a new instance, and thus we calculate the Euclidean distance of this new instance with each of the 15 points.
Rank the observations in ascending order (last column).
Choose closes 'k' neighbours with least Euclidean distance (i.e., say for k = 5, we chose ranks 1-5)
Get the value of outcome variable for these 'k' neighbours (highlighted in orange).
With majority voting of these 'k = 5' neighbours, we saw that 3 observations have the value Italy and 2 have Paris. Thus, with majority voting we have chosen the prediction for new instance as Italy
Standardising the variables
Since the distance can be influenced by difference in the scales of the variables (say one is in thousands, and other variable always taking the values less than 100), thus the variables are standardised to have mean 0 and variance 1 before KNN is implemented.
In the above example, all of our variables had the values between 1 to 8, thus we did not standardise it, but in real life, we standardise the data before applying KNN.
How to choose the best value of 'k'
The value of 'k' can highly impact the outcome decision, thus, we need to hypertune it. We can find the best value of 'k' by using K-Fold cross validation.
For more information on K-Fold cross validation you can refer to this article : K-Fold cross validation made easy.