The popular K-Nearest Neighbors Algorithm is used for regression and classification in many applications such as recommender systems, image classification, and financial data forecasting. It is the basis of many advanced machine learning techniques (e.g. in information retrieval). There is no doubt that understanding KNN is an important building block of your proficient computer science education.
Watch the article as a video:
The K-Nearest Neighbors (KNN) Algorithm is a robust, extremely simple, and very popular machine learning method. It’s simple to implement but still a competitive and performant machine learning technique.
The simple idea is the following: The whole data set is your model.
Yes, you read that right. The machine learning model is nothing more than a set of observations. Every single instance of your training data is one part of your model.
A great advantage is that you can use the KNN Algorithm for prediction or classification – as you like. You execute the following strategy, given your input vector x.
- Find the K nearest neighbors of x (according to a predefined similarity metric).
- Aggregate the K nearest neighbors into a single “prediction” or “classification” value. You can use any aggregator function such as average, mean, max, min, etc.
That’s it. Very simple, isn’t it?
Check out the following graphic:
Suppose, your company sells homes for clients. It has acquired a large database of customers and experienced house prices.
One day, your client asks how much he can expect to pay for a house with 52 square meters. You query your KNN “model” and it immediately gives you the response $33,167. And indeed, your client finds a home for $33,489 the same week. How did the KNN system come to this surprisingly accurate prediction?
It simply calculated the K=3 nearest neighbors to the query “D=52 square meters” from the model with regards Euclidean distance. The three nearest neighbors are A, B, and C with prices $34,000, $33,500, and $32,000, respectively. In the final step, the KNN aggregates the three nearest neighbors by calculating the simple average. As K=3 in this example, we denote the model as “3NN”.
Of course, you can vary the similarity functions, the parameter K, and the aggregation method to come up with more sophisticated prediction models.
Another advantage of KNN is that it can be easily adapted as new observations are made. This is not generally true for any machine learning model. A weakness in this regard is obviously that the computational complexity becomes harder and harder, the more points you add. To accommodate for that, you can continuously remove “stale” values from the system.
As I mentioned above, you can also use KNN for classification problems. Instead of averaging over the K nearest neighbors, you can simply use a voting mechanism where each nearest neighbor votes for its class. The class with the most votes wins.
## Dependencies from sklearn.neighbors import KNeighborsRegressor import numpy as np ## Data (House Size (square meters) / Hous Price ($)) X = np.array([[35, 30000], [45, 45000], [40, 50000], [35, 35000], [25, 32500], [40, 40000]]) ## One-liner KNN = KNeighborsRegressor(n_neighbors=3).fit(X[:,0].reshape(-1,1), X[:,1].reshape(-1,1)) ## Result & puzzle res = KNN.predict([]) print(res)
Let’s dive into how to use KNN in Python – in a single line of code.
Take a guess: what’s the output of this code snippet?
To help you see the result, let’s plot the housing data from the code:
Can you see the general trend? With growing size of your house, you can expect a linear growth of its market price. Double the square meters and the price will double, too.
In the code, the client requests your price prediction for a house with 30 square meters. What does KNN with K=3 (in short: 3NN) predict?
Beautifully simple, isn’t it? The KNN algorithm finds the three closest houses with respect to house size and averages the predicted house price as the average of the K=3 nearest neighbors.
Thus, the result is $32,500.
Maybe you were confused by the data conversion part within the one-liner. Let me quickly explain what happened here:
## One-liner KNN = KNeighborsRegressor(n_neighbors=3).fit(X[:,0].reshape(-1,1), X[:,1].reshape(-1,1))
First, we create a new machine learning model called “KNeighborsRegressor”. If you would like to take KNN for classification, you would take the model “KNeighborsClassifier”.
Second, we “train” the model using the fit function with two parameters. The first parameter defines the input (the house size) and the second parameter defines the output (the house price). The shape of both parameters must be so that each observation is an array-like data structure. For example, you wouldn’t use “30” as an input but “”. The reason is that, in general, the input can be multi-dimensional rather than one-dimensional. Therefore, we reshape the input:
print(X[:,0]) "[35 45 40 35 25 40]"
If we would use this 1D NumPy array as an input to the fit function, the function would not work properly because it expects an array of (array-like) observations – and not an array of integers.
Therefore, we convert the array accordingly using the reshape function:
print(X[:,0].reshape(-1,1)) """ [     ] """
Now, we have six array-like observations. The negative index -1 in the reshape function call is our “laziness” expression: we want NumPy to determine the number of rows automatically – and only specify how many columns we need (i.e., 1 column).
Where to go from here?
Understanding algorithms is hard enough.
Why do so many people struggle with algorithms?
Yes, complexity may be an issue from time to time. But in so many cases, the real problem is a lack of your quick and confident understanding of the very basics of code.
Proof: have you ever observed that you can easily understand algorithms visually but not in code?
There is only one solution: master the basics until you don’t have to think about them. Only then can your brain handle the higher-level complexity of algorithms.
To help you achieve this, I invest most of my time and effort in creating the best free Python email course