K-nearest neighbor (k-NN) is a non-parametric technique that can be used for classification and regression predictive problems. However, it is more widely used in classification problems.
It is also a supervised learning algorithm. This means that the training data is known and the algorithm outputs a prediction based on what it has learned from past data.
In classification and regression cases, the input is made up of the k closest training data of the things being observed. The output is determined differently when k-NN is used for classification or regression:
Classification: Output will be a class membership. This means that an object is classified based on its neighbors. It is then assigned to a group that recurs among its k-nearest neighbors
Regression: Output will be a value. The value is found by averaging values of its k-nearest neighbors.
K-Nearest Neighbor vs. K-Means
K-nearest neighbor is not the same as k-means. However, the two are commonly confused. K-means is an unsupervised learning algorithm that is used for clustering problems. Unsupervised learning means the data is not known or labeled. Therefore, this technique groups observations into clusters of means.
How K-Nearest Neighbor Works
Now that we have an understanding of what
Step 1 – Input known training data. Then, cluster the data and label it based on its categories.
Step 2 – Add the new data that has an unknown category. The goal is to figure out which other data it is most similar to and call label it by that category.
Step 3 – Classify the data by looking at the nearest categorized data – or the nearest neighbor.
- If k is equal to one, then we only need to look at the first nearest neighbor.
- If it’s equal to nine, then we would use the nine nearest neighbors to classify the data.
- Lastly, if the nine nearest neighbors consist of multiple categories, we pick the nearest neighbor that occurs most frequently.
Typically, determining the value of k requires testing a few different values before making a choice. Do this by pretending some of the training data is unknown. Then, use the k-NN algorithm to classify the ‘unknown’ data and see if the new categories match what you already know. Low k values can be less accurate and noisy, so larger k values are better. However, don’t make it so large that a category with fewer data points will always lose to other categories.
An Example of K-Nearest Neighbor
Let’s look at a simple example that uses colored circles to explain the k-NN algorithm. Let’s say that this is what we know about the circles: 25 red circles, 18 green circles, and 21 blue circles. We input the training data. As a result, the circles are clustered and labeled based on their color.
Now, let’s input our unknown circle in the mix and for simplicity, assume k=2. Our unknown circle falls between the green cluster and the blue cluster. We would classify the circle as blue because there are more blue (21) circles than green (18) circles.
K-Nearest Neighbor in Action
If finding ‘things that are similar’ to ‘something else’ is what you need, k-NN is an algorithm worth getting to know. While it’s incredibly powerful, given the enormous scope of consideration, it is, at least for now, something where humans will continue to need to assert some opinions. That being said, investing some time in training a k-NN model for classification purposes can yield pretty powerful results when you’re looking to build segmentation groups ‘similar’ to your best customers, or even your best friends!