Algorithms From Scratch: K-Nearest Neighbors Classifier
Algos from Scratch: K-Nearest Neighbors
To me, K-Nearest neighbors is the most intutive algorithm for classifcation. If you want to find what group a certain data point belongs to, look at the groups of the point around it and set it to the most common value.
Today, we’ll build a classifier using KNN that determines the species of flower from three options.
Iris Dataset
The Iris dataset is the simple go-to for classification problems. First import numpy, matplotlib, and load_iris. Then, load the dataset into “df”.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
df = load_iris()
First, let’s check out the dataset.
df.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
df.target_names
['setosa', 'versicolor', 'virginica']
df.feature_names
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
So, we’re gonna use the sepal length and width — the stiffer part below the petal — and the petal length and width to determine whether the flower is an iris setosa, versicolor, or virginica.
df.data
holds the feature data we want and df.target
has the flower type.
figure, axis = plt.subplots(2, 2,figsize=(10,10))
plt.figure(figsize=(6,8))
axis[0,0].scatter(X[:,0],X[:,1])
axis[0,0].set_title("Sepal Length x Sepal Width")
axis[0,0].set_xlabel("Sepal Length")
axis[0,0].set_ylabel("Sepal Width")
axis[0,1].scatter(X[:,2],X[:,3])
axis[0,1].set_title("Petal Length x Petal Width")
axis[0,1].set_xlabel("Petal Length")
axis[0,1].set_ylabel("Petal Width")
axis[1,0].scatter(X[:,0],X[:,2])
axis[1,0].set_title("Sepal Length x Petal Length")
axis[1,0].set_xlabel("Sepal Length")
axis[1,0].set_ylabel("Petal Length")
axis[1,1].scatter(X[:,1],X[:,3])
axis[1,1].set_title("Sepal Width x Petal Width")
axis[1,1].set_xlabel("Sepal Width")
axis[1,1].set_ylabel("Petal Width")
plt.show()
As you can see, some of these plots show two very distinct groups. We have three separate flowers though. Clearly, the data isn’t so obviously separable.
Running a correlation matrix on this data gives some more insight.
If you read through the documentation on this dataset from sklearns website, you’ll see that “one class is linearly separable from the other 2; the latter are NOT linearly separable from each other.” KNN, however, is not a linear classifier.
KNN
To be more precise about K-nearest neighbors, let’s break it down into steps for one individual data point.
- Create empty array to store the distance from our test point to a certain data point and it’s corresponding index
- Loop through training set
- Calculate distance between our point and the training point
- Store that distance and its point in the array from 1
- Once done with looping, sort the array by the distances in ascending order
- Isolate the k-nearest neighbors
- Get the corresponding flower class from the label dataset
- Find the most common occuring class and return this
That’s really all there is to making a classifcation.
Code
The first thing I want to do is prepare the dataset. I want to keep some test points to test out our KNN on, so I’m gonna use from sklearn.model_selection import train_test_split
with a test size of .2 on df.data
and df.target
.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.data,df.target,test_size=.2,train_size=.8, random_state=42, shuffle=True)
Now onto the KNN class.
class KNN:
def __init__(self,x_train, y_train,x_test,y_test, num_neighbors):
self.x_train = x_train
self.y_train = y_train
self.x_test = x_test
self.y_test = y_test
self.num_neighbors = num_neighbors
Now to create an individual classifier. Following the steps above, we first create the empty array then loop through the training set. Using enumerate
to get both the index and value easily, I take the euclidean distance between our individual point and the training example, then append that to the distance array along with the index.
dist_and_ind = []
for ix,ex in enumerate(self.x_train):
dist = euclidean_distance(x_test_individual,ex)
dist_and_ind.append([dist,ix])
By the way, this is the euclidean distance function — it’s very simple.
def euclidean_distance(x1,x2):
x1=np.array(x1)
x2=np.array(x2)
return np.sqrt(np.sum(np.square(x1-x2)))
Now we’ve got all of our distances for that test point. To find the closest point, I used np.argsort
just on the distance column of dist_and_ind()
. Take the first num_neighbors
of the distance array and just keep the indices.
dist_and_ind = np.array(dist_and_ind)
dist_and_ind = dist_and_ind[dist_and_ind[:,0].argsort()]
desired_indices = dist_and_ind[:self.num_neighbors,1].astype(np.int32)
Finally, we can use those indices of the close points to find their corresponding flower class. We want the most common occurence. So rounding the mean value will do.
return np.round(np.mean(self.y_train[desired_indices]))
That looks good! Here’s the full thing.
def classify_individual(self, x_test_individual):
dist_and_ind = []
for ix,ex in enumerate(self.x_train):
dist = euclidean_distance(x_test_individual,ex)
dist_and_ind.append([dist,ix])
dist_and_ind = np.array(dist_and_ind)
dist_and_ind = dist_and_ind[dist_and_ind[:,0].argsort()]
desired_indices = dist_and_ind[:self.num_neighbors,1].astype(np.int32)
return np.round(np.mean(self.y_train[desired_indices]))
All the rest we have to do is just apply this function to all the test points. I
def classify(self):
pred_labels=[]
for i in self.x_test:
pred_label = self.classify_individual(i)
pred_labels.append(pred_label)
Here’s the full class:
class KNN:
def __init__(self,x_train, y_train,x_test,y_test, num_neighbors):
self.x_train = x_train
self.y_train = y_train
self.x_test = x_test
self.y_test = y_test
self.num_neighbors = num_neighbors
def classify_individual(self, x_test_individual):
dist_and_ind = []
for ix,ex in enumerate(self.x_train):
dist = euclidean_distance(x_test_individual,ex)
dist_and_ind.append([dist,ix])
dist_and_ind = np.array(dist_and_ind)
dist_and_ind = dist_and_ind[dist_and_ind[:,0].argsort()]
desired_indices = dist_and_ind[:self.num_neighbors,1].astype(np.int32)
return np.round(np.mean(self.y_train[desired_indices]))
def classify(self):
pred_labels=[]
for i in self.x_test:
pred_label = self.classify_individual(i)
pred_labels.append(pred_label)
return pred_labels
def error(self):
pred_labels = self.classify()
return np.mean( pred_labels != self.y_test )
The last error function is just to determine the percentage of wrong guesses.
The Moment of Truth
Initialize our KNN object.
knn = KNN(X_train,y_train,X_test,y_test,5)
Run the error.
knn.error()
0.0
Success! Out of 15 test points, our KNN classifier didn’t make a single mistake, even with those tricky non-linear groups that seemed difficult to separate.