The Best ‘K’ Classifier

For this reflection I read and analyzed a dataset about penguins in Antartica which can be found here: https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data

The Objective of this assignment was to make a KNN classifier with different ranges of k values from 1 – 40 and find the best one and the mean f1 score for each of the K lengths. To start like any other KNN classifier I split the data between Tain and Test for both of them I used the StandardScaler library from Sklearn to transform the data and set the number of KNeighbors to the value of k which in this case iterates from 1 – 40. For the Test function I split the test data and also called StandardScaler on the data to transform it, additionally one of the difference between Test and Train is making a predictions array in test to hold all of my prediction values to later evaluate my predictions to the actual values.

Within the KNN function I pass in the k value, the data-frame, the features which to use to predict, and the target value. In this case my features were culmen depth in mm and the flipper length in mm and the target value to predict was the island for which the penguin resided. The KNN function is essentially just a for loop that iterates through the train and test index and calls the test function on the true verse predicted values which will return a prediction. That prediction along with the true values are passed into the evaluation function which calculates the f1 score which combines precision and recall to determine the accuracy of the predictions made. As I said above I would be finding the mean of these f1 scores so also within the KNN function I kept and a variable assigned to the evaluation function to hold that f1 score and stored them in an array after each run then after the loop I took the mean of the f1 scores. Now to figure out the best ‘K’ value I created a list of k-values 1 -40 and made a for loop to run the KNN function but increasing the k-value by one each time and running ten k-folds each. After each run the loop compares scores to store the highest one and at the very end prints the best k and prints the best f1 score.

Here is an output example

KNN for k = 1 is: 0.6447385428322892
KNN for k = 2 is: 0.6096944091093646
KNN for k = 3 is: 0.587228533127487
KNN for k = 4 is: 0.6329413513901605
KNN for k = 5 is: 0.6399729864160182
KNN for k = 6 is: 0.6173035863111862
KNN for k = 7 is: 0.6479615526860656
KNN for k = 8 is: 0.6559180911112491
KNN for k = 9 is: 0.6355883552135018
KNN for k = 10 is: 0.6367094597526258
KNN for k = 11 is: 0.6601321836835505
KNN for k = 12 is: 0.6874331167593228
KNN for k = 13 is: 0.673744989312876
KNN for k = 14 is: 0.65714994642037
KNN for k = 15 is: 0.6587411579652839
KNN for k = 16 is: 0.6509181156347517
KNN for k = 17 is: 0.6773768997965366
KNN for k = 18 is: 0.6730539237034099
KNN for k = 19 is: 0.6763714345274077
KNN for k = 20 is: 0.6736193550832233
KNN for k = 21 is: 0.6685218227558851
KNN for k = 22 is: 0.6802164940267407
KNN for k = 23 is: 0.6828297467048432
KNN for k = 24 is: 0.6755306619112366
KNN for k = 25 is: 0.6778217033972156
KNN for k = 26 is: 0.6825008768259028
KNN for k = 27 is: 0.6798186745498981
KNN for k = 28 is: 0.6689349852542978
KNN for k = 29 is: 0.6784235175815404
KNN for k = 30 is: 0.6878112656848122
KNN for k = 31 is: 0.6863575370058603
KNN for k = 32 is: 0.6875116682480806
KNN for k = 33 is: 0.6768914742286175
KNN for k = 34 is: 0.6755678601425782
KNN for k = 35 is: 0.6842258588233825
KNN for k = 36 is: 0.6824704306576266
KNN for k = 37 is: 0.6812679573836695
KNN for k = 38 is: 0.6880329625844984
KNN for k = 39 is: 0.6882488474406613
KNN for k = 40 is: 0.675187658176038
The best value for k is: 39
The best f1 score is: 0.6882488474406613

So this means the best k-value for this run was 39 which means the highest mean f1 score was when the number of KNeighbors for the KNN classifier was at 39.