Under supervised learning, there is a type called classification. These algorithms are used to recognise the category a new observation belongs to based on the training dataset. In supervised learning, there are independent variables and a dependent variable Here, the dependent variable is the category, and each category’s features are independent variables. These categories are distinct and pre-defined such as True or False, is the email “spam” or “not”, does the picture have “trees” or “not” etc.
The above diagram is a case of successful classification where the circles and triangles are in separate classes based on their features.
The most common classification problems are speech recognition, handwriting recognition and face detection. These can be binary classification or multi-class classification problems.
Examples of classification:
Classification is a 2 step process.
First is training the model and then testing its accuracy. In the first step, a classifier is built based on the training dataset. The classifier analyses the dataset and associated labels. After analysing it creates some rules for prediction.
Secondly, these rules are tested upon unseen data also called the testing data. Then the accuracy of the classifier is calculated, if the accuracy is not good on the test data the ML model will reframe until it gets optimised.
The accuracy is determined by checking the percentage of the dataset correctly classified by the classifier.
Note: Before blindly applying a classifier to the dataset do check the ratio of each category in the training dataset, if any category is less the training rules might have issues. In an ideal case, the dataset contains an equal amount of data for each category.
Let’s take an example with code to understand this better.
SONAR Dataset: Rocks Vs Mine Classification
To run and check the code:
Notebook Link: https://www.kaggle.com/code/tanavbajaj/sonar-rock-vs-mine-classifier
SONAR is a device that uses sound waves to detect objects in the ocean. Depending on different types of frequencies the object is detected.
The dataset here contains 60 frequencies that were detected by the SONAR and one column that contains information on whether the object is a rock or a mine.
For most ships in the war, this information is crucial because if a rock is present at that location it is safe to pass over it and if a mine is present then that is potentially dangerous.
For every python project, the first step is to import the modules required.
Importing Python Modules:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score |
Here Numpy and Pandas are used to read the data and help in preprocessing.
Sklearn is the library which contains modules for train_test_spilt, LogisticRegression and accuracy_score.
Train_test_split as the name shows is used to split the dataset into training and testing data.
Logistic Regression is the classification algorithm being used in this example. (More detail on this in further articles.)
Reading Data:
sonar_data = pd.read_csv(‘../input/connectionist-bench-sonar-mines-vs-rocks/sonar.all-data.csv’, header=None) |
Pandas is used to read the data from the csv file.
sonar_data[60].value_counts() |
As mentioned above it is important to check whether the data for each category is enough.
Output:
Here it shows that 111 out of 208 tuples are of the mine category and 97 are for rock.
So, 46% of the data is for rocks and 54% for mines which is sufficient for moving forward.
Some preprocessing is done where some unnecessary columns and rows are dropped.
X = sonar_data.drop(columns=60, axis=1) Y = sonar_data[60] X=X.drop([0]) Y=Y.drop([0]) |
Train Test Split:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state=1) |
As seen in code the dataset is split into train and testing datasets.
The test_size=0.25 shows that 75% of the dataset will be used for training while 25% for testing.
Training the model:
model = LogisticRegression() model.fit(X_train, Y_train) |
Sklearn makes is very easy to train the model. Only 1 line of code is required to do so.
Checking Accuracy:
X_train_prediction = model.predict(X_train) training_data_accuracy = accuracy_score(X_train_prediction, Y_train) |
This code is used to check the accuracy on training dataset.
X_test_prediction = model.predict(X_test) test_data_accuracy = accuracy_score(X_test_prediction, Y_test) |
This code is used to test out the accuracy on the testing dataset.
An accuracy of 83% for training and 80% for testing datasets is pretty good for a basic classification algorithm.
Now lets add some random data and check what the model predicts.
input_data = (0.0307,0.0523,0.0653,0.0521,0.0611,0.0577,0.0665,0.0664,0.1460,0.2792,0.3877,0.4992,0.4981,0.4972,0.5607,0.7339,0.8230,0.9173,0.9975,0.9911,0.8240,0.6498,0.5980,0.4862,0.3150,0.1543,0.0989,0.0284,0.1008,0.2636,0.2694,0.2930,0.2925,0.3998,0.3660,0.3172,0.4609,0.4374,0.1820,0.3376,0.6202,0.4448,0.1863,0.1420,0.0589,0.0576,0.0672,0.0269,0.0245,0.0190,0.0063,0.0321,0.0189,0.0137,0.0277,0.0152,0.0052,0.0121,0.0124,0.0055)
# changing the input_data to a numpy array # reshape the np array as we are predicting for one instance prediction = model.predict(input_data_reshaped) |
Here some random input data was added and then reshaped using numpy so that it can be used as input for the model.
The model predicts that this input dataset means that the ship is near a mine and should not move on top of it.
So, to summarise logistic regression model was trained on the training dataset and checked on the testing dataset to find out that it has an accuracy of 84% for training and 80% for testing data and has the ability to predict based on random input data whether the object is a mine or rock.