Under supervised learning, there is a type called classification. These algorithms recognize the category a new observation belongs to based on the training dataset. In supervised learning, there are independent variables and a dependent variable Here, the dependent variable is the category, and each category’s features are independent variables. These categories are distinct and pre-defined such as True or False, is the email is “spam” or “not,” does the picture have “trees” or “not,” etc.
The above diagram is a case of successful classification where the circles and triangles are in separate classes based on their features.
The most common classification problems are speech recognition, handwriting recognition, and face detection. These can be a binary classification or multi-class classification problems.
Examples of classification:
Classification of Email into “spam” and “not spam.”
Classification of vegetables and groceries
Classification is a 2 step process.
First is training the model and then testing its accuracy. First, a classifier is built based on the training dataset. The classifier analyses the dataset and associated labels. After analyzing it creates some prediction rules.
Secondly, these rules are tested upon unseen data, also called testing data. Then the accuracy of the classifier is calculated; if the accuracy is not good on the test data, the ML model will reframe until it gets optimized.
The accuracy is determined by checking the percentage of the dataset correctly classified by the classifier.
Note: Before blindly applying a classifier to the dataset, check the ratio of each category in the training dataset; if any category is less, the training rules might have issues. In an ideal case, the dataset contains an equal amount of data for each category.
Let’s take an example with code to understand this better.
SONAR Dataset: Rocks Vs. Mine Classification
To run and check the code:
SONAR is a device that uses sound waves to detect objects in the ocean. Depending on different types of frequencies, the object is detected.
The dataset here contains 60 frequencies detected by the SONAR and one column containing information on whether the object is a rock or a mine.
This information is crucial for most ships in the war because if a rock is present at that location, it is safe to pass over it, and if a mine is present, that is potentially dangerous.
The first step for every python project is importing the modules required.
Importing Python Modules:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score |
Here Numpy and Pandas are used to read the data and help in preprocessing.
Sklearn is the library that contains modules for train_test_spilt, LogisticRegression, and accuracy_score.
Train_test_split, as the name shows, is used to split the dataset into training and testing data.
Logistic Regression is the classification algorithm being used in this example. (More detail on this in further articles.)
Reading Data:
sonar_data = pd.read_csv(‘../input/connectionist-bench-sonar-mines-vs-rocks/sonar.all-data.csv’, header=None) |
Pandas is used to read the data from the CSV file.
sonar_data[60].value_counts() |
As mentioned above, it is important to check whether the data for each category is enough.
Output:
It shows that 111 out of 208 tuples are in the mining category, and 97 are for rock.
So, 46% of the data is for rocks and 54% for mines which is sufficient for moving forward.
Some preprocessing is done where some unnecessary columns and rows are dropped.
X = sonar_data.drop(columns=60, axis=1) Y = sonar_data[60] X=X.drop([0]) Y=Y.drop([0]) |
Train Test Split:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state=1) |
As seen in the code the dataset is split into training and testing datasets.
The test_size=0.25 shows that 75% of the dataset will be used for training while 25% for testing.
Training the model:
model = LogisticRegression() model.fit(X_train, Y_train) |
Sklearn makes it very easy to train the model. Only 1 line of code is required to do so.
Checking Accuracy:
X_train_prediction = model.predict(X_train) training_data_accuracy = accuracy_score(X_train_prediction, Y_train) |
This code is used to check the accuracy of the training dataset.
X_test_prediction = model.predict(X_test) test_data_accuracy = accuracy_score(X_test_prediction, Y_test) |
This code is used to test out the accuracy of the testing dataset.
An accuracy of 83% for training and 80% for testing datasets is pretty good for a basic classification algorithm.
Now let’s add some random data and check what the model predicts.
input_data = (0.0307,0.0523,0.0653,0.0521,0.0611,0.0577,0.0665,0.0664,0.1460,0.2792,0.3877,0.4992,0.4981,0.4972,0.5607,0.7339,0.8230,0.9173,0.9975,0.9911,0.8240,0.6498,0.5980,0.4862,0.3150,0.1543,0.0989,0.0284,0.1008,0.2636,0.2694,0.2930,0.2925,0.3998,0.3660,0.3172,0.4609,0.4374,0.1820,0.3376,0.6202,0.4448,0.1863,0.1420,0.0589,0.0576,0.0672,0.0269,0.0245,0.0190,0.0063,0.0321,0.0189,0.0137,0.0277,0.0152,0.0052,0.0121,0.0124,0.0055)
# changing the input_data to a numpy array # reshape the np array as we are predicting for one instance prediction = model.predict(input_data_reshaped) |
Here some random input data was added and then reshaped using numpy to be used as input for the model.
The model predicts that this input dataset means that the ship is near a mine and should not move on top of it.
So, to summarise logistic regression model was trained on the training dataset and checked on the testing dataset to find out that it has an accuracy of 84% for training and 80% for testing data and can predict based on random input data whether the object is a mine or rock.