In machine learning, classification is classifying data using certain input variables. A dataset with labels given (training dataset) is used to train the model in a way that the model can provide labels for datasets that are not yet labeled.
Under classification, there are 2 types of classifiers:
- Binary Classification
- Multi-Class Classification
Here let’s discuss Binary classification in detail.
Binary classification is classifying the elements into 2 groups (also called classes or categories) using classification algorithms. Examples of binary classification are
Algorithms that support binary classification are
- Logistic Regression
- K-Nearest Neighbors
- Decision Trees
- Support Vector Machines
- Naive Bayes
Certain algorithms are not specific for Binary Classification and can also be used for multi-class classification. For example, Support Vector Machine (SVM).
The statistics version of Binary classification is the Binomial Probability distribution.
Bernoulli Probability Distribution:
Random Variable: Numerical description of the outcome of a particular statistics experiment.
The outcome can be either heads or tails for a simple Coin toss.
We can define a random variable X
X(Heads) = 1
X(Tails) = 0
This can also be X(Tails)=0 and X(Heads)= 1
Here X can be either 1 or 0.
Bernoulli Trail: This situation can have only 2 outcomes i.e. Success or Failure.
Success refers to winning, so in the case of Bernoulli trials, it is referred to as the required condition.
A student passes the exam or fails it.
For a random variable X.
So, for the student, X=1 is the desired condition.
Bernoulli Distribution: When a Bernoulli trial occurs, it is called Bernoulli distribution. It is also a special condition of Binomial Distribution (Where there is only a single trial).
Binomial Distribution: When more than one Bernoulli Trial occurs, it is called Binomial distribution.
For a random variable X:
X~ Binomial(n,p) such that n is a positive integer and 0 ≼ p ≼ 1
Here n is the number of Bernoulli trials, and p is the probability of success for each trial.
In Mathematical terms, the formula for the probability of X being a success is:
When plotted on a graph, it looks something like this:
Traditionally the binary classification follows this rule and finds the probability of success for each outcome and, based on probability, classifies into Category 1 (Success) or Category 2 (Failure).
Evaluation Metrics for Binary Classification:
The predicted values can be divided into True Positives, True Negatives, False Positives, and False Negatives.
True Positives (TP): When the model states that the experiment for a particular case is a success and it really is a success.
True Negatives (TN): When the model states that the experiment for a particular case is a failure, it really is a failure.
False Positives (FP): When the model states that the experiment for a particular case is a success, but it is a failure.
False Negatives (FN): When the model states that the experiment for a particular case is a failure, it is a success.
Just to note, the success and failure are references to Bernoulli trials and not the actual success of getting a specific category. It just means the output will be Category 1 or “Not” Category 1 (which is Category 2).
After getting these values, accuracy is:
A confusion matrix is created to analyze these parameters.
Understanding properly with an example and sample code:
Loan Status Prediction:
To check and run the code: https://www.kaggle.com/code/tanavbajaj/loan-status-prediction
As usual, let’s start with importing the libraries and then reading the dataset:
|import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
|loan_dataset= pd.read_csv(‘../input/loan-predication/train_u6lujuX_CVtuZ9i (1).csv’)|
This dataset requires some preprocessing because it contains words instead of numbers and some null values.
|# dropping the missing values
loan_dataset = loan_dataset.dropna()
# numbering the labels
# replacing the value of 3+ to 4
loan_dataset = loan_dataset.replace(to_replace=’3+’, value=4)
# convert categorical columns to numerical values
Here in the first line, NULL values were dropped then the Y and N (representing Yes and No ) in the dataset are replaced by 0 and 1, similarly, other categorical values are also given numbers.
|# separating the data and label
X = loan_dataset.drop(columns=[‘Loan_ID’,’Loan_Status’],axis=1)
Y = loan_dataset[‘Loan_Status’]
X_train, X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25,random_state=2)
As seen in the code the dataset is split into training and testing datasets.
The test_size=0.25 shows that 75% of the dataset will be used for training while 25% for testing.
|model = LogisticRegression()
Sklearn makes it very easy to train the model. Only 1 line of code is required to do so.
Evaluating the model:
|X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
This shows that the accuracy of the training data is 82%
|X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
This shows that the accuracy of the training data is 78.3%
# changing the input_data to a numpy array
# reshape the np array as we are predicting for one instance
prediction = model.predict(input_data_reshaped)
Here when random input data is given to the trained model, it gives us the output of whether the loan is approved.