In machine learning, classification is classifying data using certain input variables. A dataset with labels given (training dataset) is used to train the model in a way that the model can provide labels for datasets that are not yet labeled.
Under classification, there are 2 types of classifiers:
- Binary Classification
- Multi-Class Classification
Here let’s discuss Binary classification in detail.
Binary classification is classifying the elements into 2 groups (also called classes or categories) using classification algorithms. Examples of binary classification are
Algorithms that support binary classification are
- Logistic Regression
- K-Nearest Neighbors
- Decision Trees
- Support Vector Machines
- Naive Bayes
Certain algorithms are not specific for Binary Classification and can also be used for multi-class classification. For example, Support Vector Machine (SVM).
The statistics version of Binary classification is the Binomial Probability distribution.
Bernoulli Probability Distribution:
Random Variable: Numerical description of the outcome of a particular statistics experiment.
For example:
The outcome can be either heads or tails for a simple Coin toss.
We can define a random variable X
X(Heads) = 1
X(Tails) = 0
This can also be X(Tails)=0 and X(Heads)= 1
Here X can be either 1 or 0.
Bernoulli Trail: This situation can have only 2 outcomes i.e. Success or Failure.
Success refers to winning, so in the case of Bernoulli trials, it is referred to as the required condition.
For example,
A student passes the exam or fails it.
For a random variable X.
X(Pass)= 1
X(Fails)= 0
So, for the student, X=1 is the desired condition.
Bernoulli Distribution: When a Bernoulli trial occurs, it is called Bernoulli distribution. It is also a special condition of Binomial Distribution (Where there is only a single trial).
Binomial Distribution: When more than one Bernoulli Trial occurs, it is called Binomial distribution.
For a random variable X:
X~ Binomial(n,p) such that n is a positive integer and 0 ≼ p ≼ 1
Here n is the number of Bernoulli trials, and p is the probability of success for each trial.
In Mathematical terms, the formula for the probability of X being a success is:
When plotted on a graph, it looks something like this:
Traditionally the binary classification follows this rule and finds the probability of success for each outcome and, based on probability, classifies into Category 1 (Success) or Category 2 (Failure).
Evaluation Metrics for Binary Classification:
Accuracy:
The predicted values can be divided into True Positives, True Negatives, False Positives, and False Negatives.
True Positives (TP): When the model states that the experiment for a particular case is a success and it really is a success.
True Negatives (TN): When the model states that the experiment for a particular case is a failure, it really is a failure.
False Positives (FP): When the model states that the experiment for a particular case is a success, but it is a failure.
False Negatives (FN): When the model states that the experiment for a particular case is a failure, it is a success.
Just to note, the success and failure are references to Bernoulli trials and not the actual success of getting a specific category. It just means the output will be Category 1 or “Not” Category 1 (which is Category 2).
After getting these values, accuracy is:
A confusion matrix is created to analyze these parameters.
Understanding properly with an example and sample code:
Loan Status Prediction:
To check and run the code: https://www.kaggle.com/code/tanavbajaj/loan-status-prediction
As usual, let’s start with importing the libraries and then reading the dataset:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score |
loan_dataset= pd.read_csv(‘../input/loan-predication/train_u6lujuX_CVtuZ9i (1).csv’) |
This dataset requires some preprocessing because it contains words instead of numbers and some null values.
# dropping the missing values loan_dataset = loan_dataset.dropna() # numbering the labels loan_dataset.replace({“Loan_Status”:{‘N’:0,’Y’:1}},inplace=True) # replacing the value of 3+ to 4 loan_dataset = loan_dataset.replace(to_replace=’3+’, value=4) # convert categorical columns to numerical values loan_dataset.replace({‘Married’:{‘No’:0,’Yes’:1},’Gender’:{‘Male’:1,’Female’:0},’Self_Employed’:{‘No’:0,’Yes’:1}, ‘Property_Area’:{‘Rural’:0,’Semiurban’:1,’Urban’:2},’Education’:{‘Graduate’:1,’Not Graduate’:0}},inplace=True) |
Here in the first line, NULL values were dropped then the Y and N (representing Yes and No ) in the dataset are replaced by 0 and 1, similarly, other categorical values are also given numbers.
# separating the data and label X = loan_dataset.drop(columns=[‘Loan_ID’,’Loan_Status’],axis=1) Y = loan_dataset[‘Loan_Status’] X_train, X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25,random_state=2) |
As seen in the code the dataset is split into training and testing datasets.
The test_size=0.25 shows that 75% of the dataset will be used for training while 25% for testing.
model = LogisticRegression() model.fit(X_train, Y_train) |
Sklearn makes it very easy to train the model. Only 1 line of code is required to do so.
Evaluating the model:
X_train_prediction = model.predict(X_train) training_data_accuracy = accuracy_score(X_train_prediction, Y_train) |
This shows that the accuracy of the training data is 82%
X_test_prediction = model.predict(X_test) test_data_accuracy = accuracy_score(X_test_prediction, Y_test) |
This shows that the accuracy of the training data is 78.3%
input_data= (1,1,0,1,0,3033,1459.0,94.0,360.0,1.0,2)
# changing the input_data to a numpy array # reshape the np array as we are predicting for one instance prediction = model.predict(input_data_reshaped) |
Here when random input data is given to the trained model, it gives us the output of whether the loan is approved.