Binary Classification

Binary Classifications

In machine, learning classification is the method of classifying data into categories using certain input variables. Dataset with labels given (training dataset) is used to train the model in a way that the model can provide labels for datasets that are not yet labelled.

Under classification, there are 2 types of classifiers:

  1. Binary Classification
  2. Multi-Class Classification

Here let’s discuss Binary classification in detail.

Binary classification is the method of classifying the elements into 2 groups (also called classes or categories) using classification algorithms.

Algorithms that support binary classification are

  1. Logistic Regression
  2. K-Nearest Neighbors
  3. Decision Trees
  4. Support Vector Machines
  5. Naive Bayes

Certain algorithms are not specific for Binary Classification and can also be used for multi-class classification. For example Support Vector Machine (SVM).

The statistics version of Binary classification is the  Binomial Probability distribution.

Bernoulli Probability Distribution:

Random Variable: Numerical description of the outcome of a particular statistics experiment.

For example:

For a simple Coin toss, the outcome can either be heads or tails.

We can define a random variable X

X(Heads) = 1

X(Tails) = 0

This can also be X(Tails)=0 and X(Heads)= 1

Here X can be either 1 or 0.

Bernoulli Trail:  This is a situation that can have only 2 outcomes i.e Success or Failure.

Success refers to winning so in the case of Bernoulli trials it is referred to as the condition that is required.

For example,

A student passes the exam or fails it.

For a random variable X.

X(Pass)= 1

X(Fails)= 0

So, for the student X=1 is the desired condition.

Bernoulli Distribution: When a Bernoulli trial occurs it is called Bernoulli distribution. It is also a special condition of Binomial Distribution (Where there is only a single trial).

Binomial Distribution: When more than one Bernoulli Trial occurs it is called Binomial distribution.

For a random variable X:

X~ Binomial(n,p) such that n is a positive integer and 0 ≼ p ≼ 1

Here n is the number of Bernoulli trials and p is the probability of success for each trial.

In Mathematical terms, the formula for the probability of X being a success is:

Traditionally the binary classification follows this rule and finds the probability of success for each outcome and based on probability classifies into Category 1 (Success) or Category 2 (Failure).

Evaluation Metrics for Binary Classification:

Accuracy:

The predicted values can be divided into True Positives, True Negatives, False Positives and False Negatives.

True Positives (TP): When the model states that the experiment for a particular case is a success and it really is a success.

True Negatives (TN): When the model states that the experiment for a particular case is a failure and it really is a failure.

False Positives (FP): When the model states that the experiment for a particular case is a success but it actually is a failure.

False Negatives (FN): When the model states that the experiment for a particular case is a failure but it is actually a success.

Just to note the success and failure are references to Bernoulli trials and not the actual success of getting a specific category. It just means the output will be in the form of Category 1 or “Not” Category 1 (which is Category 2).

After getting these values, accuracy is:

A confusion matrix is created to analyse these parameters.

Understanding properly with an example and sample code:

Loan Status Prediction:

To check and run the code: https://www.kaggle.com/code/tanavbajaj/loan-status-prediction

As usual, lets start with importing the libraries and then reading the dataset:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

 

loan_dataset= pd.read_csv(‘../input/loan-predication/train_u6lujuX_CVtuZ9i (1).csv’)

This dataset requires some preprocessing because it contains words instead of numbers and some null values.

# dropping the missing values
loan_dataset = loan_dataset.dropna()
# numbering the labels
loan_dataset.replace({“Loan_Status”:{‘N’:0,’Y’:1}},inplace=True)
# replacing the value of 3+ to 4
loan_dataset = loan_dataset.replace(to_replace=’3+’, value=4)
# convert categorical columns to numerical values
loan_dataset.replace({‘Married’:{‘No’:0,’Yes’:1},’Gender’:{‘Male’:1,’Female’:0},’Self_Employed’:{‘No’:0,’Yes’:1},
‘Property_Area’:{‘Rural’:0,’Semiurban’:1,’Urban’:2},’Education’:{‘Graduate’:1,’Not Graduate’:0}},inplace=True)

Here in the first line NULL values were dropped then the Y and N (representing Yes and No ) in the dataset are replaced by 0 and 1, similarly other categorical values are also given numbers.

# separating the data and label
X = loan_dataset.drop(columns=[‘Loan_ID’,’Loan_Status’],axis=1)
Y = loan_dataset[‘Loan_Status’]
X_train, X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25,random_state=2)

As seen in code the dataset is split into train and testing datasets.

The test_size=0.25 shows that 75% of the dataset will be used for training while 25% for testing.

model = LogisticRegression()
model.fit(X_train, Y_train)

Sklearn makes is very easy to train the model. Only 1 line of code is required to do so.

Evaluating the model:

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

This shows that the accuracy of the training data is 82%

X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

This shows that the accuracy of the training data is 78.3%

input_data= (1,1,0,1,0,3033,1459.0,94.0,360.0,1.0,2)

# changing the input_data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)

# reshape the np array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshaped)
print(prediction)

Here when a random input data is given to the trained model it gives us the output of whether the loan is approved or not.

Leave a Reply

Your email address will not be published. Required fields are marked *