In machine, learning classification is the method of classifying data into categories using certain input variables. Dataset with labels given (training dataset) is used to train the model in a way that the model can provide labels for datasets that are not yet labelled.

Under classification, there are 2 types of classifiers:

- Binary Classification
- Multi-Class Classification

Here let’s discuss Binary classification in detail.

Binary classification is the method of classifying the elements into 2 groups (also called classes or categories) using classification algorithms.

Algorithms that support binary classification are

- Logistic Regression
- K-Nearest Neighbors
- Decision Trees
- Support Vector Machines
- Naive Bayes

Certain algorithms are not specific for Binary Classification and can also be used for multi-class classification. For example Support Vector Machine (SVM).

The statistics version of Binary classification is the Binomial Probability distribution.

**Bernoulli Probability Distribution:**

**Random Variable:** Numerical description of the outcome of a particular statistics experiment.

For example:

For a simple Coin toss, the outcome can either be heads or tails.

We can define a random variable X

X(Heads) = 1

X(Tails) = 0

This can also be X(Tails)=0 and X(Heads)= 1

Here X can be either 1 or 0.

**Bernoulli Trail: **This is a situation that can have only 2 outcomes i.e Success or Failure.

Success refers to winning so in the case of Bernoulli trials it is referred to as the condition that is required.

For example,

A student passes the exam or fails it.

For a random variable X.

X(Pass)= 1

X(Fails)= 0

So, for the student X=1 is the desired condition.

**Bernoulli Distribution: **When a Bernoulli trial occurs it is called Bernoulli distribution. It is also a special condition of Binomial Distribution (Where there is only a single trial).

**Binomial Distribution: **When more than one Bernoulli Trial occurs it is called Binomial distribution.

For a random variable X:

X~ Binomial(n,p) such that n is a positive integer and 0 ≼ p ≼ 1

Here n is the number of Bernoulli trials and p is the probability of success for each trial.

In Mathematical terms, the formula for the probability of X being a success is:

Traditionally the binary classification follows this rule and finds the probability of success for each outcome and based on probability classifies into Category 1 (Success) or Category 2 (Failure).

**Evaluation Metrics for Binary Classification:**

**Accuracy:**

The predicted values can be divided into True Positives, True Negatives, False Positives and False Negatives.

True Positives (TP): When the model states that the experiment for a particular case is a success and it really is a success.

True Negatives (TN): When the model states that the experiment for a particular case is a failure and it really is a failure.

False Positives (FP): When the model states that the experiment for a particular case is a success but it actually is a failure.

False Negatives (FN): When the model states that the experiment for a particular case is a failure but it is actually a success.

*Just to note the success and failure are references to Bernoulli trials and not the actual success of getting a specific category. It just means the output will be in the form of Category 1 or “Not” Category 1 (which is Category 2). *

After getting these values, accuracy is:

A confusion matrix is created to analyse these parameters.

**Understanding properly with an example and sample code:**

**Loan Status Prediction:**

**To check and run the code**: https://www.kaggle.com/code/tanavbajaj/loan-status-prediction

As usual, lets start with importing the libraries and then reading the dataset:

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score |

loan_dataset= pd.read_csv(‘../input/loan-predication/train_u6lujuX_CVtuZ9i (1).csv’) |

This dataset requires some preprocessing because it contains words instead of numbers and some null values.

# dropping the missing values loan_dataset = loan_dataset.dropna() # numbering the labels loan_dataset.replace({“Loan_Status”:{‘N’:0,’Y’:1}},inplace=True) # replacing the value of 3+ to 4 loan_dataset = loan_dataset.replace(to_replace=’3+’, value=4) # convert categorical columns to numerical values loan_dataset.replace({‘Married’:{‘No’:0,’Yes’:1},’Gender’:{‘Male’:1,’Female’:0},’Self_Employed’:{‘No’:0,’Yes’:1}, ‘Property_Area’:{‘Rural’:0,’Semiurban’:1,’Urban’:2},’Education’:{‘Graduate’:1,’Not Graduate’:0}},inplace=True) |

Here in the first line NULL values were dropped then the Y and N (representing Yes and No ) in the dataset are replaced by 0 and 1, similarly other categorical values are also given numbers.

# separating the data and label X = loan_dataset.drop(columns=[‘Loan_ID’,’Loan_Status’],axis=1) Y = loan_dataset[‘Loan_Status’] X_train, X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25,random_state=2) |

As seen in code the dataset is split into train and testing datasets.

The ** test_size=0.25 **shows that 75% of the dataset will be used for training while 25% for testing.

model = LogisticRegression() model.fit(X_train, Y_train) |

Sklearn makes is very easy to train the model. Only 1 line of code is required to do so.

Evaluating the model:

X_train_prediction = model.predict(X_train) training_data_accuracy = accuracy_score(X_train_prediction, Y_train) |

This shows that the accuracy of the training data is 82%

X_test_prediction = model.predict(X_test) test_data_accuracy = accuracy_score(X_test_prediction, Y_test) |

This shows that the accuracy of the training data is 78.3%

input_data= (1,1,0,1,0,3033,1459.0,94.0,360.0,1.0,2)
# changing the input_data to a numpy array # reshape the np array as we are predicting for one instance prediction = model.predict(input_data_reshaped) |

Here when a random input data is given to the trained model it gives us the output of whether the loan is approved or not.