In machine, learning classification is the method of classifying data into categories using certain input variables. Dataset with labels given (training dataset) is used to train the model in a way that the model can provide labels for datasets that are not yet labelled.
Under classification, there are 2 types of classifiers:
- Binary Classification
- Multi-Class Classification
Here let’s discuss Binary classification in detail.
Binary classification is the method of classifying the elements into 2 groups (also called classes or categories) using classification algorithms.
Algorithms that support binary classification are
- Logistic Regression
- K-Nearest Neighbors
- Decision Trees
- Support Vector Machines
- Naive Bayes
Certain algorithms are not specific for Binary Classification and can also be used for multi-class classification. For example Support Vector Machine (SVM).
The statistics version of Binary classification is the Binomial Probability distribution.
Bernoulli Probability Distribution:
Random Variable: Numerical description of the outcome of a particular statistics experiment.
For a simple Coin toss, the outcome can either be heads or tails.
We can define a random variable X
X(Heads) = 1
X(Tails) = 0
This can also be X(Tails)=0 and X(Heads)= 1
Here X can be either 1 or 0.
Bernoulli Trail: This is a situation that can have only 2 outcomes i.e Success or Failure.
Success refers to winning so in the case of Bernoulli trials it is referred to as the condition that is required.
A student passes the exam or fails it.
For a random variable X.
So, for the student X=1 is the desired condition.
Bernoulli Distribution: When a Bernoulli trial occurs it is called Bernoulli distribution. It is also a special condition of Binomial Distribution (Where there is only a single trial).
Binomial Distribution: When more than one Bernoulli Trial occurs it is called Binomial distribution.
For a random variable X:
X~ Binomial(n,p) such that n is a positive integer and 0 ≼ p ≼ 1
Here n is the number of Bernoulli trials and p is the probability of success for each trial.
In Mathematical terms, the formula for the probability of X being a success is:
Traditionally the binary classification follows this rule and finds the probability of success for each outcome and based on probability classifies into Category 1 (Success) or Category 2 (Failure).
Evaluation Metrics for Binary Classification:
The predicted values can be divided into True Positives, True Negatives, False Positives and False Negatives.
True Positives (TP): When the model states that the experiment for a particular case is a success and it really is a success.
True Negatives (TN): When the model states that the experiment for a particular case is a failure and it really is a failure.
False Positives (FP): When the model states that the experiment for a particular case is a success but it actually is a failure.
False Negatives (FN): When the model states that the experiment for a particular case is a failure but it is actually a success.
Just to note the success and failure are references to Bernoulli trials and not the actual success of getting a specific category. It just means the output will be in the form of Category 1 or “Not” Category 1 (which is Category 2).
After getting these values, accuracy is:
A confusion matrix is created to analyse these parameters.
Understanding properly with an example and sample code:
Loan Status Prediction:
To check and run the code: https://www.kaggle.com/code/tanavbajaj/loan-status-prediction
As usual, lets start with importing the libraries and then reading the dataset:
|import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
|loan_dataset= pd.read_csv(‘../input/loan-predication/train_u6lujuX_CVtuZ9i (1).csv’)|
This dataset requires some preprocessing because it contains words instead of numbers and some null values.
|# dropping the missing values
loan_dataset = loan_dataset.dropna()
# numbering the labels
# replacing the value of 3+ to 4
loan_dataset = loan_dataset.replace(to_replace=’3+’, value=4)
# convert categorical columns to numerical values
Here in the first line NULL values were dropped then the Y and N (representing Yes and No ) in the dataset are replaced by 0 and 1, similarly other categorical values are also given numbers.
|# separating the data and label
X = loan_dataset.drop(columns=[‘Loan_ID’,’Loan_Status’],axis=1)
Y = loan_dataset[‘Loan_Status’]
X_train, X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25,random_state=2)
As seen in code the dataset is split into train and testing datasets.
The test_size=0.25 shows that 75% of the dataset will be used for training while 25% for testing.
|model = LogisticRegression()
Sklearn makes is very easy to train the model. Only 1 line of code is required to do so.
Evaluating the model:
|X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
This shows that the accuracy of the training data is 82%
|X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
This shows that the accuracy of the training data is 78.3%
# changing the input_data to a numpy array
# reshape the np array as we are predicting for one instance
prediction = model.predict(input_data_reshaped)
Here when a random input data is given to the trained model it gives us the output of whether the loan is approved or not.