Logistic regression is one of the most popular algorithm for classification problems. It is called regression even though it is not a regression algorithm because the underlying technique is similar to Linear Regression. The term “logistic” comes from the type of statistical model used (logit model).
As seen in earlier releases classification algorithms are used to classify the dataset into various classes and based on that logistic classification is a type of binary classification.
Logistic regression is an extension of the linear regression model. Even though linear regression is good for regression it does not work well for classification because the linear model doesn’t output probabilities and instead treats it as either 0 or 1 (Class A or Class B). Using this it fits the dataset in a plane with each row as its points then attempt to find the line that minimizes the distances between points and the plan. Using the information given the linear model will try to force a weird structure between the independent and the dependent variables.
The line L1 is the best fit line when only the red points are considered and it classified points to the right of the line as 1 and left as 0 which provides a decent indication of the data for classification even though there are a lot of wrongly classified points.
Now if there is one point that is an extreme case (also called outlier) the best fit line transforms to L2 which is now classifying more points incorrectly to the extent where all the points that should be classified as 1 are 0. This drastic difference came just because of a single outlier.
Seeing such conditions modifications were made to the linear regression algorithm and created the famous logistic regression.
Instead of using a straight line in the plane logistic regression model uses the logistic function to fit the output of a linear equation between 0 and 1. Just by looking at the above diagram, it is evident that S-curve created by logistic regression relates closely to the data points.
Logistic function:
When drawn on a 2-D plane it looks like:
The places where X( independent variable) goes to infinity Y (Dependent variable) goes to 1 and where X goes to negative of infinity Y goes to 0.
This logistic function is also called a sigmoid function and it will take any real number and convert it into a probability between 0 and 1 and is hence great for binary classification.
Till now there was only one independent variable (X) what will happen in a condition where there are more than one independent variables?
Then the linear equation will switch into
Here x1, x2,…. xp are all independent variables. The β values are calculated using the maximum likelihood estimation. This method checks the values of β through multiple iterations and finds the best fit of log odds which produces a likelihood function. Logistic regression works when this function is maximized and the optimal values for the coefficients are found and then used in the sigmoid function to find the probability.
The blue line here is y=0.5 above which all points will be classified as class 1 and below it are class 0.
The probabilities found by the aforementioned formula are used here.
p≥0.5,class=1
p<0.5,class=0
So, the good thing about the logistic regression algorithm is that it not only classifies but also provides the probabilities and knowing that a condition has 90+% probability for a class as compared to one with 51% is a big advantage.
Cost function of Logistic regression:
The cost function is a function that helps us understand how well the machine learning model works. It in itself calculates the difference between the actual and the predicted values and measures how wrong the algorithm was in prediction. By minimizing the value of the cost function most optimised result is found.
In logistic regression, the Log loss function is used.
Log Loss function:
Mathematically the log loss function is the average of the negative average of the log of corrected predicted probabilities for each instance.
By default logistic regression gives probabilities with respect to the hypothesis.
For example, the hypothesis is “Probability that a person sleeps more than 10 hours a day”
Here 1 represents a person sleeping more than 10 hours a day and 0 is less than 10 hours.
ID | Class | Probability |
1 | 1 | 0.93 |
2 | 1 | 0.76 |
3 | 0 | 0.2 |
4 | 0 | 0.4 |
5 | 1 | 0.78 |
Probability refers to the probability of the class being 1 i.e. probability the person sleeps more than 10 hours a day.
In the case of ID 3 and 4, the probability is 0.2 and 0.4 respectively these need to be changed to refer to the probability that they belong to their own class. Here corrective probabilities are used i.e. in place where the class 0 Corrected probability= (1- actual probability)
ID | Class | Probability | Corrected Probability |
1 | 1 | 0.93 | 0,93 |
2 | 1 | 0.76 | 0.76 |
3 | 0 | 0.2 | 0.8 |
4 | 0 | 0.4 | 0.6 |
5 | 1 | 0.78 | 0.12 |
Now it’s time to find the Log of the correct probabilities:
ID | Class | Probability | Corrected Probability | Log |
1 | 1 | 0.93 | 0.93 | -0.0315 |
2 | 1 | 0.76 | 0.76 | -0.1192 |
3 | 0 | 0.2 | 0.8 | -0.0969 |
4 | 0 | 0.4 | 0.6 | -0.2218 |
5 | 1 | 0.78 | 0.78 | -0.1079 |
Since log for numbers less than 1 is negative to deal with this negative average is taken.
Thus the final formula becomes:
To summarise the steps for the log loss function are:
- Find corrected probability
- Take the log of corrected probabilities
- Convert to the negative average of the values
The following code is for Logistic regression using Sklearn Library
Loan Status Prediction:
To check and run the code: https://www.kaggle.com/code/tanavbajaj/loan-status-prediction
As usual, lets start with importing the libraries and then reading the dataset:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score |
loan_dataset= pd.read_csv(‘../input/loan-predication/train_u6lujuX_CVtuZ9i (1).csv’) |
This dataset requires some preprocessing because it contains words instead of numbers and some null values.
# dropping the missing values loan_dataset = loan_dataset.dropna() # numbering the labels loan_dataset.replace({“Loan_Status”:{‘N’:0,’Y’:1}},inplace=True) # replacing the value of 3+ to 4 loan_dataset = loan_dataset.replace(to_replace=’3+’, value=4) # convert categorical columns to numerical values loan_dataset.replace({‘Married’:{‘No’:0,’Yes’:1},’Gender’:{‘Male’:1,’Female’:0},’Self_Employed’:{‘No’:0,’Yes’:1}, ‘Property_Area’:{‘Rural’:0,’Semiurban’:1,’Urban’:2},’Education’:{‘Graduate’:1,’Not Graduate’:0}},inplace=True) |
Here in the first line NULL values were dropped then the Y and N (representing Yes and No ) in the dataset are replaced by 0 and 1, similarly other categorical values are also given numbers.
# separating the data and label X = loan_dataset.drop(columns=[‘Loan_ID’,’Loan_Status’],axis=1) Y = loan_dataset[‘Loan_Status’] X_train, X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25,random_state=2) |
As seen in code the dataset is split into train and testing datasets.
The test_size=0.25 shows that 75% of the dataset will be used for training while 25% for testing.
model = LogisticRegression() model.fit(X_train, Y_train) |
Sklearn makes is very easy to train the model. Only 1 line of code is required to do so.
Evaluating the model:
X_train_prediction = model.predict(X_train) training_data_accuracy = accuracy_score(X_train_prediction, Y_train) |
This shows that the accuracy of the training data is 82%
X_test_prediction = model.predict(X_test) test_data_accuracy = accuracy_score(X_test_prediction, Y_test) |
This shows that the accuracy of the training data is 78.3%
input_data= (1,1,0,1,0,3033,1459.0,94.0,360.0,1.0,2)
# changing the input_data to a numpy array # reshape the np array as we are predicting for one instance prediction = model.predict(input_data_reshaped) |
Here when a random input data is given to the trained model it gives us the output of whether the loan is approved or not.
Logistic Regression’s code without Sklearn behaves like a neural network as it requires forward and backward propagation of the Loss function to set the weights and biases in such a way that optimised result is found.