There are situations in daily life where we want to know the relationship between various factors, for example, if the price of petrol increases would it affect the sales of cars or does changing the location of the house will it affect the price. The process of finding this relationship between multitudes of factors is known as regression
in more formal words, regression refers to the study of the relationship between the variables so that one can predict the unknown value of one variable for a known value of another variable.
To better understand the regression, we need to be familiar with two terms
- Independent variable:- These refer to the factors or the variables based upon which the situation is analyzed. These variables don’t change and indicate details about the dependent variable.
- Dependent variable:- It refers to the result, i.e., the sales of cars or the house price. In formal terms, the factor is affected based on the independent variables.
Types of regression
Image source https://static.javatpoint.com/tutorial/machine-learning/images/types-of-regression.png
So what is linear regression?
On a dataset, if the end goal is to find some linear relationship between the independent variable so that when an unknown point is given, a prediction can be made. In the case where there is only 1 independent feature, it is called Uni-variate linear regression, and if there are multiple features, it is called Multiple linear regression.
In the case of Univariate linear regression, a line is formed on a 2-D plane that shows the linear relationship between the independent and dependent variables. While in the case of multivariate linear regression, a hyperplane is formed.
A regression model tries to find a function that would suitably fit the training data and predict with accuracy the training data
In the case of linear regression, it is a linear function of the form:
- Where y denotes the predicted value (dependent variable)
- b is called the bias term or the offset
- Are called the modal parameters (these are the values that our modal tries to learn using the training data)
- are called the feature values (independent variables)
Understanding linear regression
To understand anything, the best way is to understand how it works under the hood let’s start by generating a random data set to test our linear regression algorithm.
|import numpy as np
import matplotlib.pyplot as plt
# generate random data-set
The above code sets the random seed to 7 i.e., when the rand function of the numpy tries to generate random numbers, it will end up generating the same numbers on each iteration this is done to ensure the same values x
Next, let the value of y at each x be given by
If we see the plot of the generated data, it looks like
So, intuitively the line which will distribute the data nearly accurately will look like something
The line we just thought, i.e., which divides the data, is known as the best fit line.
How do we decide best-fit line?
What does the best fit line represent? It acts as the line which tells us for a particular value of x what might be the value of y.
So the best fit line is, and the regression model needs to calculate the value of m and c using the training data set.
But it is possible that the actual value of y might not be equal to the predicted value in that case the difference between the predicted value and the actual value is called residuals. So the best fit line is a line that divides the data into equal parts along with minimum residuals. The sum of all residuals is also called the cost function.
How to calculate the cost function?
Suppose we have a point xi, yi
And where denotes various features, so at first thought, we may think that Since the error is caused due to residuals, let’s add them
Since calculation for multi-feature would be complex, let us consider x to be with one feature, so our error becomes
This method is inaccurate because residuals on the opposite sides of the line cancel each other out.
Let’s take the absolute value of the error so our new function becomes
Consider another scenario
Let the best fit line shift a bit towards a point the error function will not change as the error is reduced in one term and increases in another to overcome this issue.
We must square the error function.
Consider the example; let the original errors be 2 and 2
Let after shifting new errors are 1 and 3
using the old formula, our error would be
Which is the same as the original error, but after
Change in the formula this change will affect the error as in the original case; it would become
whereas afterward, it will become
So our error function would be
but to find the best fit line, the error or cost function needs to be minimized, so,
differentiate cost function concerning m and c
So are final differentiation wrt to m will be
Upon dividing by the number of training samples, N
Which is effectively the mean as one-dimensional data is taken for this case.
So, the cost function can be written as:
Similarly, differentiating concerning c
Which can be re-written as
Again dividing by N
So finally, we get both equations which are
Upon substituting the value of C in the first equation, we get
Now we know what are the values of m and c in best fit line, so we can try to code linear regression ourselves Using the same data we generated above
Importing train test split and splitting the data
|from sklearn import model_selection
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x, y, test_size=0.3)
Define a fit function that computes the value of m and c for the training data. It takes as input the training data and then uses the formula we derived above
|def fit(X_train, Y_train):
# fit the model
numerator = (X_train * Y_train).mean() – X_train.mean() * Y_train.mean()
denominator = (X_train ** 2).mean() – X_train.mean() ** 2
m = numerator / denominator
c = Y_train.mean() – m * X_train.mean()
return m, c
Defined a predict function that takes test data, m and c. it returns the predictions for our training data
|def predict(X_test, m, c):
Y_pred = m * X_test + c
|m, c = fit(X_train, Y_train)
y_pred = predict(X_test, m, c)
Values of m and c are sent through the functions and used to predict the required result.
|print(“The value of m and c are:”, m, c)|
The value of m and c are: 2.1667657711146124, 15.357261307201359
How to know the accuracy of the algorithm?
To determine the algorithm’s accuracy, we use the coefficient of determination.
Similar to the way we calculated error in our predictions, the coefficient is determined as
- Where denotes the actual value
- denotes predicted value
- And denotes the mean of the actual values
In other words, the accuracy of a regression algorithm can be understood as the ratio of the absolute errors to the mean of actual values i.e., how far are the values from the mean of the actual data
|def score(Y_test, Y_pred):
u = ((Y_test – Y_pred) ** 2).sum()
v = ((Y_test – Y_test.mean()) ** 2).sum()
r2 = 1 – u / v
Now we add a score function to our previous algorithm
And generate the score
Which results in a score of 81%
That was an example of how linear regression works from scratch.
The same work can be done using sklearn with the following code:
|from sklearn.linear_model import LinearRegression
lm = LinearRegression()
|Y_pred = lm.predict(X_test)
This gives us an accuracy score of 90% for test cases and 88% for training cases.
To test and run the code used, check out: https://www.kaggle.com/code/tanavbajaj/linear-regression-from-scratch.