“Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth-degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y.”
As stated in formal terms, this definition of polynomial regression seems far too overwhelming, so let’s break it down in simpler terms.
First, let’s understand the assumptions for polynomial regression:
- Depend variables can be explained by a curvilinear relationship between the independent variables.
- The independent variables are completely independent of each other.
Real-life situations where polynomial dependence is present are:
- Change in atmospheric pressure to predict the weather
- Generally, the salary of employees and their relative position in the company.
To understand it, let’s start by seeding random examples and understanding the polynomial regression as we go further.
To Run and Check the code:
https://www.kaggle.com/tanavbajaj/polynomial-regression
import numpy as np import matplotlib.pyplot as plt |
Importing required libraries
np.random.seed(69) x = 2 – 3 * np.random.normal(0, 1, 20) y = x – 2 * (x ** 2) + 0.5 * (x ** 3) + np.random.normal(-3, 3, 20) x = x[:, np.newaxis] y = y[:, np.newaxis] |
Next, we generate a seed so that the numpy generates the same order of random numbers every time and does not vary it
And generate values for our x and y
Upon plotting it, we see
plt.scatter(x,y) plt.show() |
Now let’s try to model it using linear regression and plot the best fit line
from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_errormodel = LinearRegression() model.fit(x, y) y_pred = model.predict(x) plt.scatter(x, y, s=10) |
If we observe the plot, the best-fit line tries to capture the point but is unable to do so
If we calculate the root mean square error or RSS, we get
rmse = np.sqrt(mean_squared_error(y,y_pred)) print(rmse) |
We get a error of
24.188793024005992 |
This is a lot now if we try to convert the features to the polynomial counterparts using the sklearn library, we will get an error far too let see that.
from sklearn.preprocessing import PolynomialFeatures
polynomial_features= PolynomialFeatures(degree=2) model = LinearRegression() rmse = np.sqrt(mean_squared_error(y,y_poly_pred)) |
We get an output of
9.51210712644908 |
This time the error is reduced if we try to plot the curve
import operator plt.scatter(x, y, s=10) # sort the values of x before line plot sort_axis = operator.itemgetter(0) sorted_zip = sorted(zip(x,y_poly_pred), key=sort_axis) x, y_poly_pred = zip(*sorted_zip) plt.plot(x, y_poly_pred, color=’m’) plt.show() |
We will get a curve that will look more like
We import the operator library so we can sort the value of x and related y value so that our curve will be plotted in an easy-to-visualize manner if we try to plot it without sorting, we might get a curve that might resemble something like this.
The low Root means score error indicates that the polynomial function fits our data better.
So in simpler words, polynomial regression is finding the best fit curve to the given nonlinear data, which might try to best generalize the data.
In other words, polynomial regression finds the coefficients for the features of the higher degree so that the feature columns convert from lower degrees to higher degrees; in other words, suppose there is a column for age in data and a column for crime, then a feature of higher degree might be the product of crime and age column.
So our linear equation in case of linear regression converts to
Maths behind Polynomial Regression:
The maths behind linear and polynomial regression is quite similar. From our linear regression release, m and b were found for the equation y= mx + b using differentiation. Similarly, in polynomial regression, values for a, b, and c are found for the equation y = a x2 + bx + c
The partial derivatives for a, b and c, respectively, are:
For finding the minima, equate the derivatives to 0
Representing these equations in matrix form looks like this:
This was an example of how it works for an equation with the power of 2. Similar matrices can be created for all powers, and a generalized equation looks like this:
Here m represents the degree of the polynomial (highest power of the equation), and n represents the number of known data points.
Limitations of Polynomial regression:
One of the main problems of polynomial regression is that one or 2 outliers can significantly impact the equation. There are far fewer model validation methods to detect outliers for a non-linear regression problem than for a linear regression one.