Ridge regression

maxresdefault

In layman’s terms, Ridge regression adds one more term to linear regression’s cost function to reduce error.

Ridge regression is a model-tuning method used to analyze data that suffer from multicollinearity. This method performs L2 regularization. When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large; this results in predicted values being far away from the actual values.

Before we jump into understanding bias and variance and the relation between them, we need first to understand Overfitting; it refers to a model which tries to capture the noise in data, i.e., it tries to memorize the trend in data perfectly, but the drawback in this is it might perform badly on the test dataset. This is because the model created fits exactly on the training dataset. So, if the model ever shows 100% accuracy for the training dataset, check for mistakes.

 To do this, we can vary coefficients by using Regularization, i.e., remove the features which are not necessary or penalize the algorithm for wrong predictions and more errors.

Now to understand bias formally, it is stated as “the bias (or bias function) of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated”  this may sound too confusing in simpler terms; bias is said to be high when a model performs poorly on training data set.

Variance is stated as high when the model performs poorly on the test dataset.

In most of the scenarios, there will be a trade-off between Bias and Variance in any model we build.

Look at the graph below, which shows the Variance and Bias for different Model Complexities. High Bias and Low Variance characterize a Low Complexity or Simple Model, whereas Low Bias and High Variance characterize a Highly Complex Model. Also, keep in mind that in both the very Simple Model and a highly Complex Model, the Total Error (Bias + Variance) of the model will be maximum.

Low bias and low variance will give a balanced model, whereas high bias leads to underfitting, and high variance lead to overfitting.

What is regularization?

Regularization in Machine Learning - GeeksforGeeks

When we use a regression model, it tries to predict the coefficients of various features of a dataset. Regularization tries to bring those coefficients close to 0, which reduces the complexity of the model and makes it more generalized for unforeseen data points. Which will reduce the model’s overfitting and bring down the total error

For example, if we consider OLS(ordinary least squares) or the residual sum of squares, it is given by

But if we observe it takes account only of the bias and not variance, the model will try to reduce the bias and overfit the data.

How does regularization help in reducing bias?

Regularization: the path to bias-variance trade-off | by HAFEEZ JIMOH | Towards Data Science

In regularization, we add a penalty term  to the cost function, which reduces the model complexity that is, the cost function of the model reduces to OLS + Penalty

What is ridge regression?

Ridge Regression for Better Usage | by Qshick | Towards Data Science

 

In ridge regression penalty term is the product of the lambda parameter and the sum of the square of weights that is

If observed, this penalty term is known as Shrinkage Penalty. Lambda is known as the tuning parameter that is, if we consider the penalty to be zero, the bias will still be high, and the model will overfit if we make the penalty too high, the model will ignore the features, and the penalty term will prevail for example consider if we add 1 to infinity the change will be negligible similarly if the penalty term is too high the model will not take into consideration the feature coefficients.

Ridge regression does not cause sparsity. 

Sparsity refers to when the coefficient of a feature drops to 0 that is, the feature becomes insignificant in the determination of the model

Ridge regression is also known as L2 regression since it uses the L2 norm for regularisation. In ridge regression, we strive to minimize the following function concerning”.  As a result, we are attempting to reduce the following function:

The first term refers to the RSS or the residual sum square the second term is the penalty term which penalizes based on the square sum of the coefficients

For ease of solving, let p = 1

Expanding

The equation we get

For minima

at

Which, upon solving, results in

But for sparsity,  this will happen when

Another type of overfitting prevention method, lasso regression, will be discussed in another release.

The main difference between the two is that Lasso regression is used if there are few variables, i.e., fewer predictors affect the output. In contrast, Ridge regression is effective when there are many parameters, i.e., most of them are significant.

This is because lasso regression eliminates some variables from the model, while ridge regression introduces a bias into the parameters to reduce variance.

There also exists a model called Elastic Net, a combination of Lasso and Ridge regression.

Leave a Reply

Your email address will not be published. Required fields are marked *