The decision tree is one of the most commonly used algorithms in supervised learning. It is used for both regression and classification when we use this algorithm for regression problems, it is called a regression tree.
In real life, decision trees are very famous algorithms they are used to predict high occupancy dates for hotels, see gross margins of companies, select which flight to travel on, and the list goes on and on.
Now, what is a decision tree? Thinking intuitively, we might conjure up an image like this.
In certain situations of life, we think in decision trees,
For example: Handling a midnight craving for food.
This is a perfect intuition for a decision tree.
A decision tree is a collection of nodes and branching that is a hierarchical structure that tries to come to a conclusion by using a variety of questions at different stages.
To understand the regression tree, we need to understand a few terms before:-
- Node – in the figure above, the yellow circles denote a node. The nodes play an integral part in decision-making. It’s the node where the algorithm decides which pathway to choose further.
- Root node – The root node is the starting node or the topmost node from where the tree originates. It acts as a starting point for the algorithm.
- Leaf node – The final node or the output node, which acts as the termination point in the algorithm, is known as the leaf node.
If the decision tree asks logical questions at each node, how can it predict the continuous values?
We know regression algorithms use RSS (residual sum of squares, i.e. )to predict the accuracy, so what the regression tree will do will ask at each node whether this step reduces the RSS or not
Let’s consider an example let the data be
X | Y |
1 | 1 |
2 | 1.2 |
3 | 1.3 |
4 | 1.5 |
5 | 1.5 |
6 | 5 |
7 | 5.7 |
8 | 5.6 |
9 | 5.4 |
10 | 5.9 |
11 | 15 |
12 | 15.4 |
13 | 15.2 |
14 | 15.8 |
15 | 15.6 |
16 | 7.7 |
17 | 7.5 |
18 | 7.11 |
19 | 7.4 |
20 | 7.9 |
If we try to plot this data
Notebook to run and test code: https://www.kaggle.com/code/tanavbajaj/decision-tree/notebook
import numpy as np import matplotlib.pyplot as pltx = np.array(range(1,21)) y = np.array([1,1.2,1.3,1.5,1.2,5,5.7,5.6,5.4,5.9,15,15.4,15.2,15.8,15.6,7.7,7.5,7.11,7.4,7.9]) plt.scatter(x,y) |
Now let us try to apply linear regression to this
from sklearn.linear_model import LinearRegression x = x.reshape(-1,1)Y = y.reshape(-1,1)
reg = LinearRegression().fit(x, y) plt.scatter(x, y) plt.show() |
We can observe that the best line does not capture all the points well; in this scenario, a decision tree would be a better option.
How does the regression tree work?
It starts by taking the average of the first 2 values and splits the data based on the average of the data points after this; it will try and predict all the values based on this split line and calculate RSS error for this line. It will then take the average of the second and third and recalculate the RSS. it will repeat this process for all the data points and then select the line for the value which would give the least RSS.
Ultimately, our tree will end up splitting nodes for all the values in the tree. Is this a good thing? No, if we visualize this.
It will look something like this which is far from the best plot which might look something like
As the first graph is far too overfitted and might lead to wrong results, how can we avoid overfitting?
We can stop the overfitting by specifying the depth of the tree, which is the minimum number of elements to have before the slit is conducted in other words, when the tree is created, what should be its height between the root node and leaf node.
Now let’s see how a regression tree performs better than linear regression in some case
If we try to retrain data using linear regression and plot the test output against test predictions, we get a plot like
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( x, y, test_size=0.25,random_state=55)from sklearn.linear_model import LinearRegression reg = LinearRegression().fit(X_train,y_train) plt.scatter(X_test,y_test,color=”red”) |
The blue dots are the test dataset, and the red is linear regression predictions.
In the same graph, if we import the decision tree from sklearn and plot it against the test data, we might get a plot like
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=4,max_depth=3) |
We can see the predicted results are closer to the actual value
In this graph, red dots denote the actual value, green dots denote the value predicted by the regression tree, and blue dots the value predicted by linear regression
Now let’s visualize the decision tree built by the algorithm
from sklearn.tree import export_graphviz export_graphviz(regressor, out_file =’tree.dot’, feature_names =[‘x value’]) |
We get this decision tree which is used by the algorithm.
This image shows what the algorithm thinks at every node.
A combination of decision trees gives us a random forest algorithm, which is the topic for the next release.