Learning Data Science: Day 12 - Decision Tree

Illustration of Decision Tree

I In the previous story, we have talked about Support Vector Machine. SVM categorized as supervised learning. Today we are going to talk about Decision Tree which is also another supervised learning technique.

Decision Tree

Decision tree is a type of supervised learning algorithm. Mostly used in classification problems. It works for both categorical and continuous input and output variables. We split the population or sample into two or more homogeneous sets based on splitters. In classification, decision tree only checks one feature at a time. Some advantages of the decision tree are it is fast to train and predict, it is also easy to interpret and understand.

Plotting decision tree

In a decision tree, we always have the first node that available on the top of the decision tree called root node. We split nodes based on certain condition, this process is known as splitting. The nodes that indicate splitting process are called decision nodes. The splitting processes will come to the end when the node can’t be split anymore. The node on the bottom of the tree will define what the class for the particular condition it is usually called leaf or terminal node.

If we plot the decision tree on the left we can get a plot that looks like the figure on the right. The boundary lines are the splitter to decide which data points classified to which class.

Node Purity

Since the decision tree only able to work on one feature at a time, we will need a way to choose the feature to be the query and the threshold to choose. In extreme cases, the decision tree will end up with leaf nodes for every point in the sample, but that may cause overfitting.

Gini Impurity

Gini impurity is the measurement of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The equation for gini impurity available below.

For example, if we have 4 red points, 3 green points, and 3 blue points. The class probabilities for each class would be 4/10 for red, 3/10 for green, and 3/10 for blue. Then the misclassification for red points would be 4/10 * (3/10 + 3/10) or equal to 0.24, and for green and blue points would be 3/10 * (4/10 + 3/10) or equal to 0.21. When we add all of them it would be counted as gini impurity or in this case 0.24 + 0.21 + 0.21 or 0.66.

Node Purity Gain

Basically, the giny impurity only compute the purity of a single node. But, what we need is to understand does splitting this node increase the purity of the two nodes we got. Node purity gain is comparing the giny impurity of the parent node and its child. Judging one node is not enough and we need to judge whether a split is a good split or not. The illustration would be something like this.

Equation for Node Purity Gain

To calculate it, we use the giny impurity of the parent (A), and subtract it with the giny impurity of the children (B and C). We weight them according to the number of data points classified to each node (B and C).

Tree Pruning

As we know, decision trees are prone to overfitting. To reduce overfitting we can use tree pruning. To put it more simply tree pruning is the opposite of splitting. So, the idea is that we grow the whole tree and then look at the classification and we remove the decision from the tree. Basically, we are moving backward. Then we need to make a prediction for the merged cell because it’s not pure anymore. What we usually do is to take the majority for the merged cell or probability.

Left: before tree pruning. Right: after tree pruning

Final Words

Today we only covered the basic of Decision Tree. There will be more topic about the improvement of basic decision tree such as random forest. Let me know what do yo think about this story on the response below or if you have suggestions for future improvement of this series. Thanks for reading!

Half Data Engineer, Half Software Engineer