Learning Data Science: Day 13 - Random Forest

4 min readJan 12, 2017

In the previous story, we have learned about Decision Tree. We have known several advantages of using Decision Tree. It train and predict fast, which SVM or kNN didn’t. It is also easy to understand and interpret. But, what if the decision tree doesn’t perform well? We could take decision tree to a different level by learning multiple trees at the same time. So, the idea to improve Decision Tree is to train multiple trees with different training data. So, each tree will be slightly different to each other. How? By applying Bootstrap Aggregating.

Bootstrap

The problem with learning multiple trees is the training data set. In some cases, it is easier to found a big dataset, but sometimes it is not the case. Let’s say to fetch the dataset we have to conduct a research that took a long time or a lot of costs, that will be a problem. One good alternative is by doing bootstrap. Basically, bootstrap it trying to optimize the dataset we already have.

The illustration looks like above figure. Let’s say we have N data points in our dataset. From those dataset we able to create bootstrap samples from multiple data points from the original training sample. Even though we take it from the training sample, we took the data points randomly. Then we do it multiple times up until we got the number of bootstrap samples we want.

A fairly simple example, let’s say we have

Training sample: 1, 2, 3

Now we want to create two bootstrap sample, and now the two bootstrap samples will have values

First bootstrap sample: 1, 1, 2
Second bootstrap sample: 2, 3, 3

If you notice all the values included in bootstrap samples will be sampled randomly from the training sample. Some of the values will occur in more than one bootstrap sample. In this case, the same occurrence in both the bootstrap is 2. But we will be fine with that.

Some of you might wonder if we can do cross validation or not. Well, the problem in bootstrap is that we can’t cross-validate due a lot of overlap in bootstrap datasets. As basic information, the probability of choosing n in N draws is about 0.632.

Bootstrap Aggregating

Bootstrap aggregating or sometimes shortened to Bagging, is basically applying bootstrap samples to learn multiple decision trees. Because the bootstrap datasets have slight differences between all them, all the trees will not be exactly the same. There will always be differences. These makes the probabilities of getting good trees higher rather than if we only have a single decision tree.

Multiple decision trees from multiple bootstrap samples.

With different decision tree, we can get different class boundaries too. The picture below is an example of the bagging results.

The light green lines indicate every single boundary line created by each decision tree. Then from those multiple boundary lines, we take the average so now we have a single smoother boundary line that indicated by the bold green line.

By using bagging, it will reduce overfitting of the model. Bagging is not only able to be applicable with a decision tree, it able to be used in other kinds of classifiers. The problem with bagging is that is not going to improve linear models, since when we averaging linear models we will get the same linear line. The other advantage with bagging is that it is easy to parallelize because each bootstrap sample doesn’t correlate with the other.

Random Forest

Random forest take Bagging to a different level, it is one of the best among classification algorithm. The basic concept of random forest is that a group of weak learners can form a strong learner. Rather than having deep trees, random forest aggregating the output from many shallow trees with different random features evaluated. By doing this degree of randomness, the probability of having good trees is getting higher.

In random forest, all trees are fully grown and there is no tree pruning. The parameters that we able to tune are the number of trees and the features.

Out of Bag Error

It is a technique to measure error for random forest. From the whole original data points, left out 1/3 of the data points to not be used in the construction of the trees. When building the trees, we also test the trees using the 1/3 of data points that’s not used to train the model. It is very similar to cross-validation. However, it is measured during training.

Wrap Up

Today, we have discussed more improvement of an ordinary decision tree that is called as Bagging. We also take it to another level with random forest. If you have more knowledge about this topic that I’m not talked about it yet, please let me know on the response below or you can also give me suggestions for my future writing. Happy learning!