Random Forests Explained

Have you ever been in a situation where you make a decision out of peer pressure? For eg. You are in a group and you all are deciding whether to go for a trip. Let us say 7 out of 10 decide to go on a trip, so you are also deciding your plan on the basis of majority voting. Similar is the concept of Random Forests which works on the principle of majority voting.

Random Forests is an ensemble technique under supervised learning which is used for both regression and classification problems.

Random Forests makes use of multiple decision trees. The output by these decision trees are gathered and final prediction is made on the basis of majority voting (for classification problem) or average voting (for regression problem).

Each decision tree in a Random Forest is independent of each other, thus, making it a parallel procedure. Thus, Random Forest is a bagging algorithm. To learn more about bagging and boosting you can refer to this tutorial: Bagging and boosting

To learn how to implement Random Forests in Python you can refer to this article: Random Forests in Python

Key Steps:

Let us suppose we have our data D, comprising of n rows and m variables. The key steps for Random Forests for 500 Decision Trees are:

Randomly choose a subset S out of the data D with p (< m) variables chosen at random. Generally S contains two-third of rows in D.

Train a decision tree on this subset S with p variables.
Use the decision tree to make a prediction on the new instance.
Repeat steps 1-2 for say 500 times. In this way, we will get 500 predictions for a same instance.
Use majority voting (for classification problems) or average (for regression problem) to make final prediction for the new instance.

What is 'Random' about Random Forests?

There are 2 aspects of randomness in Random Forests:

• Randomness in subset selection: In step 1, we saw that we are using a subset S to build our decision tree. Thus, we are randomly choosing the rows for this subset S on which the tree will be trained.

•Randomness in variable selection: In step 1, we are selecting p variables randomly out of a total of m independent variables, which will be used to grow the tree. However, the value of p is kept same for all the trees in the forests. Generally, default value of p is square root of number of variables (m). Eg., if there are 9 variables, then 3 variables are chosen at random for building a tree in the random forest. However, this can be changed or hyper-tuned using grid search or cross validation.

This is done because if we use a entire dataset for building the model or use all the variables in our dataset then our forest will have only handful of decision trees.

Hypertuning the parameters of Random Forests

While tuning a Decision Tree there are majorly three parameters to tune:

Number of trees: Number of trees in a Random Forest can impact its accuracy as well as the training time. Suppose you build a forest with 50 trees, the runtime for the code will be less but the accuracy might not be that great. However, it may happen that running a Random Forests of 500 trees can lead to a lot of gain in the accuracy, but that would also increase the run time for your code. Thus, finding suitable number of trees is a need for Random Forests.
Number of variables in a tree: While building a decision tree in Random Forests we do not make use of all the 'm' explanatory variables. Rather we use a subset 'p' of these variables. By default 'p' = square root of 'm'. However one can play with these values.
Maximum depth of the tree: It is extremely essential to decide the appropriate length(depth) of a Decision Tree. Trees with smaller depth can lead to underfitting while extremely deep trees can overfit the data.
Minimum number of samples to split: While splitting a node for the Decision Tree, it must have minimum number of samples to split. A small node size can overfit the data, while a large node size can underfit the data.
Minimum number of samples per leaf: It is possible that while splitting a node with p observations, one branch might end up having p - 1 instances and 1 on the other branch. Thus, we need to tune the minimum number of samples which should be there in the child node after the parent node is split.

Out of Bag(OOB) Samples and OOB Error

For simplicity, let us say we have a dataset D with 1000 rows and 50 variables. Let us assume we are trying to build a Random Forest model with 500 trees.

Let's say, Random Forests used 670 (two-third) rows at random to build one decision tree. The remaining 330 (one-third) rows are considered as out of bag (OOB) sample. Predictions are made on this OOB sample, and error is calculated. Similarly, error is calculated for OOB samples for other 499 decision trees.

Now we finally take an average of these errors for all the 500 trees. This is known as Out of Bag (OOB) error.

Feature Importance

Random Forests can be used to understand the importance of variables.

Assuming there are p input features to a tree in Random Forests.

Let us take a variable, say, Xi and calculate its OOB Error rate.
Now let us randomly permute i.e., change the order of values of Xi. To simplify, let Xi contains 0,0,0,1,1. Now Xi will contain 1,1,1, 0, 0. i.e., 0s and 1s have replaced each other.
Calculate out-of-bag predictions using these permuted values of Xi (keeping values of other variables unchanged).
Calculate the new OOB Error after these permuted values of Xi.
Variable importance for a variable Xi is calculated as:

new OOB Error after altering the values of Xi - original OOB Error

Similarly, this is done for all the p variables.

The higher the misclassification rate is on altering the values for a variable, the higher is the importance.