A decision tree builds a model in the form of a tree structure – almost like a flow chart. In order to calculate the expected outcome, it uses decision points and based on the results of those decisions, it’ll bucket each input. In this article, we’ll talk about classification and regression decision trees, along with random forests.
A classifier decision tree buckets (or classifies) each input. An example of this could be below, where we decide whether to approve or reject a loan application.
The input data would be:
This would output a decision tree like the below:
For decision tree regression, we aim to provide a numeric, rather than categorical output from our decision tree. We may see a table like below, which has various criteria which can help to ascertain the appropriate amount of loan that should be provided.
The partial decision tree below shows how this might work. If we follow the below path we see an outcome of 15,000 in the decision tree. In the table, we have 10,000 and 20,000 as the actual outcomes for these input parameters. So this model takes the average of the two as the predicted loan value outcome.
A random forest is a grouping of decision trees. Data is run through each tree within a forest and provides a classification. It’s important that we do this, because, while decision trees are great & easy to read, they’re inaccurate. They’re good with their training data, but aren’t flexible enough to accurately predict outcomes for new data.
Each tree works on predicting the same thing, but it does so with different bootstrapped datasets. Because we build lots of different trees from different sample data, it’s much more likely that we capture all eventualities.
What is a bootstrapped data set?
A bootstrapped dataset is a method of sampling. It’s taking a small sample of the dataset. So, let’s say we have a dataset: 55, 31, 42, 88, 12, 44, 19, 91.
Here, we would randomly draw a few numbers (55, 12 and 91), you’ll repeat this process N times, depending on the size of the original dataset. Because these numbers are selected from & not removed from the original dataset, it is possible that the same numbers can be randomly selected several times & that you’ll have duplicates in your bootstrapped sample. That’s OK.
We don’t want to run the random draw so many times that we capture all rows of our dataset – we need some out-of-bag samples. We use these to validate our random forest, without this we would have no data to test our forest with, that the forest had not already seen before.
How many trees do I need?
Generally speaking, the more trees we have in our forest, the more accurate our model will be. That said, there is a trade-off with computation times. We need to use the accuracy of our out-of-bag samples to determine the number of required trees. You can make the choice between running a few trees with relatively large sample data versus lots of trees with smaller sample data, based on the accuracy of the model in both scenarios.
How do we know what the prediction is if each tree can provide a different output?
Once each tree has provided a classification, we select the classification with the most votes as being correct. So imagine, we run our loan application through our 3 tree forest.
We sum up the total number of approved / rejected that are provided across all the trees & the majority wins. So, in this case, the loan would be approved.
How do we test our forest model?
We can then test our model using our out-of-bag samples. The accuracy of the model can be determined by the proportion of out-of-bag samples that were correctly classified. In the above example – if the correct result was approved & one of the 3 trees was wrong, we have 67% accuracy.