When you attend a data mining course, even from one of the leading data mining software vendors (like SPSS, SAS, Salford Systems) you learn how to push the buttons fo the tool and the basics of the algorithms. An advanced course consist of more advanced algorithms like neural nets, or SVM’s.
But one of the oldest algorithms is decision trees. And yes, in the advanced course you learn briefly the new ways of working with decision trees, like bagging or boosting.
But there is more.
In this post, the drawbacks and disadvantages of decision trees will show you why I love working with them and in a second part I will show you the secrets of how to get better predictions from decision tree models.
- overfitting (but you can control it and use it to enhance the model quality !)
- readability : simple if then statements
- simple algorithm
- robustness : preprocessing and cleansing is an cumbersome job, but decision trees work well without it !!
- flexibility : the settings allow you to build from very small to very large trees
- overfitting : if proporly controlled
- simple powerful settings
- large models (exponential growth with deeper models) that give very “detailed” results, see next post.
- weak learner, so there is a lot of gain in ensemble techniques like bagging.
Getting better predictions.
In numerous studies it is shown that bagging (bootstrap averaging) dramatically improves the quality of decision tree models (DTREG,Kristína Machová, František Barčák, Peter Bednár, Breiman ) Why ? decision trees are weak learners. This means that with different samples from the same dataset you can easily obtain very different models ( but of equal quality). By averaging the results of several decision tree models, you obtain an average model. What does this mean ? You can compare it to the mean, the standard deviation and the standard error of a normal distribution. If we see the best possible model as the mean, than you can look at one single decision tree as an observation with a deviation from the mean with a standard deviation distribution (about 1 chance out of three that it is more than one standard deviation away from the mean). With bagging the average model has one chance out of three to be more than one standard error away from the mean. One standard error equals one standard deviation devided by the square root of the number of trees, so one standard error is much smaller than a standard deviation, or : one bagging model is much better than one tree model.
But : you not only need a weak learner, you also need overfitting !
What is overfitting ? Normally it is described as “the model learns the noise”. According to wikipedia : the model has too much complexity as compared to the amount of data available. But since no model is 100% perfect, there is allways some noise, and hence each and every model is somehow overfit. It is the degree of overfitting that is all about. With decision trees it is very simple to determine this degree of overfitting by controlling the size of the tree.
What are the possible settings in this respect. You can define the number of layers (dept), the minimal terminal nodesize and the minimal splitsize.
Forget about the number of layers and set this allways as big as possible. It should never be a stopping criterion if you want a good quality model. (Only if you want a simple, small, readable model, then you can use this, but then it only serves for documentation purposes).
With the minimal terminal nodesize and splitsize you control the degree of overfitting. Set them too large and you obtain small trees with little overfitting but poor quality. Set them too small and you obtain huge trees with too much overfitting. It is necessary to do some experimentation and test the quality against a hold-out dataset. A good way to assess overfitting is to look at how the lift of your bagging model behaves for several selection sizes. Normally the lift increases as your selections are smaller. If at a given point the lift starts to decrease with smaller selections, or become erratic, you have too much overfitting.
Ideally, your lift versus selection size plot looks like this (not real data, but very much like my real data).
So why do you need overfitting ?
With bigger trees you have a larger degree of overfitting. But with bagging, you do not only average the real predictive value of the model, but also the overfitting part. And because it is overfitting, per definition it is random noise that tends to zero als the number of trees becomes larger. So you have to tune your settings that way that the overfitting is gone by averaging and the real patterns remain.
Good luck !
In my next post I intend to write about : How good can a model be ?
Other posts you might enjoy reading :
Data mining for marketing campaigns : interpretation of lift
Howmany inputs do data miners need ?
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes