Posted by: zyxo | January 14, 2009

data mining with decision trees : what they never tell you


SVG version of Overfitting.png by Dake. Create...
Image via Wikipedia

When you attend a data mining course, even from one of the leading data mining software vendors (like SPSS, SAS, Salford Systems) you learn how to push the buttons fo the tool and the basics of the algorithms. An advanced course consist of more advanced algorithms like neural nets, or SVM’s.
But one of the oldest algorithms is decision trees. And yes, in the advanced course you learn briefly the new ways of working with decision trees, like bagging or boosting.
But there is more.
In this post, the drawbacks and disadvantages of decision trees will show you why I love working with them and in a second part I will show you the secrets of how to get better predictions from decision tree models.

Disadvantages :
– overfitting (but you can control it and use it to enhance the model quality !)

Advantages :
– readability : simple if then statements
– simple algorithm
– robustness : preprocessing and cleansing is an cumbersome job, but decision trees work well without it !!
– flexibility : the settings allow you to build from very small to very large trees
– overfitting : if proporly controlled
– simple powerful settings
– large models (exponential growth with deeper models) that give very “detailed” results, see next post.
– weak learner, so there is a lot of gain in ensemble techniques like bagging.

Getting better predictions.

BAGGING !
In numerous studies it is shown that bagging (bootstrap averaging) dramatically improves the quality of decision tree models (DTREG,Kristína Machová, František Barčák, Peter Bednár, Breiman ) Why ? decision trees are weak learners. This means that with different samples from the same dataset you can easily obtain very different models ( but of equal quality). By averaging the results of several decision tree models, you obtain an average model. What does this mean ? You can compare it to the mean, the standard deviation and the standard error of a normal distribution. If we see the best possible model as the mean, than you can look at one single decision tree as an observation with a deviation from the mean with a standard deviation distribution (about 1 chance out of three that it is more than one standard deviation away from the mean). With bagging the average model has one chance out of three to be more than one standard error away from the mean. One standard error equals one standard deviation devided by the square root of the number of trees, so one standard error is much smaller than a standard deviation, or : one bagging model is much better than one tree model.

But : you not only need a weak learner, you also need overfitting !
What is overfitting ? Normally it is described as “the model learns the noise”. According to wikipedia : the model has too much complexity as compared to the amount of data available. But since no model is 100% perfect, there is allways some noise, and hence each and every model is somehow overfit. It is the degree of overfitting that is all about. With decision trees it is very simple to determine this degree of overfitting by controlling the size of the tree.
What are the possible settings in this respect. You can define the number of layers (dept), the minimal terminal nodesize and the minimal splitsize.
Forget about the number of layers and set this allways as big as possible. It should never be a stopping criterion if you want a good quality model. (Only if you want a simple, small, readable model, then you can use this, but then it only serves for documentation purposes).

With the minimal terminal nodesize and splitsize you control the degree of overfitting. Set them too large and you obtain small trees with little overfitting but poor quality. Set them too small and you obtain huge trees with too much overfitting. It is necessary to do some experimentation and test the quality against a hold-out dataset. A good way to assess overfitting is to look at how the lift of your bagging model behaves for several selection sizes. Normally the lift increases as your selections are smaller. If at a given point the lift starts to decrease with smaller selections, or become erratic, you have too much overfitting.
Ideally, your lift versus selection size plot looks like this (not real data, but very much like my real data).
lift

So why do you need overfitting ?
With bigger trees you have a larger degree of overfitting. But with bagging, you do not only average the real predictive value of the model, but also the overfitting part. And because it is overfitting, per definition it is random noise that tends to zero als the number of trees becomes larger. So you have to tune your settings that way that the overfitting is gone by averaging and the real patterns remain.
Good luck !

In my next post I intend to write about : How good can a model be ?

Other posts you might enjoy reading :
Data mining for marketing campaigns : interpretation of lift
Howmany inputs do data miners need ?
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes

add to del.icio.us   Add to Blinkslist    add to furl   Digg it   add to ma.gnolia   Stumble It!   add to simpy   seed the vine         TailRank   post to facebook   

Reblog this post [with Zemanta]

Responses

  1. Very good post! I took the liberty to paraphrase (and traslate to Spanish) many parts of it in my own blog (http://parramining.blogspot.com).

    I’ll add some other considerations:

    – For a decision tree the ideal target is binary: POSITIVE / NEGATIVE. If the target is categoric, it could be useful to do as many models as categories, taking as positive one category at a time, and the rest as negative. The resulting models should be ensambled in one (another tree or some other model). The trees are not the best suited tool for numeric targets.
    – They can handle missing values. When dividing a node by one of the variables, the algorithm simply assign the missing cases to one of the branches of the division. But beware, this makes that all the missing be considered as the same (as if all have the same value), something that in some circumstances is not acceptable and requires a more elaborated previous data preparation.
    – They can handle a good amount of unbalance in the target values.
    – They give a result for any evaluated case, even if the tree wasn’t trained with anyone similar.
    – Beware of the variable interactions. The algorithm takes one variable at a time, spliting the solution space in “rectangles”, so any interaction between variables that defines an important region in a non-rectangular shape will be poorly represented. There are techniques to detect this and create additional variables reflecting these interactions so the tree can use them.
    – There is not only one tree as the best solution. Little variations in the dataset (as a diferent split in training and testing) can lead to completely different trees.
    – Another parameter to bear in mind is the first variable to use (to split the root node). Some times manually changing it you can get better results (or more significative to the domain).

  2. Two more things:

    – They can handle a mix of categorical and numeric input variables.

    – They are unaffected by outliers.


Leave a comment

Categories