Posted by: zyxo | July 4, 2011

How to use the Settings to control the size of Decision Trees?


Here is my very personal view on some settings of decision trees.

Maximum depth :

Maximum tree depth is a limit to stop further splitting of nodes when the specified tree depth has been reached during the building of the initial decision tree.

My opinion is straight and simple : NEVER use maximum depth to limit the further splitting of nodes.  In other words : use the largest possible value.

I suppose some explanation is necessary.

When you grow a decision tree, different leaves in the splits normally contain different numbers of observations.  Using the tree depth totally disregards these differences.  It could cause to stop splitting a leaf containing 25,000 observations on one side of the tree, whereas on the other side, containing much less observations a leaf with only 30 observations could still get splitted.  This makes absolutely no sense!

Minimum splitsize

Minimum splitsize is a limit to stop further splitting of nodes when the number of observations in the node is lower than the minimum splitsize.

This is a good way to limit the growing of the tree.  When a leaf contains to few observations, further splitting will result in overfitting (modeling of noise in the data).

Now the capital question : at what number should we set the limit ?

Answer : it depends.

  • are you just growing one tree or do you want to create an ensemble (bagging, boosting …) ?  If you create an ensemble, overfitting is permitted, because the ensemble will take care of it: it will look for the mean or some other grouping measure.
  • howmany independent variables (predictors) do you have?  The more variables you have, the bigger the possibility of having some accidental relationship between one of the variables and the target.  So with a lot of variables you should stop earlier.
  • howmany observations do you have? With a limited number of observations you do not have the luxury to stop early or  you will end up with no tree at all.  With a lot of observations you can stop early and still obtain a large enough decision tree

With hundreds of variables I use normally a minimum splitsize in the range of the number of observations divided by a few hundreds.

Minimum leaf size

Minimum leafsize is a limit to split a node when the number of observations in one of the child nodes is lower than the minimum leafsize.

Splitting of a node in two or more child nodes has to make some statistical sense.  What is for example the sense of splitting a node with 100 observations in the two following child nodes: one with 99 observations and one with 1 observation?  It is all a bit like doing a chi-squared test.  A good rule of thumb says that you should never have less than five observations in one of the cases.  I should say : the same goes for decision trees, as long as you deal with the same amount of observations you normally use to calculate chi-squared tests.  It is known that i) with a large number of observations chi-quared tests are no longer appropriate and ii) that decision trees are not a good algorithm for small numbers of observations (say less than 500).   So you should set the minimum leafsize larger than 5.  I usually take 10% of the minimum splitsize (in a bagging ensemble).

Conclusions

There is only one way to know the best settings : try, try and try again ! This is because all projects, data sets, are different. Do you have your own rules of thumb ? Please, do’nt hesitate and let me know !

Enhanced by Zemanta

Responses

  1. Thank’s for your information… I want to ask..How to determine minimal size for split, minimal leaf size. minimal depth, minimal gain and confidence?? Does it depend on me or there is statistic to determine?

    • Hi Erna,
      It is difficult to answer your question without any knowledge about the dataset. Could you give me some details like number of observations, number of variables, proportion of positive/negative cases of the dependent variable (the target variable)? Feel free to send me an email.
      Zyxo

  2. Thanks for this nicely written and informative article – it helped me make some decisions in my modelling dillemmas.

    • Hi David,
      Thanks for visiting my blog.
      Glad it was helpful to you.
      Zyxo

  3. Hi,
    This is very informative and helps me to understand SAS EM settings a lot better. Could you explain a bit about exhaustive option in Split Search if you happened to know? Thank you.

  4. Thank you, a very interesting article. Are you aware of any references which recommend minimum split size etc based on the number observations and variables? I am struggling to find something to base my decision on, and reference in an article. I am working with approx 45 000 observations and 44 variables, and if i do not limit the tree size, I end up with over 1000 nodes, which is way too big to manage. I want to limit the tree size without compromising the accuracy (too much). Many Thanks

    • Hi Julie,
      First of all, thanks for visiting my blog.
      Unfortunately I do not know of any reference.
      My instinct tells me that splitsize should be independent of the number of observations, because in that way, the tree ajusts automatically to whatever numbet of observations you start with.
      You shoul ask yourself ( or look it up): what is the minimum number of observations I need to do a chi-squared test and still obtain a statistically meaningful result. That is your answer.
      Greetings
      Zyxo

  5. […] fitting the model we could control the testtype, the size of the tree, the splitting criterias etc. as well with a list of parameters in the control = ctree_control() […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: