Here is my very personal view on some settings of decision trees.
Maximum depth :
Maximum tree depth is a limit to stop further splitting of nodes when the specified tree depth has been reached during the building of the initial decision tree.
My opinion is straight and simple : NEVER use maximum depth to limit the further splitting of nodes. In other words : use the largest possible value.
I suppose some explanation is necessary.
When you grow a decision tree, different leaves in the splits normally contain different numbers of observations. Using the tree depth totally disregards these differences. It could cause to stop splitting a leaf containing 25,000 observations on one side of the tree, whereas on the other side, containing much less observations a leaf with only 30 observations could still get splitted. This makes absolutely no sense!
Minimum splitsize is a limit to stop further splitting of nodes when the number of observations in the node is lower than the minimum splitsize.
This is a good way to limit the growing of the tree. When a leaf contains to few observations, further splitting will result in overfitting (modeling of noise in the data).
Now the capital question : at what number should we set the limit ?
Answer : it depends.
- are you just growing one tree or do you want to create an ensemble (bagging, boosting …) ? If you create an ensemble, overfitting is permitted, because the ensemble will take care of it: it will look for the mean or some other grouping measure.
- howmany independent variables (predictors) do you have? The more variables you have, the bigger the possibility of having some accidental relationship between one of the variables and the target. So with a lot of variables you should stop earlier.
- howmany observations do you have? With a limited number of observations you do not have the luxury to stop early or you will end up with no tree at all. With a lot of observations you can stop early and still obtain a large enough decision tree
With hundreds of variables I use normally a minimum splitsize in the range of the number of observations divided by a few hundreds.
Minimum leaf size
Minimum leafsize is a limit to split a node when the number of observations in one of the child nodes is lower than the minimum leafsize.
Splitting of a node in two or more child nodes has to make some statistical sense. What is for example the sense of splitting a node with 100 observations in the two following child nodes: one with 99 observations and one with 1 observation? It is all a bit like doing a chi-squared test. A good rule of thumb says that you should never have less than five observations in one of the cases. I should say : the same goes for decision trees, as long as you deal with the same amount of observations you normally use to calculate chi-squared tests. It is known that i) with a large number of observations chi-quared tests are no longer appropriate and ii) that decision trees are not a good algorithm for small numbers of observations (say less than 500). So you should set the minimum leafsize larger than 5. I usually take 10% of the minimum splitsize (in a bagging ensemble).
There is only one way to know the best settings : try, try and try again ! This is because all projects, data sets, are different. Do you have your own rules of thumb ? Please, do’nt hesitate and let me know !