Posted by: zyxo | April 7, 2010

## Categorical variables : Solving the overfitting problem in decision trees.

A simple search on google delivers tons of articles on overfitting, combined with categorical (nominal) variables.
It is a fact that categorical variables with a lot of values (extreme examples are zip codes or customer ID’s) lead to severe overfitting.

In this post :

• What is overfitting ?
• Why do they lead to overfitting ?
• What is the consequence for our model ?
• What can we do about it ?

What is overfitting ?
The information in your data has two origins : the first is what you like : patterns, relationships between variables that exist in your real world. You want to catch those patterns into your model.
The second is what you don’t like : noise, random variation, sampling errors, unpredictable values, rubbish. You certainly don’t want them in your model.

Overfitting occurs when your modeling algorithm includes information of the second kind in your model.
Look at following chart : it is clear that the real pattern corresponds to a smooth, u-shaped line. It is the noise, the random variation that causes the erratic zig-zagging.

Why do categorical variables lead to overfitting ?
The decision tree algorithm is very simple and straightforward for continuous variables :
if greater than => left branch,
if equal or smaller than => right branch.
For categorical variables it is much more complex. You cannot compare their values in terms of greater than or smaller than. It goes more like : let’s calculate the proportion of positive targets for each individual value of the categorical variable, order them from large to small and calculate the optimal split in terms of target proportion.
When you have a lot of values in your categorical variable this means that you divide your data into al lot of small samples, which means : a lot of noise compared to the real pattern. As the algorithm uses all the info (target proportion), the more values you have, the more noise will be incorporated into your model. The limit is when you use the customer ID as categorical variable : each value represents only one record, and the result is that all you have left in your model is noise.

What is the consequence for our model ?
When we do a proper job, we calculate two performance measures for our model : the training performance (how well does our model fit the training data ?) and the test performance (how well does our model fit the test data, data not used to train our model).
Categorical variables with a lot of values will certainly increase our training performance, since more info (noise!) will be used by the model. A model based on customer ID’s will perfectly match our training data !
Alas ! The only thing that really matters is our test performance, which will severely decrease when we use those multi-value categorical variables.

What can we do about it ?

I see two things we can do about it :

1. converting them to continuous variables (I prefer this one)
2. diminishing the number of values

Some examples how we can convert our categorical variable into a continuous one :

i) customer ID’s. If they are numeric, then it’s simple : use them as numeric (continuous) variables. Chances are great that the customer ID’s are some sort of sequential numbers, which can contain the additional meaning that small numbers indicate people who were customer for a long time, and large numbers indicate more recent customers.
If they really are alphanumeric you could try to convert the alphanumeric part into something numeric. If it is really erratic, you should ask yourself if there is any sense in using them. Can they contain any real life meaning ?
ii) date or time -related. OK, real dates or timestamps are continuous, no problem. But what about week number, day of the week ? month of the year ? hour of the day ?
These variables have one thing in common : they are circular. So the solution is obvious : a circle is two-dimensional, and consequently you need two variables to represent them. For hour of the day : the first variable can start at zero (midnight), move up to 12 (noon) and goes all the way back to 0. The second can start at six p.m. with a value of zero, move up to 12 at six a.m. and than back to zero. This works very well for everything that goes round and round and round …

iii) zip codes or any other geographical indications. The most straightforward thing to do is to convert them to coordinates, the things you use to locate something on a map. Two variables (longitude, latitude) will do the job. Just get rid of the degrees, minutes, seconds and substitute them by one single number (everything expressed in seconds, for example).
iv) colour : why not use wavelength ? or their sequence number in the rainbow ? I am sure you are creative enough to find a solution !

When it is impossible to convert your multi-value categorical variable to a continuous one you must definitely try to reduce the number of values to at most 10 (my rule of thumb). OK, it all depends on the number of observations in the minority class of your target. If you have 100.000’s of them you can afford a few more, but never exaggerate, for it quickly degrades the quality of your model.
And most important : whatever method you use to reduce the number of values : NEVER use the relation to your target variable ! Otherwise you are putting the noise into your model. You should just try to group them according to the information in the categorical variable itself.
Example : automobile brand/type. Why not replacing it by two or three variables : one for the size (horsepowers, volume, whatever), another for luxury (on a scale, say, from 1 to 5), and still another for usage type/emotional type (like family car, sportscar, … limit the number of types!).

Did you had your own experiences with categorical variables ? How did you manage them ? Feel free to share your examples.