In a presentation on Slideshare, Dr Sven F. Crone of the Lancaster Center for Forecasting put his finger right on the wound when he talks about the myth of the best algorithm (Note : his talk begins at slide 116 !).
With real life examples he draws attention to the fact that the preprocessing of the data and the method of sampling is much more decisive for the quality of the resulting data mining model than the modeling algorithm.
I totally agree with him on this. But I want to do here is comment on his conclusion that oversampling is allways better than undersampling.
He IS right in HIS examples.
What happens when you do oversampling or undersampling ?

Oversampling : you duplicate the observations of the minority class to obtain a balanced dataset.
Undersampling : you drop observations of the majority class to obtain a balanced dataset, see illustration.
As far as the illustration goes, it is perfectly understandable that oversampling is better, because you keep all the information in the training dataset. With undersampling you drop a lot of information. Even if this dropped information belongs to the majority class, it is usefull information for a modeling algorithm.
But nowadays in big enterprises first there is plenty of data and second the data mining algorithms/softwares/hardwares are often limited in the amount of data they can analyse.
Ever tried training a model with a training dataset of 50Gb ?
So if you have that amount of data, maybe undersampling still leaves you with too much data and you have to use only a fraction of it.
In that case undersampling is better, oversampling is useless.
And one last remark: I do not believe that simple oversampling is a good idea, even with a largely unbalanced dataset. In that case you should choose a modeling algorithm that can handle that imbalance, like for example a decision tree.
Did you liked this post ? Then you might be interested in the following :
The top-10 data mining mistakes
Text mining : Reading at Random
data mining with decision trees : what they never tell you
Toddlers are data miners












![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=d09ebaeb-2b2c-4957-9e5b-e5007b941824)
Hi,
The part where you say “Ever tried training a model with a training dataset of 50Gb ?” made me laugh out loud because it is so true!
I always undersample because I usually have enough cases of the minority. Even just 1 or 2% of the minority is many thousands of rows. I don’t often sample to 50/50% though. I usually undersample to 75/25% for model building.
Cheers
TimManns
By: Tim Manns on January 12, 2009
at 3:23 am
TimManss,
Not clear to me whether it is true that you use datasets of 50 Gb or that it is not possible to use them as training sets ?
Zyxo
By: zyxo on January 12, 2009
at 6:14 pm
What ration should I use for balancing datasets;
say I have just 1 or 2% of the minority instances. should I sample to 50/50% or 80%/20%.
what does the answer to this actually depend on?
By: Martin on April 29, 2009
at 2:29 pm
Martin,
you should probably first read my more recent post on this subject (http://zyxo.wordpress.com/2009/03/28/mining-highy-imbalanced-data-sets-with-logistic-regressions/#comments).
It depends for a lot on the total number of observations in the minority set and the number of variables.
If the minority set contains thousands of observations and you have few variables you should only worry on the size of the data sets. Normally all methods should handle even a 1%/99% fairly good.
If you have a small minority set as compared to the number of variables (say 200 cases and 50 variables) than you need as much as possible of the majority set and you should use bagging to obtain a good model.
By: zyxo on April 30, 2009
at 6:05 pm