In a presentation on Slideshare, Dr Sven F. Crone of the Lancaster Center for Forecasting put his finger right on the wound when he talks about the myth of the best algorithm (Note : his talk begins at slide 116 !).
With real life examples he draws attention to the fact that the preprocessing of the data and the method of sampling is much more decisive for the quality of the resulting data mining model than the modeling algorithm.
I totally agree with him on this. But I want to do here is comment on his conclusion that oversampling is allways better than undersampling.
He IS right in HIS examples.
What happens when you do oversampling or undersampling ?
Oversampling : you duplicate the observations of the minority class to obtain a balanced dataset.
Undersampling : you drop observations of the majority class to obtain a balanced dataset, see illustration.
As far as the illustration goes, it is perfectly understandable that oversampling is better, because you keep all the information in the training dataset. With undersampling you drop a lot of information. Even if this dropped information belongs to the majority class, it is usefull information for a modeling algorithm.
But nowadays in big enterprises first there is plenty of data and second the data mining algorithms/softwares/hardwares are often limited in the amount of data they can analyse.
Ever tried training a model with a training dataset of 50Gb ?
So if you have that amount of data, maybe undersampling still leaves you with too much data and you have to use only a fraction of it.
In that case undersampling is better, oversampling is useless.
And one last remark: I do not believe that simple oversampling is a good idea, even with a largely unbalanced dataset. In that case you should choose a modeling algorithm that can handle that imbalance, like for example a decision tree.
Did you liked this post ? Then you might be interested in the following :
The top-10 data mining mistakes
Text mining : Reading at Random
data mining with decision trees : what they never tell you
Toddlers are data miners