Posted by: zyxo | December 30, 2008

Oversampling or undersampling ?


In a presentation on Slideshare, Dr Sven F. Crone of the Lancaster Center for Forecasting put his finger right on the wound when he talks about the myth of the best algorithm (Note : his talk begins at slide 116 !).
With real life examples he draws attention to the fact that the preprocessing of the data and the method of sampling is much more decisive for the quality of the resulting data mining model than the modeling algorithm.
I totally agree with him on this. But I want to do here is comment on his conclusion that oversampling is allways better than undersampling.
He IS right in HIS examples.

What happens when you do oversampling or undersampling ?

sampling

Oversampling : you duplicate the observations of the minority class to obtain a balanced dataset.
Undersampling : you drop observations of the majority class to obtain a balanced dataset, see illustration.

As far as the illustration goes, it is perfectly understandable that oversampling is better, because you keep all the information in the training dataset. With undersampling you drop a lot of information. Even if this dropped information belongs to the majority class, it is usefull information for a modeling algorithm.

But nowadays in big enterprises first there is plenty of data and second the data mining algorithms/softwares/hardwares are often limited in the amount of data they can analyse.
Ever tried training a model with a training dataset of 50Gb ?
So if you have that amount of data, maybe undersampling still leaves you with too much data and you have to use only a fraction of it.
In that case undersampling is better, oversampling is useless.

And one last remark: I do not believe that simple oversampling is a good idea, even with a largely unbalanced dataset. In that case you should choose a modeling algorithm that can handle that imbalance, like for example a decision tree.

Did you liked this post ? Then you might be interested in the following :
The top-10 data mining mistakes
Text mining : Reading at Random
data mining with decision trees : what they never tell you
Toddlers are data miners

add to del.icio.us :: Add to Blinkslist :: add to furl :: Digg it :: add to ma.gnolia :: Stumble It! :: add to simpy :: seed the vine :: :: :: TailRank :: post to facebook

Reblog this post [with Zemanta]

Responses

  1. Hi,

    The part where you say “Ever tried training a model with a training dataset of 50Gb ?” made me laugh out loud because it is so true!

    I always undersample because I usually have enough cases of the minority. Even just 1 or 2% of the minority is many thousands of rows. I don’t often sample to 50/50% though. I usually undersample to 75/25% for model building.

    Cheers

    TimManns

  2. TimManss,
    Not clear to me whether it is true that you use datasets of 50 Gb or that it is not possible to use them as training sets ?
    Zyxo

  3. What ration should I use for balancing datasets;

    say I have just 1 or 2% of the minority instances. should I sample to 50/50% or 80%/20%.

    what does the answer to this actually depend on?

  4. Martin,
    you should probably first read my more recent post on this subject (https://zyxo.wordpress.com/2009/03/28/mining-highy-imbalanced-data-sets-with-logistic-regressions/#comments).
    It depends for a lot on the total number of observations in the minority set and the number of variables.
    If the minority set contains thousands of observations and you have few variables you should only worry on the size of the data sets. Normally all methods should handle even a 1%/99% fairly good.
    If you have a small minority set as compared to the number of variables (say 200 cases and 50 variables) than you need as much as possible of the majority set and you should use bagging to obtain a good model.

  5. someone have references about oversamplig

  6. […] is also called to duplicate the content – you should check that out at zyxos Blog. We will stick to the quite simple view of […]

  7. […] Gesamten erhöht in dem man die Nicht-Zielvariable entfernt, kann es auch bedeuten die Zielvariable durch kopieren zu vermehren (Redundanz). Wir bleiben hier bei der […]

  8. Teach me how to use Undersampling or Oversampling if I want to use breast cancer as dataset. Thanks alot for u’r help


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: