Posted by: zyxo | March 28, 2009

## Mining highy imbalanced data sets with logistic regressions

1) logistic regressions can handle imbalanced datasets
and
2) if it is possible to adjust the regression equation as to truly represent the actual situation.

The answer to the first question is “YES” and “NO”. I will explain this further in this post.

For the second question : I really do not know. But somehow it seems to me that this would be very unreliable.

Can we mine a highly imbalanced dataset with logistic regression ?
Let me first put this : to mine highly imbalanced datasets, you should preferable NOT use logistic regressions, but something like decision trees. If you insist on using logistic regression then I advise the following :

If you use the entire imbalanced dataset I am convinced that, if the regression eventually converges, you will end up with a very poor model. So don’t.

Now sampling. Let us assume that the minority class are the positives.
Since you have much more negatives than positives, a first approach would be to take all the positives and the same amount of negatives, which gives you a perfectly balanced sample. (It is generally accepted for a balanced dataset that the minority class contains at least 30% of the observations).
There is a good thing and a bad thing to that approach.
The good thing is that logistic regressions do very well on small datasets (much better than for example decision trees). So with that approach you should already get an acceptable model.
The bad thing is that you have no possibility left to test the quality of your model, since you have no positives left. (This testing should allways be done on a sample that was not used to make the model).

Consequently I suggest an alternative approach, which yields a model of the same or more likely, better quality AND an indication of the quality of the model.
This approach is called BAGGING (bootstrap averaging). It is mostly used and shows the largest gains on weak classifiers like decision trees, but in this particular case it can be very useful too.

For example take 90% of the positives and about the same number of negatives (you may take a slightly higher number of negatives to obtain for example a 40% positives and 60% negatives distribution).
Then you calculate your model and test it on the 10% positives that you did not use in the model and of cause on whatever number of negatives you want.

Then you start all over again with another random sample of 90% of the positives and a sample of the negatives, preferably not overlapping with the first sample of negatives.
And after that you start all over again … and again … and again … The more the better, but 10 to 20 times will normally already do a great job.

Suppose you repeated the sampling and modeling 10 times, than you get 10 models and test results for 10 times 10% = 100% of the number of positives, and a lot more for the negatives, because they are the majority class.
With the test results you can plot the calculated probabilities against the real probabilities to relate the models to the real world.

But there is something even better : if you use the 10 models to score a new dataset, than you obtain 10 different scores for each point. These 10 scores will not be the same since they are generated by different models. What you have to do is calculate the average of these 10 scores to obtain a more accurate score than each of the 10 individual scores.

And that is what you needed : a more or less reliable model and an indication of its quality.
And to end a little paradox : you made a nicely “balanced” model with an unbalanced training set, since you used practically 100% of the positives, but 12 times more of the negatives!

Did you liked this post ? Then you might be interested in the following :
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes
Text mining : Reading at Random

## Responses

1. Thank you very much for the post. It looks like an interesting thing to try out.

2. “If you insist on using logistic regression then I advise the following :

If you use the entire imbalanced dataset I am convinced that, if the regression eventually converges, you will end up with a very poor model.”

Can you explain why you have drawn this conclusion?

3. I’m preparing a post on the subject of sampling as well!

I’ve built models with 90/10, 95/5 or worse without resampling with good success whether using logistic regression, neural networks, or some kinds of trees. The key is thresholding the posterior probability estimate from the model at the level of the a priori probability (if you want to compute classification accuracy or use confusion matrices). If you build trees, for CART make sure you have “Equal priors”, and if using C5 you will have to use misclassification costs or you won’t get any tree to build.

Bagging (FYI–short for Bootstrap Aggregating rather than Averaging) is a nice method if the under-represented class has small counts relative to the complexity of the problem to be solved. I have presented on Bagging methods several times (see presentations at http://abbottanalytics.com/data-mining-resources-abbott.php#papers — I need to post the actual presentations rather than just the references!) and they work quite well when one needs better accuracy, and when one just needs to reduce risk of deploying a “bad” model.

Thanks for the interesting and provocative post!

4. @Will : Dean pinpointed the problem I was thinking about when writing this post : “…if the under-represented class has small counts relative to the complexity of the problem …”
Highly unbalanced datasets for me means a minority class of 0.5 or less percent (I once had to work with about 150 in class A versus 5 million in class B and several hundreds of variables). When you have really a very sparse minority class the chance of having too few instances of it relative to the number of variables can make it difficult to get a trustworthy model.
In that case, I agree completely with Dean that bagging (“Bootstrap Aggregating”, thanks for the correction, Dean 🙂 ) is a very powerful solution.