Recently someone asked me if
1) logistic regressions can handle imbalanced datasets
2) if it is possible to adjust the regression equation as to truly represent the actual situation.
The answer to the first question is “YES” and “NO”. I will explain this further in this post.
For the second question : I really do not know. But somehow it seems to me that this would be very unreliable.
Can we mine a highly imbalanced dataset with logistic regression ?
Let me first put this : to mine highly imbalanced datasets, you should preferable NOT use logistic regressions, but something like decision trees. If you insist on using logistic regression then I advise the following :
If you use the entire imbalanced dataset I am convinced that, if the regression eventually converges, you will end up with a very poor model. So don’t.
Now sampling. Let us assume that the minority class are the positives.
Since you have much more negatives than positives, a first approach would be to take all the positives and the same amount of negatives, which gives you a perfectly balanced sample. (It is generally accepted for a balanced dataset that the minority class contains at least 30% of the observations).
There is a good thing and a bad thing to that approach.
The good thing is that logistic regressions do very well on small datasets (much better than for example decision trees). So with that approach you should already get an acceptable model.
The bad thing is that you have no possibility left to test the quality of your model, since you have no positives left. (This testing should allways be done on a sample that was not used to make the model).
Consequently I suggest an alternative approach, which yields a model of the same or more likely, better quality AND an indication of the quality of the model.
This approach is called BAGGING (bootstrap averaging). It is mostly used and shows the largest gains on weak classifiers like decision trees, but in this particular case it can be very useful too.
For example take 90% of the positives and about the same number of negatives (you may take a slightly higher number of negatives to obtain for example a 40% positives and 60% negatives distribution).
Then you calculate your model and test it on the 10% positives that you did not use in the model and of cause on whatever number of negatives you want.
Then you start all over again with another random sample of 90% of the positives and a sample of the negatives, preferably not overlapping with the first sample of negatives.
And after that you start all over again … and again … and again … The more the better, but 10 to 20 times will normally already do a great job.
Suppose you repeated the sampling and modeling 10 times, than you get 10 models and test results for 10 times 10% = 100% of the number of positives, and a lot more for the negatives, because they are the majority class.
With the test results you can plot the calculated probabilities against the real probabilities to relate the models to the real world.
But there is something even better : if you use the 10 models to score a new dataset, than you obtain 10 different scores for each point. These 10 scores will not be the same since they are generated by different models. What you have to do is calculate the average of these 10 scores to obtain a more accurate score than each of the 10 individual scores.
And that is what you needed : a more or less reliable model and an indication of its quality.
And to end a little paradox : you made a nicely “balanced” model with an unbalanced training set, since you used practically 100% of the positives, but 12 times more of the negatives!
Did you liked this post ? Then you might be interested in the following :
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes
Text mining : Reading at Random