Howmany records do you need to make a decent data mining model ?

Let us first look at a data mining definition (you find dozens of them on the web, I just took one at random).

The automatic extraction of useful, often previously unknown information from large databases or data sets.

In most definitions we find something like “*large database*” or “*lots of data*” which implies that we need a huge lot of data to enjoy our data mining hobby.

Is this so ?

Anyway it is a tough question.

Let us start simple. It is all about getting information out of the data. So let us take three points in a plane (x-y plot). If they fall on a straight line, the correlation coefficient is statistically significant. Meaning that you do not necessarily need a lot of data to extract information from it.

But data mining was invented to overcome the problems statistics have with huge amounts of data and variables.

Candidate factors that play a role to determine the optimal number of observations are :

– **dimensionality** : the number of variables (preferably transform categorical variables by dummies before counting !). As a rule of thumb you should have at least as many observations as something like the squared number of variables. (I forgot where I read or heard this).

But what about a dataset with 10,000 variables of which only 2 are really related to the target variable ? In that case there is no “curse of dimensionality”. The only problem is the storage space and computing power to find the two significant ones.

– **power** : This is a difficult one and often overlooked in statistics. Large power means : a clear and large effect of the independent variables on the target variable. Small power means that there is an effect but it is very small and hence difficult to detect … unless you have a lot of observations … Let us return to the three point on a straight line : they present a huge power, so three points are sufficient to establish the fact that there is a significant correlation. But what if the population is almost a circular cloud of points ? With 10,000 points on that plane you could calculate a correlation coefficient of 0,04, being highly significant but with a low power ! With data mining we often want to include even the smallest effects in our model to increase the prediction quality (read “marketing campaign return” ) as much as possible. So we need lots of observations to detect them.

– **modeling method** : decision trees can handle a huge number of observations. So do logistic regressions. But since you obviously want to perform some selection of variables you want a stepwise regression : this will take ages. And random forests can handle a lot of variables but relatively few observations. This you have to test on your own system.

The** one solution I propose** to get an estimate of how many observations is sufficient but not too much :** try it out !**

**Too much** :

– if your tool/system cannot handle them any more (neural networks, logistic regressions …)

– for decision trees : if the model quality does not improve any more (tested on a hold-out dataset). Be aware of the fact that decision trees grow larger and larger as long as you feed them more observations, but not necessarily get better (unless you force them to stop at a fixed number of splits, which I do not find a good idea ! )

**Too few** :

– poor model

So what should you do ? Make a lot of models with increasing numbers of observations and test them against a hold-out dataset. Continue adding observations as long as the model quality improves.

As someone said before : it is 5% inspiration and 95% transpiration …

*Did you liked this post ? Then you might be interested in the following :*

Oversampling or undersampling ?

data mining with decision trees : what they never tell you

The top-10 data mining mistakes

Mining highy imbalanced data sets with logistic regressions

Based on my data set with 12000+ observations, I decided to create 11 models (right now I use a SVM model) based on 1000, 2000,… up to 11000 observations and validated them on a validation set of 1000 cases. The results (I used the model/output correlation) were as expected: the training results decreased while the validation results increased. Since I can collect only 600 new cases each month this is a work in progress 😉

Thanks for sharing your ideas.

My results: http://img268.imageshack.us/img268/3271/shot20090515233750872.png

By:

Jimon May 16, 2009at 6:02 pm