Posted by: zyxo | May 24, 2009

Good enough / data quality

Detail on a bottle of Ardbeg whisky.
Image via Wikipedia

Data quality : when is it sufficient ?

Leave out the data, let us talk about quality.

First of all here are some examples of quality “problems”.

Obviously we have to make choices which often are worse than the best possible quality.
So it is with data : when is the quality good enough ?

It depends : what do you want to do with it ?

— If it is reporting : the numbers better be correct. In a large enterprise I bet there will be two sources of the same numbers. The results will be compared and there will be trouble.
— if it is descriptive data mining, like clusterings or descriptive classifications : the data better be as correct as possible. Errors are acceptable within reasonable limitations, as long as the picture “fits”.
— if it is data mining for targeting purposes : the data has to be stable in time. Correct ? I do not care. Does this sound crazy ? Perhaps. But really : I do not care ! If they put the size of the shoes of someone in the “Birthday” variable this poses no problem. For the data mining algorithm does not take the meaning of the variable names into account. “var1”, “var2″, var3”, etc do equally well. The only thing that matters is : how good is the predicting quality of the targeting model ? You can only obtain a good predicting model with variables that have prediction power (are related to the target) and that are stable, meaning the meaning of the variable does not change over time. I do not like it when IT people correct flaws in the data. It diminishes the model quality and I have to rebuild them.
So better use your time to build targeting models than to try to get the data to be perfect. Just use the GIQO principle I just invented : GARBAGE IN, QUALITY OUT ! (a bit like the urine-to-water machine at the space station)
— if it is web analysis : this is yet another story, neatly explained by Avinash Kaushik in this post.

Did you liked this post ? Then you might be interested in the following :
Howmany inputs do data miners need ?
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes

Reblog this post [with Zemanta]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: