Data quality : when is it sufficient ?
Leave out the data, let us talk about quality.
First of all here are some examples of quality “problems”.
- microsoft : howmany bugs do they ship with their software ? There is no such thing as a completely bug-free software product, so waiting for the last bug to be found and fixed means stop selling software.
- heinz ketchup or something cheaper ? I prefer Heinz
- Johnny Walker red label, black label, green label, gold label, blue label … or Ardbeg ? Blue label and Ardbeg are delicious, but a bit too expensive for me.
- Skoda or Rolls Royce ? I drive something in between : good enough
Obviously we have to make choices which often are worse than the best possible quality.
So it is with data : when is the quality good enough ?
It depends : what do you want to do with it ?
— If it is reporting : the numbers better be correct. In a large enterprise I bet there will be two sources of the same numbers. The results will be compared and there will be trouble.
— if it is descriptive data mining, like clusterings or descriptive classifications : the data better be as correct as possible. Errors are acceptable within reasonable limitations, as long as the picture “fits”.
— if it is data mining for targeting purposes : the data has to be stable in time. Correct ? I do not care. Does this sound crazy ? Perhaps. But really : I do not care ! If they put the size of the shoes of someone in the “Birthday” variable this poses no problem. For the data mining algorithm does not take the meaning of the variable names into account. “var1″, “var2″, var3″, etc do equally well. The only thing that matters is : how good is the predicting quality of the targeting model ? You can only obtain a good predicting model with variables that have prediction power (are related to the target) and that are stable, meaning the meaning of the variable does not change over time. I do not like it when IT people correct flaws in the data. It diminishes the model quality and I have to rebuild them.
So better use your time to build targeting models than to try to get the data to be perfect. Just use the GIQO principle I just invented : GARBAGE IN, QUALITY OUT ! (a bit like the urine-to-water machine at the space station)
— if it is web analysis : this is yet another story, neatly explained by Avinash Kaushik in this post.
Did you liked this post ? Then you might be interested in the following :
Howmany inputs do data miners need ?
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes