Traditional research is about :
- knowing a bit about a subject,
- formulating a hypothesis about the subject,
- gathering data to verify that hypothesis, 4)doing some statistical tests to see if the hypothesis can be accepted or has to be rejected,
- judging if the obtained result is worth wile writing an article about,
- if yes, writing the article,
- submitting the article,
- judging (by the reviewers) if the article is good enough to be published,
- publish the article.
Points 1. to 4. are perfectly OK.
But then begins the non-scientific part of the scientific research : people has to judge, and obviously this is always some subjective activity. So when does an article gets published (point 8.) ? if the reviewers find it good enough, and mostly (>95% of the time) this means that it has to contain some conclusion based on a statistically significant outcome.
Result 1 : non-significant results, which are as in se valuable as the significant ones, are discriminated from scientific literature.
Result 2 : since per definition 5% of the statistical tests on something with no pattern whatsoever in it, will show a statistically significant outcome, a lot of published significant findings are rubbish, bullshit. Just because the researchers and reviewers rely on statistics.
Thats the problem of uncertainty : statistical outcomes follow some distribution with a lot of uncertainty in it. Unless the research is done over and over again, and previous findings are confirmed, they first findings are worthless.
Good data mining practice.
One of the basic habits of any good data miner is using a hold-out sample to verify whether the new model really contains information about real patterns, or was it just a coincidence ? That’s the way of data miners dealing with uncertainty (and a lot of other data mining stuff like information leaks, multi-level categorical variables and the like). See the red line : training result and the blue line : validation result on the hold-out sample.
Difference between data mining and statistics.
Statistics is about testing a hypothesis, data mining is about calculating the hypothesis, and testing afterwards, with a hold-out sample if the hypothesis still stands.
Why should a data miner not test his calculated hypothesis with a statistical test ? Because for this statistical test you still need another sample of data. The data mining algorithm calculated the pattern (hypothesis) which stood out the most. Testing this statistically is spurious : it will always be highly significant.
Yes you could do the statistical test on the hold-out sample. And yes, the findings would be reliable. But what would the be worth ? If you have a very prominent pattern in the training sample and only a very weak pattern in the hold-out sample, due to the high amount of data, statistically it would still be significant. What data miners want to know is whether the model performance is good enough to be used in for example some marketing campaign.
So what’s the risk ?
Uncertainty and risk are two different things. Uncertainty deals about the possibility of ending up with a result that’s in the wrong part of the statistical distribution of you significance test. Risk, on the other hand is about things that change around your subject that change the underlying pattern.
The best illustration of risk we all saw previous years, when the global financial system collapsed. The models dealt with uncertainty, but the risks of our economic ecosystems were forgotten.
In commercial data mining you have the same risks. You can make a great model, predicting which of your prospects will by your product xyz. If at the very moment, when you launch you campaign, there is a competitor who does exactly the same, whith the same product, but at a much lower price, then there is a good chance that your campaign will suck, no matter how good your data mining model was. That’s the risky part of your business.
My inspiration for this post :