Ever saw a bunch of children playing happily in some forest for a couple of hours and return home nice and clean ? No way. When they are healthy they should return covered with dirt and mud. Only then they have really played.
“Returning home nice and clean” that’s the feeling I get when I read Scott Levine‘s “7 steps of knowledge discovery in databases“. It is correct what he writes, but it all seems so uncomfortably clean.
That’s why here I write down my own version :
“Se7en steps in finding knowledge nuggets”.
Step 1.Try to understand what your business user really needs (not what he/she asks).
Know for sure that your business user almost never ask what he needs. No, somewhere he has a problem, he figures out for himself some sort of solution and based on that solution he ask you to do some mining work. If you just do it like that, I guarantee you that the result will never be satisfactory. In stead you’ll have to challenge him, ask him what he wants to accomplish with the result of your work, ask him what the problem is he wants to resolve and together find a -usually better- solution.
Step 2. Figure out what new data you need and which data-mining algorithms and/or statistical tests you need to use.
Now begins the creative part : when you know what your user needs, figure out how you can deliver. Which data ? Which algorithms? You should perform the entire analysis in your head, or skech it on a piece of paper. Some time ago I even got the habit of paradoxically starting the analysis by writing down the end report (of cause without any results). But it forces you to think the whole thing over beforehand and not afterwards when it’s to late.
Step 3. Playing detective in trying to find the data tables, and the right selection criteria to get the specific data that you need.
Real world is not like you have all the data exposed in front of you, nicely aligned and ordered according to some obvious criteria. No often all you know is that it is there somewhere in some table. Now starts the dirty work : call/email people to ask if they know someone who knows …
When finally you get the info, you access the data, just to find out something is clearly not right. So you call/email people to ask if they know … until finally you are pretty sure you have what you wanted.
Step 4. Merge the newly found data with what you want to use from your existing datamart
This is the more easy part and pretty straightforward.
Step 5. Get this data “mining-ready”
Depending on the algorithms you want to use your data has to meet some criteria : e.g. no nominal variables, must have a normal distribution, no missing values etc … Can be pretty tough to get everything in order.
Step 6. Mine that data (run algorithms, draw your conclusions)
That’s the exiting part. Not really difficult or dirty, because you had it all prepared. The part where you actually see what happens, the part where you discover the knowledge nugget, the part where you shout “YESS!”, or the part where you realise :”Shit, is that it ? That’s all ?”.
Step 7. Convince your business user that this result is all you can get out of it, even if it looks (afterward) as rediculously obvious.
Now you com out of the woods, covered in mud holding high your nugget, or almost empty-handed, or somewhere in between (gold-dust?)
You have to explain to your business user what’s the worth of your model, what he can do with it, how it can influence his marketing campaign results and eventually withstand his somewhat accusing look of “that’s all you got for me?”.
(Step 8. Afterwards just get a shower and prepare to find your next nugget.)