Tough question. What is a good data miner ?
One way of finding out is to look at the job descriptions, for example this one : Credit Suisse Data Miner Job Description
M.George distinguished five areas of expertise necessary to be a good data miner :
- techniques : to be able to do it
- analytics : to be able to decide what and how to do it
- business : to understand your customers
- communication : make your findings clear to others
- project management : manage everything and everyone from start to end
But all that still stays a bit abstract.
In what follows I will try to be somewhat more to the point.
Let us start with the data.
You have to be a bit of a detective just to find your data. Find the people who know where the data is, Find out how you can access the data. Find out who can give you access rights to the data, Find out the corresponding key variables to join the various tables into one flatfile …
Then you have to be a programmer to put all that info to use : sql, sas, BI tools, R, whatever not only to get your raw data, but also to get them usable : what to do with missing values ? which derived variables wil you calculate ? etc…
A lot of technical skills needed.
But there is not only the data, there is also the problem to solve. So you need to be an analyst.
As an analyst you have to make decisions about doing the things right and doing the right things :
- take a step backwards, know where to start, where to stop
- question everything : allways ask yourself where you are wrong, not good enough, to complicated, not efficient enough, …
- question everything : when they ask for numbers, ask them to explain their problem and how these numbers will solve it. Propose better, cheaper, nicer solutions …
And now comes the fun part : you have to be a number cruncher
You love data, charts, statistics (not the theory, but what you can do with it). You love to explain to people why something happens, to show them relationships between numbers, the conclusions that you derive from your numbers …
You know the data mining techniques, the statistical techniques and what you can and cannot do with them, their advantages and drawbacks, how to interprete the results, how to present the results in an uderstandable way (remember :
the others are stupid and lazy,
so you have to make it simple and easy
Unfortunately there is also the business (profits, costs, ROI …)
They expect you to deliver usable results in a short time. An accountant must deliver numbers that are correct, a data miner is lucky : nothing has to be absolutely correct. When it is good enough, deliver ! (Think “Microsoft software quality” !).
They sometimes say : a data mining model is never finished, only the data miner stopped working on it. This is very true, so keep that in mind and know when to stop and deliver !
Of cause every data mining project is, wel … a project. So you have to be a project manager too.
As a project is per definition something with a start and an end, you should have somewhere a description (accepted by all involved parties) of “WHEN CAN YOU CONSIDER THE PROJECT AS FINISHED”. This description is the only thing you need, because it has to contain all the conditions that have to be fulfilled (goals, deliverables, quality metrics …).
What helps you to deliver more quickly is to stick on the following rule : do the same thing twice, but never do them three times. This means that for anything you will have to do more than two times you should find a solution to get it done automatically : write a program, download a program, write an excel macro, anything.
this means you also have to be a bit of a software engineer !
This automatisation/industrialisation holds for anything : data extraction, modelling, model result reporting, monitoring of your model quality, monitoring of the data quality etc …
And last but not least : you have to be a learner.
Never think you know it all, allways look for new ways, read articles, go to symposia, find out how ohters do it, look for ways to deliver as much quantity and quality as possible whithout working too much 🙂
Other posts you might enjoy reading :
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes
Good enough / data quality
Data mining for marketing campaigns : interpretation of lift