About the problem
About the problem
Here is my very personal list of what a man should have to be happy. I am aware that a lot of you will not agree on all 10 items and I am OK with that. I can easily imagine a dozen other things that could add to my happiness, but somehow I feel the 10 ones below are, well, my top-10.
(to avoid any misunderstanding, 1. and 2. I do not consider as “things”, but persons)
In a random order.
5.hair or hat
And now a bit more elaborated.
Have other suggestions? Let me know.
This is what “they” are trying to sell us :
This leads me to some questions:
What has changed?
Do we all have that much data?
Are we all concerned ?
Other people will certainly have other opinions on this. If you do, do not hesitate to start the discussion.
Ever wanted to lift a heavy weight, or to open a door that is really stuck or to pull a large nail out of a beam?
The simplest thing you can do when you cannot get it done by yourself is to get help. More persons can deliver more force than one single person. Or a group of persons can accomplish more when you add some people to it.
However, if you are on your own to pull that nail, which type of help would you chose: two other men like you, or one single person with a crowbar?
The choice is obvious. The one with the crowbar will unleash more force than the two other persons together, due to the huge leverage effect of the crowbar.
It is not different in marketing
You can keep adding people to your team, but if you want to really give a boost to your campaigns, you will need someone with a crowbar.
This data miner is more than just another person. It’s the one who easily doubles, triples, quadruples the result of your campaign by optimizing the selection criteria for your target group. An optimized selection will do two things:
According to wikipedia “A professional is a person who is paid to undertake a specialized set of tasks and to complete them for a fee”.
A more exhaustive definition in the same wikipedia article contains words like “expert”, “specialized knowledge”, “excellent skills”, “master in a specific field”.
In my long carreer and in my non-work time I saw (and still see) a lot of “professionals” who only meet the first line of this blog post : they are paid for what they do, but they are far from doing what they are supposed tot do, or delivering the quality you could expect from a real professional.
Some real life examples ?
My own definition of professional : ” someone who wants to get paid for his work or his time”. The most difficult part for the one who has to pay is to find out whether that so-called professional is worth the money.
I am sure most of you will remember the moment in the movie “Titanic” when someone asks ” Why doesn’t she turn?”. It happens after they discovered that they were steaming right into an iceberg and tried to pull over.
Why did she not turn?
The iceberg was spotted, the captain had given the order to turn, the guys in the machine room had made their adjustments, the first officer had turned the steering wheel.
So why did she not turn?
In the company I work for I saw this question asked many times. The CEO and board of directors had spotted the problem (loss of marked share, diminishing profits, unexpected changes in the business landscape, or whatever can go wrong with a company). They had ordered to change course, they had moved a lot of senior managers from one place in the organization to another. They had published some new company charter. But still was the company heading straight ahead. No quick turn left or right. Just keeping the old ways of doing things.
How can you expect a speeding 46,000-tons Titanic to quickly change its direction by simply changing the position of its mere 500 kilos of rudder?
How can you expect a speeding thousands-of-employees enterprise to swiftly change course by making a mere handful of senior managers switching chairs?
As long as the multitude of employees just keep doing what they have been doing for years, the same way they have been doing it, nothing will change significantly.
And still, there always are some people who would love to change (all day long, they sing : “My heart will go on” :-). There also are a whole lot of people who would accept to change. There even are some people who would go along with the change when all the others do it. But then there is this huge bulk of inertia. People who will not or cannot change no matter what. As long as you keep this inert mass aboard, the Titanic will keep it’s momentum in the direction where it was going before and nothing will change significantly. Not fast enough in any case.
So get rid of the inertia. Build a lighter boat with the changers and than you will be able to take that quick turn to avoid the iceberg.
Perhaps you will even be able to change the course of the iceberg itself…
Here I use some quotes (in italic) from the original article on Everything You Wanted to Know About Data Mining but Were Afraid to Ask and add some personal toughts. So, yes, it will be “more”!
“We know that data is powerful and valuable.” …
“Data mining allows companies and governments to use the information you provide to reveal more than you think.”
Do not be fooled. Not everything is data mining. To know more about the difference between for instance mere data gathering and real data mining, you should read “is reading a newspaper data mining“.
“To most of us data mining goes something like this: tons of data is collected, then quant wizards work their arcane magic, and then they know all of this amazing stuff. But, how? “
How? That is exactly what I described in Se7en steps in finding knowledge nuggets. O yes, and not like the polite textbook stuff, but how it’s happening in real life.
“And these days, there’s always more data.”
“The sheer scale of this data has far exceeded human sense-making capabilities.”
Yes, but in most cases it is not necessary to use all of this data. Most companies are only beginning to mine small parts of their own data. This stands in striking contrast to what large software and hardware vendors want us to believe. After all they want to sell their products and the accompanying consultancy. Mostly a free software package like Weka or R is largely sufficient.
“Data mining is used to … allow us to infer things about specific cases based on the patterns we have observed.”
That is the core task of data mining : detecting patterns. And this can be any kind of pattern as will become clear in the following paragraphs. But in fact it is relatively simple. There are two kinds of patterns: those which can be detected by unsupervised learning and those we detect used supervised learning.
Supervised means that we decide what we want to reveal, we have a specific problem to solve. Examples: how can we select customers who are highly likely to buy product X? How can we identify customers who will not be able to make their mortgage loan payments? How can we identify fraudulent tax returns?
Unsupervised means that we will not decide anything. We will let the data speak and just see which patterns emerge.
“Anomaly detection : in a large data set it is possible to get a picture of what the data tends to look like in a typical case. Statistics can be used to determine if something is notably different from this pattern. For instance, the IRS could model typical tax returns and use anomaly detection to identify specific returns that differ from this for review and audit.”
Another use for anomaly detection is for example to detect errors in the data.
“Association learning: This is the type of data mining that drives the Amazon recommendation system. For instance, this might reveal that customers who bought a cocktail shaker and a cocktail recipe book also often buy martini glasses. “
In most cases the buying patterns are not the only variables that are used. The other variables, like age, amount of money spent, frequency of buying etc. prevent the modeler from using association learning. In stead he uses some other predictive algorithm (classification) where the actual purchases are just one variable type among the various other variable types.
“Cluster detection: it is possible to let the data itself determine the groups. … in a simple example we can imagine that the purchasing habits of different hobbyists would look quite different from each other: gardeners, fishermen and model airplane enthusiasts would all be quite distinct. Machine learning algorithms can detect all of the different subgroups within a dataset that differ significantly from each other.”
This is one of the most difficult topics in data mining, not because the algorithms are difficult, but because in most cases the results have no actual business meaning, unless you take good care of the difficult preparatory work. I described this problem in Distances, the biggest challenge in clustering. Perhaps you are also interested in the difference between clustering and segmentation.
Classification: If an existing structure is already known, data mining can be used to classify new cases into these pre-determined categories. … Spam filters are a great example of this … to notice differences in word usage between legitimate and spam messages”
Other applications are : classify customers into those who will buy product X vs. those who will not buy? Classify customers into those who will not be able to make their mortgage loan payments vs those who will be able to pay? Classify tax returns into the fraudulent ones vs the good ones? Classify customers into those who are going to churn and those who are not. Classify stocks into those that are going to rise vs those that are going to drop.
There are a lot of algorithms to calculate classification models : decision trees (and the more complex decision tree based algorithms), support vector machines, (logistic) regressions, ant colony optimization, genetic algorithms.
Data mining, in this way, can grant immense inferential power. … it is how most successful Internet companies make their money and from where they draw their power.
Not only internet companies, but banks, credit card companies, each and every company that is big enough to afford to pay people and software to at least try some data mining and see what comes out of it.
*** If you liked what I wrote, perhaps you should consider clicking one of the banners on top of this post and become one of them who help making our world a better place to live ? ***
When you are up to a data mining project, you have a lot of decisions to make:
Which data mining software will you use?
Which algorithms software will you use?
Which hardware will you use?
Which data sources will you use?
What will be the size of your hold-out samples?
Will you calculate derived variables? Which ones?
How will you measure the quality of your model(s)?
How will you deploy your model(s)?
Did you notice? I forgot one. I forgot the most important one.
If you are a data miner, you should know already : The target.
THE TARGET !
Yes, but is that not something that they order you to predict? Is it YOUR decision?
It is a fact that a prediction model of the right target is much better than a good prediction model of the wrong or suboptimal target.
Let me give some examples of decisions to make:
Can you think of other examples? Let me know and I will gladly add them to the list.
Admit that it is not the task of your marketeer to decide about these things. You should decide it, together with him. Be very aware of the fact that these decisions are more important for the commercial result of your marketing campaigns than your choice of the best algorithm!
How comes we know what whe know ?
We can distinguish several stages of getting knowledge, which have evolved from what monocellular organisms know, into what we, humans, know.
Stage 1.Knowing without learning
An earthworm does not learn. It just does what it feels it must do: digging tunnels, eating soil, copulating, etc. It’s all in it’s genes. It never has to learn anything.
This holds for thousands of other animal species. They never learn. They just do what they have to do, without even consciously knowing what they are doing. Les’s say it’s all just biochemistry.
The necessary information to do the right thing at the right moment has grown in their tiny brains just like the other organs in their body. Evolution has taken care of that.
Stage 2. Knowing by learning from experience.
Ever saw a mother cat bringing a living mouse to her cubs? She lets them play with it, to let them experience what it means to handle a living mouse. The little ones have to try, fail, try again and learn by doing it. Just like you and me have to learn to drive a car or to play the guitar. We have to feel it. We have to experience it, we have to fail and fail again and to become better and better at it. Even Jimmy Hendrix began that way.
The necessary information to do the right thing at the right moment is stored in our brains bit by bit, by experiencing the right and wrong moves.
Our environment provides the necessary conditions to experience. And what we experience is the real world, the true world.
Stage 3. Knowing by learning from others.
This is a total other sort of learning. You just listen to what someone else says (or you read what he/she has written) and then you know. If someone tells you that you have to push the red button, not the green one, to open a particular box, well, without ever having seen the box you know how to open it.
The necessary information was put in our brain just by listening, or by reading.
So far so good. But the third form of knowing is a dangerous one.
What if that other person told you it is the green button, in stead of the red one? You would push the wrong button.
Learning from others means believing, without having experienced it yourself. It even opens the possibility for learning and believing erroneous information.
Do you believe that a iPhone is actually better than a Samsung Galaxy? Why ? Have you tested it ? Is it because someone told you so? Is it because the person with the iPhone was more enthousiastic than the one with the Samsung Galaxy ? Is it because you saw more ads about the iPhone than about the Samsung Galaxy ? Do you actually have any clue why you believe it (or not) ?
Learning from others also opened the door to religion : if you believe in a God, why is this ? Did you actually met a God? Did you see him? Or is it just because someone told you that everything there is was created by a super creature, like God, Allah, Zeus, Tenochtitlan, or whatever?
More on the evolutionary origin of revolution can be found in this new scientist article.
I remember our first radio (when I was a little kid), our fist television set, our first computer, my first cell phone, my first smartphone …
During my whole life technology has rushed forward at an always increasing speed.
Where do we go from here?
It seems I will have some future doctor put a device in my head that can read my thoughts and can communicate directly with the outside world. At the same time of course this device will be connected with some wireless protocol to the internet (Wifi, 3G,4G,5G…).
As with dropbox I would have some account in the cloud where I pay for extra brain capacity, eventually doubling, tripling my natural brain capacity (nice! finally I will be smart)
So forget about alzheimer. My brain extention in the cloud will make up for each and every loss of memory or thinking capacity.
Not only I will be able to connect directly via the internet to my friends and relatives, but eventually my cloudbrain (when I am sleeping for example) will directly connect to the cloudbrains of my “friends” in the cloud. When my biological me will wake up in the morning, I will synchronize with my cloudbrain and be instantly up to date with what happened in the world overnight.
Eventually, when / if my biological body dies (but why should it ? I will replace used parts …) my cloud self will continue to survive… if someone keeps paying my bill.
Just some thoughts. Possibilities are endless…
And there would be some side-effects :
With the built-in gps I would always know where I am and how to get somewhere. I would have to buy some decent antivirus, anti-spam and firewall. I would never forget anything I see, hear or read. Even more: I will be able to seach (Google cloudbrain search?) in my memory and “think” the URL to someone else.
Ah yes, social media would be entirely different. Not only sharing thoughts,images, but also real feelings, smells, senses …
“At work” would not exist any more. No need for pc’s, phones, desks, meeting rooms. Everything would happen on the servers of the company with direct input from our brains.
I dare not to imagine what terrorists will do or how wars will look/feel/smell like…
Is there a brave new world coming??
This post is about the expected longevity of european men and women.
But it is also about the interpretation of a simple linear relationship.
First the figures.
The chart shows the relationship between the expected longevity of men and women in the various european countries.
As we can see, they are highly correlated. Which is a good thing: it means that in countries with favourable life conditions, both sexes can profit.
We also observe that women live on average longer than men. In no country men’s average life span surpasses 80 years, whereas this is surely the case for women in most countries.
Next chart shows the ratio (in %) between life expectancy of women and men as related to average life expectancy of men.
Three things are obvious :
Question : is this just a “men” effect? Let us look at the same ratio in relationship with the average age of women:
We see approximately the same relationship, although the points are somewhat more scattered around the line. The equal-life-expectancy is situated here at about 91 years. Not exactly the same as in the previous graph, but let’s not quarrel about trifles.
Conclusion : if life conditions would continue improving until the point that women (or men) reach an average age of about 90 years, they would have an equal live expectancy.
We could state the above otherwise : as women seem to have an advantage over men, an improvment in life conditions favours men more than women, such that at an average age of 90, the disadvantage of men would be wiped out.
Remains the question : what is the actual advantage that women have over men? Why do they live longer?
In this article on Time Health reasons as “women develop cardiovascular diseases 10 years later than men”, “women have two X chromosomes which give them more genetic material and hence more diversity, resulting in an advantage”, “men have something like a testosterone storm when they are around their 20′s which makes them often behave dangerously”.
Another aticle in Dalymail speaks about a genetic advantage of women over men because men are more disposable. Women (especially long ago when we were gatheres-hunters) had to live long enough to raise their children.
And this article from Harvard University points the finger on the menopauze which cause women to stop giving birth to children. This gives her time to care of their children and grandchildren.
So the next question : why does the age advantage of women decrease in countries where it’s better to live? Does her genetic advantage decreases?
I believe that it’s something else. In the better countries, social structures, health care etc. are so much better that the environmental dangers decrease: better health care, safer cars, safer toys etc. diminish not only the genetic advantage of longer living mothers and grandmothers but also diminish the danger caused by the testosterone storm. So if there is no danger any more, the danger-countermeasures that women have, would become worthless.
What happens then, when in some country both men ande women would reach an average age of 90? That would perhaps be the indication that all avoidable dangers, accidents, crimes, deseases, or whatsoever have been removed by adding the necessarey the infrastructures and countermeasures. This would then be an ideal country (from the point of view of longevity) where the only deaths would be old age or incurable deseases that make no difference between the two sexes.
Any other ideas, interpretations? Do not hesitate, I will gladly read your comment.
Other posts on the differences between men and women :
Further reading on possibilities for living longer : http://ieet.org/index.php/IEET/more/brin20120108
Yesterday I got an interesting comment on a previous post on evolution.
I thought my answer would be to elaborate for a reply, hence this reply-post.
Tias Dailey writes the following (bolds are mine):
“You wrote that in one winter, a population of birds could be affected by natural selection because the small birds die off, leaving the larger birds. The thing is, natural selection always has a narrowing effect on the variation in a population. Understand that in your scenario, large birds did in fact exist before the natural selection. So that in itself is not evolution, but only narrowing of the gene pool. So that scenario doesn’t show that evolution can occur quickly.
To show that evolution can occur quickly, you would need to show that new features can arise quickly—features that were not present before.”
In fact, Tias makes 2 statements here :
Once upon a time, when the descendents of the neanderthals had invented something called object-oriented programming (and when I worked in IT), one of the good qualities of a good object-oriented design was “loose coupling“.
Loose coupling in object orientation means that software objects are loosely coupled with one another such that you can easily modify one of them without influencing the rest. The opposite of spaghetti code, where you pull on one spaghetti string and it starts to move on the other side of your plate.
Obviously loosely coupled designs are much more stable that tightly coupled ones.
One of the more interesting unsupervised data mining algorithms consists in finding clusters in a data cloud such that the clusters themselves are tight, but the clusters are far away from one another. In other words : the average distance between observations from the same cluster is far less than the distances between observations from other clusters. In some way, clusters are loosely coupled.
Obviously a segmentation based on really loosely coupled clusters clearly is much more relevant than one with very near or ill-separated clusters.
Evolution occurs when populations of organisms change under environmental pressure. When organisms of the same species live under different environmental circumstances they will change in different directions without influencing other populations a lot. That’s the way they can evolve into different species. Some sort of loose coupling.
Obviously different populations that are loosely coupled can much more easily evolve in different directions than populations that are still connected, with a high exchange rate of individuals.
Apparently Greece is NOT loosely coupled financially and economically from the rest of Europe and the rest of the world. Though it should be! The world is becoming one tightly coupled system such that when you or I fart, they smell it in Japan or the Bahamas.
That’s why at some places they start much smaller with what they call “local currencies“. And here, “local” means exactly what it means : local, in one town or even one neighbourhood. This allows this towns or neighbourhoods to do their thing more efficiently than when they are thightly connected to the rest of the country via the national currency.
Obviously the EURO was not such a good idea after all ?
Ever saw a bunch of children playing happily in some forest for a couple of hours and return home nice and clean ? No way. When they are healthy they should return covered with dirt and mud. Only then they have really played.
“Returning home nice and clean” that’s the feeling I get when I read Scott Levine‘s “7 steps of knowledge discovery in databases“. It is correct what he writes, but it all seems so uncomfortably clean.
That’s why here I write down my own version :
“Se7en steps in finding knowledge nuggets”.
Step 1.Try to understand what your business user really needs (not what he/she asks).
Know for sure that your business user almost never ask what he needs. No, somewhere he has a problem, he figures out for himself some sort of solution and based on that solution he ask you to do some mining work. If you just do it like that, I guarantee you that the result will never be satisfactory. In stead you’ll have to challenge him, ask him what he wants to accomplish with the result of your work, ask him what the problem is he wants to resolve and together find a -usually better- solution.
Step 2. Figure out what new data you need and which data-mining algorithms and/or statistical tests you need to use.
Now begins the creative part : when you know what your user needs, figure out how you can deliver. Which data ? Which algorithms? You should perform the entire analysis in your head, or skech it on a piece of paper. Some time ago I even got the habit of paradoxically starting the analysis by writing down the end report (of cause without any results). But it forces you to think the whole thing over beforehand and not afterwards when it’s to late.
Step 3. Playing detective in trying to find the data tables, and the right selection criteria to get the specific data that you need.
Real world is not like you have all the data exposed in front of you, nicely aligned and ordered according to some obvious criteria. No often all you know is that it is there somewhere in some table. Now starts the dirty work : call/email people to ask if they know someone who knows …
When finally you get the info, you access the data, just to find out something is clearly not right. So you call/email people to ask if they know … until finally you are pretty sure you have what you wanted.
Step 4. Merge the newly found data with what you want to use from your existing datamart
This is the more easy part and pretty straightforward.
Step 5. Get this data “mining-ready”
Depending on the algorithms you want to use your data has to meet some criteria : e.g. no nominal variables, must have a normal distribution, no missing values etc … Can be pretty tough to get everything in order.
Step 6. Mine that data (run algorithms, draw your conclusions)
That’s the exiting part. Not really difficult or dirty, because you had it all prepared. The part where you actually see what happens, the part where you discover the knowledge nugget, the part where you shout “YESS!”, or the part where you realise :”Shit, is that it ? That’s all ?”.
Step 7. Convince your business user that this result is all you can get out of it, even if it looks (afterward) as rediculously obvious.
Now you com out of the woods, covered in mud holding high your nugget, or almost empty-handed, or somewhere in between (gold-dust?)
You have to explain to your business user what’s the worth of your model, what he can do with it, how it can influence his marketing campaign results and eventually withstand his somewhat accusing look of “that’s all you got for me?”.
(Step 8. Afterwards just get a shower and prepare to find your next nugget.)
As I am typing this, I suppose that about every second someone gets killed or injured by some accident. And still, I am typing this, without a lot of sorry. Not that this means nothing to me, but when somebody who I do not know dies on the other side of the world, well it is a sad reality, but I do not care very much.
How does it affect you when someone dies or gets severely injured by accident?
An obvious answer to that question is : “it depends”. On what ?
There are several factors that influence the effect an accident has on somebody:
When we combine these factors, we can say that the impact on someone (I) of an accident/disaster is positively correlated with the number of casualties (N), the familiarity(F) and the health (H) and negatively with distance (D), time (T) and age (A).
So a simple equation would be : I=(N+F+H)/(D+T+A)
But now remains the problem of the unities. Distance can be measured in meters or kilometers and amount up to thousends of them whilst age can reach at maximum about 100 years. It seems attractive to try to put every factor in a scale of 0 to 100.
Age is the most simple one and can be used as such.
Number of victims : a nice measure is 10 times the logarithm of 1+ the number of victims. The table below shows us how nicely it goes from zero when there are no victims to near to 100 when the entire world population is extinguished.
A similar approach can be used for distance. Here I use 20 thousand kilometers as the maximum, it’s the other side of the world.
Time is also a rather easy one. I took the first table as a basis and adjusted it somewhat. X is the number of days, and I added a number of years column. We reach 100 after 27 thousand years.
This leaves us with familiarity and health. It is not my purpose to elaborate that in detail. So I suggest that a perfectly healthy person has a score of 100, someone who is already dead is scored as zero and let us use our gut feeling to assign the intermediate values. For familiarity we can use a similar approach: a value of 100 represents the persons we love the most, like our children, our husband or wife. 0 stands for people we do absolutely not know and do not care at all.
And now let us look at some examples.
1.The Banda Aceh tsunami.
The impact now, after 7 years, on somebody on the other side of the world is :
Our formula : I=(N+F+H)/(D+T+A)
Let us assume that those 250 thousand victims where fairly healthy and on average35 years old and we care a little bit about those people.
If we do the math then we find for the first day that the impact on you =(54+3+97)/(100+49+35) = 0.84
A low figure, is it not? But frankly, how often do you still think of that disaster, unless you live in that unlucky part of the world ?
2. Now entirely different : suppose your perfectly healthy child of 1 year old who lives with you dies in an accident (I sincerely wish this will never happen!)
If we do the math then we find for the first day that the impact on you =(3.01+100+100)/(0+0+1) = 203.
Take a look at the formula. What would happen if that child was a newborn one? The resulting value would be infinite! Hence I propose to limit the Impact Value to a maximum of 100.
The formula then becomes : I=min ( 100 , (N+F+H)/(D+T+A) ) which means that if our calculated value is smaller than 100, we accept it, otherwise we just take 100.
I know there is not much science in the above, it was just an interesting (but superficial) thought exercise.
Any suggestion for approvement is welcome.
(Some further reading on this subject : The new problem of distance in morality)
“After that the computer froze a few times over the course of a couple days, so I assumed… So, I have no clue what is going on”.
“My computer randomly freezes,… What might be the problem?”
“Your computer was working fine, but then suddenly started locking up… ometimes random lockups can be attributed to the computer memory…”
When you google “computer freezes” you get thousends of desperate people asking for help. Mostly it can be solved by checking hardware, software etc.
But occasionaly it occurs that something goes wrong for no reason whatsoever, and then it never happens again. Why?
At work we had such a problem: less than once a year our SAS software refused to run our programs. Exactly the same programs we were used to run daily, weekly, monthly without any problems. Googling the error message was no help. Obviously the software was on strike. Temporarily, because the following morning everything was back to normal.
What happened ?
After deep thought, eliminating all impossible possibilities, I came up with the only plausible explanation I could find.
This is what I wrote to my collegues :
I now know the reason of the problems: it’s what we call the IT Ghost, a species of creatures from the 5th dimension which are migrating this time of the year from the Betelgeuze area to the Crab nebula and are eventually teleporting through the earth. On this occasion they can influence the spin of some Charm Quarks causing computer processes to behave erratically, with no obvious reason.
A positronic energy field of 5.000 trillion petavolt around the earth should solve the problem.”
Do you have any better explanation :-)
Customer satisfaction is a hot topic. Numerous studies are continuously going on to get to know the enhancers end/or dissatisfiers. Depending on the branch you work in (bank, retail, internet book shop, etc), these enhancers/dissatisfiers can be very different.
Nevertheless, if we take a step back and do some abstraction, it seems that we can distinguish different levels, analoguous to the pyramid of maslow
In “maslows hierarchy of customer service” Naumi Haque distinguishes three levels :
Well, it should be no surprise, below I will present my own “customer satisfaction pyramid” which is slightly different from the two above, and for sure is put in a less cryptical language.
the hierarchy is the following :
Basis : deliver what you promise, give the customer what you make him think you should give him. This corresponds with the first level of the two pyramids above.
Second : do it fast, don’t keep your customer waiting, and do it properly, deliver it to him the way he would like it.
Third : see to it that there are no problems for the customer. OK, nothing is always perfect, so if something goes wrong, make it as soon as possible your own problem, not the problem of the customer. Make it easy for the customer to get problems solved. Make sure that when the customer complains or ask for help, you give him a reassuring, easy, satisfied feeling. Keep it easy for him, and do the hard work yourself to make him happy.
(This one was not mentioned in the two pyramids above.)
Finally : create a WOW effect
In short : optimise in this order : the WHAT’s, the HOW’s, the CURES and the WOW’s
However there is only one problem : not everything IS data mining.
To clear this mess a bit, in what follows I list and explain several activities that are sometimes (mistakenly) called “data mining”.
“Data extraction software can enable agencies to collect data on the race, gender, and ethnicity for the person(s) owning the majority of rights, equity, or interest in a business.” (Mozenda)
My definition is simple : you get the data from somewhere with some data extraction program. What you do afterwards with that data is not relevant.
Is making a report : “A Report is a piece of information describing, or an account of certain events given or presented to someone“. (wikipedia)
“Reporting is just a genre of writing, alongside essays and stories, and blogggers most certainly fall into that genre. Imho, when they talk about reporting on a show like Frontline, they mean the process a reporter goes through.” (Scripting.com)
This seems a bit more complicated than data extraction. I would say : “extracting from whatever sources of data/information those pieces of information that are sufficiently important an structuring/presenting them to be communicated to your audience, customers, boss or whatever other party”.
My defition: reporting is not showing raw data, but some communicable description. This can be in the form of tables, charts, structured drawings, or simply words.
“methods to collect, analyze and interpret data” (Nebraska university)
Is a very broad definition, and it has obviously a lot to do with data.
For me, a part from “data”, the words that are most important here are “science”, “methods”, “interpretation”. Statistics is not just extracting data or reporting, no, here we have to do better.
Hence my definition : we use some mathematical method(s) to extract the right data, to interpret the data, to draw conclusions based on mathematics and to present these results/conclusions.
This is the most difficult one, and most misunderstood.
“the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with <a title=”Database management” href=”http://en.wikipedia.org/wiki/Database_management”>database management.” (wikipedia)
“the process of analyzing data from different perspectives and summarizing it into useful information” (UCLAAnderson)
Here is my very personal view on some settings of decision trees.
Maximum depth :
Maximum tree depth is a limit to stop further splitting of nodes when the specified tree depth has been reached during the building of the initial decision tree.
My opinion is straight and simple : NEVER use maximum depth to limit the further splitting of nodes. In other words : use the largest possible value.
I suppose some explanation is necessary.
When you grow a decision tree, different leaves in the splits normally contain different numbers of observations. Using the tree depth totally disregards these differences. It could cause to stop splitting a leaf containing 25,000 observations on one side of the tree, whereas on the other side, containing much less observations a leaf with only 30 observations could still get splitted. This makes absolutely no sense!
Minimum splitsize is a limit to stop further splitting of nodes when the number of observations in the node is lower than the minimum splitsize.
This is a good way to limit the growing of the tree. When a leaf contains to few observations, further splitting will result in overfitting (modeling of noise in the data).
Now the capital question : at what number should we set the limit ?
Answer : it depends.
With hundreds of variables I use normally a minimum splitsize in the range of the number of observations divided by a few hundreds.
Minimum leaf size
Minimum leafsize is a limit to split a node when the number of observations in one of the child nodes is lower than the minimum leafsize.
Splitting of a node in two or more child nodes has to make some statistical sense. What is for example the sense of splitting a node with 100 observations in the two following child nodes: one with 99 observations and one with 1 observation? It is all a bit like doing a chi-squared test. A good rule of thumb says that you should never have less than five observations in one of the cases. I should say : the same goes for decision trees, as long as you deal with the same amount of observations you normally use to calculate chi-squared tests. It is known that i) with a large number of observations chi-quared tests are no longer appropriate and ii) that decision trees are not a good algorithm for small numbers of observations (say less than 500). So you should set the minimum leafsize larger than 5. I usually take 10% of the minimum splitsize (in a bagging ensemble).
There is only one way to know the best settings : try, try and try again ! This is because all projects, data sets, are different. Do you have your own rules of thumb ? Please, do’nt hesitate and let me know !
The internet is full of reactions, opinions about data mining and the corresponding privacy issues. Even insults like the example below towards data miners or top executives of data mining enterprizes are no exception.
But is data mining always so bad?
Domains like medical applications where data mining could save your life fall without any doubt on the good side of the picture.
But even marketing can be a justified reason to use data mining results:
You might also want to read:
In “Six myths about data analysts” I was struck about number two :
Number 2 : “Fact: It’s all about impact“
According to president of analytics Ken Rudin at Zynga “Analytics is about impact. In our company, if you have brilliant insight and you did great research and no one changes, you get zero credit.”
Dear reader : what do you think about that ?
For me it is simple. They want the lower level employee to do everything. Not only the lower level work, but also the management. If you, as a data analyst, discover something interesting, make sure you do not communicate it to your manager. OH NO ! They expect you to do the work of your managers, i.e. decide who should know about it, pass them the information, show them how it can be profitable to their work, to the enterprise, convince them to change (is’n that change MANAGEMENT ?) etc…
I thoutht that managing was all about :
I know, managers want an easy life:
So that’s why they think you are only a good data analyst if you do a good job at analyzing data AND a great job at doing the work they are supposed to do.
OK. You have a job opening for a data miner.
Now what are you going to write as job description?
If you want to hire a real data miner, I suppose any good candidate knows what it is like to be a data miner. He does not need a job description.
You just tell for which department he will work : marketing, credit risk, DNA-analytics Lab, …
Take for instance this :
Experience – Familiarity with major database and statistical packages; experience with statistical and database applications in a particular area such as biology (biostatistics), physical science, economics, or marketing (from maxizip.com).
If you do not already know this, why do you go for a data mining job ?
or this :
Right ! It means executing and reporting data mining work, for somebody, and you are not alone. So WTF ?
The feeling I have with each and every job description I find is the same : boooooooooooring !
Why not simply write the truth ?
For example :
“The people of our marketing department do a very nice job, but we want it to be better. We want them to be more data-driven. They are able to add, subtract, divide and multiply. They can deal with the gender and age of their clients. But we have a feeling that’s not enough ! We want to take it a long way further. And that will be YOUR responsability. When it comes to figures, you hold their hands. You explain. You provide the charts. You feed them numerical insights. You perform rocket science they don’t understand, convince them to use your models and prove them that they were wrong if they did not. YOUR ultimate goal is to make THEM shine with high-return-campaigns. And silently you hope they will show some gratitude, but you very well know that at least half of them will hate your guts because you are the one who forces them to change the way they are used to do their job.”