Posted by: zyxo | August 26, 2011

Why computers go bananas without any reason

“After that the computer froze a few times over the course of a couple days, so I assumed… So, I have no clue what is going on”.

“My computer randomly freezes,… What might be the problem?”

“Your computer was working fine, but then suddenly started locking up… ometimes random lockups can be attributed to the computer memory…”

When you google “computer freezes” you get thousends of desperate people asking for help. Mostly it can be solved by checking hardware, software etc.

But occasionaly it occurs that something goes wrong for no reason whatsoever, and then it never happens again. Why?

At work we had such a problem: less than once a year our SAS software refused to run our programs. Exactly the same programs we were used to run daily, weekly, monthly without any problems. Googling the error message was no help. Obviously the software was on strike. Temporarily, because the following morning everything was back to normal.
What happened ?

After deep thought, eliminating all impossible possibilities, I came up with the only plausible explanation I could find.
This is what I wrote to my collegues :

“Dear collegues,
I now know the reason of the problems: it’s what we call the IT Ghost, a species of creatures from the 5th dimension which are migrating this time of the year from the Betelgeuze area to the Crab nebula and are eventually teleporting through the earth. On this occasion they can influence the spin of some Charm Quarks causing computer processes to behave erratically, with no obvious reason.
A positronic energy field of 5.000 trillion petavolt around the earth should solve the problem.”

The Crab Nebula, the shattered remnants of a s...

Image via Wikipedia

Do you have any better explanation ¬†ūüôā

Enhanced by Zemanta
Posted by: zyxo | August 5, 2011

The 2.5 ways to segment your customer base

Terabytes have been filled with books and articles about segmentation. ¬†And we should by now expect that the most basic knowledge about it is, well … known.
Forget it !
First : what is this most basic knowledge that each and every marketeer should know?

¬†“What can you do with it” ?
Or, stated otherwise : how should you use it ?

Is the answer obvious ? Not at all !

Take for example the SAS white paper “A Marketer’s Guide to Analytics“. ¬†You could reasonably expect SAS, as a major vendor of analytics software and consultancy, to know how to use segmentation.

Well, I seriously have my doubts.
They discribe as “the first two enablers of the analytical framework” :
1) analytically driven, granular segmentation: enables you to identify how different customer segments are most likely to respond to specific campaigns or marketing actions.
2) predictive modeling: enables you to identify the specific target population likely to respond positively to a specific campaign or other marketing activity.
I get an odd feeling when I read these two “different” descriptions. ¬†Whether I can identify how different customer segments will respond to my campaign or identify the target population that will respond in a particular way (“respond positively”) does not seem very different to me. ¬†In both cases you want to predict the behaviour of each customer or customer group in response to you campaign.
So let us forget about software or algorithms. ¬†Let’s think marketing.

1.  First, you want to sell your product or service.

This means you have to find out who is likely to buy it.  You call for help any tool or algorithm that can use the data in your customer base: logistic or linear regression,  neural networks, support vector machines, genetic algorithms, bayes learners, decision trees, and all sorts of segmentations.  Use whatever you like, know, have, and delivers satisfactory results.

OK, let’s say you have done this and you know who to target, you have your customer group or best segment or whatever. ¬†Perhaps you have a lift chart or the like so you know what you can expect from your campaign. (in my earlier post “datamining for marketing campaigns: interpretation of lift” your find a lot more about this topic)

2. Second, you have, one way or another, to speak to those people.

And if there is one important issue about communication it’s that you have to send the right message to the right person.
OK, you want to sell them all your world-changing superb product. ¬†But I’m not talking about the what, but about the how ! ¬†I’m not talking about the content of the message box you will send, but about the wrapping paper, the flavour of your message. ¬†Will you use the same words, the same communication channel, the same colours for young women, for old men, for internet savvy whizzkids, for grandma’s who never touched a computer ?

Did you notice ? ¬†I gave some examples of customer SEGMENTS. ¬†So that’s your second assignment : find the segments who match your communication alternatives.
A simple, but not easy, way to do this is to think, brainstorm, use your imagination and common sense, and use what you know about the people you identified in step one : look who’s in the selection, what is their age distribution, etc …
Now you have your second segmentation.

Lastly I owe you another half segmentation: In case you are not satisfied with your “communication segmentation”, you can always test it first: ¬†Use your various communication alternatives randomly to part of the people of your selected target group. ¬†Evaluate the results, and calculate which communication flavour your should use with which customer. ¬†For this calculation you can use whatever you¬†¬†like, know, have, and delivers satisfactory results. ¬†Then use the findings to optimise subsequent campaigns.
Enhanced by Zemanta
Posted by: zyxo | July 26, 2011

The customer satisfaction hierarchy

Customer satisfaction is a hot topic. Numerous studies are continuously going on to get to know the enhancers end/or dissatisfiers. Depending on the branch you work in (bank, retail, internet book shop, etc), these enhancers/dissatisfiers can be very different.
Nevertheless, if we take a step back and do some abstraction, it seems that we can distinguish different levels, analoguous to the pyramid of maslow

In “maslows hierarchy of customer service” ¬†Naumi Haque distinguishes three levels :

  1. Meeting the customers‘ expectations
  2. Meeting the customers’ desires
  3. Meeting the customers’ unrecognized needs

At frankwatching they present a four-level pyramid :

  1. trust, reliability, value
  2. timeliness, knowledgeable, responsible
  3. Caring, concerned, helpful
  4. Fun, friendly, enjoyable, entertaining

Well, it should be no surprise, below I will present my own “customer satisfaction pyramid” which is slightly different from the two above, and for sure is put in a less cryptical language.

the hierarchy is the following :

Basis : deliver what you promise, give the customer what you make him think you should give him.  This corresponds with the first level of the two pyramids above.

Second : do it fast, don’t keep your customer waiting, and do it properly, deliver it to him the way he would like it.

Third : see to it that there are no problems for the customer.  OK, nothing is always perfect, so if something goes wrong, make it as soon as possible your own problem, not the problem of the customer.  Make it easy for the customer to get problems solved.  Make sure that when the customer complains or ask for help, you give him a reassuring, easy, satisfied feeling.    Keep it easy for him, and do the hard work yourself to make him happy.
(This one was not mentioned in the two pyramids above.)

Finally : create a WOW effect

In short : optimise in this order : ¬†the WHAT’s, ¬†the HOW’s, the ¬†CURES and the WOW’s

Enhanced by Zemanta
Posted by: zyxo | July 10, 2011

Is reading a newspaper “Data mining” ?

Data mining is a hype. ¬†As a result everything is called data mining. ¬†I suppose reading a newspaper to find some interesting information is called “data mining” by some people too.

However there is only one problem : not everything IS data mining.

To clear this mess a bit, in what follows I list and explain several activities that are sometimes (mistakenly) called “data mining”.

Data extraction 

the act or process of retrieving¬†data¬†out of (usually¬†unstructured¬†or poorly structured)¬†data sources¬†for further¬†data processing” (wikipedia)

Data extraction software can enable agencies to collect data on the race, gender, and ethnicity for the person(s) owning the majority of rights, equity, or interest in a business.” (Mozenda)

My definition is simple : you get the data from somewhere with some data extraction program.  What you do afterwards with that data is not relevant.


Is making a report : “A¬†Report¬†is a piece of information describing, or an account of certain events given or presented to someone“. (wikipedia)

Reporting is just a genre of writing, alongside essays and stories, and blogggers most certainly fall into that genre.¬†Imho, when they talk about reporting on a show like Frontline, they mean the¬†process¬†a reporter goes through.” (

This seems a bit more complicated than data extraction. ¬†I would say : “extracting from whatever sources of data/information those pieces of information that are sufficiently important an structuring/presenting them to be communicated to your audience, customers, boss or whatever other party”.

My defition: reporting is not showing raw data, but some communicable description.  This can be in the form of tables, charts, structured drawings, or simply words.


statistics is … a distinct mathematical science ¬†pertaining to the collection, analysis, interpretation or explanation, and presentation of¬†data¬†.¬†” (wikipedia)

“methods to collect, analyze and interpret data” (Nebraska university)

“collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting and then drawing conclusions” (Akila)

Is a very broad definition, and it has obviously a lot to do with data.

For me, a part from “data”, ¬†the words that are most important here are “science”, “methods”, “interpretation”. ¬†Statistics is not just extracting data or reporting, no, here we have to do better.

Hence my definition : we use some mathematical method(s) to extract the right data, to interpret the data, to draw conclusions based on mathematics and to present these results/conclusions.

Data mining

This is the most difficult one, and most misunderstood.

Some definitions:

“the process of extracting patterns from large¬†data sets¬†by combining methods from¬†statistics¬†and¬†artificial intelligence¬†with¬†<a title=”Database management” href=””>database management.” (wikipedia)

“the process of analyzing data from different perspectives and summarizing it into useful information” (UCLAAnderson)

“Data mining is the use of automated data analysis techniques to uncover previously undetected relationships among data items.” (
“Data mining is the discovery of hidden knowledge, unexpected patterns and new rules in large databases.” (E.Thomas)
The most important words or expressions here are : “extracting patterns”, “analyzing data”, “uncover relationships”, “discovery of knowledge”.
So my definition  is: searching in data collections (databases, the internet) for information that was not put there deliberately, but neverteless can be derived.
And one more thing : reading a newspaper is definitely NOT data mining ūüôā

Here is my very personal view on some settings of decision trees.

Maximum depth :

Maximum tree depth is a limit to stop further splitting of nodes when the specified tree depth has been reached during the building of the initial decision tree.

My opinion is straight and simple : NEVER use maximum depth to limit the further splitting of nodes.  In other words : use the largest possible value.

I suppose some explanation is necessary.

When you grow a decision tree, different leaves in the splits normally contain different numbers of observations.  Using the tree depth totally disregards these differences.  It could cause to stop splitting a leaf containing 25,000 observations on one side of the tree, whereas on the other side, containing much less observations a leaf with only 30 observations could still get splitted.  This makes absolutely no sense!

Minimum splitsize

Minimum splitsize is a limit to stop further splitting of nodes when the number of observations in the node is lower than the minimum splitsize.

This is a good way to limit the growing of the tree.  When a leaf contains to few observations, further splitting will result in overfitting (modeling of noise in the data).

Now the capital question : at what number should we set the limit ?

Answer : it depends.

  • are you just growing one tree or do you want to create an ensemble (bagging, boosting …) ? ¬†If you create an ensemble, overfitting is permitted, because the ensemble will take care of it: it will look for the mean or some other grouping measure.
  • howmany independent variables (predictors) do you have? ¬†The more variables you have, the bigger the possibility of having some accidental relationship between one of the variables and the target. ¬†So with a lot of variables you should stop earlier.
  • howmany observations do you have? With a limited number of observations you do not have the luxury to stop early or ¬†you will end up with no tree at all. ¬†With a lot of observations you can stop early and still obtain a large enough decision tree

With hundreds of variables I use normally a minimum splitsize in the range of the number of observations divided by a few hundreds.

Minimum leaf size

Minimum leafsize is a limit to split a node when the number of observations in one of the child nodes is lower than the minimum leafsize.

Splitting of a node in two or more child nodes has to make some statistical sense.  What is for example the sense of splitting a node with 100 observations in the two following child nodes: one with 99 observations and one with 1 observation?  It is all a bit like doing a chi-squared test.  A good rule of thumb says that you should never have less than five observations in one of the cases.  I should say : the same goes for decision trees, as long as you deal with the same amount of observations you normally use to calculate chi-squared tests.  It is known that i) with a large number of observations chi-quared tests are no longer appropriate and ii) that decision trees are not a good algorithm for small numbers of observations (say less than 500).   So you should set the minimum leafsize larger than 5.  I usually take 10% of the minimum splitsize (in a bagging ensemble).


There is only one way to know the best settings : try, try and try again ! This is because all projects, data sets, are different. Do you have your own rules of thumb ? Please, do’nt hesitate and let me know !

Enhanced by Zemanta
Posted by: zyxo | June 17, 2011

Datamining and privacy: don’t shoot the pianist

The internet is full of reactions, opinions about data mining and the corresponding privacy issues. Even insults like the example below towards data miners or top executives of data mining enterprizes are no exception.

But is data mining always so bad?
Domains like medical applications where data mining could save your life fall without any doubt on the good side of the picture.

But even marketing can be a justified reason to use data mining results:

  • some customers explicitely want to stay informed about new products or services that are within their region of interest
  • in a lot of cases data mining is used to do less mailing instead of more: do not contact people who are not going to buy anyway.
  • some product/service offers can be so rightly targeted that the targeted people think: “Wow, right! why didn’t I think of that myself ?” ¬†Because of data mining in that case we actually provide them with a free service, more or less reminding them not to forget the things they actually need. ¬†Of course this is the ideal situation.
Unfortunately there are all those less ethical initiatives out there, but that has nothing to do with data mining as such.  Has a rifle ever been condemned for killing someone?  No!  Its the shooter, the one who uses the rifle who is the criminal.  The same goes for data mining.  We, data miners are only the pianists.  We play music.  The ones that record our music and broadcast it much to loud are the ones to be blamed.

You might also want to read:

Enhanced by Zemanta
Posted by: zyxo | June 11, 2011

Does your boss wants you to do HIS work ?

In “Six myths about data analysts” I was struck about number two :

  1. Myth #1: Data analysts are geeks. / Fact: Analysts are good communicators.
  2. Myth #2: Analysis is all about insight. /¬†Fact: It’s all about impact.
  3. Myth #3: Data analysis is easy. / Fact: Data analysis takes time to learn.
  4. Myth #4: Statistics is the most important skill. / Fact: Business smarts are more important.
  5. Myth #5: Analysts work at the “speed of thought.” / Fact: Thought is often a slow, non-linear process.
  6. Myth #6: Analysts are a rare breed. / Fact: We’re all data analysts.

Number 2 : “Fact: It’s all about impact

According to¬†¬†president of analytics Ken Rudin at Zynga¬†“Analytics is about impact. In our company, if you have brilliant insight and you did great research and no one changes, you get zero credit.”

Dear reader : what do you think about that ?

For me it is simple. ¬†They want the lower level employee to do everything. ¬†Not only the lower level work, but also the management. ¬†If you, as a data analyst, discover something interesting, make sure you do not communicate it to your manager. ¬†OH NO ! They expect you to do the work of your managers, i.e. decide who should know about it, pass them the information, show them how it can be profitable to their work, to the enterprise, convince them to change (is’n that change MANAGEMENT ?) etc…

I thoutht that managing was all about :

  • making sure you stay informed, means : talk to your data analysts, be interested in what they do and read their reports, ask them for new insights
  • using that information to chose the way you want to go
  • performing the necessary actions to get everyone with you along that way
  • making sure that you get feedback about the results of the change process
  • adding corrections if the results are not satisfactory
  • etc.

I know, managers want an easy life:

  • showing up unprepared at meetings
  • making decisions about necessary changes, not data driven, but more on gut feeling
  • (eventually) communicating these necessary changes
  • several months later by coincidence finding out that the changes never took place, not realizing they themselves did absolutely nothing to make it happen

So that’s why they think you are only a good data analyst if you do a good job at analyzing data AND a great job at doing the work they are supposed to do.

Posted by: zyxo | May 11, 2011

Honest Job Description of a Data Miner

OK. You have a job opening for a data miner.
Now what are you going to write as job description?

If you want to hire a real data miner, I suppose any good candidate knows what it is like to be a data miner. He does not need a job description.
You just tell for which department he will work : marketing, credit risk, DNA-analytics Lab, …

Take for instance this :

Experience ‚Äď Familiarity with major database and statistical packages; experience with statistical and database applications in a particular area such as biology (biostatistics), physical science, economics, or marketing (from
If you do not already know this, why do you go for a data mining job ?

or this :

Job description:

  • Participation in analytical projects from the area of data analysis, processing and Data Mining
  • Preparing documentation and presenting work results
  • Cooperating with team of StatConsulting data analysts and experts
  • Actively participating in business meetings with StatConsulting clients

(from Statconsulting)

Right ! It means executing and reporting data mining work, for somebody, and you are not alone. So WTF ?

The feeling I have with each and every job description I find is the same : boooooooooooring !

Why not simply write the truth ?

For example :
The people of our marketing department do a very nice job, but we want it to be better. We want them to be more data-driven. They are able to add, subtract, divide and multiply. They can deal with the gender and age of their clients. But we have a feeling that’s not enough ! We want to take it a long way further. And that will be YOUR responsability. When it comes to figures, you hold their hands. You explain. You provide the charts. You feed them numerical insights. You perform rocket science they don’t understand, convince them to use your models and prove them that they were wrong if they did not. YOUR ultimate goal is to make THEM shine with high-return-campaigns. And silently you hope they will show some gratitude, but you very well know that at least half of them will hate your guts because you are the one who forces them to change the way they are used to do their job.”

Posted by: zyxo | March 27, 2011

Why do we pay banks ?

Posted by: zyxo | March 4, 2011

Distances : the biggest challenge in clustering

Clustering seems easy : you throw your data into a clustering algorithm (like the popular k-means clustering) and see what comes out of it.

What is clustering ? Here is one definition (picked ad random from a google search : “Clustering is a data mining (machine learning) technique used to place data elements into related groups without advance knowledge of the group definitions.”

So it’s about clusters, “related groups”. ¬†Well, not related groups, I suppose, but groups containing related individuals. ¬†Generally we see rather explanations like “putting observations in groups so that the distances within the groups are as small as possible and the distances between the groups as large as possible.”

In other words : the groups come into existence by playing around with distances. ¬†And that’s what it’s all about : distances.

Distances seem easy : if you work with continuous variables you calculate the Euclidian distance,  if you work with categorical variables you can calculate a jaccard distance.

With a bit of standardizing and creativity you can even mix the two.

But that is not what this post is about.

An example will make it more clear.

Suppose you want to cluster the customers of a bank.  You will use variables like account balances, number of investment transactions, credit card transactions, mortgage loan balance etc.

It is perfectly possible to combine them into one euclidian distance.  But does a difference of 1,000$ between the current account balances of two customers represent the same distance as for example a difference of 15 credit card  transactions during the previous month ?  How do you compare these two ?  How do you decide that the one is more important than the other ?

Eureka !  Standardisation.

Indeed, you can standardize them, so the distributions of the variables become comparable.

But even then. Does a difference of one standard deviation (std) between current account balances represent — in the eyes of the business — the same as one std difference between number of credit card transactions ? ¬†Perhaps the bank profit increase due to a current account balance increase of one std is only half than that of a one std increase in number of¬†credit card transactions! ¬† In that case it may be appropriate to for example divide the first measure by 2

In order to know that you have to talk to the business people for each and every comparison between two variables.

To be perfectly good the result must be a matrix of pairwise comparisons where each row and each column represents one of your variables. Realise that in some cases this matrix can be huge, as well as the amount of time you will have to spend with you business expert to discuss each pair of variables to come up with some importance ratio, or distance ratio.

And after that you have to make sure that the whole matrix is somewhat consistent. ¬†Because if your business expert is sure that variable A is twice as important as variable B and variable C is only half as important as variable A, but still, variable B is far more important than variable C, well, than “Houston we have a problem” and you will have to negotiate.

So indeed, clustering may be rather easy, but getting a good view on the appropriate distance definitions is the hard part.

Posted by: zyxo | February 14, 2011

Google CEO Naked ?

What would you think about a company that  uses a whole lot of personal info of you to sell you their stuff :

– you entire buying experience

– your way of living

– where you go on vacation

– details of your family

– how much you earn

– which car you drive, which car your wife drives

– where you went to school

– where you work

– your profession

– where you shop for groceries

– where you buy your clothes

– which sportsclubs or social clubs you are part of

– in which house you live, if it is your property or if you rent it

– how much you pay for rent

Sounds horrible ! No ?

What about if this company is  the local grocery store, the owner of which is your friend, went to basic school with you, lives in your street, and from time to time  proposes you his new products when he thinks they will interest you ?

Aha, that’s an entirely different story ? Isn’t it ?

WHY whould that be?

On first sight it is exactly the same : he posesses personal info of you, just like your bank, or you telco company, and uses it to make some customized offers.¬† So what’s the big deal ?

I’ll tell you : Balance versus imbalance.

You know your grocer as well as he knows you.¬† You know his wife and children, which car he drives, etc…That’s hardly the case of the CEO and the other members of the board of directors of your bank.¬† Do you even know their names ?¬†

And that is why it seems so unfair : they know all sorts of personal stuff about you, and use it to make profits, while you know nothing of them.  Nada, zip.

So when they have some papparazzi at their backs all the time, it is only to restore a little bit of that balance.

I would even think a bit  further, (just as a mental exercise) :

As an¬†example : Someone like Mr. Eric Schmidt, until now CEO of Google, is sometimes seen as one of the biggest privacy (ab-?) users of our planet.¬† It is impossible to create a balance in¬†such a¬†case.¬† The nearest thing of a balance would be to take away ALL of his privacy, and give it to all of us, which means that camera’s would be following him 7 days per week, 24 hours per day, 60 minutes per hour and 60 seconds per minutes.¬† Always, everywhere, from any angle.¬† And of course broadcasting in realtime on the internet.¬† And likewise for each and every one of his higher managers.

And in order to discriminate nobody : shouldn’t be the case for every company that uses data mining for targeted direct marketing?

Posted by: zyxo | February 9, 2011

Skills of a good data miner

What skils would a perfect data miner have ?

In short : Technical , Business knowledge, analytical, soft, creativity and practical

Technical : 

– Programming, because as we all know data mining is:

  • 95% digging into data with sql, sas, C++, Python, or whatever language you can use to manipulate data and create your training, validation and test data sets.
  • and only some 5% really generating models and putting them to work

ÔĽŅ– Statistics¬†because it makes no sens to calculate for example logistic or linear regressions without having any clue as to what you are doing or what it means

– Data mining techniques, because that’s what data mining is all about : the actual calculating the treasures of information, of patterns that are hidden in the data.¬† You have to know when it is appropriate to use which data mining technique.¬† Should you use a decision tree or a K-means clustering ?¬† Or why not a logistic regression ?

Business knowledge 

Because we, as data miners do not work with only numbers, but with data that have a business meaning? How could we interpret a model, detect an information leak, spot impossible results that point to some mistake in your data set if we do not know what the data mean ?


Because data miners do not just run data mining agorithms because someone tells him/her to.¬† No as data miners our customers come to us with a problem they want to solve.¬† We must be able to analyze the situation, find out what our customer really want (this is not always what he’s telling us), and create our way to cook a delicious solution to his problem.

Soft skills 

Presentations, because you have to convince your superiors, colleagues that you have their models, explain what your models can do and can’t do and how they should use then in their marketing campaigns

report-writing, because like being able to give a decent presentation you should be able to write a good, clear and concise report.  Not only for anyone interested in your data mining work/art, but for yourself a year and a some dozens of models later, when you want to know what the heck you have been doing some time ago to get at that particular model.


¬†Because that’s the “art” part of data mining.¬† You cannot stupidly apply some algorithms.¬† You have to have the feeling of what will happen, with such or such algoritm in combination with some particular aspects of your data.¬† You have to have some gut feeling of why you should try something else, in what direction …


Two feet on the ground.¬† Never lose sight of your ultimate goal : Business outcomes.¬† As Avinash Kaushik puts it : “an absolute obsession, with outcomes is mandatory”.

Some further reading :

skills of a data miner

what kind of data mining tools and knowledge should I know ?

Posted by: zyxo | January 30, 2011

Are people who believe in God stupid ?

Recently I saw this article where  Richard Lynn, John Harvey and Helmuth Nyborg claim that Average intelligence predicts atheism rates across 137 nations

This raises a number of questions :

Are people who believe in one or more Gods more likely to remain less intelligent than non-believers?

Are intelligent people more likely to lose their faith (if they ever had any)?

Is there some other factor which causes people to believe in God AND be less intelligent ?

  • quid GNP ?
  • quid education levels ?
  • quid democracy level ?
  • quid freedom of press level ?
  • quid female emancipation ?

And last but not least : what is IQ ? how is this measured ?

So there would be a lot of work to do to investigate all this.  Unfortunately I do not have the time to include all the above factors in an analysis for this blog post. But just to give an idea I of what it could give, I added the gross domestic product (GDP) per capita.
OK let us start with IQ and % non-believers :

It is clear that in the countries with a large percentage of non-believers the IQ is on the average larger.  However, the deviations from the exponential model are considerable. So let us take these deviations en see whether they are related to the GDP per capita :

We see a positive correlation.    Which means that we can make a combinated model where the % non-believers equals to the sum of the two models :

% non-believers = 0,0003 * exp(à,1065*IQ) + 1,2513 + 0,3026*GDPcap

This is already a fairly good model, explaining 85% of the variance of the %-age of non-believers, as illustrated in following chart :

So does this explains something ?  Only that % non-believers, IQ, and GDP per capita show concurring trends among countries.  All three move in the same direction.  Take for example the relationship between GDP per capita and IQ.

What can we learn from this ?

Does this mean that if you increase the GDP per capita, the IQ will follow ?
Or does it mean that the IQ-tests measure the extend to which someone is adapted to the life of the rich countries ?

This would lead us to a discussion about the validity of IQ-tests.  But that is a totally different story.

To continue the discussion if religion is something for stupid people.  Is it not so that people who believe in some God can support on their belief to cope with the problems in life ? Is that not one of the reasons that natural selection has favored religion ? So from that point of view it is relatively stupid not to firmly believe in your God.

One last remark : It would be interesting to do the math with data about people who live in similar (wealth) circumstances, who had the same education.  I think it is irrelevant to compare countries as different as France or Mali  on this subject.

Posted by: zyxo | January 17, 2011

When is it OK to kill people ?

Uma Thurman in "Kill Bill"

A first answer is simple : it is never OK to kill people.

But unfortunately the reality is never that simple.  In this post I list a number of cases to illustrate in what circumstances humanity nowadays accept or not to kill people.

Important : this list is certainly not exhaustive, so feel free to suggest additions.

Killing people is NOT OK :

-When it concerns one specific person who gets killed :

  • killing a particular someone for personal reasons : “I hate your guts, so I will kill you“.
  • deciding not helping people who need help (gross negligence).¬† For example after a car accident not calling 911.

-when it concerns a lot of unknown persons, and you do not know howmany and who of them will die because of your action, but you do know that if you stop your detrimental action they all will stay alive.

Killing people seems OK in the following cases :

Conclusion : just like the fact that life is a terminal disease (everybody who lives dies sooner or later…),¬† to live is to influence other people’s¬† lives.¬† In some cases¬†the influence is way to much, in other cases it is negligible.¬† In between there is the gray zone of discussion : can we accept this behaviour or not ?

I assume that these discussions will never end … and¬†remember that everybody is right from his own point of view !

Posted by: zyxo | January 9, 2011

Link list for december 2010

Hi, here are my links for last month.  Enyoy browsing !

The Rise of Analytics: …

Man makes living suing e-mail spammers

On customer segmentation


why you should care about missing tweets

Will posthumans be atheists ?

Individual Knowledge in the Internet Age

most health researchers live in a world dominated by the fascism of the randomised controlled trial”¬†

The most striking pictures of 2010

Humanity is devoting some of its best minds, from a wide diversity of fields, to helping software achieve consciousness

the rate of bird species discoveries

optimal solution to towers of hanoi found by … ants¬†

12 most interesting food facts :

how wikileaks changes things for us all

Posted by: zyxo | December 28, 2010

Cheap wines are as good as expensive ones

It is always a difficult question : which wine should I buy ?  A cheap one ? An expensive one ?  Am I sure that the expensive one will be a good one ?

A consumer organisation tested 205 red wines with a maximum price of 12 euros (~15.70 us dollars).

What was tested ?

  1. 3 taste categories (blind tasting)
  2. One general quality score, taking into account taste and some lab measurements like alcohol, sugar, acidity, sorbic acid, SO2)

What I looked at is whether the price is an indication of the quality of the wine.  Do you get a better wine when you buy a more expensive one ?

The results clearly say: NO !

Reltionship between wine price and taste category:

From the above chart is is obvious that there is no relationship whatsoever between taste category and price.  No statistical tests needed to see that.

Reltionship between wine price and quality score:

There is a negligeable positive correlation between the quality score of the wine and its price.

Conclusion :

If you want a good wine, buy several cheap ones and taste them.  Eventually you will find a very good one for a very small price.

Or do as I do : buy the yearly results book of a consumer organisation who tests hundreds of different wines.  The best wine investment I ever made.  For the price of a handful cheap bottles you get the book, containing the necessary info to buy good wines for a very reasonable price.

Related posts :

Drinkhacker : wine price versus quality

Price tag can change the way people experience wine

Posted by: zyxo | December 19, 2010

Uncertainty, Risk, Statistics and Data Mining

Uncertainty and false research findings.

Traditional research is about :

  1. knowing a bit about a subject,
  2. formulating a hypothesis about the subject,
  3. gathering data to verify that hypothesis, 4)doing some statistical tests to see if the hypothesis can be accepted or has to be rejected,
  4. judging if the obtained result is worth wile writing an article about,
  5. if yes, writing the article,
  6. submitting the article,
  7. judging (by the reviewers) if the article is good enough to be published,
  8. publish the article.

Points 1. to 4.  are perfectly OK.

But then begins the non-scientific part of the scientific research : people has to judge, and obviously this is always some subjective activity.  So when does an article gets published (point 8.) ?  if the reviewers find it good enough, and mostly (>95% of the time)  this means that it has to contain some conclusion based on a statistically significant outcome.

Result 1 : non-significant results, which are as in se valuable as the significant ones, are discriminated from scientific literature.

Result 2 : since per definition 5% of the statistical tests on something with no pattern whatsoever in it, will show a statistically significant outcome, a lot of published significant findings are rubbish, bullshit.   Just because the researchers and reviewers rely on statistics.

Thats the problem of uncertainty : statistical outcomes follow some distribution with a lot of uncertainty in it.  Unless the research is done over and over again, and previous findings are confirmed, they first findings are worthless.

Good data mining practice.

One of the basic habits of any good data miner is using a hold-out sample to verify whether the new model really contains information about real patterns, or was it just a coincidence ? ¬†That’s the way of data miners dealing with uncertainty (and a lot of other data mining stuff like information leaks, multi-level categorical variables and the like). ¬†See the red line : training result and the blue line : validation result on the hold-out sample.

Difference between data mining and statistics.

Statistics is about testing a hypothesis, data mining is about calculating the hypothesis, and testing afterwards, with a hold-out sample if the hypothesis still stands.

Why should a data miner not test his calculated hypothesis with a statistical test ?  Because for this statistical test you still need another sample of data.  The data mining algorithm calculated the pattern (hypothesis) which stood out the most.  Testing this statistically is spurious : it will always be highly significant.

Yes you could do the statistical test on the hold-out sample.  And yes, the findings would be reliable.  But what would the be worth ?  If you have a very prominent pattern in the training sample and only a very weak pattern in the hold-out sample, due to the high amount of data, statistically it would still be significant.  What data miners want to know is whether the model performance is good enough to be used in for example some marketing campaign.

So what’s the risk ?

Uncertainty and risk are two different things. ¬†Uncertainty deals about the possibility of ending up with a result that’s in the wrong part of the statistical distribution of you significance test. ¬†Risk, on the other hand is about things that change around your subject that change the underlying pattern.

The best illustration of risk we all saw previous years, when the global financial system collapsed.  The models dealt with uncertainty, but the risks of our economic ecosystems were forgotten.

In commercial data mining you have the same risks. ¬†You can make a great model, predicting which of your prospects will by your product xyz. ¬†If at the very moment, when you launch you campaign, there is a competitor who does exactly the same, whith the same product, but at a much lower price, then there is a good chance that your campaign will suck, no matter how good your data mining model was. ¬†That’s the risky part of your business.

My inspiration for this post :

The Truth wears off

Uncertainty vs risk vs randomness vs risk

Posted by: zyxo | December 8, 2010

Mr. CEO : when you downsize, you kill people.

When a company gets into trouble, whatever the cause (lousy products, lousy employees, lousy management, better competitors, lousy investments…) one of the first things they do is to downsize.¬† Less employees means a lighter payroll.¬† And admit it : don’t we all perform some unnecessary work.¬† In a 5000 employee company surely there must be some people that can be missed.¬† Is it not ?

So they decide to downsize.

What they want to happen, but does not happen :

  • companies that announce layoffs do not enjoy higher stock prices than peers
  • Layoffs don’t increase individual company productivity
  • layoffs do not increase profits
  • Layoffs do not reliably cut costs

The collateral damage, the effect on people, what they did not expect or at least what never showed up in their calculations  :

  • people leave the company, mostly the best ones who can easily find a job elsewhere
  • among the people who accept the layoff package, a lot of them are the ones the company does not want to lose => loss of “institutional memory”
  • layoffs reduce morale, trust, motivation, commitment, and increase fear in the workplace
  • When the recession ends, a lot of employees will look for another job
  • as more work has to be done with less resources, creativity and innovation goes down
  • downsizing increases stress levels
  • increasing stress and worsening work conditions increase absenteeism due to depressions.
  • increasing stress and worsening work conditions increases suicide levels.¬† Simply put : when you downsize, you kill people. (I saw this happen !)

Sources :

Stretching fewer employees to cover ever more work in our job-starved recovery is no way to run the future.
During major downsizing, suicide levels increase
Lay Off the Layoffs
Downsizing dangers

Enhanced by Zemanta
Posted by: zyxo | December 6, 2010

Link list for november 2010

Here is my link list of november 2010.  Enjoy browsing !

How to create bubble charts with R

R by example

Learning R

teaching ethics to a robot

Can crime be stopped before it starts using virtual intelligence?

ever saw a snake fly ?

Interactive data mining map

the origins of the alphabet

Is your business profitable because your customers are smart? or stupid?

how to fold a t-shirt in three seconds

The more you tell, the more you sell

how wise are crowds ?

Stretching fewer employees to cover ever more work in our job-starved recovery is no way to run the future.

The color of your pill affect its efficacity !

can science and religion peacefully coexist ?

7 Innovative uses of analytics

World taxi prices: What a 3-kilometer ride costs in 72 big cities:

Economics in one sentence

Are You Addicted to the Internet? –

6 tools to quantifiy yourself. It’s only the beginning !

Top 10 analytics mistakes

Your browser choice may affect the price you get on loan offers

Difference between Data Mining And Screen-Scraping

Google Has Indexed Only 0.004% of All Data on the Internet

Researchers discover how to erase memory

alcohol more harmful than heroine or crack

Posted by: zyxo | November 15, 2010

Data mining, Text mining and social media analytics

In this post I want to give my very simplified view of what data mining, text mining and social media analytics is all about.

Data mining, Text mining and Social media analytics all essentially take 3 steps :
1. get the raw material
2. transform it into “minable” data
3. do the mining.

(Many times people mention “data mining” when they actually only mean getting the raw material.)

Let us view those three steps in reverse order.

3. Do the mining.
This is very straightforward : you have a data set, file or database at hand which is exactly structured to be used by your data mining algorithm. So run it, find the hidden information treasures or whatever information you want to find in your data. For more details about this, there is a lot to find on the internet or in good data mining books. Start for example on

Before that :

2. Transform the raw material into minable data.
This is a lot more interesting. Depending on what type of raw material you have. We can distinguish different types, or even combinations of types :

  • data“. This is the easiest. If your raw material is essentially data, then either you can go directly to step 3 or you can first engage in data preparation activities like imputing missing values, transformation of variables, combinations of different data sources, generation of derived data etc.
  • text“. Gets a bit more difficult. We enter the kingdom of text mining. But essentially, text mining is transforming the text into data and then just doing data mining. The trick is to get the transformation done. Very simplistic : just make a variable for each word, fill it with the number of occurences of that word in each of your texts. That way, each text is transformed in one record with a huge quantity of variables that are mostly equal to zero. Or you can do it the difficult way : moving from words to expressions, meanings or whatever transformations up-to-date text mining packages are capable of nowadays.
  • sounds“. This is getting real fun. Either your sounds are just … sounds without any meaning (like bird songs) or are very meaningful sounds, like conversations. If conversations you can transform them to text and treat them that way. If just sounds, you can transform them in a number of variables, like amplitude, frequency. In fact you could turn them into charts just like charts of financial stocks, and treat them likewise. But be aware that just sounds other than conversations can be very meaningfull : sound of a car crash, of a door slammed, of a crying child etc… Anyway there are a lot of possibilities, and I am not sure if it is really do-able to get everything automatically into a straightforward data format.
  • images“. Here it is really getting nasty. With pictures you could try things like face recognition. In order to accomplish that task, you must find a way to quantify each meaningful “entity” on somebody’s face, like the corners of the eyes, the nose length, the distance between the two eyes. Simply put : find the interesting points, measure the distances and quantify some ratio’s.
    A lot more difficult are random pictures. How can you identify the various objects, people, locations, on a picture ? In other words : how can you transform a picture into a data set with variables that not only contain info of what can be seen on the picture, but also what is happening ? A picture with a glass of beer, and a man is not necessarily the same as a picture of a man drinking a glass of beer.
  • movies“. This is definitely hell. Combine lots of pictures in meaningful sequences with spoken words, music and noise and try to put information like ” a video where a guy named zyxo talks about data mining, text mining and social media predictive analytics, and with some self-reference in it” –(how much detail will you include ?)– in a data record. Looks a bit like transforming the video into text and then transform the text into data.
  • Social media“.¬† Can be any combination of the above.¬† Simplest is of course twitter.¬† But in social media (or any web content)¬† you can¬† decide to limit yourself to the pure text content or to the text or picture content, or …

Before that :

1. Get the raw material.

Well, just get it …

I am well aware there is lots and lots more to say about this vast subjects. My only goal was to come up with a very simple basic structure.

Do know that any comments are welcome ūüôā

Enhanced by Zemanta
Posted by: zyxo | October 31, 2010

Link list for october 2010

Here’s my list from october.¬† Enjoy browsing !
Correlation Is Not Causation
Without any data, the world would be simpler: we would simply believe what authorities tell us
how much overhead is to much ?
grunting tennis players slow down the reactions of their opponents !
U.S. people waste 27 percent of their food, an energy waste of about 350 million barrels of oil a year.
the difference between ‚Äėsignificant‚Äô and ‚Äėnot significant‚Äô is not itself statistically significant.
What would happen if you put your hand in the Large Hadron Collider?
The 7 Types of presentations to avoid
The Data Science Venn Diagram
boys & girls equally good at mathematics
There isn’t a googol of anything
How To Use Twitter For Personal Data Mining
A Taxonomy of Data Science
About mining employee data. (Must read the comment by Alex).
falling in love only takes about a fifth of a second
Mitochondria : the fuel of evolution
The secrets to negotiating a higher salary
it’s cheaper to build things right the first time than it is to fix them later
tiny bees solve the traveling salesman problem !
What If Lehman Brothers had been Lehman Sisters?
in the future, will you have a license to reproduce ?
Never attribute to malice which can be attributed to stupidity, but …
Are U a web analysis ninja? Think U truly understand analytics? Play the board game & find out!
How to sound like a social-media expert with a dozen easy-to-learn phrases
Scientists study the ‘DNA of perfect pop song

Posted by: zyxo | October 24, 2010

5 Ways to generate better data mining models

Better data mining models.

This implies that you can measure the quality of your models.¬† You know, only one quality measure really matters (whatever lift-adepts and AUC-adepts may tell you: what’s in it for your business ($$$$$) ?

Disclaimer : this post is NOT a list of things you should do in order to avoid all the known data mining mistakes.  On the contrary, I suppose you know what you are doing as a data miner. Only there are some possibilities you might have overlooked.

1. There is no data like more data (I) : observations

Push your data mining tool to the limits.  The more data you use, the better your model.

–¬† As you know the best models are “ensembles” of weak learners, like bagging.¬† In stead of feeding one data file to the algorithm and let it do the sampling, learning, averaging, I prefer to make the samples myself and feed one at the time to the algorithm.¬† That way it is possible to use a lot more data before the tool crashes.¬† For each individual model I push it to its limits.¬† The averaging can be done afterward.

РA second advantage of making the samples yourself is that you can chose to generate non-overlapping samples as much as possible.  That way the total number of different observations used in model building reach much higher levels than by feeding only one file to the modeling tool.

2. There is no data like more data (II) : variables

РCalculate additional (derived)  fields. This is fairly easy.  You can multiply, subtract, divide, add, numbers.  OK it has to have some business meaning, otherwise how will you explain it afterwards?

– Find additional information, inside or outside your company.

3. Find the best algorithm (the very best actually is a combination of all: ensemble)

– It it tempting to state that probably for each problem there is one best algorithm.¬† So all you have to do is try a handful of really different algorithms to find out which one is the best for the problem-data-data miner combination at hand.¬† Surprised that the data miner plays a role in this ?¬† Different data miners will use the same algorithm differently, according to their taste, experience, mood ūüôā …

So find out which algorithm works best for you and your problem.

4. Zoom in on your targets

– When you want to use a data mining model to select your customers who are most likely to buy your outstanding product XYZ, it is reasonable to use your past buyers of XYZ as your positive targets in your model.¬† You get a model with an excellent lift and use it for a mailing.¬† Afterwards you proudly report to your executives that your model (you !) increased the mailing return by a factor 3.¬† Great.¬† The logical thing is to move on to the next problem, product MNO …

Wait !¬† Zoom in on your targets !¬† When your mailing campaign is over, you now have all the data you need to create¬† a new, better, model for product XYZ.¬† Your targets : your past buyers of XYZ in response to your mailing.¬†¬† With this new model, you will not only take their “natural” propensity to buy into account, but also their willingness to respond to your mailing !

– If your databases contain far more observations than your data mining tool likes, the only thing you can do is use samples.¬† No problem.¬† Calculate your model, and you can use it.¬† But you can push it a bit further.¬† Zoom in !¬†¬† Use your model to score the entire customer base.¬† And now zoom in on the customers with the best scores.¬† Let’s say the top-10%.¬† Use them to calculate a new,¬† second model which will use the far more tiny differences in customer information to find the really promising ones.

5. Make it simple

I confess : the four previous point all went in the direction of making things more complicated.  But nevertheless, you have to keep your data mining work as simple as possible, because the guy who pays your bills wants you to deliver good models, on time for his campaigns.

So try to :

– automate as much as possible

Рnot to try out every possible algorithm in each data mining project.  If problem A was best solved with algorithm X, than probably problem B, which is very similar to A, should equally be tackled with algorithm X.  No need to wast time checking out other algorithms.

– not to make a model when you know that for the next campaign for product XYZ¬† your marketeer will mail each and every customer.¬† Models are made for campaign selection.¬† When they don’t select, the do not need a model

Good luck !

Enhanced by Zemanta
Posted by: zyxo | October 3, 2010

Link list for september 2010

Here the most interesting links (according to myself) I saw in september.  Enjoy browsing !

Always make it simpler. A wise Tesco lesson

7 Sneaky Ways to Use Twitter to Spy on Your Competition

Objective “quality of live index” with elastic mapping

lonely island in the middle of the South Atlantic conceals Charles Darwin’s best-kept secret.

Gossip improves productivity

No less than 5% of your payroll should go toward data analysis

Hans Rosling on global population growth : Great talk !

How reliable is science?

Six Ways to Supercharge Your Productivity

Sizing samples – how much data is enough?,

How incentives should work!

Is predictive analytics possible with web data ?

The world’s oldest living things: Including a 80,000 & 600,000 yrs old!

Beyond BI & Analytics”¬†

The new robots will feel with their skin !

Dilbert nails it with new Marketing Manager for Social Media

We tend to ignore random chance when results seem meaningful. The Texas Sharpshooter Fallacy:

finally there is a formula to understand women

The greatest dashboard in the world has no numbers on it” (A.Kaushik)

For my scientific work, I always use data, for the design I often trust my feelings. I was wrong, terribly wrong”

The best data mining algorithm ever : decision trees

Reconstructing minds from software mindfiles

do you want a mind-reading phone ?

How long until Artificial intelligence beats humans ?

A high Rsquared does not necessarily mean it’s a good model¬†

a FREE A/B Test Significance Calculator ->

Do you know Socrates’ test of three ?

how to get cheap concert tickets

Are your protected against zombie coockies ?

you age faster when you stand a couple of steps higher on a staircase !

Posted by: zyxo | September 17, 2010

Why decision trees is the best data mining algorithm

Data miners who visited my blog in the past, already know that I like decision trees . They are without any doubt my favorite data mining tool.

Want to know why ?  Because it is simply the best data mining algorithm.

For a number of reasons :

  • Decision trees are white boxes = means they generate imple, understandable rules.¬† You can look into the trees, clearly understand each an every split, see the impact of that split and even compare it to alternative splits.
  • Decision trees are non-parametric = means no specific data distribution is necessary.¬† Decision trees easily handle continuous and categorical variables.
  • Decision trees handle missing values as easily as any normal value of the variable
  • In decision trees elegant tweaking is possible.¬† You can chose to¬† set the dept of the trees, the minimum number of observations needed for a split, or for a leave, the number of leaves per split (in case of multilevel target variables).¬† And many more.
  • Decision trees is one of the best independent variable selection algorithms.¬† If you really want to make a model with logistic (or linear) regressions or with neural networks, but first you want to reduce the number of variables by selecting only the relevant ones : use decision trees.¬† They are fast, and, unlike calculating simple correlations with the target variable, they also take into account the interactions between variables .
  • Decision trees are weak learners.¬† At first sight this rather seems to be a disadvantage, but NO !¬† Weak learners are great when you want to use lots of them in ensembles, because ensembles, like bagging, boosting, random forests, treenets become very powerful algorithms when the individual models are weak learners,.
  • Decision trees identifies subgroups.¬† Each terminal or intermediate leave in a decision tree can be seen as a subgroup/segment of your population.
  • Decision trees run fast even with lots of observations and variables
  • Decision trees can be used for supervised AND unsupervised learning.¬†¬† Yes, even with the fact that a decision tree is per definition a supervised learning algorithm where you need a target variable, they can be used for unsupervised learning, like clustering.¬† For this, see one of my previous posts.
  • Decision trees are simple.¬† I mean : it is a simple algorithm.¬† No complicated mathematics needed to understand how they work.
  • Decision trees deliver high quality models, are able to squeeze pretty much all information out of the data, especially if you use them in ensembles.
  • Decision trees can easily handle unbalanced datasets.¬† If you have 0.1 % of positive targets and 99.9% of negative ones : no problem for decision trees ! (see one of my previous posts)

Reasons enough ?  Do you know other algorithms with such beautiful characteristics ?

Please do let me know !

Enhanced by Zemanta
Posted by: zyxo | September 1, 2010

Link list for august 2010

Enjoy browsing !

always make it simpler. A wise Tesco lesson
In defense of A/B testing
test your RQ (Risk Intelligence)
social media ninjas versus gurus
The A-Z List: How Twitter Can Make You A Better Blogger
an open letter to all of advertising and marketing
Job Interviews: 20 Questions to Ask (Or Be Ready to Answer!!)
Why Life Should Be Effortless
relaxing on the beach in China …
Dilbert pokes fun at Knowledge Management
Huge list of cognitive biases
The hardest thing to do in tests is to kill things
A/B testing ? don’t forget dayparting
can you look at your employees’ facebook pages ?
Six Fundamental Shifts in the Way We Work
Artificial intelligence: Riders on a swarm
Champagne tastes better if you pour it like beer
LOL : What does it look like in a hierachical organization ?
The illustrated guide to a Ph.D.
It’s all about respecting the customer
Mankind must abandon earth or face extinction
convert data to pears to simplify your reports
hype cycle for emerging technologies 2010
The Periodic Table of Irrational Nonsense
Predicting Entrepreneurial Success Using Data Mining
dangerous technology : geoengineering !
The chinese bus that drives over the cars : must see !
To find out what happens to a system when you interfere with it, you have to interfere with it
gamers better find protein structures than sophisticated algorithms
Worst Practices in Data Mining
innovation is not creativity, it is the execution of ideas
five reasons NOT to CLONE yourself
four things leaders can’t give, and one they can

Enhanced by Zemanta

« Newer Posts - Older Posts »