Posted by: zyxo | January 15, 2012

Will men ever live as long as women ?

This post is about the expected longevity of european men and women.
But it is also about the interpretation of a simple linear relationship.

First the figures.

The chart shows the relationship between the expected longevity of men and women in the various european countries.

relationship between the age of men and women

As we can see, they are highly correlated. Which is a good thing: it means that in countries with favourable life conditions, both sexes can profit.
We also observe that women live on average longer than men.  In no country men’s average life span surpasses 80 years, whereas this is surely the case for women in most countries.

Next chart shows the ratio (in %)  between life expectancy of women and men as related to average life expectancy of men.

Three things are obvious :

  1. the ratio is >100% in all countries, meaning that everywhere women live longer than men
  2. the ratio goes down as men live longer, meaning that the advantage of women decreases, that men catch up.
  3. the regression line crosses the 100% line at an average age of men of about 87 years

Question : is this just a “men” effect?  Let us look at the same ratio in relationship with the average age of women:

We see approximately the same relationship, although the points are somewhat more scattered around the line. The equal-life-expectancy is situated here at about 91 years. Not exactly the same as in the previous graph, but let’s not quarrel about trifles.

Conclusion : if life conditions would continue improving until the point that women (or men) reach an average age of about 90 years, they would have an equal live expectancy.

We could state the above otherwise : as women seem to have an advantage over men, an improvment in life conditions favours men more than women, such that at an average age of 90, the disadvantage of men would be wiped out.

Remains the question : what is the actual advantage that women have over men?  Why do they live longer?

In this article on Time Health reasons as “women develop cardiovascular diseases 10 years later than men”, “women have two X chromosomes which give them more genetic material and hence more diversity, resulting in an advantage”, “men have something like a testosterone storm when they are around their 20′s which makes them often behave dangerously”.

Another aticle in Dalymail speaks about a genetic advantage of women over men because men are more disposable. Women (especially long ago when we were gatheres-hunters) had to live long enough to raise their children.

And this article from Harvard University points the finger on the menopauze which cause women to stop giving birth to children. This gives her time to care of their children and grandchildren.

So the next question : why does the age advantage of women decrease in countries where it’s better to live?  Does her genetic advantage decreases?

I believe that it’s something else. In the better countries, social structures, health care etc. are so much better that the environmental dangers decrease: better health care, safer cars, safer toys etc. diminish not only the genetic advantage of longer living mothers and grandmothers but also diminish the danger caused by the testosterone storm.  So if there is no danger any more, the danger-countermeasures that women have, would become worthless.

What happens then, when in some country both men ande women would reach an average age of 90?  That would perhaps be the indication that all avoidable dangers, accidents, crimes, deseases, or whatsoever have been removed by adding the necessarey the infrastructures and countermeasures.  This would then be an ideal country (from the point of view of longevity) where the only deaths would be old age or incurable deseases that make no difference between the two sexes.

Any other ideas, interpretations? Do not hesitate, I will gladly read your comment.

Other posts on the differences between men and women :

Are men and women different species ?
Imbalance of cheating

Further reading on possibilities for living longer : http://ieet.org/index.php/IEET/more/brin20120108

Enhanced by Zemanta

Yesterday I got an interesting comment on a previous post on evolution.
I thought my answer would be to elaborate for a reply, hence this reply-post.

Tias Dailey writes the following (bolds are mine):

“You wrote that in one winter, a population of birds could be affected by natural selection because the small birds die off, leaving the larger birds. The thing is, natural selection always has a narrowing effect on the variation in a population. Understand that in your scenario, large birds did in fact exist before the natural selection. So that in itself is not evolution, but only narrowing of the gene pool. So that scenario doesn’t show that evolution can occur quickly.
To show that evolution can occur quickly, you would need to show that new features can arise quickly—features that were not present before.”

In fact, Tias makes 2 statements here :

  1. Natural selection always has a narrowing effect on the variation in a population.
  2. Narrowing of the gene pool in itself is not evolution.
The first conclusion we draw from these 2 statements are purely logical : since 1)Natural selection always has a narrowing effect and 2)a narrowing effect is NOT evolution then it follows that natural selection cannot be the cause of evolution.
In the above, we assume that both statements 1) and 2) are right.  [As many will know, it is always dangerous to assume (ASS U ME)].
So when does evolution occur ?  If it is not when natural selection occurs (as a result of some sort of more severe environmental pressure) then it must occur in the opposite situation : when the environmental pressure is relaxated.  Under those circumstances inheritance / mutation / recombination can do a lot more without being naturally-selected away.  In other words: the variation in the population increases and new features (e.g. a bird that’s larger than any previously existing individual of it’s species) can see the light. Aha, we have evolution.
But let us look at Tias’ 2 initial statements.  Are they correct ?
For the first one : OK, I agree.  Natural selection does weed out the non-fit outliers and narrows the population variation.
For the second one : NOK.  Why should narrowing of the gene pool in itself not be evolution ?  By the way… What IS evolution ?
Let’s look at some definitions:
  • Change in the genetic composition of a population during successive generations, as a result of natural selection acting on the genetic variation among individuals (the free dictionary)
  • Biological evolution … is change in the properties of populations of organisms that transcend the lifetime of a single individual. The ontogeny of an individual is not considered evolution; individual organisms do not evolve. The changes in populations that are considered evolutionary are those that are inheritable via the genetic material from one generation to the next. Biological evolution may be slight or substantial; it embraces everything from slight changes in the proportion of different alleles within a population (such as those determining blood types) to the successive alterations that led from the earliest protoorganism to snails, bees, giraffes, and dandelions. (talkorigins)
  • Biological evolution is defined as descent with modification.   Biological evolution occurs at different scales. These include small-scale evolution and broad-scale evolution. Small-scale evolution, also referred to as microevolution, is the change in gene frequencies within a population of organisms changes from one generation to the next. Broad-scale evolution, also referred to as macroevolution, refers to evolution at a grander scale. It focuses on the progression of species or entire clades from a common ancestor to descendent clades over the course of numerous generations. (animals.about)
  • Evolution is any change across successive generations in the heritable characteristics of biological populations. Evolutionary processes give rise to diversity at every level of biological organisation, including species, individual organisms and molecules such as DNA and proteins. (Wikipedia)
So what do we see ?  ”change in genetic composition”, “changes in populations…that are inheritable”, “change in gene frequencies”, “change in the heritable characteristics of biological populations”.
So what about narrowing of the gene pool ?
  • This IS change in genetic composition.
  • This IS inheritable.
  • This DOES change gene frequencies
So IMHO narrowing of the gene pool is evolution.  Evolution does not always add new features.   Losing capabilities as a result of evolution is called regressive evolution.  Examples: the european mole that start to live beneath the ground and lost its vision capabilities.
To finish let me make a comparison.  Is it necessary to walk from New York to Rio De Janeiro in order to prove you can walk ?  Nope!  If I can show you that I can walk 10 steps, you will believe I can walk.
Likewise, is it necessary to show the emergence of a totally new feature like for example wings in order to prove evolution ?  Nope! If we can show a change in genetic composition of a populations, than we have shown evolution at work.
Enhanced by Zemanta

Loose coupling
Once upon a time, when the descendents of the neanderthals had invented something called object-oriented programming (and when I worked in IT), one of the good qualities of a good object-oriented design was “loose coupling“.
Loose coupling in object orientation means that software objects are loosely coupled with one another such that you can easily modify one of them without influencing the rest. The opposite of spaghetti code, where you pull on one spaghetti string and it starts to move on the other side of your plate.
Obviously loosely coupled designs are much more stable that tightly coupled ones.

Clustering
One of the more interesting unsupervised data mining algorithms consists in finding clusters in a data cloud such that the clusters themselves are tight, but the clusters are far away from one another. In other words : the average distance between observations from the same cluster is far less than the distances between observations from other clusters. In some way, clusters are loosely coupled.
Obviously a segmentation based on really loosely coupled clusters clearly is much more relevant than one with very near or ill-separated clusters.

Evolution
Evolution occurs when populations of organisms change under environmental pressure. When organisms of the same species live under different environmental circumstances they will change in different directions without influencing other populations a lot. That’s the way they can evolve into different species. Some sort of loose coupling.
Obviously different populations that are loosely coupled can much more easily evolve in different directions than populations that are still connected, with a high exchange rate of individuals.

Greece
Apparently Greece is NOT loosely coupled financially and economically from the rest of Europe and the rest of the world. Though it should be! The world is becoming one tightly coupled system such that when you or I fart, they smell it in Japan or the Bahamas.
That’s why at some places they start much smaller with what they call “local currencies“. And here, “local” means exactly what it means : local, in one town or even one neighbourhood.  This allows this towns or neighbourhoods to do their thing more efficiently than when they are thightly connected to the rest of the country via the national currency.
Obviously the EURO was not such a good idea after all ?

Enhanced by Zemanta
Posted by: zyxo | September 13, 2011

Se7en steps in finding knowledge nuggets.


Ever saw a bunch of children playing happily in some forest for a couple of hours and return home nice and clean ? No way. When they are healthy they should return covered with dirt and mud. Only then they have really played.

“Returning home nice and clean” that’s the feeling I get when I read Scott Levine‘s “7 steps of knowledge discovery in databases“. It is correct what he writes, but it all seems so uncomfortably clean.
That’s why here I write down my own version :

“Se7en steps in finding knowledge nuggets”.

 
Step 1.Try to understand what your business user really needs (not what he/she asks).
Know for sure that your business user almost never ask what he needs. No, somewhere he has a problem, he figures out for himself some sort of solution and based on that solution he ask you to do some mining work. If you just do it like that, I guarantee you that the result will never be satisfactory. In stead you’ll have to challenge him, ask him what he wants to accomplish with the result of your work, ask him what the problem is he wants to resolve and together find a -usually better- solution.

Step 2. Figure out what new data you need and which data-mining algorithms and/or statistical tests you need to use.
Now begins the creative part : when you know what your user needs, figure out how you can deliver. Which data ? Which algorithms? You should perform the entire analysis in your head, or skech it on a piece of paper. Some time ago I even got the habit of paradoxically starting the analysis by writing down the end report (of cause without any results). But it forces you to think the whole thing over beforehand and not afterwards when it’s to late.

Step 3. Playing detective in trying to find the data tables, and the right selection criteria to get the specific data that you need.
Real world is not like you have all the data exposed in front of you, nicely aligned and ordered according to some obvious criteria. No often all you know is that it is there somewhere in some table. Now starts the dirty work : call/email people to ask if they know someone who knows …
When finally you get the info, you access the data, just to find out something is clearly not right. So you call/email people to ask if they know … until finally you are pretty sure you have what you wanted.

Step 4. Merge the newly found data with what you want to use from your existing datamart
This is the more easy part and pretty straightforward.

Step 5. Get this data “mining-ready”
Depending on the algorithms you want to use your data has to meet some criteria : e.g. no nominal variables, must have a normal distribution, no missing values etc … Can be pretty tough to get everything in order.

Step 6. Mine that data (run algorithms, draw your conclusions)
That’s the exiting part. Not really difficult or dirty, because you had it all prepared. The part where you actually see what happens, the part where you discover the knowledge nugget, the part where you shout “YESS!”, or the part where you realise :”Shit, is that it ? That’s all ?”.

Step 7. Convince your business user that this result is all you can get out of it, even if it looks (afterward) as rediculously obvious.
Now you com out of the woods, covered in mud holding high your nugget, or almost empty-handed, or somewhere in between (gold-dust?)
You have to explain to your business user what’s the worth of your model, what he can do with it, how it can influence his marketing campaign results and eventually withstand his somewhat accusing look of “that’s all you got for me?”.

(Step 8. Afterwards just get a shower and prepare to find your next nugget.)

Enhanced by Zemanta
Posted by: zyxo | September 9, 2011

A victim far away means less than one nearby.

As I am typing this, I suppose that about every second someone gets killed or injured by some accident.  And still, I am typing this, without a lot of sorry.  Not that this means nothing to me, but when somebody who I do not know dies on the other side of the world, well it is a sad reality, but I do not care very much.

How does it affect you when someone dies or gets severely injured by accident?

An obvious answer to that question is : “it depends”.  On what ?

There are several factors that influence the effect an accident has on somebody:

  • distance : how far is it from where you live ?
  • familiarity : is that person a close relative? Is he a friend, a colleage ?  Someone you know from the media?
  • Number of casualties : is it one person?  hundreds? (compare one person hit by a truck in your village to thousands of them killed in the 9/11 disaster)
  • time : how long is it since you first heard of it (time heals all wounds).
  • age : an old man dying seems less severe than a child
  • health : a very sick (already dying) person, killed in an accident seems less dramatic than a healthy one.

When we combine these factors, we can say that the impact on someone (I) of an accident/disaster is positively correlated with the number of casualties (N),  the familiarity(F) and the health (H) and negatively with distance (D), time (T) and age (A).

So a simple equation would be : I=(N+F+H)/(D+T+A)

But now remains the problem of the unities.  Distance can be measured in meters or kilometers and amount up to thousends of them whilst age can reach at maximum about 100 years.  It seems attractive to try to put every factor in a scale of 0 to 100.

Age is the most simple one and can be used as such.

Number of victims : a nice measure is 10 times the logarithm of 1+ the number of victims. The table below shows us how nicely it goes from zero when there are no victims to near to 100 when the entire world population is extinguished.

A similar approach can be used for distance.  Here I use 20 thousand kilometers as the maximum, it’s the other side of the world.

Time is also a rather easy one.  I took the first table as a basis and adjusted it somewhat. X is the number of days, and I added a number of years column.  We reach 100 after 27 thousand years.

This leaves us with familiarity and health.  It is not my purpose to elaborate that in detail. So I suggest that a perfectly healthy person has a score of 100, someone who is already dead is scored as zero and let us use our gut feeling to assign the intermediate values.  For familiarity we can use a similar approach: a value of 100 represents the persons we love the most, like our children, our husband or wife.  0 stands for people we do absolutely not know and do not care at all.

And now let us look at some examples.


1.The Banda Aceh tsunami.

The impact now, after 7 years, on somebody on the other side of the world is :

Our formula : I=(N+F+H)/(D+T+A)

Let us assume that those 250 thousand victims where fairly healthy and on average35 years old and we care a little bit about those people.

If we do the math then we find for the first day that the impact on you =(54+3+97)/(100+49+35) = 0.84

A low figure, is it not?  But frankly, how often do you still think of that disaster, unless you live in that unlucky  part of the world ?

2. Now entirely different : suppose your perfectly healthy child of 1 year old who lives with you dies in an accident (I sincerely wish this will never happen!)

If we do the math then we find for the first day that the impact on you =(3.01+100+100)/(0+0+1) = 203.

Take a look at the formula. What would happen if that child was a newborn one?  The resulting value would be infinite!  Hence I propose to limit the Impact Value to a maximum of 100.

The formula then becomes : I=min ( 100 , (N+F+H)/(D+T+A) )  which means that if our calculated value is smaller than 100,  we accept it, otherwise we just take 100.

I know there is not much science in the above, it was just an interesting (but superficial) thought exercise.

Any suggestion for approvement is welcome.

(Some further reading on this subject : The new problem of distance in morality)

Enhanced by Zemanta
Posted by: zyxo | August 26, 2011

Why computers go bananas without any reason

“After that the computer froze a few times over the course of a couple days, so I assumed… So, I have no clue what is going on”.

“My computer randomly freezes,… What might be the problem?”

“Your computer was working fine, but then suddenly started locking up… ometimes random lockups can be attributed to the computer memory…”

When you google “computer freezes” you get thousends of desperate people asking for help. Mostly it can be solved by checking hardware, software etc.

But occasionaly it occurs that something goes wrong for no reason whatsoever, and then it never happens again. Why?

At work we had such a problem: less than once a year our SAS software refused to run our programs. Exactly the same programs we were used to run daily, weekly, monthly without any problems. Googling the error message was no help. Obviously the software was on strike. Temporarily, because the following morning everything was back to normal.
What happened ?

After deep thought, eliminating all impossible possibilities, I came up with the only plausible explanation I could find.
This is what I wrote to my collegues :

“Dear collegues,
I now know the reason of the problems: it’s what we call the IT Ghost, a species of creatures from the 5th dimension which are migrating this time of the year from the Betelgeuze area to the Crab nebula and are eventually teleporting through the earth. On this occasion they can influence the spin of some Charm Quarks causing computer processes to behave erratically, with no obvious reason.
A positronic energy field of 5.000 trillion petavolt around the earth should solve the problem.”

The Crab Nebula, the shattered remnants of a s...

Image via Wikipedia

Do you have any better explanation  :-)

Enhanced by Zemanta
Posted by: zyxo | August 5, 2011

The 2.5 ways to segment your customer base

Terabytes have been filled with books and articles about segmentation.  And we should by now expect that the most basic knowledge about it is, well … known.
Forget it !
First : what is this most basic knowledge that each and every marketeer should know?

 ”What can you do with it” ?
Or, stated otherwise : how should you use it ?

Is the answer obvious ? Not at all !

Take for example the SAS white paper “A Marketer’s Guide to Analytics“.  You could reasonably expect SAS, as a major vendor of analytics software and consultancy, to know how to use segmentation.
Right?

Well, I seriously have my doubts.
They discribe as “the first two enablers of the analytical framework” :
1) analytically driven, granular segmentation: enables you to identify how different customer segments are most likely to respond to specific campaigns or marketing actions.
and
2) predictive modeling: enables you to identify the specific target population likely to respond positively to a specific campaign or other marketing activity.
I get an odd feeling when I read these two “different” descriptions.  Whether I can identify how different customer segments will respond to my campaign or identify the target population that will respond in a particular way (“respond positively”) does not seem very different to me.  In both cases you want to predict the behaviour of each customer or customer group in response to you campaign.
So let us forget about software or algorithms.  Let’s think marketing.

1.  First, you want to sell your product or service.

This means you have to find out who is likely to buy it.  You call for help any tool or algorithm that can use the data in your customer base: logistic or linear regression,  neural networks, support vector machines, genetic algorithms, bayes learners, decision trees, and all sorts of segmentations.  Use whatever you like, know, have, and delivers satisfactory results.

OK, let’s say you have done this and you know who to target, you have your customer group or best segment or whatever.  Perhaps you have a lift chart or the like so you know what you can expect from your campaign. (in my earlier post “datamining for marketing campaigns: interpretation of lift” your find a lot more about this topic)

2. Second, you have, one way or another, to speak to those people.

And if there is one important issue about communication it’s that you have to send the right message to the right person.
OK, you want to sell them all your world-changing superb product.  But I’m not talking about the what, but about the how !  I’m not talking about the content of the message box you will send, but about the wrapping paper, the flavour of your message.  Will you use the same words, the same communication channel, the same colours for young women, for old men, for internet savvy whizzkids, for grandma’s who never touched a computer ?

Did you notice ?  I gave some examples of customer SEGMENTS.  So that’s your second assignment : find the segments who match your communication alternatives.
A simple, but not easy, way to do this is to think, brainstorm, use your imagination and common sense, and use what you know about the people you identified in step one : look who’s in the selection, what is their age distribution, etc …
Now you have your second segmentation.

Lastly I owe you another half segmentation: In case you are not satisfied with your “communication segmentation”, you can always test it first:  Use your various communication alternatives randomly to part of the people of your selected target group.  Evaluate the results, and calculate which communication flavour your should use with which customer.  For this calculation you can use whatever you  like, know, have, and delivers satisfactory results.  Then use the findings to optimise subsequent campaigns.
Enhanced by Zemanta
Posted by: zyxo | July 26, 2011

The customer satisfaction hierarchy

Customer satisfaction is a hot topic. Numerous studies are continuously going on to get to know the enhancers end/or dissatisfiers. Depending on the branch you work in (bank, retail, internet book shop, etc), these enhancers/dissatisfiers can be very different.
Nevertheless, if we take a step back and do some abstraction, it seems that we can distinguish different levels, analoguous to the pyramid of maslow

In “maslows hierarchy of customer service”  Naumi Haque distinguishes three levels :

  1. Meeting the customers‘ expectations
  2. Meeting the customers’ desires
  3. Meeting the customers’ unrecognized needs

At frankwatching they present a four-level pyramid :

  1. trust, reliability, value
  2. timeliness, knowledgeable, responsible
  3. Caring, concerned, helpful
  4. Fun, friendly, enjoyable, entertaining

Well, it should be no surprise, below I will present my own “customer satisfaction pyramid” which is slightly different from the two above, and for sure is put in a less cryptical language.

the hierarchy is the following :

Basis : deliver what you promise, give the customer what you make him think you should give him.  This corresponds with the first level of the two pyramids above.

Second : do it fast, don’t keep your customer waiting, and do it properly, deliver it to him the way he would like it.

Third : see to it that there are no problems for the customer.  OK, nothing is always perfect, so if something goes wrong, make it as soon as possible your own problem, not the problem of the customer.  Make it easy for the customer to get problems solved.  Make sure that when the customer complains or ask for help, you give him a reassuring, easy, satisfied feeling.    Keep it easy for him, and do the hard work yourself to make him happy.
(This one was not mentioned in the two pyramids above.)

Finally : create a WOW effect

In short : optimise in this order :  the WHAT’s,  the HOW’s, the  CURES and the WOW’s

Enhanced by Zemanta
Posted by: zyxo | July 10, 2011

Is reading a newspaper “Data mining” ?

Data mining is a hype.  As a result everything is called data mining.  I suppose reading a newspaper to find some interesting information is called “data mining” by some people too.

However there is only one problem : not everything IS data mining.

To clear this mess a bit, in what follows I list and explain several activities that are sometimes (mistakenly) called “data mining”.

Data extraction 

the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing” (wikipedia)

Data extraction software can enable agencies to collect data on the race, gender, and ethnicity for the person(s) owning the majority of rights, equity, or interest in a business.” (Mozenda)

My definition is simple : you get the data from somewhere with some data extraction program.  What you do afterwards with that data is not relevant.

Reporting

Is making a report : “Report is a piece of information describing, or an account of certain events given or presented to someone“. (wikipedia)

Reporting is just a genre of writing, alongside essays and stories, and blogggers most certainly fall into that genre. Imho, when they talk about reporting on a show like Frontline, they mean the process a reporter goes through.” (Scripting.com)

This seems a bit more complicated than data extraction.  I would say : “extracting from whatever sources of data/information those pieces of information that are sufficiently important an structuring/presenting them to be communicated to your audience, customers, boss or whatever other party”.

My defition: reporting is not showing raw data, but some communicable description.  This can be in the form of tables, charts, structured drawings, or simply words.

Statistics

statistics is … a distinct mathematical science  pertaining to the collection, analysis, interpretation or explanation, and presentation of data . ” (wikipedia)

“methods to collect, analyze and interpret data” (Nebraska university)

“collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting and then drawing conclusions” (Akila)

Is a very broad definition, and it has obviously a lot to do with data.

For me, a part from “data”,  the words that are most important here are “science”, “methods”, “interpretation”.  Statistics is not just extracting data or reporting, no, here we have to do better.

Hence my definition : we use some mathematical method(s) to extract the right data, to interpret the data, to draw conclusions based on mathematics and to present these results/conclusions.

Data mining

This is the most difficult one, and most misunderstood.

Some definitions:

“the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with <a title=”Database management” href=”http://en.wikipedia.org/wiki/Database_management”>database management.” (wikipedia)

“the process of analyzing data from different perspectives and summarizing it into useful information” (UCLAAnderson)

“Data mining is the use of automated data analysis techniques to uncover previously undetected relationships among data items.” (about.com)
“Data mining is the discovery of hidden knowledge, unexpected patterns and new rules in large databases.” (E.Thomas)
The most important words or expressions here are : “extracting patterns”, “analyzing data”, “uncover relationships”, “discovery of knowledge”.
So my definition  is: searching in data collections (databases, the internet) for information that was not put there deliberately, but neverteless can be derived.
And one more thing : reading a newspaper is definitely NOT data mining :-)

Here is my very personal view on some settings of decision trees.

Maximum depth :

Maximum tree depth is a limit to stop further splitting of nodes when the specified tree depth has been reached during the building of the initial decision tree.

My opinion is straight and simple : NEVER use maximum depth to limit the further splitting of nodes.  In other words : use the largest possible value.

I suppose some explanation is necessary.

When you grow a decision tree, different leaves in the splits normally contain different numbers of observations.  Using the tree depth totally disregards these differences.  It could cause to stop splitting a leaf containing 25,000 observations on one side of the tree, whereas on the other side, containing much less observations a leaf with only 30 observations could still get splitted.  This makes absolutely no sense!

Minimum splitsize

Minimum splitsize is a limit to stop further splitting of nodes when the number of observations in the node is lower than the minimum splitsize.

This is a good way to limit the growing of the tree.  When a leaf contains to few observations, further splitting will result in overfitting (modeling of noise in the data).

Now the capital question : at what number should we set the limit ?

Answer : it depends.

  • are you just growing one tree or do you want to create an ensemble (bagging, boosting …) ?  If you create an ensemble, overfitting is permitted, because the ensemble will take care of it: it will look for the mean or some other grouping measure.
  • howmany independent variables (predictors) do you have?  The more variables you have, the bigger the possibility of having some accidental relationship between one of the variables and the target.  So with a lot of variables you should stop earlier.
  • howmany observations do you have? With a limited number of observations you do not have the luxury to stop early or  you will end up with no tree at all.  With a lot of observations you can stop early and still obtain a large enough decision tree

With hundreds of variables I use normally a minimum splitsize in the range of the number of observations divided by a few hundreds.

Minimum leaf size

Minimum leafsize is a limit to split a node when the number of observations in one of the child nodes is lower than the minimum leafsize.

Splitting of a node in two or more child nodes has to make some statistical sense.  What is for example the sense of splitting a node with 100 observations in the two following child nodes: one with 99 observations and one with 1 observation?  It is all a bit like doing a chi-squared test.  A good rule of thumb says that you should never have less than five observations in one of the cases.  I should say : the same goes for decision trees, as long as you deal with the same amount of observations you normally use to calculate chi-squared tests.  It is known that i) with a large number of observations chi-quared tests are no longer appropriate and ii) that decision trees are not a good algorithm for small numbers of observations (say less than 500).   So you should set the minimum leafsize larger than 5.  I usually take 10% of the minimum splitsize (in a bagging ensemble).

Conclusions

There is only one way to know the best settings : try, try and try again ! This is because all projects, data sets, are different. Do you have your own rules of thumb ? Please, do’nt hesitate and let me know !

Enhanced by Zemanta
Posted by: zyxo | June 17, 2011

Datamining and privacy: don’t shoot the pianist

The internet is full of reactions, opinions about data mining and the corresponding privacy issues. Even insults like the example below towards data miners or top executives of data mining enterprizes are no exception.

But is data mining always so bad?
Domains like medical applications where data mining could save your life fall without any doubt on the good side of the picture.

But even marketing can be a justified reason to use data mining results:

  • some customers explicitely want to stay informed about new products or services that are within their region of interest
  • in a lot of cases data mining is used to do less mailing instead of more: do not contact people who are not going to buy anyway.
  • some product/service offers can be so rightly targeted that the targeted people think: “Wow, right! why didn’t I think of that myself ?”  Because of data mining in that case we actually provide them with a free service, more or less reminding them not to forget the things they actually need.  Of course this is the ideal situation.
Unfortunately there are all those less ethical initiatives out there, but that has nothing to do with data mining as such.  Has a rifle ever been condemned for killing someone?  No!  Its the shooter, the one who uses the rifle who is the criminal.  The same goes for data mining.  We, data miners are only the pianists.  We play music.  The ones that record our music and broadcast it much to loud are the ones to be blamed.


You might also want to read:

Enhanced by Zemanta
Posted by: zyxo | June 11, 2011

Does your boss wants you to do HIS work ?

In “Six myths about data analysts” I was struck about number two :

  1. Myth #1: Data analysts are geeks. / Fact: Analysts are good communicators.
  2. Myth #2: Analysis is all about insight. / Fact: It’s all about impact.
  3. Myth #3: Data analysis is easy. / Fact: Data analysis takes time to learn.
  4. Myth #4: Statistics is the most important skill. / Fact: Business smarts are more important.
  5. Myth #5: Analysts work at the “speed of thought.” / Fact: Thought is often a slow, non-linear process.
  6. Myth #6: Analysts are a rare breed. / Fact: We’re all data analysts.

Number 2 : “Fact: It’s all about impact

According to  president of analytics Ken Rudin at Zynga ”Analytics is about impact. In our company, if you have brilliant insight and you did great research and no one changes, you get zero credit.”

Dear reader : what do you think about that ?

For me it is simple.  They want the lower level employee to do everything.  Not only the lower level work, but also the management.  If you, as a data analyst, discover something interesting, make sure you do not communicate it to your manager.  OH NO ! They expect you to do the work of your managers, i.e. decide who should know about it, pass them the information, show them how it can be profitable to their work, to the enterprise, convince them to change (is’n that change MANAGEMENT ?) etc…

I thoutht that managing was all about :

  • making sure you stay informed, means : talk to your data analysts, be interested in what they do and read their reports, ask them for new insights
  • using that information to chose the way you want to go
  • performing the necessary actions to get everyone with you along that way
  • making sure that you get feedback about the results of the change process
  • adding corrections if the results are not satisfactory
  • etc.

I know, managers want an easy life:

  • showing up unprepared at meetings
  • making decisions about necessary changes, not data driven, but more on gut feeling
  • (eventually) communicating these necessary changes
  • several months later by coincidence finding out that the changes never took place, not realizing they themselves did absolutely nothing to make it happen

So that’s why they think you are only a good data analyst if you do a good job at analyzing data AND a great job at doing the work they are supposed to do.

Posted by: zyxo | May 11, 2011

Honest Job Description of a Data Miner

OK. You have a job opening for a data miner.
Now what are you going to write as job description?

If you want to hire a real data miner, I suppose any good candidate knows what it is like to be a data miner. He does not need a job description.
You just tell for which department he will work : marketing, credit risk, DNA-analytics Lab, …

Take for instance this :

Experience – Familiarity with major database and statistical packages; experience with statistical and database applications in a particular area such as biology (biostatistics), physical science, economics, or marketing (from maxizip.com).
If you do not already know this, why do you go for a data mining job ?

or this :

Job description:

  • Participation in analytical projects from the area of data analysis, processing and Data Mining
  • Preparing documentation and presenting work results
  • Cooperating with team of StatConsulting data analysts and experts
  • Actively participating in business meetings with StatConsulting clients

(from Statconsulting)

Right ! It means executing and reporting data mining work, for somebody, and you are not alone. So WTF ?

The feeling I have with each and every job description I find is the same : boooooooooooring !

Why not simply write the truth ?

For example :
The people of our marketing department do a very nice job, but we want it to be better. We want them to be more data-driven. They are able to add, subtract, divide and multiply. They can deal with the gender and age of their clients. But we have a feeling that’s not enough ! We want to take it a long way further. And that will be YOUR responsability. When it comes to figures, you hold their hands. You explain. You provide the charts. You feed them numerical insights. You perform rocket science they don’t understand, convince them to use your models and prove them that they were wrong if they did not. YOUR ultimate goal is to make THEM shine with high-return-campaigns. And silently you hope they will show some gratitude, but you very well know that at least half of them will hate your guts because you are the one who forces them to change the way they are used to do their job.”

Posted by: zyxo | March 27, 2011

Why do we pay banks ?

Posted by: zyxo | March 4, 2011

Distances : the biggest challenge in clustering

Clustering seems easy : you throw your data into a clustering algorithm (like the popular k-means clustering) and see what comes out of it.
FORGET IT !

What is clustering ? Here is one definition (picked ad random from a google search : “Clustering is a data mining (machine learning) technique used to place data elements into related groups without advance knowledge of the group definitions.”

So it’s about clusters, “related groups”.  Well, not related groups, I suppose, but groups containing related individuals.  Generally we see rather explanations like “putting observations in groups so that the distances within the groups are as small as possible and the distances between the groups as large as possible.”

In other words : the groups come into existence by playing around with distances.  And that’s what it’s all about : distances.

Distances seem easy : if you work with continuous variables you calculate the Euclidian distance,  if you work with categorical variables you can calculate a jaccard distance.

With a bit of standardizing and creativity you can even mix the two.

But that is not what this post is about.

An example will make it more clear.

Suppose you want to cluster the customers of a bank.  You will use variables like account balances, number of investment transactions, credit card transactions, mortgage loan balance etc.

It is perfectly possible to combine them into one euclidian distance.  But does a difference of 1,000$ between the current account balances of two customers represent the same distance as for example a difference of 15 credit card  transactions during the previous month ?  How do you compare these two ?  How do you decide that the one is more important than the other ?

Eureka !  Standardisation.

Indeed, you can standardize them, so the distributions of the variables become comparable.

But even then. Does a difference of one standard deviation (std) between current account balances represent — in the eyes of the business — the same as one std difference between number of credit card transactions ?  Perhaps the bank profit increase due to a current account balance increase of one std is only half than that of a one std increase in number of credit card transactions!   In that case it may be appropriate to for example divide the first measure by 2

In order to know that you have to talk to the business people for each and every comparison between two variables.

To be perfectly good the result must be a matrix of pairwise comparisons where each row and each column represents one of your variables. Realise that in some cases this matrix can be huge, as well as the amount of time you will have to spend with you business expert to discuss each pair of variables to come up with some importance ratio, or distance ratio.

And after that you have to make sure that the whole matrix is somewhat consistent.  Because if your business expert is sure that variable A is twice as important as variable B and variable C is only half as important as variable A, but still, variable B is far more important than variable C, well, than “Houston we have a problem” and you will have to negotiate.

So indeed, clustering may be rather easy, but getting a good view on the appropriate distance definitions is the hard part.

Posted by: zyxo | February 14, 2011

Google CEO Naked ?

What would you think about a company that  uses a whole lot of personal info of you to sell you their stuff :

- you entire buying experience

- your way of living

- where you go on vacation

- details of your family

- how much you earn

- which car you drive, which car your wife drives

- where you went to school

- where you work

- your profession

- where you shop for groceries

- where you buy your clothes

- which sportsclubs or social clubs you are part of

- in which house you live, if it is your property or if you rent it

- how much you pay for rent

Sounds horrible ! No ?

What about if this company is  the local grocery store, the owner of which is your friend, went to basic school with you, lives in your street, and from time to time  proposes you his new products when he thinks they will interest you ?

Aha, that’s an entirely different story ? Isn’t it ?

WHY whould that be?

On first sight it is exactly the same : he posesses personal info of you, just like your bank, or you telco company, and uses it to make some customized offers.  So what’s the big deal ?

I’ll tell you : Balance versus imbalance.

You know your grocer as well as he knows you.  You know his wife and children, which car he drives, etc…That’s hardly the case of the CEO and the other members of the board of directors of your bank.  Do you even know their names ? 

And that is why it seems so unfair : they know all sorts of personal stuff about you, and use it to make profits, while you know nothing of them.  Nada, zip.

So when they have some papparazzi at their backs all the time, it is only to restore a little bit of that balance.

I would even think a bit  further, (just as a mental exercise) :

As an example : Someone like Mr. Eric Schmidt, until now CEO of Google, is sometimes seen as one of the biggest privacy (ab-?) users of our planet.  It is impossible to create a balance in such a case.  The nearest thing of a balance would be to take away ALL of his privacy, and give it to all of us, which means that camera’s would be following him 7 days per week, 24 hours per day, 60 minutes per hour and 60 seconds per minutes.  Always, everywhere, from any angle.  And of course broadcasting in realtime on the internet.  And likewise for each and every one of his higher managers.

And in order to discriminate nobody : shouldn’t be the case for every company that uses data mining for targeted direct marketing?

Posted by: zyxo | February 9, 2011

Skills of a good data miner

What skils would a perfect data miner have ?

In short : Technical , Business knowledge, analytical, soft, creativity and practical

Technical

- Programming, because as we all know data mining is:

  • 95% digging into data with sql, sas, C++, Python, or whatever language you can use to manipulate data and create your training, validation and test data sets.
  • and only some 5% really generating models and putting them to work

- Statistics because it makes no sens to calculate for example logistic or linear regressions without having any clue as to what you are doing or what it means

- Data mining techniques, because that’s what data mining is all about : the actual calculating the treasures of information, of patterns that are hidden in the data.  You have to know when it is appropriate to use which data mining technique.  Should you use a decision tree or a K-means clustering ?  Or why not a logistic regression ?

Business knowledge 

Because we, as data miners do not work with only numbers, but with data that have a business meaning? How could we interpret a model, detect an information leak, spot impossible results that point to some mistake in your data set if we do not know what the data mean ?

Analytical 

Because data miners do not just run data mining agorithms because someone tells him/her to.  No as data miners our customers come to us with a problem they want to solve.  We must be able to analyze the situation, find out what our customer really want (this is not always what he’s telling us), and create our way to cook a delicious solution to his problem.

Soft skills 

Presentations, because you have to convince your superiors, colleagues that you have their models, explain what your models can do and can’t do and how they should use then in their marketing campaigns

report-writing, because like being able to give a decent presentation you should be able to write a good, clear and concise report.  Not only for anyone interested in your data mining work/art, but for yourself a year and a some dozens of models later, when you want to know what the heck you have been doing some time ago to get at that particular model.

Creativity 

 Because that’s the “art” part of data mining.  You cannot stupidly apply some algorithms.  You have to have the feeling of what will happen, with such or such algoritm in combination with some particular aspects of your data.  You have to have some gut feeling of why you should try something else, in what direction …

Practical  

Two feet on the ground.  Never lose sight of your ultimate goal : Business outcomes.  As Avinash Kaushik puts it : “an absolute obsession, with outcomes is mandatory”.

Some further reading :

skills of a data miner

what kind of data mining tools and knowledge should I know ?

Posted by: zyxo | January 30, 2011

Are people who believe in God stupid ?

Recently I saw this article where  Richard Lynn, John Harvey and Helmuth Nyborg claim that Average intelligence predicts atheism rates across 137 nations

This raises a number of questions :

Are people who believe in one or more Gods more likely to remain less intelligent than non-believers?

Are intelligent people more likely to lose their faith (if they ever had any)?

Is there some other factor which causes people to believe in God AND be less intelligent ?

  • quid GNP ?
  • quid education levels ?
  • quid democracy level ?
  • quid freedom of press level ?
  • quid female emancipation ?

And last but not least : what is IQ ? how is this measured ?

So there would be a lot of work to do to investigate all this.  Unfortunately I do not have the time to include all the above factors in an analysis for this blog post. But just to give an idea I of what it could give, I added the gross domestic product (GDP) per capita.
OK let us start with IQ and % non-believers :

It is clear that in the countries with a large percentage of non-believers the IQ is on the average larger.  However, the deviations from the exponential model are considerable. So let us take these deviations en see whether they are related to the GDP per capita :

We see a positive correlation.    Which means that we can make a combinated model where the % non-believers equals to the sum of the two models :

% non-believers = 0,0003 * exp(à,1065*IQ) + 1,2513 + 0,3026*GDPcap

This is already a fairly good model, explaining 85% of the variance of the %-age of non-believers, as illustrated in following chart :

So does this explains something ?  Only that % non-believers, IQ, and GDP per capita show concurring trends among countries.  All three move in the same direction.  Take for example the relationship between GDP per capita and IQ.

What can we learn from this ?

Does this mean that if you increase the GDP per capita, the IQ will follow ?
Or does it mean that the IQ-tests measure the extend to which someone is adapted to the life of the rich countries ?

This would lead us to a discussion about the validity of IQ-tests.  But that is a totally different story.

To continue the discussion if religion is something for stupid people.  Is it not so that people who believe in some God can support on their belief to cope with the problems in life ? Is that not one of the reasons that natural selection has favored religion ? So from that point of view it is relatively stupid not to firmly believe in your God.

One last remark : It would be interesting to do the math with data about people who live in similar (wealth) circumstances, who had the same education.  I think it is irrelevant to compare countries as different as France or Mali  on this subject.

Posted by: zyxo | January 17, 2011

When is it OK to kill people ?

Uma Thurman in "Kill Bill"

A first answer is simple : it is never OK to kill people.

But unfortunately the reality is never that simple.  In this post I list a number of cases to illustrate in what circumstances humanity nowadays accept or not to kill people.

Important : this list is certainly not exhaustive, so feel free to suggest additions.

Killing people is NOT OK :

-When it concerns one specific person who gets killed :

  • killing a particular someone for personal reasons : “I hate your guts, so I will kill you“.
  • deciding not helping people who need help (gross negligence).  For example after a car accident not calling 911.

-when it concerns a lot of unknown persons, and you do not know howmany and who of them will die because of your action, but you do know that if you stop your detrimental action they all will stay alive.

Killing people seems OK in the following cases :

Conclusion : just like the fact that life is a terminal disease (everybody who lives dies sooner or later…),  to live is to influence other people’s  lives.  In some cases the influence is way to much, in other cases it is negligible.  In between there is the gray zone of discussion : can we accept this behaviour or not ?

I assume that these discussions will never end … and remember that everybody is right from his own point of view !

Posted by: zyxo | January 9, 2011

Link list for december 2010

Hi, here are my links for last month.  Enyoy browsing !

The Rise of Analytics: …http://bit.ly/dJDirg

Man makes living suing e-mail spammershttp://bit.ly/e5z6qs

On customer segmentation http://adage.com/cmostrategy/article?article_id=135961

THE DIGITAL STORY OF THE NATIVITYhttp://www.youtube.com/watch?v=GkHNNPM7pJA

why you should care about missing tweetshttp://bit.ly/e9EpvY

Will posthumans be atheists ? http://bit.ly/eHE46O

Individual Knowledge in the Internet Age http://bit.ly/hx3DZo

most health researchers live in a world dominated by the fascism of the randomised controlled trial” http://bit.ly/e2ZmPG

The most striking pictures of 2010http://bit.ly/e5Uo1M

Humanity is devoting some of its best minds, from a wide diversity of fields, to helping software achieve consciousness http://ur.ly/ynz8

the rate of bird species discoveries http://j.mp/fYSvno

optimal solution to towers of hanoi found by … ants http://ur.ly/yI5H

12 most interesting food facts : http://ur.ly/yFu2

how wikileaks changes things for us allhttp://ur.ly/xi7C

Posted by: zyxo | December 28, 2010

Cheap wines are as good as expensive ones

It is always a difficult question : which wine should I buy ?  A cheap one ? An expensive one ?  Am I sure that the expensive one will be a good one ?

A consumer organisation tested 205 red wines with a maximum price of 12 euros (~15.70 us dollars).

What was tested ?

  1. 3 taste categories (blind tasting)
  2. One general quality score, taking into account taste and some lab measurements like alcohol, sugar, acidity, sorbic acid, SO2)

What I looked at is whether the price is an indication of the quality of the wine.  Do you get a better wine when you buy a more expensive one ?

The results clearly say: NO !

Reltionship between wine price and taste category:

From the above chart is is obvious that there is no relationship whatsoever between taste category and price.  No statistical tests needed to see that.

Reltionship between wine price and quality score:

There is a negligeable positive correlation between the quality score of the wine and its price.

Conclusion :

If you want a good wine, buy several cheap ones and taste them.  Eventually you will find a very good one for a very small price.

Or do as I do : buy the yearly results book of a consumer organisation who tests hundreds of different wines.  The best wine investment I ever made.  For the price of a handful cheap bottles you get the book, containing the necessary info to buy good wines for a very reasonable price.

Related posts :

Drinkhacker : wine price versus quality

Price tag can change the way people experience wine

Posted by: zyxo | December 19, 2010

Uncertainty, Risk, Statistics and Data Mining

Uncertainty and false research findings.

Traditional research is about :

  1. knowing a bit about a subject,
  2. formulating a hypothesis about the subject,
  3. gathering data to verify that hypothesis, 4)doing some statistical tests to see if the hypothesis can be accepted or has to be rejected,
  4. judging if the obtained result is worth wile writing an article about,
  5. if yes, writing the article,
  6. submitting the article,
  7. judging (by the reviewers) if the article is good enough to be published,
  8. publish the article.

Points 1. to 4.  are perfectly OK.

But then begins the non-scientific part of the scientific research : people has to judge, and obviously this is always some subjective activity.  So when does an article gets published (point 8.) ?  if the reviewers find it good enough, and mostly (>95% of the time)  this means that it has to contain some conclusion based on a statistically significant outcome.

Result 1 : non-significant results, which are as in se valuable as the significant ones, are discriminated from scientific literature.

Result 2 : since per definition 5% of the statistical tests on something with no pattern whatsoever in it, will show a statistically significant outcome, a lot of published significant findings are rubbish, bullshit.   Just because the researchers and reviewers rely on statistics.

Thats the problem of uncertainty : statistical outcomes follow some distribution with a lot of uncertainty in it.  Unless the research is done over and over again, and previous findings are confirmed, they first findings are worthless.

Good data mining practice.

One of the basic habits of any good data miner is using a hold-out sample to verify whether the new model really contains information about real patterns, or was it just a coincidence ?  That’s the way of data miners dealing with uncertainty (and a lot of other data mining stuff like information leaks, multi-level categorical variables and the like).  See the red line : training result and the blue line : validation result on the hold-out sample.

Difference between data mining and statistics.

Statistics is about testing a hypothesis, data mining is about calculating the hypothesis, and testing afterwards, with a hold-out sample if the hypothesis still stands.

Why should a data miner not test his calculated hypothesis with a statistical test ?  Because for this statistical test you still need another sample of data.  The data mining algorithm calculated the pattern (hypothesis) which stood out the most.  Testing this statistically is spurious : it will always be highly significant.

Yes you could do the statistical test on the hold-out sample.  And yes, the findings would be reliable.  But what would the be worth ?  If you have a very prominent pattern in the training sample and only a very weak pattern in the hold-out sample, due to the high amount of data, statistically it would still be significant.  What data miners want to know is whether the model performance is good enough to be used in for example some marketing campaign.

So what’s the risk ?

Uncertainty and risk are two different things.  Uncertainty deals about the possibility of ending up with a result that’s in the wrong part of the statistical distribution of you significance test.  Risk, on the other hand is about things that change around your subject that change the underlying pattern.

The best illustration of risk we all saw previous years, when the global financial system collapsed.  The models dealt with uncertainty, but the risks of our economic ecosystems were forgotten.

In commercial data mining you have the same risks.  You can make a great model, predicting which of your prospects will by your product xyz.  If at the very moment, when you launch you campaign, there is a competitor who does exactly the same, whith the same product, but at a much lower price, then there is a good chance that your campaign will suck, no matter how good your data mining model was.  That’s the risky part of your business.

My inspiration for this post :

The Truth wears off

Uncertainty vs risk vs randomness vs risk


Posted by: zyxo | December 8, 2010

Mr. CEO : when you downsize, you kill people.

When a company gets into trouble, whatever the cause (lousy products, lousy employees, lousy management, better competitors, lousy investments…) one of the first things they do is to downsize.  Less employees means a lighter payroll.  And admit it : don’t we all perform some unnecessary work.  In a 5000 employee company surely there must be some people that can be missed.  Is it not ?

So they decide to downsize.

What they want to happen, but does not happen :

  • companies that announce layoffs do not enjoy higher stock prices than peers
  • Layoffs don’t increase individual company productivity
  • layoffs do not increase profits
  • Layoffs do not reliably cut costs

The collateral damage, the effect on people, what they did not expect or at least what never showed up in their calculations  :

  • people leave the company, mostly the best ones who can easily find a job elsewhere
  • among the people who accept the layoff package, a lot of them are the ones the company does not want to lose => loss of “institutional memory”
  • layoffs reduce morale, trust, motivation, commitment, and increase fear in the workplace
  • When the recession ends, a lot of employees will look for another job
  • as more work has to be done with less resources, creativity and innovation goes down
  • downsizing increases stress levels
  • increasing stress and worsening work conditions increase absenteeism due to depressions.
  • increasing stress and worsening work conditions increases suicide levels.  Simply put : when you downsize, you kill people. (I saw this happen !)

Sources :

Stretching fewer employees to cover ever more work in our job-starved recovery is no way to run the future.
During major downsizing, suicide levels increase
Lay Off the Layoffs
Downsizing dangers

Enhanced by Zemanta
Posted by: zyxo | December 6, 2010

Link list for november 2010

Here is my link list of november 2010.  Enjoy browsing !

How to create bubble charts with R

R by example

Learning R

teaching ethics to a robot http://ur.ly/vNjv

Can crime be stopped before it starts using virtual intelligence? http://tiny.cc/6zpgf

ever saw a snake fly ? http://ur.ly/w7hN

Interactive data mining map

the origins of the alphabet http://ur.ly/vR9s

Is your business profitable because your customers are smart? or stupid? http://ur.ly/vmM2

how to fold a t-shirt in three seconds http://ur.ly/v9YS

The more you tell, the more you sell http://kiss.ly/avSmFs

how wise are crowds ? http://ur.ly/vcjR

Stretching fewer employees to cover ever more work in our job-starved recovery is no way to run the future. http://ur.ly/uVFH

The color of your pill affect its efficacity ! http://ur.ly/uBGK

can science and religion peacefully coexist ? http://ur.ly/ulfq

7 Innovative uses of analytics http://bit.ly/bai20F

World taxi prices: What a 3-kilometer ride costs in 72 big cities: http://zqi.bo.lt/fvexb

Economics in one sentence http://bit.ly/auBPKj

Are You Addicted to the Internet? – http://tinyurl.com

6 tools to quantifiy yourself. It’s only the beginning ! http://bit.ly/94ngan

Top 10 analytics mistakes http://bit.ly/cwk2mT

Your browser choice may affect the price you get on loan offers http://bit.ly/axhg3F

Difference between Data Mining And Screen-Scraping http://ow.ly/35ch8

Google Has Indexed Only 0.004% of All Data on the Internet http://ow.ly/33Uu8

Researchers discover how to erase memory http://ow.ly/32wCS

alcohol more harmful than heroine or crack http://ow.ly/32vqc

Posted by: zyxo | November 15, 2010

Data mining, Text mining and social media analytics

In this post I want to give my very simplified view of what data mining, text mining and social media analytics is all about.

Data mining, Text mining and Social media analytics all essentially take 3 steps :
1. get the raw material
2. transform it into “minable” data
3. do the mining.

(Many times people mention “data mining” when they actually only mean getting the raw material.)

Let us view those three steps in reverse order.

3. Do the mining.
This is very straightforward : you have a data set, file or database at hand which is exactly structured to be used by your data mining algorithm. So run it, find the hidden information treasures or whatever information you want to find in your data. For more details about this, there is a lot to find on the internet or in good data mining books. Start for example on www.kdnuggets.com.

Before that :

2. Transform the raw material into minable data.
This is a lot more interesting. Depending on what type of raw material you have. We can distinguish different types, or even combinations of types :

  • data“. This is the easiest. If your raw material is essentially data, then either you can go directly to step 3 or you can first engage in data preparation activities like imputing missing values, transformation of variables, combinations of different data sources, generation of derived data etc.
  • text“. Gets a bit more difficult. We enter the kingdom of text mining. But essentially, text mining is transforming the text into data and then just doing data mining. The trick is to get the transformation done. Very simplistic : just make a variable for each word, fill it with the number of occurences of that word in each of your texts. That way, each text is transformed in one record with a huge quantity of variables that are mostly equal to zero. Or you can do it the difficult way : moving from words to expressions, meanings or whatever transformations up-to-date text mining packages are capable of nowadays.
  • sounds“. This is getting real fun. Either your sounds are just … sounds without any meaning (like bird songs) or are very meaningful sounds, like conversations. If conversations you can transform them to text and treat them that way. If just sounds, you can transform them in a number of variables, like amplitude, frequency. In fact you could turn them into charts just like charts of financial stocks, and treat them likewise. But be aware that just sounds other than conversations can be very meaningfull : sound of a car crash, of a door slammed, of a crying child etc… Anyway there are a lot of possibilities, and I am not sure if it is really do-able to get everything automatically into a straightforward data format.
  • images“. Here it is really getting nasty. With pictures you could try things like face recognition. In order to accomplish that task, you must find a way to quantify each meaningful “entity” on somebody’s face, like the corners of the eyes, the nose length, the distance between the two eyes. Simply put : find the interesting points, measure the distances and quantify some ratio’s.
    A lot more difficult are random pictures. How can you identify the various objects, people, locations, on a picture ? In other words : how can you transform a picture into a data set with variables that not only contain info of what can be seen on the picture, but also what is happening ? A picture with a glass of beer, and a man is not necessarily the same as a picture of a man drinking a glass of beer.
  • movies“. This is definitely hell. Combine lots of pictures in meaningful sequences with spoken words, music and noise and try to put information like ” a video where a guy named zyxo talks about data mining, text mining and social media predictive analytics, and with some self-reference in it” –(how much detail will you include ?)– in a data record. Looks a bit like transforming the video into text and then transform the text into data.
  • Social media“.  Can be any combination of the above.  Simplest is of course twitter.  But in social media (or any web content)  you can  decide to limit yourself to the pure text content or to the text or picture content, or …

Before that :

1. Get the raw material.

Well, just get it …

I am well aware there is lots and lots more to say about this vast subjects. My only goal was to come up with a very simple basic structure.

Do know that any comments are welcome :-)

Enhanced by Zemanta

Older Posts »

Categories

Follow

Get every new post delivered to your Inbox.