Posted by: zyxo | November 19, 2014

Will I die from EBOLA ?

Will I die from ebola?

Who knows?

Will ebola be stopped/eliminated?

Who knows?

Anyway: I do not want to die from ebola at all!   I would very much like that doctors and other health workers arrive at totally wiping out that terrible disease.

That is why I actually helped them.  Not by going there on the field (I’m not that enthousiastic) but simply by donating some money.

If you also do not want to die form ebola : join me in helping them: donate a few dollars/pounds/euros

And no, you do not have to be altruistic to act in that way.  Egoism is a very good reason to help others protect yourself !

 Join the fight against ebola!

Posted by: zyxo | July 4, 2014

Soccer rules are unfair and illogical

20140704_231002

 

When I type this, the Germany national soccer team has eliminated the French, and Brazil is going to play against Columbia in the quarterfinals of the world cup Soccer in Brazil.

It is really weird that the matchresults of the most popular sport in the world are extremely influenced by unfair, illogical rules and coïncidences.

 

TIME.

Let us begin with time. A soccer game basically consists of two halfs of 45 minutes. Thus : 90 minutes of playing time.

NO !

Soccer has two types of time:
1. playing time: this is the time between a) a whistle blow of the referee (indicating the beginning of actual playing time) or a player bringing the ball into play (e.g. throw-in) and b) something happening that causes the play to stop (ball going outside the field boundaries, a player who commits a fault, a goal).

2. non-playing time, or “dead” time: This is the contrary of playing time: the time between something happening that causes the play to stop and a whistle blow of the referee or a player bringing the ball into play.

Since the total amount of time always equals 90 minutes(*), during the non-playing time, the players have the opportunity to stall, which means to act non-sportsmanlike and steal playing time from the opponent. This is typical behaviour towards the end of the game of the leading team.
Example: when the goalkeeper catches the ball he can chose to immediately return the ball into play, or wait several seconds. There is a limit to that, but not fixed. It depends totally on the currend mood of the referee.
Solution 1: the goalkeeper only is allowed to punch the ball, not to cath and immobilise it. So the game would not be halted.

Better and more general solution 2: only take the actual playing time into account and stop the clock during the non-playing time. Eventually setting a limit on the actual playing time to 2 times 30 minutes.

Solution one and two are not mutually exclusive and preferably be applied both.

(*) there are exceptions like injuries in case of which extra time is added to the end, but the players never know up front how much this will be.

 

Throw-in.

I already mentioned this. When a player sends the ball outside the field boundaries, an opponent must bring back the ball into play by throwing it in with both hands, over his head.
In Europe, soccer is called “football”, meaning that it is a game where you predominantly use your feet to play the ball. In fact, you can use any part of your body except your arms and hands. So it is completely illogical to use hands to bring the ball back into play.

Solution 3: replace throw-ins by a simple free kick from the spot where the ball has left the field.

This solution would immediately remediate the next weird fact: if someone sends the ball outside the field crossing the goal line, close to the corner the result is a corner kick for the opponing team.
But if the ball leaves the field on the other side of the corner flag, crossing the end of the sideline, the result is only a throw-in. With a free kick the result would be similar, in accordance to the similarity of what happens in both cases.

 

Penalty Kicks

 

When a player commits a relatively severe fault inside the penalty area the opponent team can take a penalty kick: 11 meters from the goal. All field players must stay outside the penalyt area. As penalty kicks result in a goal in 85% of the cases.
The weak point here is the referee judgment, that consists of two parts:
1) is the fault severe enough?
2) was the fault committed inside the penalty area?
So in the doubtful cases it is up to the referee to decide to give the opponent team a 85% chance of scoring a goal. Which can be very close to influencing the game.

Solution 4 : apply the same set of rules to every free kick, with no distiction for penalty kicks:
– always taken from the spot where the fault was committed
– no field player between the ball and the goal
– all opponents must stay at least 9.15 meters from the ball. exception: if the ball is less than 9.15 meters from the goal, opponents may stay on the goal line.

 

Red cards

soccer red card

 

Now comes the complete unfairness of an important rule. If a player commits a severe fault, explicitely hurting or endagering an opponent, the referee can punish him with a yellow (=warning) or red card (the second yellow means automatically red). The latter means that he has to leave the game and his team now counts one person less.

The big problem is the following: if the red card is given during the first minute of the game the receiving team loses a player for almost 90 minutes. If the red card is given during the last minute (for an equally severe fault), punition is quasi-non-existant.

Solution 5: each and every fault is punished with some non-playing time. In order not to exaggerate the interruptions, we can use a threshold of for example three minutes. Depending on the severity of the fault one (for a minor fault) up to 10 minutes can be given for one single fault.
This does not entirely solves the problem of the most severe error during the last minute, but this can be compensated by for example punishing two players instead of one.

 

And last unfairness: referee judgment.

20140704_231122

A lot of rules are fuzzy in soccer. Perhaps not in the written rules, but certainly in the way they are interpreted by the referee (“in a manner considered by the referee
to be careless, reckless or using excessive force
“).
In hockey they use video referrals to prevent severe influencing of the match result bij the umpire.
solution 6: this should also be the case in the most important soccer matches.

Conclusion:

If FIFA would apply these solutions, soccer would become a much more fun game to watch, much more logical, much more fair. The results would be much less function of accidental events, referee moods or judging errors.
But on the other hand: the saying goes: soccer is emotion and passion. The major part of these emotions occur after the game, when fans of both teams discuss what happened during the game. And because of the unfairnesses and illogical rules and decisions there is always a lot to discuss. This part of soccer would not be the same any more. And perhaps it is exactly this part that is responsible for the enormous popularity of what, at the end only is “some people playing with a ball”.

Posted by: zyxo | December 18, 2013

Real explanation of IT-problems

This really happened.  Do to privacy policy I only give you the final section of the problem history:

About the problem

The problem seems to be solved but we (at least I) don’t really know how.
The problem doesn’t come from any long comments in the code I sent you (we use this macro on daily basis for years and it was not even working when I deleted the only comment in this macro).
When I got this message, I tried te relaunch SAS -> NOK
I tried to delete as much files as possible in different work directories on the unix machine –> NOK
The real explanation (as stated by this expert)
Dear collegues,
I know the reason of the problems: it’s what we call the IT Ghost, a species of creatures from the 5th dimension which are migrating this time of the year from the Betelgeuze area to the Crab nebula and are eventually teleporting through the earth.  On this occasion they can influence the spin of some Charm Quarks causing computer processes to behave erratically, with no obvious reason.
A positronic energy field of 5.000 trillion petavolt around the earth should solve the problem.
Posted by: zyxo | April 13, 2013

10 items every man should have

Here is my very personal list of what a man should have to be happy.  I am aware that a lot of you will not agree on all 10 items and I am OK with that.  I can easily imagine a dozen other things that could add to my happiness, but somehow I feel the 10 ones below are, well, my top-10.
(to avoid any misunderstanding, 1. and 2. I do not consider as “things”, but persons)

In a random order.

1.wife/husband
2.daughter/son
3.leatherman/victorinox
4.computer/laptop/smartphone/tablet
5.hair or hat
6.car
7.bicycle
8.toolset
9.garden
10.job

And now a bit more elaborated.

  1. wife/husband: Well, a human being is not meant to be alone. You can do a lot of things together with friends or colleagues, but at night, just before going to sleep, you want to put your mask aside. And then you need someone to really talk to. Someone to be your true self with. Someone to share your real sorrows, your real happinesses.
  2. daughter/son: We are here because all our ancesters, each and every single one of them, had offspring. So if we do not have any son or daughter it finishes here. We can discuss whether this is a good or bad thing, but if we do not want to stop our ancestor line here, we should follow their example and pass our genes on to the next generation
  3. leatherman/victorinox: There is always something to repair, some rope to cut, some screw to tighten, a can to open, or whatever. If you have some good quality multi-tool in you pocket like a leatherman or victorinox you feel a lot more powerful, a lot more master of the situation, a lot more man.
  4. computer/laptop/tablet: In the good old days people learned to read and write. Nowadays kids learn to “computer”, to “intenet”.  If you do not want to be looked at as someone outdated and useless, you should be able to communicate with others, with kids using their technology. “I have no email address” is not acceptable any more, even if you are older than 70.
  5. smartphone: There is always something to look up, some mail to read, some stock price to check, a picture to take, a place to locate, or whatever. If you have some good quality multi-tool in you pocket like a iPhone or Samsung Galaxy you feel a lot more powerful, a lot more master of the situation, a lot more man. For me, my smartphone is as an IT version of a leatherman.
  6. hair or hat: This is very personal. It does not mean I do not like bald people. I just means I do not want to be bald.  I still have hair, but the day all my hairs will have left me, I definitely will want to be able to cover my head.
  7. car: Unless you have plenty of time and nothing much to do, or you live in a big city where you get around faster on a bicycle than in a car, that car is very useful to save you a lot of time. You also can transport a lot more in your car than on your bicycle. And for some people a fancy car comes handy to compensate for a small … you know what :-)
  8. bicycle: Unless you live in the country with little or no traffic you need a bicycle. For distances less than 5 km a bicycle is faster than a car, especially in town.  Add the time to walk to your car, eventually open your garage, drive and confront the traffic jams, circle around for a place to park, in that time you already would have been at your destination with your bicycle.
  9. toolset: A leatherman comes handy, but when you have big jobs to do you need the proper tools, otherwise you will ruin it. A drill, a few types of sawing machines, some hammers, some screwdrivers (preferable a good electric one) an angle grinder, and many more. When you have been a Handy Harry for some decades, you need a big car to transport all your tools to help rebuilding your kids’ place.
  10. garden: When I lived in a appartment, I hated it to have to leave my property in order to get some fresh air. When you step outside in your very own garden and hear the birds sing, see the green of the lawn, the flowers, the vegetables in your kitchen garden and know you are still at home, and all this is yours, it helps a lot to feel happy.
  11. job: OK, points 1 to 9 kost a lot of money.  Money that you inherited, or that’s just a gift from your parents is worth as much as the money your actually worked for.  But the latter feels a lot better.  It means you are able to do something that other people value and are willing to pay for.  The money you earn does not only have its own value, but it also shows YOUR value.

Have other suggestions? Let me know.

Posted by: zyxo | September 4, 2012

Which body improvement would you prefer ?

Just for fun.

I’m curious what you will chose!

Posted by: zyxo | September 4, 2012

The Big Data Disease

white

Artificial intelligence, data mining, knowledge discovery…
And now there’s a new word on the block : big data

 

This is what “they” are trying to sell us :

The first page of a google search for “big data solutions” already contains this phrases:

 

” Learn how Vestas Wind Systems use IBM big data analytics software …”

 

“Learn how Oracle helps you acquire, organize, and analyze your big data…”

 

“Unlock actionable insights from all your structured and unstructured data with the Microsoft Big Data Solution…”

 

NetApp offers Big Data Solutions that efficiently process, analyze, manage, and access data at scale…”

 

“The era for Big Data has arrived. EMC Isilon is leading the way with scale-out NASsolutions purpose-built to solve the performance, capacity, and application …”

 

Think Big Analytics is the leading professional services firm for big data and advanced analytics…”

 

 

This leads me to some questions:

  • Which part is new, which is the same as before, but with just the words “big data” added ?
  • Do we all have that much data?
  • Are we all concerned?

What has changed?

This is simple: the volume of available data in the world really has become big.  But nevertheless, all remains relative. I suppose in 5 or 10 years we will make fun of those guys who think that a handful of petabytes already represent “big data”.

 
Do we all have that much data?

I am convinced the majority of the enterprises have way below a petabyte of data in their databases. Ofcause there is always the internet where we can extract as many petabytes of data as we want.  Which means that so to speak, we all have somehow that same amount of data at our disposition.

 
Are we all concerned ?

Not really, for now. First, not each and every enterprise has that amount of analysable data in its’ databases.  And not each and every enterprise has the need to extract those amounts of data from the internet.  As long as some gigabytes or maybe some terabytes are sufficient to analyse whatever you want to analyse, why bother?
The big data hype is advocated by enterprises that sell big data solutions.  But it is not because they tell us that we need it, that we do need it or that we are ready to put it to proper use, generating an acceptable ROI.
I saw enough examples of expensive software that was just money thrown away.

Other people will certainly have other opinions on this. If you do, do not hesitate to start the discussion.

Enhanced by Zemanta
Posted by: zyxo | July 17, 2012

The 10 best free android apps

Here’s my list of favorites.  I use more than that, but those are the ones that are really useful and give me the greatest satisfaction.

(And yes, I listed 12 of them :-) because I did not know which ones to leave out.)

1.Google maps

Simply because it shows you where you are, and shows you the way to wherever you want to be.  The latest improvement is superb : you can locally store your maps on the smartphone. Very useful in case you do not have any internet connection when you are abroad.

2.My tracks

Because it shows you where you have been, some averages, charts of your speed and altitude,  Great when you want to hike or bike.

3.Camscanner

Literally lets you use your smartphone camera as a scanner (remember the flatbed scanners?).  No more tilted pictures of documents, no more grayscales where there supposes to be white.

4.Tweetcaster

The best smartphone twitter client. (I should instead have mentioned twicca here, but it kept freezing my HTC).   Tweetcaster is great! Way better than tweetdeck.

5.WordPress

Very handy when you have a wordpress blog.

6.Skype

The most used app to call someone over the internet

7.Where’s my droid

Yes, in case you have lost your smartphone.  You can have it make a lot of noise, which is useful in case you dropped it somewhere under a pillow, or you can have it send you it’s location, in case it is stolen.

8.Torch

Makes you see in the dark, by using the (flash) led light of your smartphone.

9.Dropbox

Your diskspace on the internet

10.Diskusage

I assume you want to know what’s on your sd-card?  This app is as simple as it’s useful.

11.Opera mini

The must-have browser

12.Es fileexplorer

The equivalent of the file explorer on your pc or laptop.

And yes, I installed a lot more, but almost never use them.

Enhanced by Zemanta
Posted by: zyxo | July 13, 2012

Put a Crowbar under your Marketing Campaign

Ever wanted to lift a heavy weight, or to open a door that is really stuck or to pull a large nail out of a beam?
The simplest thing you can do when you cannot get it done by yourself is to get help. More persons can deliver more force than one single person. Or a group of persons can accomplish more when you add some people to it.


However, if you are on your own to pull that nail, which type of help would you chose: two other men like you, or one single person with a crowbar?
The choice is obvious. The one with the crowbar will unleash more force than the two other persons together, due to the huge leverage effect of the crowbar.

It is not different in marketing
You can keep adding people to your team, but if you want to really give a boost to your campaigns, you will need someone with a crowbar.

In marketing such a person is call a data miner.

This data miner is more than just another person.  It’s the one who easily doubles, triples, quadruples the result of your campaign by optimizing the selection criteria for your target group.  An optimized selection will do two things:

  1. avoid sending lettres to, or calling people who are not likely to buy anyway => important reduction of the cost of your campaign
  2. contact only those people who are highly likely to buy => important increase of campaign success
You can find lot about the basics of how data mining works for marketing in some of my previous posts:

Enhanced by Zemanta
Posted by: zyxo | June 22, 2012

But what IS a professional ?

According to wikipediaprofessional is a person who is paid to undertake a specialized set of tasks and to complete them for a fee”.

A more exhaustive definition in the same wikipedia article contains words like “expert”, “specialized knowledge”, “excellent skills”, “master in a specific field”.

In my long carreer and in my non-work time I saw (and still see) a lot of “professionals” who only meet the first line of this blog post : they are paid for what they do, but they are far from doing what they are supposed tot do, or delivering the quality you could expect from a real professional.

Some real life examples ?

  • a cobol programmer (a hired consultant) who started the job by taking a 2-weeks vacation to learn his first cobol syntax
  • a mason who had never heard of using a brick that was cut in 2 thirds.  He only knew whole and half bricks.
  • an orthopedist (with a good reputation !) who erroneously concluded that I had a whiplash while all I had was a tendinitis in the upper arm.
  • bankers all over the world caused bankruptcy of their enterprises by neglecting the most basic asset-liabilities balance rules.

My own definition of professional : ” someone who wants to get paid for his work or his time”.  The most difficult part for the one who has to pay is to find out whether that so-called professional is worth the money.

Posted by: zyxo | June 14, 2012

The Titanic will go on: How to avoid the iceberg

Nature of the Titanic's damage wrought by the ...

Nature of the Titanic’s damage wrought by the iceberg. (Photo credit: Wikipedia)

I am sure most of you will remember the moment in the movie “Titanic” when someone asks ” Why doesn’t she turn?”. It happens after they discovered that they were steaming right into an iceberg and tried to pull over.
Why did she not turn?

The iceberg was spotted, the captain had given the order to turn, the guys in the machine room had made their adjustments, the first officer had turned the steering wheel.
So why did she not turn?

In the company I work for I saw this question asked many times. The CEO and board of directors had spotted the problem (loss of marked share, diminishing profits, unexpected changes in the business landscape, or whatever can go wrong with a company). They had ordered to change course, they had moved a lot of senior managers from one place in the organization to another. They had published some new company charter. But still was the company heading straight ahead. No quick turn left or right. Just keeping the old ways of doing things.
Why? INERTIA.

How can you expect a speeding 46,000-tons Titanic to quickly change its direction by simply changing the position of its mere 500 kilos of rudder?
How can you expect a speeding thousands-of-employees enterprise to swiftly change course by making a mere handful of senior managers switching chairs?
As long as the multitude of employees just keep doing what they have been doing for years, the same way they have been doing it, nothing will change significantly.

And still, there always are some people who would love to change (all day long, they sing : “My heart will go on” :-). There also are a whole lot of people who would accept to change. There even are some people who would go along with the change when all the others do it. But then there is this huge bulk of inertia. People who will not or cannot change no matter what. As long as you keep this inert mass aboard, the Titanic will keep it’s momentum in the direction where it was going before and nothing will change significantly.  Not fast enough in any case.

So get rid of the inertia. Build a lighter boat with the changers and than you will be able to take that quick turn to avoid the iceberg.
Perhaps you will even be able to change the course of the iceberg itself…

Enhanced by Zemanta

Here I use some quotes (in italic) from the original article on Everything You Wanted to Know About Data Mining but Were Afraid to Ask and add some personal toughts.  So, yes, it will be “more”!


“We know that data is powerful and valuable.” …  
“Data mining allows companies and governments to use the information you provide to reveal more than you think.”

Do not be fooled.  Not everything is data mining. To know more about the difference between for instance mere data gathering and real data mining, you should read “is reading a newspaper data mining“.


“To most of us data mining goes something like this: tons of data is collected, then quant wizards work their arcane magic, and then they know all of this amazing stuff. But, how? “

How? That is exactly what I described in Se7en steps in finding knowledge nuggets.  O yes, and not like the polite textbook stuff, but how it’s happening in real life.


“And these days, there’s always more data.”
  “The sheer scale of this data has far exceeded human sense-making capabilities.”

Yes, but in most cases it is not necessary to use all of this data.  Most companies are only beginning to mine small parts of their own data.  This stands in striking contrast to what large software and hardware vendors want us to believe.  After all they want to sell their products and the accompanying consultancy.  Mostly a free software package like Weka or R is largely sufficient.


Data mining is used to … allow us to infer things about specific cases based on the patterns we have observed.”

That is the core task of data mining : detecting patterns.  And this can be any kind of pattern as will become clear in the following paragraphs.  But in fact it is relatively simple.  There are two kinds of patterns: those which can be detected by unsupervised learning and those we detect used supervised learning.

Supervised means that we decide what we want to reveal, we have a specific problem to solve. Examples: how can we select customers who are highly likely to buy product X?  How can we identify customers who will not be able to make their mortgage loan payments?  How can we identify fraudulent tax returns?

Unsupervised means that we will not decide anything.  We will let the data speak and just see which patterns emerge.


Anomaly detection : in a large data set it is possible to get a picture of what the data tends to look like in a typical case. Statistics can be used to determine if something is notably different from this pattern. For instance, the IRS could model typical tax returns and use anomaly detection to identify specific returns that differ from this for review and audit.”

Another use for anomaly detection is for example to detect errors in the data.


Association learning: This is the type of data mining that drives the Amazon recommendation system. For instance, this might reveal that customers who bought a cocktail shaker and a cocktail recipe book also often buy martini glasses. “

In most cases the buying patterns are not the only variables that are used. The other variables, like age, amount of money spent, frequency of buying etc. prevent the modeler from using association learning.  In stead he uses some other predictive algorithm (classification) where the actual purchases are just one variable type among the various other variable types.


Cluster detection: it is possible to let the data itself determine the groups. … in a simple example we can imagine that the purchasing habits of different hobbyists would look quite different from each other: gardeners, fishermen and model airplane enthusiasts would all be quite distinct. Machine learning algorithms can detect all of the different subgroups within a dataset that differ significantly from each other.”

This is one of the most difficult topics in data mining, not because the algorithms are difficult, but because in most cases the results have no actual business meaning, unless you take good care of the difficult preparatory work.  I described this problem in Distances, the biggest challenge in clustering. Perhaps you are also interested in the difference between clustering and segmentation.


Classification: If an existing structure is already known, data mining can be used to classify new cases into these pre-determined categories. … Spam filters are a great example of this … to notice differences in word usage between legitimate and spam messages”

Other applications are : classify customers into those who will buy product X vs. those who will not buy?  Classify customers into those who will not be able to make their mortgage loan payments vs those who will be able to pay?  Classify tax returns into the fraudulent ones vs the good ones? Classify customers into those who are going to churn and those who are not. Classify stocks into those that are going to rise vs those that are going to drop.

There are a lot of algorithms to calculate classification models : decision trees (and the more complex decision tree based algorithms), support vector machines, (logistic) regressions, ant colony optimization, genetic algorithms.


Data mining, in this way, can grant immense inferential power. … it is how most successful Internet companies make their money and from where they draw their power.

Not only internet companies, but banks, credit card companies, each and every company that is big enough to afford to pay people and software to at least try some data mining and see what comes out of it.



***  If you liked what I wrote, perhaps you should consider clicking one of the banners on top of this post and become one of them who help making our world a better place to live ?  ***



Enhanced by Zemanta
Posted by: zyxo | April 10, 2012

The most important decision in data mining

When you are up to a data mining project, you have a lot of decisions to make:

Which data mining software will you use?

Which algorithms software will you use?

Which hardware will you use?

Which data sources will you use?

What will be the size of your hold-out samples?

Will you calculate derived variables? Which ones?

How will you measure the quality of  your model(s)?

How will you deploy your model(s)?

Did you notice? I forgot one. I forgot the most important one.

If you are a data miner, you should know already : The target.

What ?

THE TARGET !

Yes, but is that not something that they order you to predict? Is it YOUR decision?

It is a fact that a prediction model of the right target is much better than a good prediction model of the wrong or suboptimal target.

Let me give some examples of decisions to make:

  • Are you going to calculate the probability to buy, or are you going to calculate the probability to buy online?
  • Are you going to calculate the probability to buy beer, or are you going to calculate the probability to buy a particular kind of beer?
  • Are you going to calculate the probability to buy product XYZ, or are you going to calculate the probability to buy at least nnn items of product XYZ?
  • Are you going to calculate the probability to buy product XYZ, or are you going to calculate the probability to buy product XYZ after a purchase of product ABC?

Can you think of other examples?  Let me know and I will gladly add them to the list.

Admit that it is not the task of your marketeer to decide about these things.  You should decide it, together with him.  Be very aware of the fact that these decisions are more important for the commercial result of your marketing campaigns than your choice of the best algorithm!

Enhanced by Zemanta
Posted by: zyxo | April 2, 2012

Evolution of knowing=?=evolution of religion

How comes we know what whe know ?

We can distinguish several stages of getting knowledge, which have evolved from what monocellular organisms know, into what we, humans, know.

Knowing

Knowing (Image via RottenTomatoes.com)

Stage 1.Knowing without learning

An earthworm does not learn.  It just does what it feels it must do: digging tunnels, eating soil, copulating, etc.  It’s all in it’s genes. It never has to learn anything.

This holds for thousands of other animal species.  They never learn.  They just do what they have to do, without even consciously knowing what they are doing.  Les’s say it’s all just biochemistry.

The necessary information to do the right thing at the right moment has grown in their tiny brains just like the other organs in their body.   Evolution has taken care of that.

Stage 2. Knowing by learning from experience.

Ever saw a mother cat bringing a living mouse to her cubs?  She lets them play with it, to let them experience what it means to handle a living mouse.  The little ones have to try, fail, try again and learn by doing it.  Just like you and me have to learn to drive a car or to play the guitar.  We have to feel it.  We have to experience it, we have to fail and fail again and to become better and better at it.  Even Jimmy Hendrix began that way.

Jimi Hendrix

Cover of Jimi Hendrix

The necessary information to do the right thing at the right moment is stored in our brains bit by bit, by experiencing the right and wrong moves.

Our environment provides the necessary conditions to experience.  And what we experience is the real world, the true world.

Stage 3. Knowing by learning from others.

This is a total other sort of learning.  You just listen to what someone else says (or you read what he/she has written) and then you know.  If someone tells you that you have to push the red button, not the green one, to open a particular box, well, without ever having seen the box you know how to open it.

The necessary information was put in our brain just by listening, or by reading.

So far so good.  But the third form of knowing is a dangerous one.

What if that other person told you it is the green button, in stead of the red one?  You would push the wrong button.

Learning from others means believing, without having experienced it yourself.  It even opens the possibility for learning and believing erroneous information.

Samsung Galaxy Ace

Samsung Galaxy Ace (Photo credit: Wikipedia)

Do you believe that a iPhone  is actually better than a Samsung Galaxy?  Why ?  Have you tested it ?  Is it because someone told you so?  Is it because the person with the iPhone was more enthousiastic than the one with the Samsung Galaxy ? Is it because you saw more ads about the iPhone than about the Samsung Galaxy ?  Do you actually have any clue why you believe it (or not)  ?

Learning from others also opened the door to religion : if you believe in a God, why is this ? Did you actually met a God? Did you see him?  Or is it just because someone told you that everything there is was created by a super creature, like God, Allah, Zeus, Tenochtitlan, or whatever?

More on the evolutionary origin of revolution can be found in this new scientist article.

Enhanced by Zemanta
Posted by: zyxo | February 16, 2012

Will we forget about Alzheimer ?

I remember our first radio (when I was a little kid), our fist television set, our first computer, my first cell phone, my first smartphone …

During my whole life technology has rushed forward at an always increasing speed.

Where do we go from here?

Let us try just one possibility, namely to combine some recent developments like the internet, cloud computing and storage and mind-machine interfaces.

It seems I will have some future doctor put a device in my head that can read my thoughts and can communicate directly with the outside world. At the same time of course this device will be connected with some wireless protocol to the internet (Wifi, 3G,4G,5G…).

As with dropbox I would have some account in the cloud where I pay for extra brain capacity, eventually doubling, tripling my natural brain capacity (nice! finally I will be smart)

So forget about alzheimer.  My brain extention in the cloud will make up for each and every loss of memory or thinking capacity.

Not only I will be able to connect directly via the internet to my friends and relatives, but eventually my cloudbrain (when I am sleeping for example) will directly connect to the cloudbrains of my “friends” in the cloud.  When my biological me will wake up in the morning, I will synchronize with my cloudbrain and be instantly up to date with what happened in the world overnight.

Eventually, when / if my biological body dies (but why should it ? I will replace used parts …)  my cloud self will continue to survive… if someone keeps paying my bill.

Just some thoughts. Possibilities are endless…

And there would be some side-effects :

With the built-in gps I would always know where I am and how to get somewhere.   I would have to buy some decent antivirus, anti-spam and firewall.  I would never forget anything I see, hear or read. Even more: I will be able to seach (Google cloudbrain search?) in my memory and “think” the URL to someone else.

Ah yes, social media would be entirely different. Not only sharing thoughts,images, but also real feelings, smells, senses …

“At work” would not exist any more.  No need for pc’s, phones, desks, meeting rooms.  Everything would happen on the servers of the company with direct input from our brains.

I dare not to imagine what terrorists will do or how wars will look/feel/smell like…

Is there a brave new world coming??

Enhanced by Zemanta
Posted by: zyxo | January 15, 2012

Will men ever live as long as women ?

This post is about the expected longevity of european men and women.
But it is also about the interpretation of a simple linear relationship.

First the figures.

The chart shows the relationship between the expected longevity of men and women in the various european countries.

relationship between the age of men and women

As we can see, they are highly correlated. Which is a good thing: it means that in countries with favourable life conditions, both sexes can profit.
We also observe that women live on average longer than men.  In no country men’s average life span surpasses 80 years, whereas this is surely the case for women in most countries.

Next chart shows the ratio (in %)  between life expectancy of women and men as related to average life expectancy of men.

Three things are obvious :

  1. the ratio is >100% in all countries, meaning that everywhere women live longer than men
  2. the ratio goes down as men live longer, meaning that the advantage of women decreases, that men catch up.
  3. the regression line crosses the 100% line at an average age of men of about 87 years

Question : is this just a “men” effect?  Let us look at the same ratio in relationship with the average age of women:

We see approximately the same relationship, although the points are somewhat more scattered around the line. The equal-life-expectancy is situated here at about 91 years. Not exactly the same as in the previous graph, but let’s not quarrel about trifles.

Conclusion : if life conditions would continue improving until the point that women (or men) reach an average age of about 90 years, they would have an equal live expectancy.

We could state the above otherwise : as women seem to have an advantage over men, an improvment in life conditions favours men more than women, such that at an average age of 90, the disadvantage of men would be wiped out.

Remains the question : what is the actual advantage that women have over men?  Why do they live longer?

In this article on Time Health reasons as “women develop cardiovascular diseases 10 years later than men”, “women have two X chromosomes which give them more genetic material and hence more diversity, resulting in an advantage”, “men have something like a testosterone storm when they are around their 20’s which makes them often behave dangerously”.

Another aticle in Dalymail speaks about a genetic advantage of women over men because men are more disposable. Women (especially long ago when we were gatheres-hunters) had to live long enough to raise their children.

And this article from Harvard University points the finger on the menopauze which cause women to stop giving birth to children. This gives her time to care of their children and grandchildren.

So the next question : why does the age advantage of women decrease in countries where it’s better to live?  Does her genetic advantage decreases?

I believe that it’s something else. In the better countries, social structures, health care etc. are so much better that the environmental dangers decrease: better health care, safer cars, safer toys etc. diminish not only the genetic advantage of longer living mothers and grandmothers but also diminish the danger caused by the testosterone storm.  So if there is no danger any more, the danger-countermeasures that women have, would become worthless.

What happens then, when in some country both men ande women would reach an average age of 90?  That would perhaps be the indication that all avoidable dangers, accidents, crimes, deseases, or whatsoever have been removed by adding the necessarey the infrastructures and countermeasures.  This would then be an ideal country (from the point of view of longevity) where the only deaths would be old age or incurable deseases that make no difference between the two sexes.

Any other ideas, interpretations? Do not hesitate, I will gladly read your comment.

Other posts on the differences between men and women :

Are men and women different species ?
Imbalance of cheating

Further reading on possibilities for living longer : http://ieet.org/index.php/IEET/more/brin20120108

Enhanced by Zemanta

Yesterday I got an interesting comment on a previous post on evolution.
I thought my answer would be to elaborate for a reply, hence this reply-post.

Tias Dailey writes the following (bolds are mine):

“You wrote that in one winter, a population of birds could be affected by natural selection because the small birds die off, leaving the larger birds. The thing is, natural selection always has a narrowing effect on the variation in a population. Understand that in your scenario, large birds did in fact exist before the natural selection. So that in itself is not evolution, but only narrowing of the gene pool. So that scenario doesn’t show that evolution can occur quickly.
To show that evolution can occur quickly, you would need to show that new features can arise quickly—features that were not present before.”

In fact, Tias makes 2 statements here :

  1. Natural selection always has a narrowing effect on the variation in a population.
  2. Narrowing of the gene pool in itself is not evolution.
The first conclusion we draw from these 2 statements are purely logical : since 1)Natural selection always has a narrowing effect and 2)a narrowing effect is NOT evolution then it follows that natural selection cannot be the cause of evolution.
In the above, we assume that both statements 1) and 2) are right.  [As many will know, it is always dangerous to assume (ASS U ME)].
So when does evolution occur ?  If it is not when natural selection occurs (as a result of some sort of more severe environmental pressure) then it must occur in the opposite situation : when the environmental pressure is relaxated.  Under those circumstances inheritance / mutation / recombination can do a lot more without being naturally-selected away.  In other words: the variation in the population increases and new features (e.g. a bird that’s larger than any previously existing individual of it’s species) can see the light. Aha, we have evolution.
But let us look at Tias’ 2 initial statements.  Are they correct ?
For the first one : OK, I agree.  Natural selection does weed out the non-fit outliers and narrows the population variation.
For the second one : NOK.  Why should narrowing of the gene pool in itself not be evolution ?  By the way… What IS evolution ?
Let’s look at some definitions:
  • Change in the genetic composition of a population during successive generations, as a result of natural selection acting on the genetic variation among individuals (the free dictionary)
  • Biological evolution … is change in the properties of populations of organisms that transcend the lifetime of a single individual. The ontogeny of an individual is not considered evolution; individual organisms do not evolve. The changes in populations that are considered evolutionary are those that are inheritable via the genetic material from one generation to the next. Biological evolution may be slight or substantial; it embraces everything from slight changes in the proportion of different alleles within a population (such as those determining blood types) to the successive alterations that led from the earliest protoorganism to snails, bees, giraffes, and dandelions. (talkorigins)
  • Biological evolution is defined as descent with modification.   Biological evolution occurs at different scales. These include small-scale evolution and broad-scale evolution. Small-scale evolution, also referred to as microevolution, is the change in gene frequencies within a population of organisms changes from one generation to the next. Broad-scale evolution, also referred to as macroevolution, refers to evolution at a grander scale. It focuses on the progression of species or entire clades from a common ancestor to descendent clades over the course of numerous generations. (animals.about)
  • Evolution is any change across successive generations in the heritable characteristics of biological populations. Evolutionary processes give rise to diversity at every level of biological organisation, including species, individual organisms and molecules such as DNA and proteins. (Wikipedia)
So what do we see ?  “change in genetic composition”, “changes in populations…that are inheritable”, “change in gene frequencies”, “change in the heritable characteristics of biological populations”.
So what about narrowing of the gene pool ?
  • This IS change in genetic composition.
  • This IS inheritable.
  • This DOES change gene frequencies
So IMHO narrowing of the gene pool is evolution.  Evolution does not always add new features.   Losing capabilities as a result of evolution is called regressive evolution.  Examples: the european mole that start to live beneath the ground and lost its vision capabilities.
To finish let me make a comparison.  Is it necessary to walk from New York to Rio De Janeiro in order to prove you can walk ?  Nope!  If I can show you that I can walk 10 steps, you will believe I can walk.
Likewise, is it necessary to show the emergence of a totally new feature like for example wings in order to prove evolution ?  Nope! If we can show a change in genetic composition of a populations, than we have shown evolution at work.
Enhanced by Zemanta

Loose coupling
Once upon a time, when the descendents of the neanderthals had invented something called object-oriented programming (and when I worked in IT), one of the good qualities of a good object-oriented design was “loose coupling“.
Loose coupling in object orientation means that software objects are loosely coupled with one another such that you can easily modify one of them without influencing the rest. The opposite of spaghetti code, where you pull on one spaghetti string and it starts to move on the other side of your plate.
Obviously loosely coupled designs are much more stable that tightly coupled ones.

Clustering
One of the more interesting unsupervised data mining algorithms consists in finding clusters in a data cloud such that the clusters themselves are tight, but the clusters are far away from one another. In other words : the average distance between observations from the same cluster is far less than the distances between observations from other clusters. In some way, clusters are loosely coupled.
Obviously a segmentation based on really loosely coupled clusters clearly is much more relevant than one with very near or ill-separated clusters.

Evolution
Evolution occurs when populations of organisms change under environmental pressure. When organisms of the same species live under different environmental circumstances they will change in different directions without influencing other populations a lot. That’s the way they can evolve into different species. Some sort of loose coupling.
Obviously different populations that are loosely coupled can much more easily evolve in different directions than populations that are still connected, with a high exchange rate of individuals.

Greece
Apparently Greece is NOT loosely coupled financially and economically from the rest of Europe and the rest of the world. Though it should be! The world is becoming one tightly coupled system such that when you or I fart, they smell it in Japan or the Bahamas.
That’s why at some places they start much smaller with what they call “local currencies“. And here, “local” means exactly what it means : local, in one town or even one neighbourhood.  This allows this towns or neighbourhoods to do their thing more efficiently than when they are thightly connected to the rest of the country via the national currency.
Obviously the EURO was not such a good idea after all ?

Enhanced by Zemanta
Posted by: zyxo | September 13, 2011

Se7en steps in finding knowledge nuggets.


Ever saw a bunch of children playing happily in some forest for a couple of hours and return home nice and clean ? No way. When they are healthy they should return covered with dirt and mud. Only then they have really played.

“Returning home nice and clean” that’s the feeling I get when I read Scott Levine‘s “7 steps of knowledge discovery in databases“. It is correct what he writes, but it all seems so uncomfortably clean.
That’s why here I write down my own version :

“Se7en steps in finding knowledge nuggets”.

 
Step 1.Try to understand what your business user really needs (not what he/she asks).
Know for sure that your business user almost never ask what he needs. No, somewhere he has a problem, he figures out for himself some sort of solution and based on that solution he ask you to do some mining work. If you just do it like that, I guarantee you that the result will never be satisfactory. In stead you’ll have to challenge him, ask him what he wants to accomplish with the result of your work, ask him what the problem is he wants to resolve and together find a -usually better- solution.

Step 2. Figure out what new data you need and which data-mining algorithms and/or statistical tests you need to use.
Now begins the creative part : when you know what your user needs, figure out how you can deliver. Which data ? Which algorithms? You should perform the entire analysis in your head, or skech it on a piece of paper. Some time ago I even got the habit of paradoxically starting the analysis by writing down the end report (of cause without any results). But it forces you to think the whole thing over beforehand and not afterwards when it’s to late.

Step 3. Playing detective in trying to find the data tables, and the right selection criteria to get the specific data that you need.
Real world is not like you have all the data exposed in front of you, nicely aligned and ordered according to some obvious criteria. No often all you know is that it is there somewhere in some table. Now starts the dirty work : call/email people to ask if they know someone who knows …
When finally you get the info, you access the data, just to find out something is clearly not right. So you call/email people to ask if they know … until finally you are pretty sure you have what you wanted.

Step 4. Merge the newly found data with what you want to use from your existing datamart
This is the more easy part and pretty straightforward.

Step 5. Get this data “mining-ready”
Depending on the algorithms you want to use your data has to meet some criteria : e.g. no nominal variables, must have a normal distribution, no missing values etc … Can be pretty tough to get everything in order.

Step 6. Mine that data (run algorithms, draw your conclusions)
That’s the exiting part. Not really difficult or dirty, because you had it all prepared. The part where you actually see what happens, the part where you discover the knowledge nugget, the part where you shout “YESS!”, or the part where you realise :”Shit, is that it ? That’s all ?”.

Step 7. Convince your business user that this result is all you can get out of it, even if it looks (afterward) as rediculously obvious.
Now you com out of the woods, covered in mud holding high your nugget, or almost empty-handed, or somewhere in between (gold-dust?)
You have to explain to your business user what’s the worth of your model, what he can do with it, how it can influence his marketing campaign results and eventually withstand his somewhat accusing look of “that’s all you got for me?”.

(Step 8. Afterwards just get a shower and prepare to find your next nugget.)

Enhanced by Zemanta
Posted by: zyxo | September 9, 2011

A victim far away means less than one nearby.

As I am typing this, I suppose that about every second someone gets killed or injured by some accident.  And still, I am typing this, without a lot of sorry.  Not that this means nothing to me, but when somebody who I do not know dies on the other side of the world, well it is a sad reality, but I do not care very much.

How does it affect you when someone dies or gets severely injured by accident?

An obvious answer to that question is : “it depends”.  On what ?

There are several factors that influence the effect an accident has on somebody:

  • distance : how far is it from where you live ?
  • familiarity : is that person a close relative? Is he a friend, a colleage ?  Someone you know from the media?
  • Number of casualties : is it one person?  hundreds? (compare one person hit by a truck in your village to thousands of them killed in the 9/11 disaster)
  • time : how long is it since you first heard of it (time heals all wounds).
  • age : an old man dying seems less severe than a child
  • health : a very sick (already dying) person, killed in an accident seems less dramatic than a healthy one.

When we combine these factors, we can say that the impact on someone (I) of an accident/disaster is positively correlated with the number of casualties (N),  the familiarity(F) and the health (H) and negatively with distance (D), time (T) and age (A).

So a simple equation would be : I=(N+F+H)/(D+T+A)

But now remains the problem of the unities.  Distance can be measured in meters or kilometers and amount up to thousends of them whilst age can reach at maximum about 100 years.  It seems attractive to try to put every factor in a scale of 0 to 100.

Age is the most simple one and can be used as such.

Number of victims : a nice measure is 10 times the logarithm of 1+ the number of victims. The table below shows us how nicely it goes from zero when there are no victims to near to 100 when the entire world population is extinguished.

A similar approach can be used for distance.  Here I use 20 thousand kilometers as the maximum, it’s the other side of the world.

Time is also a rather easy one.  I took the first table as a basis and adjusted it somewhat. X is the number of days, and I added a number of years column.  We reach 100 after 27 thousand years.

This leaves us with familiarity and health.  It is not my purpose to elaborate that in detail. So I suggest that a perfectly healthy person has a score of 100, someone who is already dead is scored as zero and let us use our gut feeling to assign the intermediate values.  For familiarity we can use a similar approach: a value of 100 represents the persons we love the most, like our children, our husband or wife.  0 stands for people we do absolutely not know and do not care at all.

And now let us look at some examples.


1.The Banda Aceh tsunami.

The impact now, after 7 years, on somebody on the other side of the world is :

Our formula : I=(N+F+H)/(D+T+A)

Let us assume that those 250 thousand victims where fairly healthy and on average35 years old and we care a little bit about those people.

If we do the math then we find for the first day that the impact on you =(54+3+97)/(100+49+35) = 0.84

A low figure, is it not?  But frankly, how often do you still think of that disaster, unless you live in that unlucky  part of the world ?

2. Now entirely different : suppose your perfectly healthy child of 1 year old who lives with you dies in an accident (I sincerely wish this will never happen!)

If we do the math then we find for the first day that the impact on you =(3.01+100+100)/(0+0+1) = 203.

Take a look at the formula. What would happen if that child was a newborn one?  The resulting value would be infinite!  Hence I propose to limit the Impact Value to a maximum of 100.

The formula then becomes : I=min ( 100 , (N+F+H)/(D+T+A) )  which means that if our calculated value is smaller than 100,  we accept it, otherwise we just take 100.

I know there is not much science in the above, it was just an interesting (but superficial) thought exercise.

Any suggestion for approvement is welcome.

(Some further reading on this subject : The new problem of distance in morality)

Enhanced by Zemanta
Posted by: zyxo | August 26, 2011

Why computers go bananas without any reason

“After that the computer froze a few times over the course of a couple days, so I assumed… So, I have no clue what is going on”.

“My computer randomly freezes,… What might be the problem?”

“Your computer was working fine, but then suddenly started locking up… ometimes random lockups can be attributed to the computer memory…”

When you google “computer freezes” you get thousends of desperate people asking for help. Mostly it can be solved by checking hardware, software etc.

But occasionaly it occurs that something goes wrong for no reason whatsoever, and then it never happens again. Why?

At work we had such a problem: less than once a year our SAS software refused to run our programs. Exactly the same programs we were used to run daily, weekly, monthly without any problems. Googling the error message was no help. Obviously the software was on strike. Temporarily, because the following morning everything was back to normal.
What happened ?

After deep thought, eliminating all impossible possibilities, I came up with the only plausible explanation I could find.
This is what I wrote to my collegues :

“Dear collegues,
I now know the reason of the problems: it’s what we call the IT Ghost, a species of creatures from the 5th dimension which are migrating this time of the year from the Betelgeuze area to the Crab nebula and are eventually teleporting through the earth. On this occasion they can influence the spin of some Charm Quarks causing computer processes to behave erratically, with no obvious reason.
A positronic energy field of 5.000 trillion petavolt around the earth should solve the problem.”

The Crab Nebula, the shattered remnants of a s...

Image via Wikipedia

Do you have any better explanation  :-)

Enhanced by Zemanta
Posted by: zyxo | August 5, 2011

The 2.5 ways to segment your customer base

Terabytes have been filled with books and articles about segmentation.  And we should by now expect that the most basic knowledge about it is, well … known.
Forget it !
First : what is this most basic knowledge that each and every marketeer should know?

 “What can you do with it” ?
Or, stated otherwise : how should you use it ?

Is the answer obvious ? Not at all !

Take for example the SAS white paper “A Marketer’s Guide to Analytics“.  You could reasonably expect SAS, as a major vendor of analytics software and consultancy, to know how to use segmentation.
Right?

Well, I seriously have my doubts.
They discribe as “the first two enablers of the analytical framework” :
1) analytically driven, granular segmentation: enables you to identify how different customer segments are most likely to respond to specific campaigns or marketing actions.
and
2) predictive modeling: enables you to identify the specific target population likely to respond positively to a specific campaign or other marketing activity.
I get an odd feeling when I read these two “different” descriptions.  Whether I can identify how different customer segments will respond to my campaign or identify the target population that will respond in a particular way (“respond positively”) does not seem very different to me.  In both cases you want to predict the behaviour of each customer or customer group in response to you campaign.
So let us forget about software or algorithms.  Let’s think marketing.

1.  First, you want to sell your product or service.

This means you have to find out who is likely to buy it.  You call for help any tool or algorithm that can use the data in your customer base: logistic or linear regression,  neural networks, support vector machines, genetic algorithms, bayes learners, decision trees, and all sorts of segmentations.  Use whatever you like, know, have, and delivers satisfactory results.

OK, let’s say you have done this and you know who to target, you have your customer group or best segment or whatever.  Perhaps you have a lift chart or the like so you know what you can expect from your campaign. (in my earlier post “datamining for marketing campaigns: interpretation of lift” your find a lot more about this topic)

2. Second, you have, one way or another, to speak to those people.

And if there is one important issue about communication it’s that you have to send the right message to the right person.
OK, you want to sell them all your world-changing superb product.  But I’m not talking about the what, but about the how !  I’m not talking about the content of the message box you will send, but about the wrapping paper, the flavour of your message.  Will you use the same words, the same communication channel, the same colours for young women, for old men, for internet savvy whizzkids, for grandma’s who never touched a computer ?

Did you notice ?  I gave some examples of customer SEGMENTS.  So that’s your second assignment : find the segments who match your communication alternatives.
A simple, but not easy, way to do this is to think, brainstorm, use your imagination and common sense, and use what you know about the people you identified in step one : look who’s in the selection, what is their age distribution, etc …
Now you have your second segmentation.

Lastly I owe you another half segmentation: In case you are not satisfied with your “communication segmentation”, you can always test it first:  Use your various communication alternatives randomly to part of the people of your selected target group.  Evaluate the results, and calculate which communication flavour your should use with which customer.  For this calculation you can use whatever you  like, know, have, and delivers satisfactory results.  Then use the findings to optimise subsequent campaigns.
Enhanced by Zemanta
Posted by: zyxo | July 26, 2011

The customer satisfaction hierarchy

Customer satisfaction is a hot topic. Numerous studies are continuously going on to get to know the enhancers end/or dissatisfiers. Depending on the branch you work in (bank, retail, internet book shop, etc), these enhancers/dissatisfiers can be very different.
Nevertheless, if we take a step back and do some abstraction, it seems that we can distinguish different levels, analoguous to the pyramid of maslow

In “maslows hierarchy of customer service”  Naumi Haque distinguishes three levels :

  1. Meeting the customers‘ expectations
  2. Meeting the customers’ desires
  3. Meeting the customers’ unrecognized needs

At frankwatching they present a four-level pyramid :

  1. trust, reliability, value
  2. timeliness, knowledgeable, responsible
  3. Caring, concerned, helpful
  4. Fun, friendly, enjoyable, entertaining

Well, it should be no surprise, below I will present my own “customer satisfaction pyramid” which is slightly different from the two above, and for sure is put in a less cryptical language.

the hierarchy is the following :

Basis : deliver what you promise, give the customer what you make him think you should give him.  This corresponds with the first level of the two pyramids above.

Second : do it fast, don’t keep your customer waiting, and do it properly, deliver it to him the way he would like it.

Third : see to it that there are no problems for the customer.  OK, nothing is always perfect, so if something goes wrong, make it as soon as possible your own problem, not the problem of the customer.  Make it easy for the customer to get problems solved.  Make sure that when the customer complains or ask for help, you give him a reassuring, easy, satisfied feeling.    Keep it easy for him, and do the hard work yourself to make him happy.
(This one was not mentioned in the two pyramids above.)

Finally : create a WOW effect

In short : optimise in this order :  the WHAT’s,  the HOW’s, the  CURES and the WOW’s

Enhanced by Zemanta
Posted by: zyxo | July 10, 2011

Is reading a newspaper “Data mining” ?

Data mining is a hype.  As a result everything is called data mining.  I suppose reading a newspaper to find some interesting information is called “data mining” by some people too.

However there is only one problem : not everything IS data mining.

To clear this mess a bit, in what follows I list and explain several activities that are sometimes (mistakenly) called “data mining”.

Data extraction 

the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing” (wikipedia)

Data extraction software can enable agencies to collect data on the race, gender, and ethnicity for the person(s) owning the majority of rights, equity, or interest in a business.” (Mozenda)

My definition is simple : you get the data from somewhere with some data extraction program.  What you do afterwards with that data is not relevant.

Reporting

Is making a report : “Report is a piece of information describing, or an account of certain events given or presented to someone“. (wikipedia)

Reporting is just a genre of writing, alongside essays and stories, and blogggers most certainly fall into that genre. Imho, when they talk about reporting on a show like Frontline, they mean the process a reporter goes through.” (Scripting.com)

This seems a bit more complicated than data extraction.  I would say : “extracting from whatever sources of data/information those pieces of information that are sufficiently important an structuring/presenting them to be communicated to your audience, customers, boss or whatever other party”.

My defition: reporting is not showing raw data, but some communicable description.  This can be in the form of tables, charts, structured drawings, or simply words.

Statistics

statistics is … a distinct mathematical science  pertaining to the collection, analysis, interpretation or explanation, and presentation of data . ” (wikipedia)

“methods to collect, analyze and interpret data” (Nebraska university)

“collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting and then drawing conclusions” (Akila)

Is a very broad definition, and it has obviously a lot to do with data.

For me, a part from “data”,  the words that are most important here are “science”, “methods”, “interpretation”.  Statistics is not just extracting data or reporting, no, here we have to do better.

Hence my definition : we use some mathematical method(s) to extract the right data, to interpret the data, to draw conclusions based on mathematics and to present these results/conclusions.

Data mining

This is the most difficult one, and most misunderstood.

Some definitions:

“the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with <a title=”Database management” href=”http://en.wikipedia.org/wiki/Database_management”>database management.” (wikipedia)

“the process of analyzing data from different perspectives and summarizing it into useful information” (UCLAAnderson)

“Data mining is the use of automated data analysis techniques to uncover previously undetected relationships among data items.” (about.com)
“Data mining is the discovery of hidden knowledge, unexpected patterns and new rules in large databases.” (E.Thomas)
The most important words or expressions here are : “extracting patterns”, “analyzing data”, “uncover relationships”, “discovery of knowledge”.
So my definition  is: searching in data collections (databases, the internet) for information that was not put there deliberately, but neverteless can be derived.
And one more thing : reading a newspaper is definitely NOT data mining :-)

Here is my very personal view on some settings of decision trees.

Maximum depth :

Maximum tree depth is a limit to stop further splitting of nodes when the specified tree depth has been reached during the building of the initial decision tree.

My opinion is straight and simple : NEVER use maximum depth to limit the further splitting of nodes.  In other words : use the largest possible value.

I suppose some explanation is necessary.

When you grow a decision tree, different leaves in the splits normally contain different numbers of observations.  Using the tree depth totally disregards these differences.  It could cause to stop splitting a leaf containing 25,000 observations on one side of the tree, whereas on the other side, containing much less observations a leaf with only 30 observations could still get splitted.  This makes absolutely no sense!

Minimum splitsize

Minimum splitsize is a limit to stop further splitting of nodes when the number of observations in the node is lower than the minimum splitsize.

This is a good way to limit the growing of the tree.  When a leaf contains to few observations, further splitting will result in overfitting (modeling of noise in the data).

Now the capital question : at what number should we set the limit ?

Answer : it depends.

  • are you just growing one tree or do you want to create an ensemble (bagging, boosting …) ?  If you create an ensemble, overfitting is permitted, because the ensemble will take care of it: it will look for the mean or some other grouping measure.
  • howmany independent variables (predictors) do you have?  The more variables you have, the bigger the possibility of having some accidental relationship between one of the variables and the target.  So with a lot of variables you should stop earlier.
  • howmany observations do you have? With a limited number of observations you do not have the luxury to stop early or  you will end up with no tree at all.  With a lot of observations you can stop early and still obtain a large enough decision tree

With hundreds of variables I use normally a minimum splitsize in the range of the number of observations divided by a few hundreds.

Minimum leaf size

Minimum leafsize is a limit to split a node when the number of observations in one of the child nodes is lower than the minimum leafsize.

Splitting of a node in two or more child nodes has to make some statistical sense.  What is for example the sense of splitting a node with 100 observations in the two following child nodes: one with 99 observations and one with 1 observation?  It is all a bit like doing a chi-squared test.  A good rule of thumb says that you should never have less than five observations in one of the cases.  I should say : the same goes for decision trees, as long as you deal with the same amount of observations you normally use to calculate chi-squared tests.  It is known that i) with a large number of observations chi-quared tests are no longer appropriate and ii) that decision trees are not a good algorithm for small numbers of observations (say less than 500).   So you should set the minimum leafsize larger than 5.  I usually take 10% of the minimum splitsize (in a bagging ensemble).

Conclusions

There is only one way to know the best settings : try, try and try again ! This is because all projects, data sets, are different. Do you have your own rules of thumb ? Please, do’nt hesitate and let me know !

Enhanced by Zemanta
Posted by: zyxo | June 17, 2011

Datamining and privacy: don’t shoot the pianist

The internet is full of reactions, opinions about data mining and the corresponding privacy issues. Even insults like the example below towards data miners or top executives of data mining enterprizes are no exception.

But is data mining always so bad?
Domains like medical applications where data mining could save your life fall without any doubt on the good side of the picture.

But even marketing can be a justified reason to use data mining results:

  • some customers explicitely want to stay informed about new products or services that are within their region of interest
  • in a lot of cases data mining is used to do less mailing instead of more: do not contact people who are not going to buy anyway.
  • some product/service offers can be so rightly targeted that the targeted people think: “Wow, right! why didn’t I think of that myself ?”  Because of data mining in that case we actually provide them with a free service, more or less reminding them not to forget the things they actually need.  Of course this is the ideal situation.
Unfortunately there are all those less ethical initiatives out there, but that has nothing to do with data mining as such.  Has a rifle ever been condemned for killing someone?  No!  Its the shooter, the one who uses the rifle who is the criminal.  The same goes for data mining.  We, data miners are only the pianists.  We play music.  The ones that record our music and broadcast it much to loud are the ones to be blamed.


You might also want to read:

Enhanced by Zemanta

Older Posts »

Categories

Follow

Get every new post delivered to your Inbox.