Posted by: zyxo | August 31, 2009

Link list for august 2009

Dear reader, enjoy my list of interesting articles I found in july.

Ten Great Ways to Crush Creativity
Top 10 Signs You Might Be a Geek
Social Media’s Top 10 Dirty Little Secrets
Piwik open source web analytics
the power and value of a fan (in social media)
open source text analytics
billion dollar gram
DNA computation gets logical
The Remotest place on earth
How to Ask a (Near) Stranger for a Favor

Reblog this post [with Zemanta]

This is a post about the lift of a data mining model for marketing campaigns.

(But if you want to keep it really simple, read first this really simple explanation of lift)

The topics discussed are :

  • Definition
  • Wait ! Why lift and not AUC ?
  • Your selection size determines your lift
  • Your target class proportion determines your lift
  • How can you use lift for model comparison ?
  • Large lifts, still small profits?
  • To simplify : user returns in stead of lift

Definition (Wikipedia) : … measure of the performance of a model… The lift of a subset of the population is the ratio of the predicted response rate for that subset to the predicted response rate for the population.

Wait ! Why lift and not AUC ?
The Area Under the ROC-Curve is often cited as a better, geneal measure of the quality of the model. It allows you to compare different models. OK, right, but 1) try to explain AUC to your HiPPO’s. and 2) when you use your model for marketing campaigns you are only interested in the performance of a small selection of your customers, the ones with the best scores.
So : lift is simple, and you can make it even more simple. Read on.

Your selection size determines your lift
If you read the definition, you saw : … “lift of a subset of the population” … Normally you take a subset of the population with the best scores, the ones you would use in a campaign.
The size of your selection determines the upper limit of the lift.

  • 100% of the population : lift is per definition = 1, meaning the performance of this selection = the performance of the total population. Of cause it is useless to make a data mining model and then use the entire population.
  • 50% of the population : Upper limit = 2
  • 25% of the population : Upper limit = 4
  • 10% of the population : Upper limit = 10
  • 5% of the population : Upper limit = 20
  • 1% of the population : upper limit = 100
  • etc.

You see it makes no sense to say : “my model has a lift of 10“. This means nothing. It depends for a great deal on the selection size.

 

  • Your target class proportion determines your lift

 
Imagine you want to predict how many people in a selection will buy products A or B during,say, next week. Let’s say that normally you sell 100 pieces of A and 5,000 pieces of B in a week and the two products are equally predictable, meaning that the two data mining models are of comparable quality. So which model will show higher lifts for equal selection sizes ?
Again the proportion of the target class determines the upper limit of the lift :

  • if you only have 5,000 customers in your database the lift for product B will be … 1. Since every client buys product B in a week you cannot get it higher with a model.
  • if you have 50,000 customers, the highest possible lift will be 10. Why ? The proportion of buyers in the entire population is 10%. If with a very good model you can make a selection of 5000 (or less) where everyone buys than you get 100% buyers in your selection which is 10 times better => hence a lift of 10.
  • if you have 5,000,000 customers and your model enables you to make a selection of 5,000 (or less) where everyone buys you compare 100% buyers with the 0.1% buyers in the entire population, which gives you a lift of 1,000 !

How can you use lift for model comparison ?

The way I do this is rather straightforward. I take the lift of all my models for the same selection size and plot the lift against the proportion of the target class.

This gives something like this with lift on the vertical axis and selection size on the horizontal axis :

lift plotted against sample proportion

See that the star is lower than the flower ? Nevertheless the “star” model is of better quality because its lift is one of the best compared with the other models, whereas the “flower” model is relatively poor.

Large lifts, small profits ?
What does a large lift mean for the return of a marketing campaign ? Absolutely nothing !
The return of a marketing campaign depends on (among others) :

  • the fixed costs for the campaign (making the model, paying for the administration,
  • the variable costs for the campaign (paying for the publicity, costs per letter when using snail mail, ….)
  • The number of surplus sales (= the number of sales in the campaign minus the “normal” number of sales : expected number if you did no campaign)
  • the gain in $ per surplus sale

So where does the data mining model come in ? The number of surplus sales depends on the impact of the e-mail, letter, phone call on the client behaviour : will he/she buy, whereas without the e-mail he/she would not ? If thanks to the data mining model you selected a very good target group the impact will be bigger.
And now the lift :
Case A : 1) normally you sell a 1,000 pieces to 5% of your customers. 2) You select a target group of 5,000 customers with a sales rate of 10% (=> lift = 2). 3) the e-mail impact doubles the success rate which means that you sell 1,000 pieces to that target group of 5,000 customers. Hence you get 500 surplus sales.
Case B : 1) normally you sell a 20 pieces to 0,1% of your customers. 2) You select a target group of 100 customers with a sales rate of 1% (=> lift = 10). 3) the e-mail impact doubles the success rate which means that you sell 20 pieces to that target group of 100 customers. Hence you get 10 surplus sales. If each sale of product A is worth the same amount of $ as product B it is clear that the high lift in case B is worth much less than the lower lift of product A.
Lift is just … lift. You have to lift something. It means more to life a huge quantity a little bit than a tiny quantity a lot.

So do not waste your time to develop targeting models for products that do not sell !

To simplify : user returns in stead of lift
For some HiPPO’s lift is still something too complicated. In that case use a simple return chart : take the lift chart in this post, but replace the life by the percentage of buyers. It is very easy and it is much more business language. You can directly tell if using the model is worth wile.

%-age of positive targets in relation to the selection size

%-age of positive targets in relation to the selection size

Notice that the “lift curve” shows how much the %-age of positive targets is “lifted” above the baseline (random selection).

Other posts you might enjoy reading :
Howmany inputs do data miners need ?
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes
Good enough / data quality

Reblog this post [with Zemanta]
Posted by: zyxo | August 19, 2009

The direction of evolution : speed matters !

Richard Dawkins' The Selfish Gene first public...
Image via Wikipedia

Evolution does not know any direction. Genes that have the highest proportion in the next generation stay in the race, the others gradually disapear. It is the (changing) environment that dictates the direction, not (Darwinian) evolution.

(But see this post : direction is inevitable towards two-legged two-armed humanlikes … : I’m not really a believer in this).

Why Darwinian evolution (variation, selection, reproduction) ? Is there an other form ?

Let’s go back, when there was no life yet. Was there evolution ?
Assume you think :”no, there was no evolution”. So when did evolution start ? The very moment that the first living thing came into existence ?
Wait a minut ! Some dead thing evolved to become that first living thing ? Was that not evolution yet ?
I agree to say that it was not Darwinian evolution : reproduction lacked.
But still : it was some sort of evolution : variation, selection, and production. A bit like the evolution of our cars nowadays. They do not reproduce, they are produced, there is variation, and they are selected by the consumers.

But before life : evolution was all very slow.
To speed things up reproduction was … evolved, “invented” by evolution. It was a paradigm shift. In stead of being produced by chance, the entities were constructed that way that, in the right environment, they were copied. Imagine the advantage of that speed gain (speed = number of copies produced per time unit). So Darwinian evolution was selected as an advantageous strategy. Numerous enhancements evolved like proper genes, cell structures, entire organisms, you all know that.

But after a while, another strategy evolved : in stead to rely on genes to carry all necessary information from one generation to the next, organisms evolved that passed on information directly to each other and to their offspring in the form of memes : culture, communication, education, whatever you call it. Numerous enhancements evolved, like writing, telephone, blogging, twitter, you all know that.
It’s where we stand nowadays.

But after a while, another strategy will evolve. As making predictions is difficult, especially when it’s about the future, the only thing I can write about it is a guess : in stead to rely on biological organisms to carry all necessary info from one generation to the next, artificial (non-biological organisms will evolve and test new information in some artificial intelligent programs, models, whatever and select them before they incorporate them into the next generation. Total automatic scientists already exists (their inventors called them Adam and Eve). So the next paradigm shift will be something like evolution without biology.

Did you enjoy this post ? Then you might be interested in the following :
Human evolution : the future of men
top-10 lists on evolution
Evolution of minerals
Evolution in blue and red
The end of evolution

Reblog this post [with Zemanta]
Posted by: zyxo | August 11, 2009

New laws of robotics

ASIMO at Expo 2005 in Japan
Image via Wikipedia

I am sure you all know the three laws of robotics, invented by Isaac Asimov :
* A robot may not injure a human being, or through inaction, allow a human being to come to harm.
* A robot must obey orders given to it by human beings, except where such orders would conflict with the First Law.
* A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

Are these the only necessary laws ? Are these three (good) enough ? Are there alternatives ?

Recently David Woods and Robin Murphy made up three new laws (Want responsible robotics? Start with responsible humans and The 3 Laws of Robotics – Modified) :

* A human may not deploy a robot without the human-robot work system meeting the highest legal and professional standards of safety and ethics.
* A robot must respond to humans as appropriate for their roles.
* A robot must be endowed with sufficient situated autonomy to protect its own existence as long as such protection provides smooth transfer of control which does not conflict with the First and Second Laws.

Two differences with the “old” ones :
1) the first law is for humans
2) the other law are less specific and hence applicable in more situations.

But Woods and Murphy are not the only ones to rework or discuss or suggest alternatives for the three original laws. Some examples :

Asimov himself : The zeroth law : A robot may not harm humanity, or, by inaction, allow humanity to come to harm

Asimov’s Laws of Robotics
Implications for Information Technology

HOW will a robot identify a human being?

3 laws unsafe
Asimov’s Laws of Robotics Are Total BS

ten ethical laws of robotics

open the future : five laws of roboticists

New Laws of Robotics proposed for US kill-bots

30 laws (not to taken seriously 🙂 ) :

And even a contest : Winner of the “Maker’s – Three laws of robotics contest

Humans, robots or Cyborgs ?

The most interesting issue here is the clear distinction Asimov makes between humans and robots. In his books, humans were pure humans, robots were 100% robots.
But our future world is not the world of Asimov. Nowadays many people are not 100% human any more : artificial hips, knees, teeth, eyes, hearth valves, pacemakers and whatever are more and more part of our humanity.

On the other hand, experimental robotics does not limit itself to metals or plastic. A recent experiment involved a living rat brain steering a sort of simple robot.

If we continue that way the differences between humans and robots will dissapear. But not only between humans and robots, but also between humans and some animals with enhanced brains (= implants will make them as smart as most humans).

So what about the laws of robotics ?
IMHO lawmakers will have to make laws for everyone, not making any difference of species, technical specifications or whatever you can disciminate between living beings.

Enjoyed this post ? Then you might be interested in the following :
– Web 5.0: The telepathic web
– Robotic insects or cyber-insects ?
– Self reassembling Robot
– Human brain copy protection by AnyMind Inc.
– Humans 2.0

Reblog this post [with Zemanta]
Posted by: zyxo | August 2, 2009

evolution can occur in less than 10 years

Male and female young guppies (the male here i...
Image via Wikipedia

How fast can evolution take place?
We usually think of evolution of some process that takes thousands or millions of years.
WRONG !

OK, the really big changes need some time to show up. To evolve from a dinosaurus to a bird cannot happen overnight.

But it does not take centuries to see evolution. This studies on guppies show that evolution can go a lot faster.

It all depends on your definition of evolution. For example R.P.Worden uses in a somewhat theoritical article the “rate of increase of Genetic Information in the Phenotype” to measure evolution speed.
I find this a bit silly. If the genetic information does not increase, but nevertheless changes is that not evolution too ? As if you say that two different books of 350 pages each are the same book because of the same number of pages !

I think evolution works on a much smaller scale : whenever the genetic content of a population has changed between two generations there has been evolution !
This simply means that evolution takes place constantly, because there are always tiny changes between following generations.

In the guppy article, the experiment consisted in putting guppies in two different environments. Everyone familiar with evolution knows that evolution is caused by changes in the environment.

Important changes cause fast evolution, like the guppy experiment, where they saw significant evolution in 10 years.

But it can go even more rapidly : during a cold winter on the average the smallest birds in a population suffer most because their lose more heat than the bigger ones (the Bergmann rule). This means that after an extremely cold winter the average size of a bird population has increased, because the big ones survived better. Since size is inherited it is obvious that the genetic content of the population has changed. IN ONLY ONE WINTER !

Did you enjoy this post ? Then you might be interested in the following :
Human evolution : the future of men
top-10 lists on evolution
Evolution of minerals
Evolution in blue and red
The end of evolution

Reblog this post [with Zemanta]
Posted by: zyxo | August 2, 2009

link list of interesting articles (july 2009)

Dear reader, here is my list of interesting articles I found in july. Enjoy reading.

Twitter Better: 20 Ways to Filter Your Tweets

Britney Spears’ Guide to Semiconductor Physics

Bare essentials of safety from Air New Zealand

8 levels of analytics

The generation M manifesto

Born to cheat

Using bacteria as a computer

artificial brain 10 years away

Posted by: zyxo | July 25, 2009

mathematics of information

When I saw the site of the Center for the Mathematics of Information, I started to have a somewhat funny feeling : mathematics of information. Is that the same mathematics we use for calculations on for example money-related topics ? I doubt it, because In the past I heard things like :

  • if you share information with someone it doubles, because now both of you posess the information.
    Very unlike sharing money!
  • Drowning in information : if you have more than you can handle, and you add still more information, you end up with less. Is also called “information overload“.
    With money it is simple : adding money makes you richer, even if you are Bill Gates.
  • – Does negative information exists ? : So that when you received it, you know less than before.
    (here I do not mean information about non-pleasant situations or the like) I think it is possible : Suppose you are convinced about something (your info is that X is true). Than you receive new info and as a result you are not sure any more, maybe it is not true after all.
    But there is a much more serious explanation on negative information : according to physicists Quantum information really can be negative.

Did you liked this post ? Then you might be interested in the following :
Information overload, filters and Web 3.
Howmany inputs do data miners need ?
Simplexity : new word about old situations
The family of PI
Is Google God ?

Target Corporation
Image via Wikipedia

A study of Duncan Irschick at the university of Massachusetts drew my attention. It says :

Men Are More Accurate than Women When Hitting a Target with Force in the Dark

The story in itself is interesting, but what particularly struck me was that It was quote ” … a small study …” end quote.
I totally agree with that since they “…tested four male and three female adults”.

Yes, right : 4 men and 3 women.

The first reflection of somebody with a statistical/data mining background is : “How on earth can a self-respecting scientist publish results on differences between men and women with such a small sample ?”. I not only mean self-respecting, since he is also respected by oythers and a first-class scientist.

So there must be something else. Could it be that with such a small sample you can indeed do some thorough statistics ?

Let us try it out.

The case at hand is men and women hitting at something with a hammer. I do not know the details, but for the present purpose it is simple to use some fake data.
Let us take an extreme case :
If they must hit some target, suppose that the four men missed the target by 20, 18,22,and 21 centimeters respectively. The women, being much accurate missed only by 3,5 and 6 centimeters.
With a simple t-test we find out that the two means of 20.25 for the men and 4.67 for the women are significantly different (p=030003; two-tailed).
So if all 3 women are far better than all 4 men we have a proven case !

With one woman being a bit less accurate than one men we get the following : let us assume that the best man in stead of missing by 18 centimeters misses by 10 centimeters and the worst woman misses by 11 centimeters in stead of 6.
The difference is still significant (p=0.0227; two-tailed).

Let us try a third one : we take case 2 but make the two of the three worst men somewhat better and the second best women somewhat less accurate (men: 10,14,15,22; women 3,8,11) : it is not significant any more (p=0.0664; two-tailed)

So, even with such small samples it is perfectly acceptable to draw conclusions.

But for a data miner, who is used to work with millions of observations it still feels a bit weird !

Did you liked this post ? Then you might be interested in the following :
Howmany inputs do data miners need ?
Oversampling or undersampling ?
are men and women different ?

Reblog this post [with Zemanta]
Posted by: zyxo | July 8, 2009

Chromosome numbers, evolution and lies

Deutsch: Metaphasechromosomen aus einer weibli...
Image via Wikipedia

A certain Kent Hovind has apparently turned a “spoof” into a serious matter. in “Opossums, Redwood Trees, and Kidney Beans” he writes (but obviously does not believe it himself) that evolution goes in the direction from few to many chromosomes. Meaning that we started as a penicillinum with two chromosomes and evolve in the direction of a fern with 480 chromosomes. Of cause totally rubbish.

Here you can find other discussions by Kent Hovind on the subject and here the wikipedia description of the man.

The question is : is evolution following a certain direction like :
– getting bigger
– having more genes
– having a larger brain
– having a larger total length of the nervous system

I would say : NO

Evolution is simply an adaptation to changing environments. It is the environment that dictates the direction of evolution. If it becomes colder, individuals that better resist cold are at an advantage and consequently the mean cold resistance of the population increases. If afterwards it becomes warmer, evolution is forced in the opposite direction.

Remember : evolution has no purpose whatsoever, it is only the consequence of selection, which is not random, but favors those individuals which are best adapted tot the environment.

Did you enjoy this post ? Then you might be interested in the following :
top-10 lists on evolution
The pope believes in evolution
Human evolution : the future of men
Evolution towards Intelligent Design
The end of evolution

Reblog this post [with Zemanta]
Posted by: zyxo | July 6, 2009

Simplexity : new word about old situations

Red 2 × 4 LEGO brick from the LDraw parts libr...
Image via Wikipedia

What is simplexity ?
Before some weeks I never heard or saw the word. It seams like cute, original, and most of all : scientific and difficult.

What is it all about ?
According to wikipedia is is an ” emerging theory that proposes a possible complementary relationship between complexity and simplicity.”
Professor Petter Wipperman (also in wikipedia) proposed a social definition :
Simplexity therefore stands for a balance between the growing complexity of daily life and our own personal satisfaction.”

But searching a little further I found this : ” Simplexity: in systems theory a term for the emergence of simple features as a direct (though possibly highly intricate) consequence of a system of rules. Jack Cohen and Ian Stewart. “The Collapse of Chaos: Discovering Simplicity in a Complex World.” New York: Penguin, 1994. p. 399.” on Simplexity.co.uk.
So the word is not all that new.

If you read some posts on my blog, of other writings about emerging patterns, hierarchies and the like you will probably find that these definitions ring a bell : there is nothing new, only the name.
It is all about making 1) complex things simple (The romans already new : “divide et impera / divide and conquer), which is what all our analysis methods are about : cut the complex monster in simple pieces and then the whole becomes simple an 2) how you can make complex constructions with simple items or rules : think of Lego, fractals, swarm intelligence, a book written with 26 different letters, or our DNA blueprint written with 4 different nucleotides

One of the most recent examples is twitter : how such a simple (messages with 140 characters) system give birth to such a huge and complex hype, with hundreds of twitter tools and applications ?

The book

Perhaps the hype (?) about simplexity comes from a book, written by Jeffry Kluger : “Simplexity. Why simple things become complex (and how complex things can be made simple)”. J.Kluger also wrote an article about it in Time.
I must confess: until now I did not read the book, but I you believe others then here you can find more :

Did you liked this post ? Then you might be interested in the following :
A bunch of tools for twitter
Do stock traders show swarm intelligence ?
The end of emergence
Evolution towards intelligent design

Reblog this post [with Zemanta]
Posted by: zyxo | June 29, 2009

Link list of interesting articles (june 2009)

Santa Cruz de Barahona
Image via Wikipedia

Enjoy reading !

Twitter’s Ten Rules For Radical Innovators
New Twitter Research: Men Follow Men and Nobody Tweets
Ground zero brandbuilding
Google vs. Bing: The Blind Taste Test
Google Analytics Learning
Half of your friends lost in seven years
Endless copyright free music
People who wear rose-colored glasses see more
Predictive powers: a robot that reads your intention?
Stress makes your hair go gray
Computing in the quantum dimension
Metadata floating around in the real world: allways and everywhere online !
Baseball infographics and other visual treats
Website Analytics Toolbox
Find and share logins for websites that force you to register
192 Creative, Smart & Clever Advertisements
Complexity Papers Online
Asking a machine to spot threats human eyes miss
There is much more consensus among men about whom they find attractive than there is among women
Reading the brain without poking it

Reblog this post [with Zemanta]
Posted by: zyxo | June 27, 2009

List of animal species with 46 chromosomes


Humans have 46 chromosomes. But what about other animal species ?

There are surprisingly few comparative lists of chromosome numbers to be found on the internet. I admit : it does not make a lot of sense. What would be the scientific value of that ?

Just out of curiosity I searched around and as far as I know, below is the only list of animal species with 46 chromosomes.  I mean: at first it was the only list. Then many people just bluntly put copies of it all over the internet without even mentioning the source. So feel free to use/copy small parts of this list. But do not copy the entire list, that’s just plain stealing.  Instead I would very much appreciate if you provide your audience with a link to here.

white
copyr

46chrom1

46chrom3

46chrom4

46schrom2

Did you enjoy this post ? Then you might be interested in the following :
Human evolution : the future of men
top-10 lists on evolution
Cost and benefits of complexity in evolution and data mining
Evolution of minerals
Evolution in blue and red

Enhanced by Zemanta
Posted by: zyxo | June 22, 2009

How to steal energy ?

Photovoltaic Solar Energy
Image by compujeramey via Flickr

In physorg.com I found this post about a supermarket that “taps” energy out of the cars that drive into the parking lot.
It made me wonder : which are the other possibilities of letting others pay for your energy ?

Here are some examples I came up with. Feel free to add yours in a comment.

CARS :

  • passing cars in front of your house
  • at traffic lights to tap the energy at the passing cars to let the lights function.
  • system to collect taxes for driving, in the form of energy, at high intensity roads,

PEDESTRIANS :

SPORTSMEN / -WOMEN :

  • fitness centers : what an energy would be produced when the fitness machines would store the energy produced by the people using it
  • tennis rackets to recharge batteries, placed in the grip

OFFICES :

  • keyboards button pressing, mouse movements, to power the screen or webcam, ..

I am sure there are a lot more examples to be found.

Another category is not the “stealing” of energy from people but just recycling energy that otherwise would be wasted. But that is for another post.

Did you liked this post ? Then you might be interested in the following :
Solar power ring : enough energy to fry the earth
What comes first ?
The limit of power
Science-fiction gadgets are near
No free will ?

Reblog this post [with Zemanta]
Posted by: zyxo | June 9, 2009

Do it standing up !

A rugby union scrum
Image via Wikipedia

Years ago I had the pleasure of working in a team of excellent people who had the habit of organizing a meeting every evening as the last part of the daily work. It was a short, quick meeting where we went over what had been done today, what where the problems to be solved, what had to be done tomorrow. Very simple but very efficient.
After this project I went back to the old rhythm of weekly, biweekly, or monthly (depending on my bosses) long boring unproductive meetings and never experienced these short but extremely efficient daily meetings again.

Last day I stumbled upon this article of Martin Fowler on Daily stand-up meetings. He gives an extensive description of how to organize these meetings, and these meetings contain everything I missed from our daily evening meetings.
It is clear from the following that these daily “scrums” as there are also called, come from the software development world.

The wikipedia definition :

“A stand-up meeting (or simply stand-up) is a daily team meeting held to provide a status update to the team members. The ‘semi-real-time’ status allows participants to know about potential challenges as well as coordinate efforts to resolve difficult and/or time-consuming issues. It has particular value in Agile software development processes, such as Scrum, but can be utilized in any development methodology.

The meetings are usually time boxed to 5-15 minutes and are held standing up to remind people to keep the meeting short and to the point. Most people usually refer to this meeting as just the stand-up, although it is sometimes also referred to as the morning rollcall or the daily scrum.

The meeting is usually held at the same time and place every working day. All team members are expected to attend, but the meetings are not postponed if some of the team members are not present. One of the crucial features is that the meeting is intended to be a status update to other team members and not a status update to the management or other stakeholders. Team members take turns speaking, sometimes passing along a token to indicate the current person allowed to speak. Each member talks about his progress since the last stand-up, the anticipated work until the next stand-up and any impediments they foresee.

Team members may sometimes ask for short clarifications but the stand-up does not usually consist of full fledged discussions.”

Here is what others say about daily stand-up meetings :

– The daily stand up meeting is not another meeting to waste people’s time. It will replace many other meetings giving a net savings several times its own length. (extremeprogramming.org)

– There are plenty of other things to improve, but a daily stand-up meeting is low-hanging fruit. It is easy to implement and returns immediate gains. (codebetter.com)

– Done properly, the daily Scrum will achieve it’s own results, however handled incorrectly it can become a time wasting social hour (David’s comment on mitchlacy.com)

– Daily Scrum is a powerful tool, but as any other tool it is good, when you know what it’s useful for and have some experience in using it. … The important part is the goal, not the method. (agilesoftwaredevelopment.com)

– … how the team can synchronize their work and progress by meeting every day for a quick (15-20 min) status update and report on impediments (intranet.5amsolutions.com)

– Projects get to be late one day at a time, so it seems logical to have a daily team meeting to ensure you are all on track (www.scrumlabs.com)

– The daily stand-up meeting is a crucial aspect of keeping projects moving without interruption (www.reformingprojectmanagement.com)

– … the ability to reprioritize is one of the key strengths to a fully functioning Agile process, and having this opportunity every 24 hours is a significant benefit. (talk.bmc.com)

– There has been several occasions where the stand up meetings saved us from troubles (specially in rush hours) (Hasith comment on railspikes.com)

– the daily stand up is often the first tool to be implemented because its low cost and management can see value in it quickly. (webascender.com)

– How Microsoft’s p&p Teams do Daily Standup Meetings (ademiller.com)

I wonder if someone is using this type of meeting in another context than agile software development ?

Did you enjoy this post ? Then you might be interested in the following :
Top-10 lists on Knowledge management
Knowledge management = Change management
15 ways to use knowledge management software
The 10 most important failure factors of knowledge management.

Reblog this post [with Zemanta]
Posted by: zyxo | June 1, 2009

Twitter, human evolution, and stock quotes

{{PAGENAME}}
Image via Wikipedia

Look at the title of this post. Seems to be a sort of silly combination, not ?
What do these three have in common ?

I got the idea from a post entitled “Twitter and Human evolution” by Trey Ratcliff.
Trey compares the communication between tweeple (people who tweet) with communications between the cells of the human body which send short messages to eachother asking for stuff and offering some stuff.
Seems interesting.

But he concludes that these short tweets could get humanity to act as a super-organism, where people get some sort of bottom-up decision making.

OK for the bottom-up decision making, not OK for the super-organism.
First of all : we will never know if there is or will be a super-organism, just like the body cells do not know that there is a body.
Second, and not really objectively : I do not see how this could lead to a super-organism. Twitter being only a very little part of the internet it should be more likely that the internet as a whole becomes a super-organisme. But I see it largely improbable that one single organism (the internet) evolves to some super-organism with real mental capacities. Evolution uses large number of organisms, and (natural) selection to end up with something meaningful. One internet is not really a large number …

And what about stock quotes ?

Bottom-up decision making due to twitter is comparable with buying or selling stocks based on the information we find in discussion fora, in newspapers, and even on twitter. But there is a huge difference : with stock quotes we also have the actual stock quotes, which is the real result of the combined buy-sell behaviour of thousands or millions of people.

With twitter, we only have the tweets. There is no software running behind the scene to analyse for example all the tweets concerning “evolution” to come up with a global picture of what people think of evolution second by second. Offcause it would be nice to have such a service!

Enjoyed this post ? Then you might be interested by the following :
Web 5.0 : the telepathic web
Do Stock Traders show Swarm Intelligence?
Swarm versus intelligence
Piqqem : Prediction market for prediction errors
swarm-information-transfer-techniques

Reblog this post [with Zemanta]
Posted by: zyxo | June 1, 2009

Link list of interesting articles (may 2009)

Popular popularity
Software that maps dreams
Origins and evolution of altruism
Scale-free thinking
Curve ball visual illusion

Reblog this post [with Zemanta]
Posted by: zyxo | May 31, 2009

Dangerous to click this link !

Hi, this is just a test to see if you people can or cannot resist “dangerous” links .
Sorry for bothering you. Perhaps you can enjoy other posts of my blog. It’s free …
Zyxo

Reblog this post [with Zemanta]
Posted by: zyxo | May 31, 2009

After GIGO comes GIQO

Manure, a field in Randers in Denmark
Image via Wikipedia

Garbage In, Quality Out.
Is that not a dream ?
Well no, it is reality.
Here is a list of examples where garbage comes in and some quality product is produced :
drinkwater out of urine in the space station
a useful targeting data mining model out of a “dirty” database
perfectly healthy vegetables out of a garden enriched with manure (=shit!)
perfectly clear glass, out of sand
usefull products like construction blocks out of garbage
gasoline out of garbage

I am sure there are many other examples.

Enjoyed this post ? Then you might be interested by the following :
solar power ring : enoug energy to fry the earth
Evolution towards Intelligent Design
should we invest in photovoltaic cells ?
Web 5.0 : the telepathic web

Reblog this post [with Zemanta]
Posted by: zyxo | May 24, 2009

Good enough / data quality

Detail on a bottle of Ardbeg whisky.
Image via Wikipedia

Data quality : when is it sufficient ?

Leave out the data, let us talk about quality.

First of all here are some examples of quality “problems”.

Obviously we have to make choices which often are worse than the best possible quality.
So it is with data : when is the quality good enough ?

It depends : what do you want to do with it ?

— If it is reporting : the numbers better be correct. In a large enterprise I bet there will be two sources of the same numbers. The results will be compared and there will be trouble.
— if it is descriptive data mining, like clusterings or descriptive classifications : the data better be as correct as possible. Errors are acceptable within reasonable limitations, as long as the picture “fits”.
— if it is data mining for targeting purposes : the data has to be stable in time. Correct ? I do not care. Does this sound crazy ? Perhaps. But really : I do not care ! If they put the size of the shoes of someone in the “Birthday” variable this poses no problem. For the data mining algorithm does not take the meaning of the variable names into account. “var1”, “var2″, var3”, etc do equally well. The only thing that matters is : how good is the predicting quality of the targeting model ? You can only obtain a good predicting model with variables that have prediction power (are related to the target) and that are stable, meaning the meaning of the variable does not change over time. I do not like it when IT people correct flaws in the data. It diminishes the model quality and I have to rebuild them.
So better use your time to build targeting models than to try to get the data to be perfect. Just use the GIQO principle I just invented : GARBAGE IN, QUALITY OUT ! (a bit like the urine-to-water machine at the space station)
— if it is web analysis : this is yet another story, neatly explained by Avinash Kaushik in this post.

Did you liked this post ? Then you might be interested in the following :
Howmany inputs do data miners need ?
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes

Reblog this post [with Zemanta]
Posted by: zyxo | May 12, 2009

Howmany inputs do data miners need ?

The scatterplot of Iris flower data set, colle...
Image via Wikipedia

Howmany records do you need to make a decent data mining model ?

Let us first look at a data mining definition (you find dozens of them on the web, I just took one at random).
The automatic extraction of useful, often previously unknown information from large databases or data sets.

In most definitions we find something like “large database” or “lots of data” which implies that we need a huge lot of data to enjoy our data mining hobby.
Is this so ?

Anyway it is a tough question.

Let us start simple. It is all about getting information out of the data. So let us take three points in a plane (x-y plot). If they fall on a straight line, the correlation coefficient is statistically significant. Meaning that you do not necessarily need a lot of data to extract information from it.

But data mining was invented to overcome the problems statistics have with huge amounts of data and variables.

Candidate factors that play a role to determine the optimal number of observations are :
dimensionality : the number of variables (preferably transform categorical variables by dummies before counting !). As a rule of thumb you should have at least as many observations as something like the squared number of variables. (I forgot where I read or heard this).
But what about a dataset with 10,000 variables of which only 2 are really related to the target variable ? In that case there is no “curse of dimensionality”. The only problem is the storage space and computing power to find the two significant ones.
power : This is a difficult one and often overlooked in statistics. Large power means : a clear and large effect of the independent variables on the target variable. Small power means that there is an effect but it is very small and hence difficult to detect … unless you have a lot of observations … Let us return to the three point on a straight line : they present a huge power, so three points are sufficient to establish the fact that there is a significant correlation. But what if the population is almost a circular cloud of points ? With 10,000 points on that plane you could calculate a correlation coefficient of 0,04, being highly significant but with a low power ! With data mining we often want to include even the smallest effects in our model to increase the prediction quality (read “marketing campaign return” ) as much as possible. So we need lots of observations to detect them.
modeling method : decision trees can handle a huge number of observations. So do logistic regressions. But since you obviously want to perform some selection of variables you want a stepwise regression : this will take ages. And random forests can handle a lot of variables but relatively few observations. This you have to test on your own system.

The one solution I propose to get an estimate of how many observations is sufficient but not too much : try it out !

Too much :
– if your tool/system cannot handle them any more (neural networks, logistic regressions …)
– for decision trees : if the model quality does not improve any more (tested on a hold-out dataset). Be aware of the fact that decision trees grow larger and larger as long as you feed them more observations, but not necessarily get better (unless you force them to stop at a fixed number of splits, which I do not find a good idea ! )

Too few :
– poor model

So what should you do ? Make a lot of models with increasing numbers of observations and test them against a hold-out dataset. Continue adding observations as long as the model quality improves.

As someone said before : it is 5% inspiration and 95% transpiration …

Did you liked this post ? Then you might be interested in the following :
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes
Mining highy imbalanced data sets with logistic regressions

Reblog this post [with Zemanta]
Posted by: zyxo | May 8, 2009

Game complexity

Just some complexity numbers :

Number of possible combinations for the following games :

four in a row : 10 exp(14)
checkers : 10 exp(23)
chess : 10 exp(50)
game of go : 10 exp(171) The most complexe of all ! That is why computers still cannot beat the human masters !

Enhanced by Zemanta
Posted by: zyxo | May 5, 2009

Simple physics of snooker

Still from Media:Snooker break.
Image via Wikipedia

The last days I watched some games of the snooker world championship at The Crucible Theatre in Sheffield. I was amazed at the incredible quality of the play. Then I started to wonder : what calculations are necessary to come to such magnificent shots ?

I could google surprisingly little on the physics of snooker. And what I found was either poor physics or poor snooker.
So I decided to give a brief overview of the physics of snooker, as I see it. By the way, my snooker capacities are very limited :-), but I love the game, and my physics knowledge limits to some practical insights, so I leave out all the theory.

To shorten this post and to reduce the complexity I will limit this overview to one single shot.

What do we need ? A snooker table, a cue stick, a cue ball (the white one) and an object ball (red or other colour).

What do we want ? The object ball has to go in the pocket, the cue ball must come to rest on a very good spot to pot the next object ball.

What are our limits ? i) We can only play the cue ball. Everything that we want the object ball to do is caused by the collision with the cue ball. ii) with the cue stick we can give the cue ball a forward motion and a spin (follow, draw, side, where follow and draw correspond with top spin and back spin respectively, and of cause combinations between follow or draw and side). As for the object ball, only a forward motion is practically possible. Given the fact that there is very little friction when two balls collide the limited object ball spin is of no practical value.

( See drawing at onekaraoke.com)

OK, how will we attain these two objectives ?

First objective : the object ball must go in the pocket. The direction of the object ball is totally determined by the point where it is touched by the cue ball. When the two balls collide, simply draw a line through the centers of the two balls. This line shows the direction of the object ball. It should point straight to the center of the pocket.
This is the simplest of the two, but it is far from simple ! The direction of the object ball is limited between 180° (goes in the same direction as the cue ball, when this direction goes through the center of the object ball) and a theoretical 90° when the cue ball “kisses” the object ball extremely thinly at its side.

Second objective : the cueball must stop at the exact (approximately is mostly good enough) spot of the table where we want it to stop. Remember the cueball has two moving properties : direction and spin, so this is a lot more complicated than simply potting the object ball.

– Right after (first nanoseconds ! ) the collision : the cue ball direction is entirely determined by the touching points. This direction can be anything between 90° right to 270°(left), except for the circle segment behind the object ball. The size of this segment is determined by the original distance between the two balls.
At the moment of the impact, the impulse does not change. With a completely central impact, this means that the object ball will follow the direction of the cue ball at the speed of the cue ball, and the cue ball will remain motionless at the exact spot where it touched the object ball. The other extreme is that the cue ball misses the object ball, no need to tell what happens.
Then you have everything in between : touching very slightly the object ball will cause
i) a very little deviation and apparently nearly no loss of speed of the cue ball and
ii) a very slow movement of the object ball in an angle of just above 90°. So here is not much the player can play with to get his cue ball on the right spot after the shot.

– Later on (following nanoseconds, milliseconds, seconds …) after the collison there is something totally different in play : the spin of the cue ball. Until the cue ball hits a cushion the side spin does not play an important role (and by the way, this would be too difficult to elaborate). The vertical spin though has an huge effect on the cue ball direction.

Let us start with a head kick, straight at the center of the object ball.

There are three possibilities :
1) the cue ball has absolutely no spin when it touches the object ball. In that case the cue ball immediately stops and stays where it hits the object ball.
2) the cue ball is kicked with a draw (back spin) : after the collision the backspin and the friction with the table forces the cue ball to come back in the direction where it came from. the speed and distance is determined by the amount of spin.
3) the cue ball is kicked with a follow (top spin) : after the collision the topspin and the friction with the table forces the cue ball to follow the same direction it had before the collision. The speed and distance is determined by the amount of spin. (Note that this can cause the cue ball to disappear in the pocket too !)

It is important to realise that in 2) and 3) there is an acceleration of the cue ball (remember : in the first nanosecond after the collision it stopped every movement) after the collision, until the spin has slowed down until there is no more friction and the cue ball simply rolls further, meaning that the spinning velocity equals the horizontal velocity.
It is this acceleration phase that is interesting in the other case, where the cue ball hits the object ball at an angle : in that case we have
i) a straight movement of the cue ball at an angle to its original direction and at the same time
ii) an acceleration in either the same or the opposite direction of the original movement. This causes the cue ball to follow a curved trajectory, until the complete rolling phase. With this draw and follow spin, an excellent player is able to force the cue ball to follow nearly any direction between 0 and 180° after the collision.

And at last comes the side spin into play : when the cue ball hits the cushion. Normally it leaves the cushion at the same angle of its arrival, but at the other side of the perpendicular. With the side spin the player can increase or decrease this angle. If the angle is close to the perpendicular, it is even possible to leave at the same side as the arrival.

Did you enjoy this post ? Then you might be interested in the following :

The family of PI
Evolution in blue and red
Web 5.0 : the telepathic web
No free will ?

Reblog this post [with Zemanta]
Posted by: zyxo | April 24, 2009

Why clustering is difficult

This image is part of a series of images showi...
Image via Wikipedia

Is clustering difficult ?
You just take your data and run it trough a clustering algorithm like k-means clustering , and you have your result …

Of cause you could do that, but what will be the quality of the result ?

For a good clustering you have to resolve three problems :
1. which clustering algorithm to use ?
2. what definition of distance to use ?
3. choosing your clusters

1. the choice of the clustering algorithm is in my opinion the easiest of the three. I will not go into a taxonomy of possible clustering algorithms, you find them everywhere.

2. The first hard problem is finding a good definition / calculation of distance. Clustering is based on distances (maximizing distances between clusters, minimizing distances within clusters).
I am not talking of geographical locations here, that’s too simple since in that case distances are … well … distances : miles, kilometers or whatever.
But try to define a distance between two customers, based on for example 500 variables like age, account balances, time since last purchase, which are continuous variables and some handfulls of categorical variable like gender, type of environment they live in, are they married or not ? etc.

What is then the distance measure ?

With the continuous variables you could calculate an euclidean distance after converting all (standardized) variables to principal components which are orthogonal. But what is the business meaning of such a distance ?
Has a difference of one standard deviation along variable X (e.g. total purchase amount during the last month) the same value for the business as a comparable difference along variable Y (e.g. age) ?

The same problem arises with categorical variables. You can simply count the number (or proportion) of non-matching categorical variables. But is the difference between married or not married equally important for your business as the difference between man and woman ?

The bulk of the hard labour comes at this stage : if you want to deliver a good clustering, you first have to talk for many hours an days with your business people to know
1) which variables are relevant to the clustering (what do they want to use the clusters for ?) and whch to discard.
2) to accord a weight to each selected variable. Variable X kan be three or ten times more important for your business than variable Y. You should take this into account.

Only then can you go to the next stage : calculating the distances.

Then comes the easy part : choosing and using the clustering algorithm. Based upon the characteristics of the algorithms en the known types of clusterings these generally produce you should be able to make a decent choice.

The second really difficult part is selecting which result to chose.
Will you be satisfied with only one clustering? I recommend to use different samples of your data to check whether the calculated clusters are stable. Do you get each time a similar result ? Great ! Then you have to verify with your business people whether the result makes some sense :
– is there any business logic that explains the clusters ? (If you did a good job selecting the variables and weighing them up this should be no problem !).
– is the number of clusters not too big ? too small ? Considering merging two adjacent clusters is a good option. (thanks to Ned Kumar for pointing this out).

But what if not ? What if you end up with 15 totally different clusterings from 15 random samples ? This simply means that there are no clusters in your world and the “clusters” you found are only the products of random variation.

In that case there is one simple solution left : a) calculate the distance matrix. b) Run a multidimensional scaling, c) plot the result on some charts and finally d) let your business user choose where to cut.

Did you liked this post ? Then you might be interested in the following :
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes
Mining highy imbalanced data sets with logistic regressions

Reblog this post [with Zemanta]
Posted by: zyxo | April 17, 2009

Finally a time machine !

A wormhole
Image via Wikipedia

The first time machine was constructed by some galactic super-creature about 50,000 years in our future. These creatures thought it would be interesting to speed up the whole process by sending the design back in the past, of cause after adjusting it to the existing technology some 1,000 years before. They could not go further back, because they adapted design had to match exactly to the technology of that time, and going futher back would putting too much risk of a time-technology mismatch, especially at the lightning speed evolution of technology in this period.

So the first time machine was constructed by some galactic creature about 49,000 years in our future. This creature thought it would be interesting speed up the whole process by sending the design back in the past, of cause after adjusting it to the existing technology some 1,000 years before. They could not go further back, because they adapted design had to match exactly to the technology of that time, and going futher back would putting too much risk of a time-technology mismatch, especially at the incredible speed of evolution of technology in this period.

So the first time machine was constructed by some galactic entity about 48,000 years in our future. This entity thought it would be interesting speed up the whole process by sending the design back in the past, of cause after adjusting it to the existing technology some 1,000 years before. They could not go further back, because they adapted design had to match exactly to the technology of that time, and going futher back would putting too much risk of a time-technology mismatch, especially at the astonishing speed of evolution of technology in this period.
So the first time machine was constructed …
………………………………………………………………………………
… some 100 years before. They could not go further back, because they adapted design had to match exactly to the technology of that time, and going futher back would putting too much risk of a time-technology mismatch, especially at the high speed of evolution of technology in this period.

So the first time machine was constructed by the first hyperbrained-and-connected-to-the-world-wide-brain-web cyborg about 100 years in our future. This cyborg thought it would be interesting speed up the whole process by sending the design back in the past, of cause after adjusting it to the existing technology of the beginning of the 21st century.

So the first time machine is being constructed right now !

If you enjoyed this post, then you might also be interested in the following :
Web 5.0 : the telepathic web
The Human Cyborg
robotic insects or cyber insects ?
Is God the result of evolution?
Humans 2.0 ?

 

Reblog this post [with Zemanta]
Posted by: zyxo | April 15, 2009

Delivering quality texts

Today I saw this post by David Silverman at Harvard Business Publishing about “how to revise an email so that people will read it“.
Here I only repeat his 10 points (why exactly 10?), do read the original post.

But there is more. We write not only emails (I hope so) but other texts too. We present slides, we give talks.
The point is : how much do we care about the quality we deliver to the others ?

I remember when we used to do quality inspections for software designs. It is so enlightling when you see the resulting texts after you rewrote it based upon the insight of 3 to five people who went in detail over your texts and pinned down every single “defect” (minors, majors, fatals).

Quality does not come like that. You have to invest effort to get it !

So here are the 10 points of David :
1. Delete redundancies.
2. Use numbers and specifics instead of adverbs and adjectives.
3. Add missing context.
4. Focus on the strongest argument.
5. Delete off-topic material.
6. Seek out equivocation and remove it.
7. Kill your favorites.
8. Delete anything written in the heat of emotion.
9. Shorten.
10. Give it a day.

Almost forgot : any comments, suggestions for improving the quality of this or my other posts are very welcome 🙂

Enjoyed this post ? Then you might be interested in the following :
-reducing my work email
– Micro-email = twitmail
-Email tricks

Reblog this post [with Zemanta]

« Newer Posts - Older Posts »

Categories

Design a site like this with WordPress.com
Get started