Posted by: zyxo | July 8, 2009

Chromosome numbers, evolution and lies

Deutsch: Metaphasechromosomen aus einer weibli...
Image via Wikipedia

A certain Kent Hovind has apparently turned a “spoof” into a serious matter. in “Opossums, Redwood Trees, and Kidney Beans” he writes (but obviously does not believe it himself) that evolution goes in the direction from few to many chromosomes. Meaning that we started as a penicillinum with two chromosomes and evolve in the direction of a fern with 480 chromosomes. Of cause totally rubbish.

Here you can find other discussions by Kent Hovind on the subject and here the wikipedia description of the man.

The question is : is evolution following a certain direction like :
- getting bigger
- having more genes
- having a larger brain
- having a larger total length of the nervous system
-

I would say : NO

Evolution is simply an adaptation to changing environments. It is the environment that dictates the direction of evolution. If it becomes colder, individuals that better resist cold are at an advantage and consequently the mean cold resistance of the population increases. If afterwards it becomes warmer, evolution is forced in the opposite direction.

Remember : evolution has no purpose whatsoever, it is only the consequence of selection, which is not random, but favors those individuals which are best adapted tot the environment.

Did you enjoy this post ? Then you might be interested in the following :
top-10 lists on evolution
The pope believes in evolution
Human evolution : the future of men
Evolution towards Intelligent Design
The end of evolution

Reblog this post [with Zemanta]
Posted by: zyxo | July 6, 2009

Simplexity : new word about old situations

Red 2 × 4 LEGO brick from the LDraw parts libr...
Image via Wikipedia

What is simplexity ?
Before some weeks I never heard or saw the word. It seams like cute, original, and most of all : scientific and difficult.

What is it all about ?
According to wikipedia is is an ” emerging theory that proposes a possible complementary relationship between complexity and simplicity.”
Professor Petter Wipperman (also in wikipedia) proposed a social definition :
Simplexity therefore stands for a balance between the growing complexity of daily life and our own personal satisfaction.”

But searching a little further I found this : ” Simplexity: in systems theory a term for the emergence of simple features as a direct (though possibly highly intricate) consequence of a system of rules. Jack Cohen and Ian Stewart. “The Collapse of Chaos: Discovering Simplicity in a Complex World.” New York: Penguin, 1994. p. 399.” on Simplexity.co.uk.
So the word is not all that new.

If you read some posts on my blog, of other writings about emerging patterns, hierarchies and the like you will probably find that these definitions ring a bell : there is nothing new, only the name.
It is all about making 1) complex things simple (The romans already new : “divide et impera / divide and conquer), which is what all our analysis methods are about : cut the complex monster in simple pieces and then the whole becomes simple an 2) how you can make complex constructions with simple items or rules : think of Lego, fractals, swarm intelligence, a book written with 26 different letters, or our DNA blueprint written with 4 different nucleotides

One of the most recent examples is twitter : how such a simple (messages with 140 characters) system give birth to such a huge and complex hype, with hundreds of twitter tools and applications ?

The book

Perhaps the hype (?) about simplexity comes from a book, written by Jeffry Kluger : “Simplexity. Why simple things become complex (and how complex things can be made simple)”. J.Kluger also wrote an article about it in Time.
I must confess: until now I did not read the book, but I you believe others then here you can find more :

Did you liked this post ? Then you might be interested in the following :
A bunch of tools for twitter
Do stock traders show swarm intelligence ?
The end of emergence
Evolution towards intelligent design

Reblog this post [with Zemanta]
Posted by: zyxo | June 29, 2009

Link list of interesting articles (june 2009)

Posted by: zyxo | June 27, 2009

List of animal species with 46 chromosomes

Nilgai
Image via Wikipedia

Humans have 46 chromosomes. But what about other animal species ?

There are surprisingly few comparative lists of chromosome numbers to be found on the internet. I admit : it does not make a lot of sense. What would be the scientific value of that ?

Just out of curiosity I searched a bit and as far as I know below is the only list of animal species with 46 chromosomes :

- Humans (Homo sapiens)
- Muntjacs (Muntiacus reevesi)
- Black rat (Rattus rattus) , but not all of them have 46
- European hare (Lepus europeus)
- Merriam’s ground squirrel (Spermophilus canus)
- Southern short-tailes shrew (Blarina carolinensis)
- Mountain beaver (Aplodontia rufa)
- Beach vole (Microtus breweri)
- Nilgai (Boselaphus tragocamelus)
- Kirk’s dik-dik (Rhynchotragus kirki)
- Grey vole (Microtus arvalis)

Did you enjoy this post ? Then you might be interested in the following :
Human evolution : the future of men
top-10 lists on evolution
Cost and benefits of complexity in evolution
Evolution of minerals
Evolution in blue and red

Reblog this post [with Zemanta]
Posted by: zyxo | June 22, 2009

How to steal energy ?

Photovoltaic Solar Energy
Image by compujeramey via Flickr

In physorg.com I found this post about a supermarket that “taps” energy out of the cars that drive into the parking lot.
It made me wonder : which are the other possibilities of letting others pay for your energy ?

Here are some examples I came up with. Feel free to add yours in a comment.

CARS :

  • passing cars in front of your house
  • at traffic lights to tap the energy at the passing cars to let the lights function.
  • system to collect taxes for driving, in the form of energy, at high intensity roads,

PEDESTRIANS :

SPORTSMEN / -WOMEN :

  • fitness centers : what an energy would be produced when the fitness machines would store the energy produced by the people using it
  • tennis rackets to recharge batteries, placed in the grip

OFFICES :

  • keyboards button pressing, mouse movements, to power the screen or webcam, ..

I am sure there are a lot more examples to be found.

Another category is not the “stealing” of energy from people but just recycling energy that otherwise would be wasted. But that is for another post.

Did you liked this post ? Then you might be interested in the following :
Solar power ring : enough energy to fry the earth
What comes first ?
The limit of power
Science-fiction gadgets are near
No free will ?

Reblog this post [with Zemanta]
Posted by: zyxo | June 9, 2009

Do it standing up !

A rugby union scrum
Image via Wikipedia

Years ago I had the pleasure of working in a team of excellent people who had the habit of organizing a meeting every evening as the last part of the daily work. It was a short, quick meeting where we went over what had been done today, what where the problems to be solved, what had to be done tomorrow. Very simple but very efficient.
After this project I went back to the old rhythm of weekly, biweekly, or monthly (depending on my bosses) long boring unproductive meetings and never experienced these short but extremely efficient daily meetings again.

Last day I stumbled upon this article of Martin Fowler on Daily stand-up meetings. He gives an extensive description of how to organize these meetings, and these meetings contain everything I missed from our daily evening meetings.
It is clear from the following that these daily “scrums” as there are also called, come from the software development world.

The wikipedia definition :

“A stand-up meeting (or simply stand-up) is a daily team meeting held to provide a status update to the team members. The ’semi-real-time’ status allows participants to know about potential challenges as well as coordinate efforts to resolve difficult and/or time-consuming issues. It has particular value in Agile software development processes, such as Scrum, but can be utilized in any development methodology.

The meetings are usually time boxed to 5-15 minutes and are held standing up to remind people to keep the meeting short and to the point. Most people usually refer to this meeting as just the stand-up, although it is sometimes also referred to as the morning rollcall or the daily scrum.

The meeting is usually held at the same time and place every working day. All team members are expected to attend, but the meetings are not postponed if some of the team members are not present. One of the crucial features is that the meeting is intended to be a status update to other team members and not a status update to the management or other stakeholders. Team members take turns speaking, sometimes passing along a token to indicate the current person allowed to speak. Each member talks about his progress since the last stand-up, the anticipated work until the next stand-up and any impediments they foresee.

Team members may sometimes ask for short clarifications but the stand-up does not usually consist of full fledged discussions.”

Here is what others say about daily stand-up meetings :

- The daily stand up meeting is not another meeting to waste people’s time. It will replace many other meetings giving a net savings several times its own length. (extremeprogramming.org)

- There are plenty of other things to improve, but a daily stand-up meeting is low-hanging fruit. It is easy to implement and returns immediate gains. (codebetter.com)

- Done properly, the daily Scrum will achieve it’s own results, however handled incorrectly it can become a time wasting social hour (David’s comment on mitchlacy.com)

- Daily Scrum is a powerful tool, but as any other tool it is good, when you know what it’s useful for and have some experience in using it. … The important part is the goal, not the method. (agilesoftwaredevelopment.com)

- … how the team can synchronize their work and progress by meeting every day for a quick (15-20 min) status update and report on impediments (intranet.5amsolutions.com)

- Projects get to be late one day at a time, so it seems logical to have a daily team meeting to ensure you are all on track (www.scrumlabs.com)

- The daily stand-up meeting is a crucial aspect of keeping projects moving without interruption (www.reformingprojectmanagement.com)

- … the ability to reprioritize is one of the key strengths to a fully functioning Agile process, and having this opportunity every 24 hours is a significant benefit. (talk.bmc.com)

- There has been several occasions where the stand up meetings saved us from troubles (specially in rush hours) (Hasith comment on railspikes.com)

- the daily stand up is often the first tool to be implemented because its low cost and management can see value in it quickly. (webascender.com)

- How Microsoft’s p&p Teams do Daily Standup Meetings (ademiller.com)

I wonder if someone is using this type of meeting in another context than agile software development ?

Did you enjoy this post ? Then you might be interested in the following :
Top-10 lists on Knowledge management
Knowledge management = Change management
15 ways to use knowledge management software
The 10 most important failure factors of knowledge management.

Reblog this post [with Zemanta]
Posted by: zyxo | June 1, 2009

Twitter, human evolution, and stock quotes

{{PAGENAME}}
Image via Wikipedia

Look at the title of this post. Seems to be a sort of silly combination, not ?
What do these three have in common ?

I got the idea from a post entitled “Twitter and Human evolution” by Trey Ratcliff.
Trey compares the communication between tweeple (people who tweet) with communications between the cells of the human body which send short messages to eachother asking for stuff and offering some stuff.
Seems interesting.

But he concludes that these short tweets could get humanity to act as a super-organism, where people get some sort of bottom-up decision making.

OK for the bottom-up decision making, not OK for the super-organism.
First of all : we will never know if there is or will be a super-organism, just like the body cells do not know that there is a body.
Second, and not really objectively : I do not see how this could lead to a super-organism. Twitter being only a very little part of the internet it should be more likely that the internet as a whole becomes a super-organisme. But I see it largely improbable that one single organism (the internet) evolves to some super-organism with real mental capacities. Evolution uses large number of organisms, and (natural) selection to end up with something meaningful. One internet is not really a large number …

And what about stock quotes ?

Bottom-up decision making due to twitter is comparable with buying or selling stocks based on the information we find in discussion fora, in newspapers, and even on twitter. But there is a huge difference : with stock quotes we also have the actual stock quotes, which is the real result of the combined buy-sell behaviour of thousands or millions of people.

With twitter, we only have the tweets. There is no software running behind the scene to analyse for example all the tweets concerning “evolution” to come up with a global picture of what people think of evolution second by second. Offcause it would be nice to have such a service!

Enjoyed this post ? Then you might be interested by the following :
Web 5.0 : the telepathic web
Do Stock Traders show Swarm Intelligence?
Swarm versus intelligence
Piqqem : Prediction market for prediction errors
swarm-information-transfer-techniques

Reblog this post [with Zemanta]
Posted by: zyxo | June 1, 2009

Link list of interesting articles (may 2009)

Posted by: zyxo | May 31, 2009

Dangerous to click this link !

Hi, this is just a test to see if you people can or cannot resist “dangerous” links .
Sorry for bothering you. Perhaps you can enjoy other posts of my blog. It’s free …
Zyxo

Reblog this post [with Zemanta]
Posted by: zyxo | May 31, 2009

After GIGO comes GIQO

Manure, a field in Randers in Denmark
Image via Wikipedia

Garbage In, Quality Out.
Is that not a dream ?
Well no, it is reality.
Here is a list of examples where garbage comes in and some quality product is produced :
- drinkwater out of urine in the space station
- a useful targeting data mining model out of a “dirty” database
- perfectly healthy vegetables out of a garden enriched with manure (=shit!)
- perfectly clear glass, out of sand
- usefull products like construction blocks out of garbage
- gasoline out of garbage

I am sure there are many other examples.

Enjoyed this post ? Then you might be interested by the following :
- solar power ring : enoug energy to fry the earth
- Evolution towards Intelligent Design
- should we invest in photovoltaic cells ?
- Web 5.0 : the telepathic web

Reblog this post [with Zemanta]
Posted by: zyxo | May 24, 2009

Good enough / data quality

Detail on a bottle of Ardbeg whisky.
Image via Wikipedia

Data quality : when is it sufficient ?

Leave out the data, let us talk about quality.

First of all here are some examples of quality “problems”.

Obviously we have to make choices which often are worse than the best possible quality.
So it is with data : when is the quality good enough ?

It depends : what do you want to do with it ?

– If it is reporting : the numbers better be correct. In a large enterprise I bet there will be two sources of the same numbers. The results will be compared and there will be trouble.
– if it is descriptive data mining, like clusterings or descriptive classifications : the data better be as correct as possible. Errors are acceptable within reasonable limitations, as long as the picture “fits”.
– if it is data mining for targeting purposes : the data has to be stable in time. Correct ? I do not care. Does this sound crazy ? Perhaps. But really : I do not care ! If they put the size of the shoes of someone in the “Birthday” variable this poses no problem. For the data mining algorithm does not take the meaning of the variable names into account. “var1″, “var2″, var3″, etc do equally well. The only thing that matters is : how good is the predicting quality of the targeting model ? You can only obtain a good predicting model with variables that have prediction power (are related to the target) and that are stable, meaning the meaning of the variable does not change over time. I do not like it when IT people correct flaws in the data. It diminishes the model quality and I have to rebuild them.
So better use your time to build targeting models than to try to get the data to be perfect. Just use the GIQO principle I just invented : GARBAGE IN, QUALITY OUT ! (a bit like the urine-to-water machine at the space station)
– if it is web analysis : this is yet another story, neatly explained by Avinash Kaushik in this post.

Did you liked this post ? Then you might be interested in the following :
Howmany inputs do data miners need ?
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes

Reblog this post [with Zemanta]
Posted by: zyxo | May 12, 2009

Howmany inputs do data miners need ?

The scatterplot of Iris flower data set, colle...
Image via Wikipedia

Howmany records do you need to make a decent data mining model ?

Let us first look at a data mining definition (you find dozens of them on the web, I just took one at random).
The automatic extraction of useful, often previously unknown information from large databases or data sets.

In most definitions we find something like “large database” or “lots of data” which implies that we need a huge lot of data to enjoy our data mining hobby.
Is this so ?

Anyway it is a tough question.

Let us start simple. It is all about getting information out of the data. So let us take three points in a plane (x-y plot). If they fall on a straight line, the correlation coefficient is statistically significant. Meaning that you do not necessarily need a lot of data to extract information from it.

But data mining was invented to overcome the problems statistics have with huge amounts of data and variables.

Candidate factors that play a role to determine the optimal number of observations are :
- dimensionality : the number of variables (preferably transform categorical variables by dummies before counting !). As a rule of thumb you should have at least as many observations as something like the squared number of variables. (I forgot where I read or heard this).
But what about a dataset with 10,000 variables of which only 2 are really related to the target variable ? In that case there is no “curse of dimensionality”. The only problem is the storage space and computing power to find the two significant ones.
- power : This is a difficult one and often overlooked in statistics. Large power means : a clear and large effect of the independent variables on the target variable. Small power means that there is an effect but it is very small and hence difficult to detect … unless you have a lot of observations … Let us return to the three point on a straight line : they present a huge power, so three points are sufficient to establish the fact that there is a significant correlation. But what if the population is almost a circular cloud of points ? With 10,000 points on that plane you could calculate a correlation coefficient of 0,04, being highly significant but with a low power ! With data mining we often want to include even the smallest effects in our model to increase the prediction quality (read “marketing campaign return” ) as much as possible. So we need lots of observations to detect them.
- modeling method : decision trees can handle a huge number of observations. So do logistic regressions. But since you obviously want to perform some selection of variables you want a stepwise regression : this will take ages. And random forests can handle a lot of variables but relatively few observations. This you have to test on your own system.

The one solution I propose to get an estimate of how many observations is sufficient but not too much : try it out !

Too much :
- if your tool/system cannot handle them any more (neural networks, logistic regressions …)
- for decision trees : if the model quality does not improve any more (tested on a hold-out dataset). Be aware of the fact that decision trees grow larger and larger as long as you feed them more observations, but not necessarily get better (unless you force them to stop at a fixed number of splits, which I do not find a good idea ! )

Too few :
- poor model

So what should you do ? Make a lot of models with increasing numbers of observations and test them against a hold-out dataset. Continue adding observations as long as the model quality improves.

As someone said before : it is 5% inspiration and 95% transpiration …

Did you liked this post ? Then you might be interested in the following :
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes
Mining highy imbalanced data sets with logistic regressions

Reblog this post [with Zemanta]
Posted by: zyxo | May 8, 2009

Game complexity

Chess game
Image by Ed Yourdon via Flickr

Just some complexity numbers :

Number of possible combinations for the following games :

four in a row : 10 exp(14)
checkers : 10 exp(23)
chess : 10 exp(50)
game of go : 10 exp(171) The most complexe of all ! That is why computers still cannot beat the human masters !

Reblog this post [with Zemanta]
Posted by: zyxo | May 5, 2009

Simple physics of snooker

Still from Media:Snooker break.
Image via Wikipedia

The last days I watched some games of the snooker world championship at The Crucible Theatre in Sheffield. I was amazed at the incredible quality of the play. Then I started to wonder : what calculations are necessary to come to such magnificent shots ?

I could google surprisingly little on the physics of snooker. And what I found was either poor physics or poor snooker.
So I decided to give a brief overview of the physics of snooker, as I see it. By the way, my snooker capacities are very limited :-) , but I love the game, and my physics knowledge limits to some practical insights, so I leave out all the theory.

To shorten this post and to reduce the complexity I will limit this overview to one single shot.

What do we need ? A snooker table, a cue stick, a cue ball (the white one) and an object ball (red or other colour).

What do we want ? The object ball has to go in the pocket, the cue ball must come to rest on a very good spot to pot the next object ball.

What are our limits ? i) We can only play the cue ball. Everything that we want the object ball to do is caused by the collision with the cue ball. ii) with the cue stick we can give the cue ball a forward motion and a spin (follow, draw, side, where follow and draw correspond with top spin and back spin respectively, and of cause combinations between follow or draw and side). As for the object ball, only a forward motion is practically possible. Given the fact that there is very little friction when two balls collide the limited object ball spin is of no practical value.

( See drawing at onekaraoke.com)

OK, how will we attain these two objectives ?

First objective : the object ball must go in the pocket. The direction of the object ball is totally determined by the point where it is touched by the cue ball. When the two balls collide, simply draw a line through the centers of the two balls. This line shows the direction of the object ball. It should point straight to the center of the pocket.
This is the simplest of the two, but it is far from simple ! The direction of the object ball is limited between 180° (goes in the same direction as the cue ball, when this direction goes through the center of the object ball) and a theoretical 90° when the cue ball “kisses” the object ball extremely thinly at its side.

Second objective : the cueball must stop at the exact (approximately is mostly good enough) spot of the table where we want it to stop. Remember the cueball has two moving properties : direction and spin, so this is a lot more complicated than simply potting the object ball.

- Right after (first nanoseconds ! ) the collision : the cue ball direction is entirely determined by the touching points. This direction can be anything between 90° right to 270°(left), except for the circle segment behind the object ball. The size of this segment is determined by the original distance between the two balls.
At the moment of the impact, the impulse does not change. With a completely central impact, this means that the object ball will follow the direction of the cue ball at the speed of the cue ball, and the cue ball will remain motionless at the exact spot where it touched the object ball. The other extreme is that the cue ball misses the object ball, no need to tell what happens.
Then you have everything in between : touching very slightly the object ball will cause
i) a very little deviation and apparently nearly no loss of speed of the cue ball and
ii) a very slow movement of the object ball in an angle of just above 90°. So here is not much the player can play with to get his cue ball on the right spot after the shot.

- Later on (following nanoseconds, milliseconds, seconds …) after the collison there is something totally different in play : the spin of the cue ball. Until the cue ball hits a cushion the side spin does not play an important role (and by the way, this would be too difficult to elaborate). The vertical spin though has an huge effect on the cue ball direction.

Let us start with a head kick, straight at the center of the object ball.

There are three possibilities :
1) the cue ball has absolutely no spin when it touches the object ball. In that case the cue ball immediately stops and stays where it hits the object ball.
2) the cue ball is kicked with a draw (back spin) : after the collision the backspin and the friction with the table forces the cue ball to come back in the direction where it came from. the speed and distance is determined by the amount of spin.
3) the cue ball is kicked with a follow (top spin) : after the collision the topspin and the friction with the table forces the cue ball to follow the same direction it had before the collision. The speed and distance is determined by the amount of spin. (Note that this can cause the cue ball to disappear in the pocket too !)

It is important to realise that in 2) and 3) there is an acceleration of the cue ball (remember : in the first nanosecond after the collision it stopped every movement) after the collision, until the spin has slowed down until there is no more friction and the cue ball simply rolls further, meaning that the spinning velocity equals the horizontal velocity.
It is this acceleration phase that is interesting in the other case, where the cue ball hits the object ball at an angle : in that case we have
i) a straight movement of the cue ball at an angle to its original direction and at the same time
ii) an acceleration in either the same or the opposite direction of the original movement. This causes the cue ball to follow a curved trajectory, until the complete rolling phase. With this draw and follow spin, an excellent player is able to force the cue ball to follow nearly any direction between 0 and 180° after the collision.

And at last comes the side spin into play : when the cue ball hits the cushion. Normally it leaves the cushion at the same angle of its arrival, but at the other side of the perpendicular. With the side spin the player can increase or decrease this angle. If the angle is close to the perpendicular, it is even possible to leave at the same side as the arrival.

Did you enjoy this post ? Then you might be interested in the following :

The family of PI
Evolution in blue and red
Web 5.0 : the telepathic web
No free will ?

Reblog this post [with Zemanta]
Posted by: zyxo | April 24, 2009

Why clustering is difficult

This image is part of a series of images showi...
Image via Wikipedia

Is clustering difficult ?
You just take your data and run it trough a clustering algorithm like k-means clustering , and you have your result …

Of cause you could do that, but what will be the quality of the result ?

For a good clustering you have to resolve three problems :
1. which clustering algorithm to use ?
2. what definition of distance to use ?
3. choosing your clusters

1. the choice of the clustering algorithm is in my opinion the easiest of the three. I will not go into a taxonomy of possible clustering algorithms, you find them everywhere.

2. The first hard problem is finding a good definition / calculation of distance. Clustering is based on distances (maximizing distances between clusters, minimizing distances within clusters).
I am not talking of geographical locations here, that’s too simple since in that case distances are … well … distances : miles, kilometers or whatever.
But try to define a distance between two customers, based on for example 500 variables like age, account balances, time since last purchase, which are continuous variables and some handfulls of categorical variable like gender, type of environment they live in, are they married or not ? etc.

What is then the distance measure ?

With the continuous variables you could calculate an euclidean distance after converting all (standardized) variables to principal components which are orthogonal. But what is the business meaning of such a distance ?
Has a difference of one standard deviation along variable X (e.g. total purchase amount during the last month) the same value for the business as a comparable difference along variable Y (e.g. age) ?

The same problem arises with categorical variables. You can simply count the number (or proportion) of non-matching categorical variables. But is the difference between married or not married equally important for your business as the difference between man and woman ?

The bulk of the hard labour comes at this stage : if you want to deliver a good clustering, you first have to talk for many hours an days with your business people to know
1) which variables are relevant to the clustering (what do they want to use the clusters for ?) and whch to discard.
2) to accord a weight to each selected variable. Variable X kan be three or ten times more important for your business than variable Y. You should take this into account.

Only then can you go to the next stage : calculating the distances.

Then comes the easy part : choosing and using the clustering algorithm. Based upon the characteristics of the algorithms en the known types of clusterings these generally produce you should be able to make a decent choice.

The second really difficult part is selecting which result to chose.
Will you be satisfied with only one clustering? I recommend to use different samples of your data to check whether the calculated clusters are stable. Do you get each time a similar result ? Great ! Then you have to verify with your business people whether the result makes some sense :
- is there any business logic that explains the clusters ? (If you did a good job selecting the variables and weighing them up this should be no problem !).
- is the number of clusters not too big ? too small ? Considering merging two adjacent clusters is a good option. (thanks to Ned Kumar for pointing this out).

But what if not ? What if you end up with 15 totally different clusterings from 15 random samples ? This simply means that there are no clusters in your world and the “clusters” you found are only the products of random variation.

In that case there is one simple solution left : a) calculate the distance matrix. b) Run a multidimensional scaling, c) plot the result on some charts and finally d) let your business user choose where to cut.

Did you liked this post ? Then you might be interested in the following :
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes
Mining highy imbalanced data sets with logistic regressions

Reblog this post [with Zemanta]
Posted by: zyxo | April 17, 2009

Finally a time machine !

A wormhole
Image via Wikipedia

The first time machine was constructed by some galactic super-creature about 50,000 years in our future. These creatures thought it would be interesting to speed up the whole process by sending the design back in the past, of cause after adjusting it to the existing technology some 1,000 years before. They could not go further back, because they adapted design had to match exactly to the technology of that time, and going futher back would putting too much risk of a time-technology mismatch, especially at the lightning speed evolution of technology in this period.

So the first time machine was constructed by some galactic creature about 49,000 years in our future. This creature thought it would be interesting speed up the whole process by sending the design back in the past, of cause after adjusting it to the existing technology some 1,000 years before. They could not go further back, because they adapted design had to match exactly to the technology of that time, and going futher back would putting too much risk of a time-technology mismatch, especially at the incredible speed of evolution of technology in this period.

So the first time machine was constructed by some galactic entity about 48,000 years in our future. This entity thought it would be interesting speed up the whole process by sending the design back in the past, of cause after adjusting it to the existing technology some 1,000 years before. They could not go further back, because they adapted design had to match exactly to the technology of that time, and going futher back would putting too much risk of a time-technology mismatch, especially at the astonishing speed of evolution of technology in this period.
So the first time machine was constructed …
………………………………………………………………………………
… some 100 years before. They could not go further back, because they adapted design had to match exactly to the technology of that time, and going futher back would putting too much risk of a time-technology mismatch, especially at the high speed of evolution of technology in this period.

So the first time machine was constructed by the first hyperbrained-and-connected-to-the-world-wide-brain-web cyborg about 100 years in our future. This cyborg thought it would be interesting speed up the whole process by sending the design back in the past, of cause after adjusting it to the existing technology of the beginning of the 21st century.

So the first time machine is being constructed right now !

If you enjoyed this post, then you might also be interested in the following :
Web 5.0 : the telepathic web
The Human Cyborg
robotic insects or cyber insects ?
Is God the result of evolution?
Humans 2.0 ?

 

Reblog this post [with Zemanta]
Posted by: zyxo | April 15, 2009

Delivering quality texts

Today I saw this post by David Silverman at Harvard Business Publishing about “how to revise an email so that people will read it“.
Here I only repeat his 10 points (why exactly 10?), do read the original post.

But there is more. We write not only emails (I hope so) but other texts too. We present slides, we give talks.
The point is : how much do we care about the quality we deliver to the others ?

I remember when we used to do quality inspections for software designs. It is so enlightling when you see the resulting texts after you rewrote it based upon the insight of 3 to five people who went in detail over your texts and pinned down every single “defect” (minors, majors, fatals).

Quality does not come like that. You have to invest effort to get it !

So here are the 10 points of David :
1. Delete redundancies.
2. Use numbers and specifics instead of adverbs and adjectives.
3. Add missing context.
4. Focus on the strongest argument.
5. Delete off-topic material.
6. Seek out equivocation and remove it.
7. Kill your favorites.
8. Delete anything written in the heat of emotion.
9. Shorten.
10. Give it a day.

Almost forgot : any comments, suggestions for improving the quality of this or my other posts are very welcome :-)

Enjoyed this post ? Then you might be interested in the following :
-reducing my work email
- Micro-email = twitmail
-Email tricks

Reblog this post [with Zemanta]
Posted by: zyxo | April 14, 2009

No Free Will ?

Axial MRI slice at the level of the basal gang...
Image via Wikipedia

Do we have a free will, or is it chemically determined ?

We can make all our choices, but is the fact that we choose option A something we are FREE to do, or is it already chemically determined. Or in other words : does our chemistry fool ourselves?
It is indeed a fact that brain scanners can detect your decisions seven seconds before you make them!.
Nevertheless we are convinced of our free will. Does this mean that the idea of free will is only generated in our brain after the brain made the choice ?

If this is really true, why ? What is the function of free will ? Could we not do without it ?
Life would be much easier. No free will, no more judgements about wrong choices, no more guilty feelings …
But evolution theory, Darwin, survival of the fittest etc… also had their say.
A small community with people who make the wrong choices is in danger ! So this thing of free will and the feeling that we are able to make choices and hence try to adapt our behaviour to previous experiences, to feedback from our fellow tribe members is needed to change behavior in favour of the tribe.
For you know : the research only talks about 7 seconds. A lot of decisions take a long time to make. Just think of your wife choosing a new pair of shoes …

Enjoyed this post ? Then you might be interested in the following :
- A taxonomy of psychons
- Complex decision to make ?
- Job interview or brain scan ?
- Web 5.0: The telepathic web
- Human brain copy protection by AnyMind Inc.

Reblog this post [with Zemanta]
Posted by: zyxo | April 10, 2009

Douglas Adams’ superb nonsens

Life, the Universe and Everything
Image via Wikipedia

Do not read this book for the story, but just for the unusual text !
I’m reading the third book in Douglas AdamsHitchhiker’s, trilogy, namely “Life, the Universe and Everything“.
The story itself is not much but the way he writes!
I’m asking myself if he somewhere dug up a list of extremely scientific an other words that he uses in more or less random combinations when he describes the weird environments, circumstances, gadgets, mindstates and whatever.
The guy is really astonishing. Just look at the list of ‘interesting’ words/expressions I took just from pages 48 and 49:
recipriverexclusions”, “somebody else’s problem fields”, “nonabsoluteness”, “subphenomenon”, “interactive subjectivity frameworks“.
Or look how poetically he describes the sunrise : “seven billion trillion tons of superhot exploding hydrogen nuclei rose slowly above the horizon and managed to look small, cold an slightly damp“.
And this weird unexpected twists of mind keeps going on and on throughout the whole book.

Normally I do not advertise for books, but for this exceptional one, I make an exception.
By the way, if you did not read “the hitchhiker’s guide” yet, you should start with that one first !

Enjoyed this post ? Then you might be interested in the following :
-Gödel Esher bach online course
- Continuïty Gap in The Intelligent Universe
- Web 5.0: The telepathic web
- Human brain copy protection by AnyMind Inc.

Reblog this post [with Zemanta]
Posted by: zyxo | April 6, 2009

Adam and Eve : Robot scientists

Adam is the first robot that has discovered new scientific knowledge. Eve will be the next.
Adam is still a prototype but it works, proving that in some future, science will be done by robots, leaving time for the humans to do useful stuff like politics, sports, management and warfare ??

Enjoyed this post ? Then you might be interested in the following :
- Web 5.0: The telepathic web
- Futurology : Top ten emerging technologies
- Robotic insects or cyber-insects ?
- Self reassembling Robot
- Human brain copy protection by AnyMind Inc.

Reblog this post [with Zemanta]

Recently someone asked me if
1) logistic regressions can handle imbalanced datasets
and
2) if it is possible to adjust the regression equation as to truly represent the actual situation.

The answer to the first question is “YES” and “NO”. I will explain this further in this post.

For the second question : I really do not know. But somehow it seems to me that this would be very unreliable.

Can we mine a highly imbalanced dataset with logistic regression ?
Let me first put this : to mine highly imbalanced datasets, you should preferable NOT use logistic regressions, but something like decision trees. If you insist on using logistic regression then I advise the following :

If you use the entire imbalanced dataset I am convinced that, if the regression eventually converges, you will end up with a very poor model. So don’t.

Now sampling. Let us assume that the minority class are the positives.
Since you have much more negatives than positives, a first approach would be to take all the positives and the same amount of negatives, which gives you a perfectly balanced sample. (It is generally accepted for a balanced dataset that the minority class contains at least 30% of the observations).
There is a good thing and a bad thing to that approach.
The good thing is that logistic regressions do very well on small datasets (much better than for example decision trees). So with that approach you should already get an acceptable model.
The bad thing is that you have no possibility left to test the quality of your model, since you have no positives left. (This testing should allways be done on a sample that was not used to make the model).

Consequently I suggest an alternative approach, which yields a model of the same or more likely, better quality AND an indication of the quality of the model.
This approach is called BAGGING (bootstrap averaging). It is mostly used and shows the largest gains on weak classifiers like decision trees, but in this particular case it can be very useful too.

For example take 90% of the positives and about the same number of negatives (you may take a slightly higher number of negatives to obtain for example a 40% positives and 60% negatives distribution).
Then you calculate your model and test it on the 10% positives that you did not use in the model and of cause on whatever number of negatives you want.

Then you start all over again with another random sample of 90% of the positives and a sample of the negatives, preferably not overlapping with the first sample of negatives.
And after that you start all over again … and again … and again … The more the better, but 10 to 20 times will normally already do a great job.

Suppose you repeated the sampling and modeling 10 times, than you get 10 models and test results for 10 times 10% = 100% of the number of positives, and a lot more for the negatives, because they are the majority class.
With the test results you can plot the calculated probabilities against the real probabilities to relate the models to the real world.

But there is something even better : if you use the 10 models to score a new dataset, than you obtain 10 different scores for each point. These 10 scores will not be the same since they are generated by different models. What you have to do is calculate the average of these 10 scores to obtain a more accurate score than each of the 10 individual scores.

And that is what you needed : a more or less reliable model and an indication of its quality.
And to end a little paradox : you made a nicely “balanced” model with an unbalanced training set, since you used practically 100% of the positives, but 12 times more of the negatives!

Did you liked this post ? Then you might be interested in the following :
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes
Text mining : Reading at Random

Reblog this post [with Zemanta]
Posted by: zyxo | March 20, 2009

The chicken or the egg ?

There was an incubator-box-thingy in the agric...
Image via Wikipedia

What was first ? The chicken or the egg ?
It is an often heard question. The answer instead is never told.

Well, for a change : here is the answer.

First was … the egg.

But … did you not need a chicken first to lay the egg ?
The anwer is : NO !

Before the birds were the reptiles, before the reptiles were other animal species etc. And that way we can go back to the monocellular organisms.
The monocellulars, you know, those living things that consisted of only one single cell. They reproduced simply by splitting themselves in two new cells.
It was simple.
But, perhaps because it was a bit more efficiënt, those simple cells became a bit more complex, a bit more efficient to gather food, to stay alive, to duplicate. And so gradually, in stead of simply splitting in two, they built incredible complicated structures (for example a chicken) to build THE new cell : the egg.
(If you did not new : a chicken egg is one single huge cell).

So if I put it a bit awkwardly : a chicken is the means by wich an egg reproduces itself. (oh yes, there is also the rooster, which complicates the whole story even more).

But long before the chicken there was already the egg, the one that did not needed a chicken to reproduce.

Did you liked this post ? Then you might be interested in the following :
top-10 lists on evolution
Evolution in blue and red
Evolution towards Intelligent Design
Are men and women different species ?
Web 5.0 : the telepathic web

Reblog this post [with Zemanta]
Posted by: zyxo | March 19, 2009

The pope believes in evolution

Female condom
Image via Wikipedia

With his recent speech about him still being against condoms, even in Africa, I believe that implicitely pope Benedictus admits to believe in evolution.

Why ?
As he wants everyone to live the rules of the catholic church, he wants to sort of punish those who do not, thus the people who think more creatively about sex.
Underneath this I think that he wants to forbid condoms in order to enhance the extinction of people who do not follow the rules of “go and multiply”, (meaning that you should only have sex if you want to make a child).
Or, like the germans, some decades ago : select and you will end up with a modified population, which is what evolution is all about !

Enjoyed this post ? Then you might be interested by the following :
Does the Pope believe in aliens?
top-10 lists on evolution
Natural Selection : posthuman evolution
Evolution of minerals
Evolution in blue and red

Reblog this post [with Zemanta]
Posted by: zyxo | March 15, 2009

Cloud thinking and scientific gaming

Illustration of a protein on fold it

Gaming for the sake of science !

Seti@home
: Search for extraterrestrial intelligence, was the first and most popular scientific calculation program that makes use of those millions of pc’s doing nothing.

Now there is fold.it, a sort of game where you have to fold proteins. This is a very complex matter and David Baker invented this game to put not only the pc’s to work, but also the brains of all those humans behind the pc’s.
So it is not only cloud computing, but also cloud thinking !

Besides that there is also Games with a purpose : games that serves some purpose for the people setting up the game by harnessing human abilities in an entertaining setting

Google image labeler uses people( who have nothing better to do) to tag images.

In Galaxy zoo you can participate to the search of new galaxys in space.

Reblog this post [with Zemanta]
Posted by: zyxo | March 13, 2009

A second bunch of tools for twitter

Image representing Twitter as depicted in Crun...
Image via CrunchBase

be-a-magpie : converse your tweets into money

tweetsmarter : to add special characters to your tweet and to add a retweet link so that followers can retweet you in one click.

twollo : automatically find and follow fellow tweeple with similar interests as yours.

just tweet it : a directory for twitter users : users listed per subject.

retweetist : list of links and people that are retweeted the most in the last 24 hours.

microplaza : your personal micro-news agency

tweetbackup : freebackup for your Twitter

twitterhawk : targeted marketing based on tweets

twttrip : Where will you go next?
Share your travel plans with your tweeple!

twalala A client for Twitter that allows you to control what you see, and more importantly, what you don’t see in your twitterstream.

tweetag : wordcloud of most popular topics discussed on Twitter in the last 24h + search for topics

twitalyzer : The Twitalyzer is a tool to evaluate the activity of any Twitter user and report on relative influence, signal-to-noise ratio, generosity, velocity, clout, and other useful measures of success in social media.

Tweeterate : Tweeterate extends Twitter with the possibility to rate the tweets you get from your friends

TweeterGetter : Start Getting 1000’s Of Legitimate New Twitter Followers On Autopilot via a sort of waterfall / pyramid system

obamatwits : a mashup with twits on Obama

tweetaways : an easy way to pick a random winner for your next twitter contest or giveaway

mycleenr : MyCleenr is a unique way to sort your friends by their last tweets. It allows you to get rid off all the inactive and useless accounts that you are following!

Twit2do is a simple, online to-do list manager. Create and update to-do lists here or via twitter. No signup needed, just use your twitter login and away you go – happy twit2do-ing!

twtask : create tasks directly from twitter

twtvite : to invite people for tweetups

twittercounter : shows a graph with the number of followers

twimailer : replaces the shallow emails from twitter in your inbox when people follow you by emails with a lot of info on the new follower.

tweetvolume : howmany times a word is found on twitter ?

tweetgrid : create a twitter search dashboard that updates in realtime.

twitterthoughts : charts and maps based on twitter

tweetoclock : looks which day and time your friends tweet most, gives you an accurate idea of when they’ll be using Twitter

destroytoday : Twitter application built to run on Mac, Windows, and Linux using Adobe AIR. It consists of a series of canvases that constantly update to keep tweets current and up-to-date using notifications that appear immediately after a new tweet arrives.

tweetmeme : the most popular links that are shared on twitter

And at this point, dear reader, I gave up !

I have found also the following sites with sometimes huge lists of twittertools, and I get the feeling that twittertools are more rapidly created than I can write them down.
So to finish this second bunch of twittertools I add some links to other sources :

twitdom

alltwittertools

another list of twitter tools

pbwiki

If you enjoyed this post, then you might also be interested in the following :
A bunch of tools for twitter
Micro Email = twitmail

 

Reblog this post [with Zemanta]

Older Posts »

Categories