Posted by: zyxo | September 16, 2008

Science or swarm intelligence ?


Illustration of linear least squares.Image via Wikipedia You may already have read the nonsense of Chris Anderson who claims that science and clear models will be totally replaced by swarm intelligence. I love the reply of Luke T :”Please have your massive amounts of data and applied mathematics provide us with nuclear fusion by this weekend.

But it is a fact that in data mining we are going step by step in that direction. Why step by step ? Why not directly all the way ?
It is all a matter of computing power and storage.
The once simple straightforward scientific models like linear regression equations etc… do not require a lot of computer power. I remember doing a lot of statistics on a HP calculator with a maximum of something like 225 instructions.
So on this ‘cheap’ side we find the traditional data mining algorithms like decision trees or logistic regressions. Even artificial neural networks, which have to be kept relatively simple to prevent overfitting are only a collection of logistic regressions. More complex algorithms comprise the “ensembles”, which use voting by a lot of simpler models like decision trees. The most elaborate one is “Random Forests”, invented by Leo Breiman and Adele Cutler. Its most important drawback is its limited performance. It can handle a lot of variables, but only a relatively limited number of observations.
On the other hand, we find the genetic algoritms and more recently the antmining algorithms, which I already mentioned in a previous post. Especially the last are even more limited in the number of variables and observations they can handle.
These two algoritms are copied from nature. The question : does nature have all that computing power ?
YES ! And not only that, but also TIME !
Evolution which delivered the current brains, swarms, and replication mechanisms worked millions of years on the project.
As for computing power : the calculation to solve the traveling salesman problem in a lab with biochemical molecules (DNA, proteins) takes less than an eyeblink.
The amount of information that can be stored in our brain is huge ! (see this experiment on MIT).

Even more interesting : in our brain everything seems to be connected with everyting. We have thousends “models” working together, connected, interweaved, to deliver answers to ill-defined problems. In data mining : a model can only handle one problem. It is theoretically possible to train models simultanously on multiple problems, but only at the cost of quality loss.
In my day-to-day work I already have to limit the amount of data I want to use just to build targeting models with binary targets.
I would love to be able to train one huge model in an acceptable time (e.g. overnight) that contains all targets at once. I know how to do it, only the hardware/software combination cannot handle all the necessary data to deliver acceptable quality.

Reblog this post [with Zemanta]

Responses

  1. First of al, I like your blog, it’s very readable and practical!

    You wrote about the random forest algorithm: “It can handle a lot of variables, but only a relatively limited number of observations.” I use RF (the R package) on datasets (regression) with almost 100 variables and 10,000-15,000 observations with robust results, so please explain your conclusion.
    Thank you,

    Jim

  2. Jim,
    Thanks for visiting my blog.
    100 variables is not much. In for example DNA-research they use thousends of them.
    10,000 to 15,000 observations is in business terms for a medium to large enterprise a small sample, and falls within the capacity of random forests (It also depends on the memory of your machine).
    A few 100,000 observations with say 500 variables achievable with a decision tree, but way out of reach for random forests.
    Zyxo

  3. Thanks for the insight, I’m not familiar with microarray analysis. Perhaps a recursive random forest can be useful, see http://www.cmj.org/periodical/PaperList.asp?id=LW20081217515449304870 .
    Your answer brings me to my next question. Is there a good way to decide whether a given dataset has enough inputs & cases based on the output distribution? (Could be a nice blog item 🙂


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: