December 15, 2008

Cost and benefits of complexity in evolution and data mining

A recent study by Gunter Wagner and other researchers at Yale and Washington University show that higher organisms do not have a “cost of complexity” — or slowdown in the evolution of complex traits.

Cost in evolution : because of the complexity, the effect of a single mutation is diluted and has a smaller impact.
Benefits : one single mutation has often impact on more than one trait.
The benefits make up for the costs.

I see some correspondence with complex data mining models (from for example bagging, random forest or artificial neural network algorithms).
The cost in complex data mining models is the burden not only of training the model but also the burden of extracting all the variables that are used by the model (sometimes one of my models uses more than 100 variables). And the beneficial effect of adding new variables to an allready complex model generally is relatively small.
The advantage is the dilution : when one variable contains errors or is missing for one client, the other variables together with the complexity of the model take over and the model still comes up with a good, usable score.
So in stead of wasting my time to get all variables right and to clean up all the data, I just throw the whole bunch of variables in the algorithm and let the computer do the dirty work. It usually comes up with very good and very robust models.
I love the data mining saying : “more is better” and by that I mean more observations and more variables.

