From simple statistics we know that by taking larger samples, the calculated average lies closer to the real average (standard error of the mean equals standard deviation divided by the square root of the number of observations).

The principle behind this is simple : by taking a small sample, you get a random mistake. By taking a lot of small samples you get a lot of random mistakes. The good thing is that the average of these random mistakes approaches zero as the number of small samples increases.

In data mining this using of multiple samples was invented by Leo Breiman and called bagging (bootstrap averaging) and is a very powerfull source of model improvement. But using this technique you may not forget that each separate sample & associated model **has to make a random error**. In data mining we call this ‘overfitting’. Without overfitting the gain of bagging largely diminishes.

In the recent Netflix data mining competition, the top results were obtained by averaging the results of a multiplicity of models, each with their own ‘random errors’.

But not only the number of samples is important : the difference between the samples has to be as large as possible. Which means that the diversity within the original data source has to be large enough.

Now the question : is bagging only useful in data mining ?

NO !

The google report of their prediction markets showed that people close together showed similar trading behaviour. So, to obtain the obtimal result, you need multiple ‘samples’ of people, samples form a large original data source : in other words, people from as much different locations of the enterprise as possible.

Now I come to the idea of Jenny Ambrozek in a post on the application gap. Enterprises with geographical dispersed staff are in an advantage here to deliver better solutions to problems due to their larger diversity of thinking.

Persueing this line of thought we arrive at web 2.0 situations. If thinking, and information exchange and storage in an enterprise is as free as possible, not limited by a zillion internal rules or technical obstacles, the diversity of information increases and allow much more diverse idea’s and finally better solutions, products, innovations etc…

