Posted by: zyxo | April 20, 2012

More than Everything You Wanted to Know About Data Mining


Here I use some quotes (in italic) from the original article on Everything You Wanted to Know About Data Mining but Were Afraid to Ask and add some personal toughts.  So, yes, it will be “more”!


“We know that data is powerful and valuable.” …  
“Data mining allows companies and governments to use the information you provide to reveal more than you think.”

Do not be fooled.  Not everything is data mining. To know more about the difference between for instance mere data gathering and real data mining, you should read “is reading a newspaper data mining“.


“To most of us data mining goes something like this: tons of data is collected, then quant wizards work their arcane magic, and then they know all of this amazing stuff. But, how? “

How? That is exactly what I described in Se7en steps in finding knowledge nuggets.  O yes, and not like the polite textbook stuff, but how it’s happening in real life.


“And these days, there’s always more data.”
  “The sheer scale of this data has far exceeded human sense-making capabilities.”

Yes, but in most cases it is not necessary to use all of this data.  Most companies are only beginning to mine small parts of their own data.  This stands in striking contrast to what large software and hardware vendors want us to believe.  After all they want to sell their products and the accompanying consultancy.  Mostly a free software package like Weka or R is largely sufficient.


Data mining is used to … allow us to infer things about specific cases based on the patterns we have observed.”

That is the core task of data mining : detecting patterns.  And this can be any kind of pattern as will become clear in the following paragraphs.  But in fact it is relatively simple.  There are two kinds of patterns: those which can be detected by unsupervised learning and those we detect used supervised learning.

Supervised means that we decide what we want to reveal, we have a specific problem to solve. Examples: how can we select customers who are highly likely to buy product X?  How can we identify customers who will not be able to make their mortgage loan payments?  How can we identify fraudulent tax returns?

Unsupervised means that we will not decide anything.  We will let the data speak and just see which patterns emerge.


Anomaly detection : in a large data set it is possible to get a picture of what the data tends to look like in a typical case. Statistics can be used to determine if something is notably different from this pattern. For instance, the IRS could model typical tax returns and use anomaly detection to identify specific returns that differ from this for review and audit.”

Another use for anomaly detection is for example to detect errors in the data.


Association learning: This is the type of data mining that drives the Amazon recommendation system. For instance, this might reveal that customers who bought a cocktail shaker and a cocktail recipe book also often buy martini glasses. “

In most cases the buying patterns are not the only variables that are used. The other variables, like age, amount of money spent, frequency of buying etc. prevent the modeler from using association learning.  In stead he uses some other predictive algorithm (classification) where the actual purchases are just one variable type among the various other variable types.


Cluster detection: it is possible to let the data itself determine the groups. … in a simple example we can imagine that the purchasing habits of different hobbyists would look quite different from each other: gardeners, fishermen and model airplane enthusiasts would all be quite distinct. Machine learning algorithms can detect all of the different subgroups within a dataset that differ significantly from each other.”

This is one of the most difficult topics in data mining, not because the algorithms are difficult, but because in most cases the results have no actual business meaning, unless you take good care of the difficult preparatory work.  I described this problem in Distances, the biggest challenge in clustering. Perhaps you are also interested in the difference between clustering and segmentation.


Classification: If an existing structure is already known, data mining can be used to classify new cases into these pre-determined categories. … Spam filters are a great example of this … to notice differences in word usage between legitimate and spam messages”

Other applications are : classify customers into those who will buy product X vs. those who will not buy?  Classify customers into those who will not be able to make their mortgage loan payments vs those who will be able to pay?  Classify tax returns into the fraudulent ones vs the good ones? Classify customers into those who are going to churn and those who are not. Classify stocks into those that are going to rise vs those that are going to drop.

There are a lot of algorithms to calculate classification models : decision trees (and the more complex decision tree based algorithms), support vector machines, (logistic) regressions, ant colony optimization, genetic algorithms.


Data mining, in this way, can grant immense inferential power. … it is how most successful Internet companies make their money and from where they draw their power.

Not only internet companies, but banks, credit card companies, each and every company that is big enough to afford to pay people and software to at least try some data mining and see what comes out of it.



***  If you liked what I wrote, perhaps you should consider clicking one of the banners on top of this post and become one of them who help making our world a better place to live ?  ***



Enhanced by Zemanta

Leave a comment

Categories