Posted by: zyxo | August 4, 2008

The top-10 data mining mistakes

An example artificial neural network with a hi...Image via WikipediaThis list of John Elder, Ph.D is not so recent any more, but is worth looking at every once in a while in order not to forget them :

  1. Lack Data
  2. Focus on Training
  3. Rely on One Technique
  4. Ask the Wrong Question
  5. Listenen (only) to the Data
  6. Accept Leaks from the Future
  7. Discount Pesky Cases
  8. Extrapolate
  9. Answer Every Inquiry
  10. Sample Casually
  11. Believe the Best Model.

(the counting error is John Elders’, not mine!)

I destilled also a more recenter list from this splendid paper of Doug Wielenga on the 2007 SAS forum :

  1. failing to consider enough variables
  2. incorrectly preparing or failing to prepare categorical predictors
  3. incorrectly preparing or failing to prepare continuous predictors
  4. ignoring or misusing time-dependent information
  5. inappropriate metadata
  6. inadequate or excessive input data
  7. inappropriate or missing target profile for categorical target
  8. target variable event levels occuring in different proportions
  9. differences in misclassification costs
  10. misunderstanding the roles of the partitioned data sets
  11. failing to consider changing the default partition
  12. failing to evaluate the variables before selection
  13. using only one selection method
  14. misunderstanding or ignoring variable selection options
  15. choosing settings in the chi-squared or R-squared mode
  16. failing to evaluate imputation method
  17. overlooking missing value indicators
  18. overusing stepwise regression
  19. inaccurately interpreting the results
  20. ignoring tree instability
  21. ignoring tree limitations
  22. failing to do variable selection
  23. failing to consider neural networks
  24. misinterpreting lift
  25. chosing the wrong assessment statistic
  26. generating inefficient score code
  27. ignoring the model performance
  28. building one cluster solution
  29. including (many) categorical variables
  30. failing to sort the (assessment) data set
  31. failing to manage the number of outcomes

Two totally different lists. Has data mining changed that much ? No, people have learnt in the meantime and also : the first one is more general : do not misuse data-mining, the second one is more technical : when you do it, do it good !
Both papers are worth reading ! Enjoy!

Did you liked this post ? Then you might be interested in the following :
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
Mining highy imbalanced data sets with logistic regressions
Howmany inputs do data miners need ?

Reblog this post [with Zemanta]


  1. so true… along the similar lines @ HT blog

    The Most Difficult Task for Every Data Analyst is to…

    comments welcomed

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: