Image via WikipediaThis list of John Elder, Ph.D is not so recent any more, but is worth looking at every once in a while in order not to forget them :

- Lack Data
- Focus on Training
- Rely on One Technique
- Ask the Wrong Question
- Listenen (only) to the Data
- Accept Leaks from the Future
- Discount Pesky Cases
- Extrapolate
- Answer Every Inquiry
- Sample Casually
- Believe the Best Model.

(the counting error is John Elders’, not mine!)

I destilled also a more recenter list from this splendid paper of Doug Wielenga on the 2007 SAS forum :

- failing to consider enough variables
- incorrectly preparing or failing to prepare categorical predictors
- incorrectly preparing or failing to prepare continuous predictors
- ignoring or misusing time-dependent information
- inappropriate metadata
- inadequate or excessive input data
- inappropriate or missing target profile for categorical target
- target variable event levels occuring in different proportions
- differences in misclassification costs
- misunderstanding the roles of the partitioned data sets
- failing to consider changing the default partition
- failing to evaluate the variables before selection
- using only one selection method
- misunderstanding or ignoring variable selection options
- choosing settings in the chi-squared or R-squared mode
- failing to evaluate imputation method
- overlooking missing value indicators
- overusing stepwise regression
- inaccurately interpreting the results
- ignoring tree instability
- ignoring tree limitations
- failing to do variable selection
- failing to consider neural networks
- misinterpreting lift
- chosing the wrong assessment statistic
- generating inefficient score code
- ignoring the model performance
- building one cluster solution
- including (many) categorical variables
- failing to sort the (assessment) data set
- failing to manage the number of outcomes

Two totally different lists. Has data mining changed that much ? No, people have learnt in the meantime and also : the first one is more general : do not misuse data-mining, the second one is more technical : when you do it, do it good !

Both papers are worth reading ! Enjoy!

###### Related articles by Zemanta

*Did you liked this post ? Then you might be interested in the following :*

Oversampling or undersampling ?

data mining with decision trees : what they never tell you

Mining highy imbalanced data sets with logistic regressions

Howmany inputs do data miners need ?

so true… along the similar lines @ HT blog

The Most Difficult Task for Every Data Analyst is to…

comments welcomed

By:

Highstone Toweron September 12, 2012at 10:28 pm