Image via WikipediaThis list of John Elder, Ph.D is not so recent any more, but is worth looking at every once in a while in order not to forget them :
- Lack Data
- Focus on Training
- Rely on One Technique
- Ask the Wrong Question
- Listenen (only) to the Data
- Accept Leaks from the Future
- Discount Pesky Cases
- Extrapolate
- Answer Every Inquiry
- Sample Casually
- Believe the Best Model.
(the counting error is John Elders’, not mine!)
I destilled also a more recenter list from this splendid paper of Doug Wielenga on the 2007 SAS forum :
- failing to consider enough variables
- incorrectly preparing or failing to prepare categorical predictors
- incorrectly preparing or failing to prepare continuous predictors
- ignoring or misusing time-dependent information
- inappropriate metadata
- inadequate or excessive input data
- inappropriate or missing target profile for categorical target
- target variable event levels occuring in different proportions
- differences in misclassification costs
- misunderstanding the roles of the partitioned data sets
- failing to consider changing the default partition
- failing to evaluate the variables before selection
- using only one selection method
- misunderstanding or ignoring variable selection options
- choosing settings in the chi-squared or R-squared mode
- failing to evaluate imputation method
- overlooking missing value indicators
- overusing stepwise regression
- inaccurately interpreting the results
- ignoring tree instability
- ignoring tree limitations
- failing to do variable selection
- failing to consider neural networks
- misinterpreting lift
- chosing the wrong assessment statistic
- generating inefficient score code
- ignoring the model performance
- building one cluster solution
- including (many) categorical variables
- failing to sort the (assessment) data set
- failing to manage the number of outcomes
Two totally different lists. Has data mining changed that much ? No people have learnt in the meantime and also : the first one is more general : do not misuse data-mining, the second one is more technical : when you do it, do it good !
Both papers are worth reading ! Enjoy!
Related articles by Zemanta
Did you liked this post ? Then you might be interested in the following :
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
Mining highy imbalanced data sets with logistic regressions
Howmany inputs do data miners need ?
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=a5a6f5db-5861-4ec8-8ea9-5969f12e4bf3)