You are lazy, or you have not much time, or the boss does not want to hire another data miner … But still you want to do more…
For example, like me you use decision trees to calculate models for response prediction, because decision trees are easy : they are not parametric, which means you do not need to go trough all the cumbersome work of imputing missing values, coding dummy’s for categorical variables, do some transformations to get more normally distributed variables and so on.
But still you want to do more. You not only want to predict the probability of purchasing, but you want also to estimate the amount that your customer will spend.
Why, that’s a problem, Did you ever try to run a multiple regression with 500 variables where 95% of them have a missing value for at least one variable ? Right, you get nothing, nada, nil …
But there’s no time to to all the work of imputing missing values, coding dummy’s for categorical variables, do some transformations to get … !
Decision Trees ! YES ! Decision Trees !
The trick is simple :
- take all your customers who did some purchases in the past and calculate the amount (for example per week, per month …)
- calculate the distribution of these amounts to get an indea of what you are dealing with
- transform the amount into let’s say five categories (very low, low, moderate, high, very high) with a fair number of customers in each category. (What’s a fair number ? in data mining it is simple : “there’s no data like more data”)
- Now you can chose : either you calculate 5 decision trees with a binary target to predict the probability that your customer belongs to one of the 5 categories, or you can calculate one decision tree with a multi-level target of 5 values
- for each customer, you calculate the probability of each of the 5 amount categories. Keep the category with the highest probability as the right one.
You could even go further : in stead of your original 500 variables which where no good for a decent multivariate regression, you have now 5 new nice continuous variables which you can use in a multiple regression to model the original amount. Which means that the end result is not just a category, but a genuine estimated purchase amount :
estimated purchase amount = constant + a x probability_very_low + b x probability_low + c x probability_moderate + d x probability_high + e x probability_very_high
What do you think ? A silly idea ? Worth a try ? Please let me know.
Related articles by Zemanta
- Data mining at a higher level (zyxo.wordpress.com)
- Classification, Prior Probabilities and Soft Metrics (zyxo.wordpress.com)
- Data Mining : What is a good lift ? (zyxo.wordpress.com)