There has been a lot of buzz recently about how Target was able to predict which of their customers are pregnant. Here is how we would approach the problem.
Target started with a belief that females form brand allegiances when they shop in the their third trimester. As such, they want to be able to predict when their female customers will enter that trimester. By sending relevant coupons at the end of the second trimester, they want to encourage their customers to visit Target and forge more of those long-lasting relationships.
We can think of approaching the problem in three steps: first predict which customers are pregnant, then predict the due dates, and finally figure out the best coupons to send to get the customer to come back to the store. In this post we’re going to look at the first problem, predicting which customers are pregnant.
The pregnancy prediction problem can be further broken apart as follows:
- Establish a training data set made up of pregnant and non-pregnant shoppers
- Create ‘market baskets’ of items purchased by these customers
- Choose a model, identifying relevant features and generating pregnancy prediction scores
- Determining which customers receive mailers
Creating a training data set
To predict pregnancy, we first need to develop a training data set for the models. We filter down the customer data set to women who shop regularly at target. Target must have a way of linking gender and guest ID directly, or they are able to determine gender from the products that guests buy. They need to be fairly regular shoppers to have enough data for accurate predictions.
Target also has some data on which of these women are pregnant. The article says that they have due date information from guests who supply the information with Target’s gift registry. We can use this data as the training set for the model.
Defining ‘pregnant’ and ‘non-pregnant’ market baskets
We can establish a variety of ‘market baskets’ of products that are purchased by pregnant and non pregnant women. We create the baskets of pregnancy products by looking at what guests buy in their first 26 weeks of pregnancy. We establish a baseline basket of products that non-pregnant women purchase by taking products that women purchase in a randomly selected 26 week period.
We are now armed with the data we need to predict pregnancy due dates. The article says that:
[Target’s statistician] was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy.
Picking a model and approach to learn and predict
So the first thing to do is feature selection. Feature selection is the process of picking which of the possible predictor variables are relevant. In this case the features are the purchase or lack of purchase of specific products. Target has tens of thousands of products, and in order to predict which customers are pregnant we need to determine the subset of products that are purchased more by pregnant women. To figure this out, we could code each product in their portfolio with a boolean indicator variable then each market basket is a collection of these variables. So for n market baskets, and m items in the store, we can encode the problem as a nx1 matrix of response variables where a 1 indicates that the market basket was from a pregnant women and a 0 indicates the market basket was from a randomly chosen female customer. We would make an nxm matrix of predictor variables where rows are the individual market baskets and the m columns are items in targets inventory. Cells in the matrix are filled with with 1 if the item is present in the market basket, and 0 if the item is absent.
Then we could use a supervised learning algorithm to predict which baskets belong to pregnant individuals, and then we can perform feature selection to figure out which products are the most predictive of pregnancy. The more popular supervised learning algorithms, logistic regression, neural-nets, support vector machines and random forests. I would start with a regularized logistic regression, which combines the prediction and feature selection steps (Tibshirani et. al). Regularization is a way to avoid over fitting and uses a penalized maximum likelihood estimation. The regularization also is used to determine which products are useful, we can just pick a regularization parameter, and then pick all products that have non-zero prediction coefficients.
Choosing to whom to send mailers
At this point we have a pregnancy prediction score for every customer, and need to figure out what the appropriate cut-off is. We do this by establishing picking a false discovery rate (FDR). Since we will never be able to predict with 100% accuracy who is pregnant and who is not, we need a way to minimize the error that we will make. We can set a FDR at 0.05, that is to say that when we send out the mailers we expect 95% of the women who receive them to be pregnant, and 5% to be false positives. (Storey et. al).
This is just one method that could be used to identify the pregnant customers. Obviously this is fairly speculative, and not necessarily how Target approached the problem.
Rob Tibshirani & Trevor Hastie & Jerome H. Friedman, “Regularization Paths for Generalized Linear Models via Coordinate Descent,” Journal of Statistical Software, American Statistical Association, vol. 33(i01).
Storey, J. D. (2002), A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64: 479–498. doi: 10.1111/1467-9868.00346