This image may not relate to this project at all. Source: endpoints.elysiumhealth.com. All images, data and R Script can be found here
This is a project I did in DSO_530 Applied Modern Statistical Learning Methods class by professor Robertas Gabrys, USC.
Prompt
Dr. Sam Parameter, author of the textbook Statistics for Poets, has been contemplating starting a new magazine, Popular Statistics, and needs to know if it will be profitable enough. You have augmented the original dataset of 50 magazines by measuring more characteristics of the magazines and their predictors that may be useful in understanding the one-page advertisement costs better. The variables are the following :
Magazine Name
Cost of a four-color, one-page ad
Circulation (projected, in thousands)
Percent male among the predicted readership
Median household income of readership
Market = Sold in USA or outside USA
Your goal is to analyze the data with R using Multiple Linear Regression methods and choose the best model to explain the differences in advertising costs between the different titles and then to predict what Sam should be able to charge for the advertisements in the new magazine.
Steps
Data Preparation
> mag_data <-read.csv("MagazineAds.csv")>#Import data >sum(is.na(mag_data))>#Check missing values>head(mag_data) C1.T pagecost circ percmale medianincome Market1 Audubon 25315164551.138787 Domestic2 Better Homes & Gardens 1980003479722.141933 Domestic3 Business Week 103300476068.163667 Domestic4 Cosmopolitan 941001545217.344237 Domestic5 Elle 55540373512.547211 Domestic6 Entrepreneur 40355247660.447579 Domestic
We need to transform market variable into a dummy variable
> mag_data$domestic <-as.numeric(as.numeric(mag_data$Market)==1)>#create a new dummy variable (domestic) to describe Market variable> mag_data1 <- mag_data[,c(-1,-6)]>#remove the magazine name and the market columns
Data Exploration
Histogram
>for (i in1:5){ histi <-hist(mag_data1[,i], plot=FALSE) histi$density <-histi$counts/sum(histi$counts)*100plot(histi, freq =FALSE,main =paste("Histogram of" , colnames(mag_data1)[i]), xlab =colnames(mag_data1)[i]) }>#a loop that generates five histogram for 5 variables
The skewness is noticeable in three variables Median Income, One Ad Page Cost, and Circulation. They’re skewed right, with long tail and the peak of the histogram veers to the left. There seems to be more magazines with one ad page cost lower than the average of one ad page cost for all magazines. Same seems to be true for median income and circulation. There are more magazines with lower circulations and lower median income of reader.
>summary(mag_data1)>#get the descriptive statistics for the variables pagecost circ percmale medianincome domestic Min. :20180 Min. :1645 Min. :3.60 Min. :31835 Min. :0.0000 1st Qu.:48875 1st Qu.:3634 1st Qu.:15.70 1st Qu.:41413 1st Qu.:0.0000 Median :71968 Median :5510 Median :47.40 Median :46251 Median :1.0000 Mean :82386 Mean :10793 Mean :42.88 Mean :48958 Mean :0.7045 3rd Qu.:99895 3rd Qu.:14528 3rd Qu.:65.55 3rd Qu.:58038 3rd Qu.:1.0000 Max. :198000 Max. :51925 Max. :88.50 Max. :72493 Max. :1.0000
Some insights from discriptive statistics for varibles
Cost for one ad page varies mostly from $50,000 to $100,000
Projected circulation varies mostly from 3500 to 15000, but over 50% of the observations have a circulation of fewer than 6000
Over 50% of magazines in the sample targets readers with median income from $40,000 to $60,000
There are a significant number of magazines that target solely one reader gender(25% of the observations have a percentage of male reader of less than 15%; 25% of the observations have a percentage of male reader of more than 65%)
70% of the observations are domestic magazine
Correlation Matrix
>plot(mag_data1)>#scatter plot
The relationship between pagecost and circ is strong and looks linear. This strong relationship makes an economic sense since the higher the circulation, the stronger the impact of the ad campaign, the higher the ad cost.
The relationship between percmale and medianincome is also noticeable. However, it does not look linear visually. Different from the magazine that targets solely female readers, the magazine that targets solely male readers also target male readers with higher income, probably higher social class.
There is also a slight relationship between medianincome and pagecost, as well as medianincome and circ. The magazines that have higher ad page cost or higher circulation tend to target readers with lower median income.
From the correlation table, we confirm very similar insights. Cost per one ad page is highly correlated with circulation number. Median Income is also strongly correlated with percentage of male readers. Median Income also has slight negative correlation with both ad page cost and circulation.
Model 1 - Multiple Linear Regression analysis using all the predictor variables
> M1 =lm(pagecost~.,data = mag_data1)>summary(M1)Call:lm(formula = pagecost ~ ., data = mag_data1)Residuals: Min 1Q Median 3Q Max -46196-14665-231637840272Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 7179.440820738.14600.3460.731circ 3.96330.343311.5443.77e-14***percmale -32.4809151.7827-0.2140.832medianincome 0.70010.43131.6230.113domestic -640.54867524.6056-0.0850.933---Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error:21620 on 39 degrees of freedomMultiple R-squared:0.8013, Adjusted R-squared:0.7809F-statistic:39.32 on 4 and 39 DF, p-value:3.436e-13
This regression model is useful because p-value is 3.436e-13. R-squared of 80.13% is decent, but there’s room for further improvement on the model. Also based on a 5% significance level, three out of the four explanatory variables are insignificant. p-value for Circulation is small, so we can say that Circulation is contributing significantly to the cost. I would recommend keeping circulation variable. All other variables'p-value are greater than .05, hence we can say they are not significant based on our confidence interval selection and should be removed. The dummy variable domestic should be kept.
The VIF for each explanatory variable is low (<5), so there does not appear to be _**_multicollinearity.
Predict the pagecost for one observation
Predict what Sam should be able to charge for the advertisements in the new magazine, if his magazine has a projected predictor of 1,800,000 readers, 60 percent of which are male, with a median income of $80,000 and it will be sold internationally.
> newdata<-data.frame(circ=1800,percmale =60,medianincome =80000,domestic =0)>#create the data for Dr. Sam's magazine. Popular Statistics, for prediction>predict(M1,newdata = newdata, se.fit =TRUE, interval ="confidence")$fit fit lwr upr168372.4941879.8394865.15$se.fit[1] 13097.73$df[1] 39$residual.scale[1] 21620.7
We are 95% confident that the cost is from $41,879.83 to $94,865.15
Residual Analysis
> res_data1 <-residuals(M1)>plot(res_data1, type ="l")>abline(h=0,col="red")#QQ plot to check normality of residuals>qqnorm(res_data1)>qqline(res_data1, col ="red")>shapiro.test(res_data1) Shapiro-Wilk normality testdata: res_data1W =0.98262, p-value =0.7384
Residual plot shows homoscedasticity. The variance of the residuals seems constant. QQ plot shows normality of the data. Shapiro test shows that there is no evidence to reject the hypothesis that data is not normally distributed.
#plot residuals against predicted values>predicted_data1 <-predict(M1)>plot(predicted_data1,res_data1)>abline(h=0,col="red")>library(lmtest)>dwtest(M1) Durbin-Watson testdata: M1DW =2.1763, p-value =0.7356alternative hypothesis: true autocorrelation is greater than 0>acf(res_data1)
Durbin Watson test shows a p-value of 0.7356. So, we can not reject the null hypothesis of zero autocorrelation. However, ACF plot does not show any autocorrelation. When we plot residuals against predicted value, we recognize non-random pattern, and probably outliers. This violates linearity assumption. The non-random pattern in the residuals indicates that the deterministic portion (predictor variables) of the model is not capturing some explanatory information that is “leaking” into the residuals. There are some possible solutions for this problem:
Adding another explanatory variable that helps explain the pattern in the residual
Variable in the model needs to be transformed to explain the curvature
Adding interactions between explanatory variables
Often, when dealing with dependent variables that represent financial data (income, price, etc.), using the natural log of the dependent variable will help to alleviate problems. Re-run the Multiple Regression analysis using the natural log of the page cost variable.
With p-value of 1.573e-08, this model is useful. However, the model does not seem better than the previous one, because:
The r-square and adjusted r-square are worse with only 0.6521 and 0.6164
The model has log transformation, which causes difficulty in interpretation
p-value is larger
However, the medianincome variable is relatively useful with p-valule of 0.08. I recommend keeping three variable circ, meidanincome, and domestic, which is a dummy variable.
Residual Analysis
> res_data2 <-residuals(M2)>plot(res_data2, type ="l")>abline(h=0,col="red")>qqnorm(res_data2)>qqline(res_data2, col ="red")> predicted_data2 <-predict(M2)>plot(predicted_data2,res_data2)>abline(h=0,col="red")>dwtest(M2) Durbin-Watson testdata: M2DW =2.2303, p-value =0.7915alternative hypothesis: true autocorrelation is greater than 0>acf(res_data2)>shapiro.test(res_data2) Shapiro-Wilk normality testdata: res_data2W =0.94891, p-value =0.05011
Residuals plot does not show heteroskedasticity. The assumption of equal error variances is reasonable. QQ plot shows normality of the data. Durbin Watson test shows a p-value of 0.7915. So, we can not reject the null hypothesis of 0 autocorrelation. However, ACF plot does not show autocorrelation. The residuals plot still shows non-random pattern, and probably outliers. This violates linearity assumption.
The scatterplots again suggest that there is a high correlation between lncost and circulation. The higher the circulation, the higher the ln.cost. It's also noticeable that the relationship between these two are not linear, which suggests further transformation. All other plots do not show any patterns and are random in nature.
Model 3 - Transforming response variables and explanatory variables
In this model we take the natural log of circ variable.
> mag_data3 <- mag_data2> mag_data3$circ <-log(mag_data3$circ)>head(mag_data3) ln.pagecost circ percmale medianincome domestic110.139157.40549651.1387871212.1960210.45728622.1419331311.545398.46800368.1636671411.452119.64549417.3442371510.924868.22550312.5472111610.605477.81440060.4475791>colnames(mag_data3)[2] <-"ln.circ">#Transform log of circ> M3<-lm(ln.pagecost~., data = mag_data3)>summary(M3)Call:lm(formula = ln.pagecost ~ ., data = mag_data3)Residuals: Min 1Q Median 3Q Max -0.46001-0.151070.013540.202990.38204Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 4.795e+005.507e-018.7071.11e-10***ln.circ 6.301e-014.798e-0213.1316.77e-16***percmale 5.698e-041.734e-030.3290.74418medianincome 1.652e-054.927e-063.3520.00179**domestic -5.732e-028.556e-02-0.6700.50687---Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error:0.2457 on 39 degrees of freedomMultiple R-squared:0.8295, Adjusted R-squared:0.812F-statistic:47.42 on 4 and 39 DF, p-value:1.801e-14
This model has a greater r-square value than all our previous models. The p-value is smaller, and the model is useful. Besides lncirc, medianincome variable, which was not that significant in our previous models, is behaving as a statistically significant variable in this model. We recommend keeping three variable circ, meidanincome, and domestic.
Residuals plot does not show heteroskedasticity. The assumption of equal error variances is reasonable. Except for some extreme value of residuals, QQ plot shows normality of the data. The residuals plot shows a random pattern. This model meets the linearity assumption.for (i in2:5) {plot(mag_data3[,i],res_data3,xlab =colnames(mag_data3)[i], ylab ="residuals")}
Plotting residuals against predictor variables, we don’t notice any apparent pattern. However, if there’s room for improvement, we may want to try transformation on percmale variable. Durbin Watson test shows a p-value of 0.5624. So, we can not reject the null hypothesis of 0 autocorrelation. However, ACF plot does not show autocorrelation.> newdata2 <-newdata> newdata2$circ =log(newdata2$circ)>colnames(newdata2)[1] = "ln.circ">#transform the newdata to feed the predict method>predict(M3,newdata = newdata2, se.fit =TRUE, interval ="confidence")$fit fit lwr upr110.8730910.5816611.16453$se.fit[1] 0.144082$df[1] 39$residual.scale[1] 0.2457245
Outliers
I create a function to automatically remove outliers based on Cook's Distance. This function first run linear model with the response variable being the 1st column and the explanatory variables being the remaining columns. After removing outliers, the function runs linear model again and extract the summary. In practice, we need to assess the model again every time we remove one outliers. However, for the sake of time, and partly my laziness, I here remove them all at once. This function helps us assess new model quickly without having to go through each code line again and again. Let's try it with Model 3
removeoutlier<-function(sampledata) { model<-lm( as.formula(paste(colnames(sampledata)[1], "~",paste(colnames(sampledata)[c(2:length(sampledata))], collapse ="+"), sep ="")),data=sampledata)paste(colnames(sampledata)[1], "~",paste(colnames(sampledata)[c(2:length(sampledata))], collapse ="+"), sep ="") cook.distance <-cooks.distance(model) outlier.index <- cook.distance>(4/44) nonoutlier.index <-!outlier.index mag_data_outlier_remove <-sampledata[nonoutlier.index,] MX<-lm( as.formula(paste(colnames(mag_data_outlier_remove)[1], "~",paste(colnames(mag_data_outlier_remove)[c(2:ncol(mag_data_outlier_remove))], collapse ="+"), sep ="" )), data=mag_data_outlier_remove )summary(MX)}>removeoutlier(mag_data3)Call:lm(formula =as.formula(paste(colnames(mag_data_outlier_remove)[1], "~", paste(colnames(mag_data_outlier_remove)[c(2:ncol(mag_data_outlier_remove))], collapse ="+"), sep ="")), data = mag_data_outlier_remove)Residuals: Min 1Q Median 3Q Max -0.45301-0.11550-0.000420.135600.44817Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 5.414e+005.215e-0110.3822.26e-12***ln.circ 5.944e-014.430e-0213.4171.41e-15***percmale -4.438e-041.611e-03-0.2760.7845medianincome 1.102e-054.808e-062.2930.0278*domestic -2.167e-028.226e-02-0.2630.7938---Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error:0.2195 on 36 degrees of freedomMultiple R-squared:0.86, Adjusted R-squared:0.8445F-statistic:55.29 on 4 and 36 DF, p-value:7.021e-15
After removing outliers, r-square improves up to 0.845
Further Transformation
In this steps we have to try different transformations and assess the R-square. Below is the major steps that lead to the model with the highest r-square.
Model 4 - Adding interactions
> mag_data4<-mag_data3> mag_data4$dom.ln.circ <-mag_data4$domestic*mag_data4$ln.circ> mag_data4$dom.percmale <-mag_data4$domestic*mag_data4$percmale> mag_data4$dom.medianincome <-mag_data4$domestic*mag_data4$medianincome>#add interaction terms> M4<-lm(ln.pagecost~.,data = mag_data4)>summary(M4)Call:lm(formula = ln.pagecost ~ ., data = mag_data4)Residuals: Min 1Q Median 3Q Max -0.47604-0.117330.032520.123510.48747Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 3.355e+001.142e+002.9380.005726**ln.circ 7.207e-011.055e-016.8285.53e-08***percmale 9.997e-042.418e-030.4130.681693medianincome 2.996e-057.352e-064.0750.000242***domestic 2.057e+001.305e+001.5760.123765dom.ln.circ -1.181e-011.174e-01-1.0060.320916dom.percmale 8.733e-043.407e-030.2560.799175dom.medianincome -2.333e-059.927e-06-2.3500.024374*---Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error:0.2363 on 36 degrees of freedomMultiple R-squared:0.8545, Adjusted R-squared:0.8262F-statistic:30.2 on 7 and 36 DF, p-value:3.083e-13>removeoutlier(mag_data4)Call:lm(formula =as.formula(paste(colnames(mag_data_outlier_remove)[1], "~", paste(colnames(mag_data_outlier_remove)[c(2:ncol(mag_data_outlier_remove))], collapse ="+"), sep ="")), data = mag_data_outlier_remove)Residuals: Min 1Q Median 3Q Max -0.46834-0.113090.034580.133540.47953Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 1.910e+001.111e+001.7190.09462 . ln.circ 8.486e-011.018e-018.3339.97e-10***percmale 3.021e-032.210e-031.3670.18070medianincome 3.454e-056.565e-065.2617.87e-06***domestic 3.821e+001.246e+003.0670.00422**dom.ln.circ -2.745e-011.117e-01-2.4570.01927*dom.percmale -3.100e-033.138e-03-0.9880.33024dom.medianincome -2.702e-058.751e-06-3.0880.00400**---Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error:0.2046 on 34 degrees of freedomMultiple R-squared:0.8891, Adjusted R-squared:0.8663F-statistic:38.95 on 7 and 34 DF, p-value:2.005e-14
Model 5 - Squaring permale
> mag_data5<-mag_data4> mag_data5$sqr.permale <-mag_data5$percmale*mag_data5$percmale>#check what the r-square would be after removing outliers>removeoutlier(mag_data5)Call:lm(formula =as.formula(paste(colnames(mag_data_outlier_remove)[1], "~", paste(colnames(mag_data_outlier_remove)[c(2:ncol(mag_data_outlier_remove))], collapse ="+"), sep ="")), data = mag_data_outlier_remove)Residuals: Min 1Q Median 3Q Max -0.40544-0.061900.030970.089840.45945Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 3.753e+001.125e+003.3340.002172**ln.circ 6.927e-019.910e-026.9906.40e-08***percmale -2.206e-027.070e-03-3.1200.003816**medianincome 3.271e-056.408e-065.1041.47e-05***domestic 2.204e+001.206e+001.8270.077037 . dom.ln.circ -1.071e-011.079e-01-0.9930.328364dom.percmale 2.483e-033.095e-030.8020.428188dom.medianincome -2.699e-058.142e-06-3.3140.002290**sqr.permale 2.457e-046.759e-053.6340.000967***---Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error:0.1761 on 32 degrees of freedomMultiple R-squared:0.9192, Adjusted R-squared:0.899F-statistic:45.51 on 8 and 32 DF, p-value:2.521e-15
After adding some interaction variables and transforming permale variable, we come up with this new model with an improved r-square and adjusted r-square.
> newdata3<-newdata2> newdata3$dom.medianincome <-newdata3$domestic*newdata3$medianincome> newdata3$sqr.permale <-newdata3$percmale*newdata3$percmale>colnames(newdata3)==colnames(mag_data6)[-1][1] TRUETRUETRUETRUETRUETRUE>#transform the newdata to feed the predict method>predict(M6,newdata = newdata3, se.fit =TRUE, interval ="confidence")$fit fit lwr upr111.1826810.8080411.55731$se.fit[1] 0.1848971$df[1] 37$residual.scale[1] 0.2185238
Summary
The last model best fits the data since it explains over 90% the Ad cost for 1 page.
Model 3 is also a good candidate since the r-square is over 85%
Model 1 is also a decent model with an r-square of over 80%. We should also mention the advantage of model one with its simple form. The regression model of pagecost = 7179.4408 + 3.9633_Circ - 32.4809_percmale + 0.7001_medianincome - 640.5486_domestic is very simple and easy to explain. For example, by increasing the circulation by 1000, Dr. Sam can boost the ad cost up by almost $4000.
Model 3 and model 6 are relatively complicated model with complicated relationship between each variable. With these models, we can’t tell easily what would happen with ad cost if we change predictor variables. Besides, model 6 predicts a wider interval compared to Model 3. This should also be considered.